AI Tools for Site Reliability Engineers
AI tools that help SREs track CVEs, audit endpoint security, generate incident runbooks, and research SLO and reliability patterns.
Works in Chat, Cowork and Code
Service endpoint health monitoring
Audit SSL certificate expiry, DNS record accuracy, and HTTP response health across your entire service mesh in a single query. Catch a certificate expiring during a holiday freeze or a dangling CNAME from a decommissioned service before they cause an outage.
api.myapp.com: SSL 87 days, healthy. grpc.myapp.com: SSL expires in 9 days (URGENT — renew before freeze). ws.myapp.com: healthy, 45 days. internal-api.myapp.com: CNAME points to old-internal.elb.amazonaws.com (load balancer deleted — dangling DNS). metrics.myapp.com: SSL healthy 120 days. Actions: renew grpc cert, remove dangling CNAME.
Security posture checks
Audit production endpoints for security header misconfigurations that fail SOC 2 audits — missing HSTS, overly permissive CORS, TLS 1.0 still enabled, and error responses that leak stack traces.
Findings: HSTS max-age is 86400 (1 day) — should be 31536000 minimum. CORS: Access-Control-Allow-Origin: * on /api/events. TLS 1.0 enabled (deprecate). 500 error returns {"error": "PrismaClientKnownRequestError: ...at /app/src/..."} — exposes file path and ORM. Priority: fix error handling first (P1), then TLS 1.0, then CORS.
Infrastructure CVE triage
Check Kubernetes, container runtime, and node pool packages for critical CVEs before upgrading production infrastructure. A CVE in containerd or runc at CVSS 9+ can allow container escape.
runc@1.1.12: CVE-2024-21626 (CVSS 8.6) — container escape via file descriptor leak. Upgrade to 1.1.13. containerd@1.7.13: clean. kubernetes@1.29.3: CVE-2023-5528 (CVSS 7.2) — insecure volume mount on Windows nodes (Linux not affected). kube-proxy, kubelet: clean. If running Linux nodes: runc is the critical upgrade. Windows nodes: also patch kubernetes.
Runbook and incident diagram generation
Generate incident response flowcharts, system architecture diagrams for runbooks, and postmortem timelines from a text description. Get Mermaid diagrams that render in PagerDuty runbook links, Confluence, and GitHub.
Generated Mermaid flowchart with 9 decision nodes. Conditional branches: lag ≤30s goes to query analysis path, lag >30s goes to replica promotion. Recovery verification shown as loop with 3-minute check interval. Postmortem ticket creation annotated as final step with Jira link placeholder.
SLO and reliability pattern research
Research SLO burn rate alerting, error budget policies, and chaos engineering approaches from Google SRE, Stripe, and Netflix engineering blogs. Build evidence-based reliability practices instead of guessing.
Google SRE recommends 2 alerting windows: 1-hour burn rate >14.4x (consumes 2% error budget in 1 hour) and 6-hour burn rate >6x (consumes 5% error budget). Prometheus: alert on sum(rate(errors[1h]))/sum(rate(requests[1h])) > 14.4 * (1-0.999). Page for 1h window (fast burn), ticket for 6h window (slow burn). Slack notifications for 24h window.
Dependency supply chain auditing
Audit the Helm charts, Kubernetes operators, and infrastructure tools your team is adding to the cluster. SREs own the cluster — unknown supply chain compromises in operators running as cluster-admin are catastrophic.
argocd@2.10: maintained by CNCF Argo project, no advisories. external-secrets@0.9: clean, active maintainers. keda@2.14: clean, CNCF project. velero@1.13: CVE-2024-37082 (CVSS 6.5) — backup restoration path traversal. Upgrade to 1.14.0 before deploying. Others safe.
Ready-to-use prompts
Check SSL expiry and DNS health for these production endpoints: api.myapp.com, grpc.myapp.com, ws.myapp.com, admin.myapp.com, metrics.myapp.com. Flag certs expiring within 21 days and any dangling CNAME records.
Audit https://api.myapp.com for security header compliance: HSTS (check max-age value), CORS policy, CSP, X-Content-Type-Options, and whether 500 error responses expose stack traces or framework names.
Check these Kubernetes components for CVEs before a node pool upgrade: kubernetes@1.30.0, containerd@1.7.15, runc@1.1.13, coredns@1.11.1, kube-proxy@1.30.0. Flag anything CVSS 7+.
Generate a Mermaid incident flowchart for a Postgres primary failure: alert fires → check replication status → if replica lag <30s auto-promote via Patroni → redirect app to new primary → verify connections → write postmortem ticket.
Research multi-window burn rate alerting for 99.9% and 99.95% availability SLOs. Include Google SRE recommended window sizes, burn rate thresholds, and Prometheus alerting rule examples.
Audit these Kubernetes operators before cluster installation: argocd@2.10, external-secrets@0.9, keda@2.14, cert-manager@1.14, velero@1.13. Check maintainer activity and known CVEs.
Compare Istio, Linkerd, and Cilium service mesh for a 50-microservice cluster: mTLS overhead on p99 latency, observability integration with Prometheus, and operational complexity for a 3-person SRE team.
Research chaos engineering approaches for testing database failover: Gremlin vs Chaos Monkey vs AWS Fault Injection Service. Include how to safely test Aurora Global Database promotion and impact on connection pool recovery time.
Tools to power your best work
165+ tools.
One conversation.
Everything site reliability engineers need from AI, connected to the assistant you already use. No extra apps, no switching tabs.
Weekly infrastructure health check
Every week, verify SSL/DNS health, audit security headers, and scan infrastructure packages for new CVEs.
Node pool upgrade preparation
Before upgrading a Kubernetes node pool, check for CVEs in the new version, validate supply chain health, and update runbook diagrams.
Post-incident documentation
After an incident is resolved, generate the timeline diagram, research reliability improvements, and document the runbook changes.
Frequently Asked Questions
How does DNS & Domain help with SRE work beyond just checking certificate expiry?
DNS & Domain checks A records, CNAME targets, MX records, WHOIS registration expiry, and SSL/TLS certificate details in one pass. For SREs, dangling CNAMEs (pointing to deleted load balancers) and short HSTS max-age values are common issues it surfaces that SSL monitoring alone misses.
Can the Vulnerability Database alert on new CVEs for packages we are already running?
The Vulnerability Database is a lookup tool — you query it on demand. For continuous CVE monitoring, run it as a weekly automated check against your pinned package versions. For real-time alerts, integrate with GitHub Dependabot or Snyk separately.
Does Diagram Generator support sequence diagrams for incident timelines?
Yes. Specify the incident sequence in your prompt — "event A happened, then system B failed, SRE paged at T+5, remediation at T+22" — and Diagram Generator produces a PlantUML or Mermaid sequence diagram with actor annotations and timeline markers.
Does Deep Research have access to the Google SRE book content?
Deep Research synthesizes from web-accessible sources. The Google SRE book is available at sre.google/sre-book and its content is widely referenced in engineering blogs and articles that Deep Research can find and synthesize. For direct chapter citations, specify the topic and it will find the relevant SRE principles.
What does Supply Chain Risk check for that regular CVE scanning misses?
Supply Chain Risk looks for anomalies beyond known CVEs — sudden maintainer changes, unusual recent release patterns, dependency confusion attacks, and abandoned packages that may be squatted. These supply chain vectors are increasingly common in infrastructure tooling and are not covered by the CVE database.
Give your AI superpowers.
Works in Chat, Cowork and Code