AI Tools for Site Reliability Engineers

AI tools that help SREs track CVEs, audit endpoint security, generate incident runbooks, and research SLO and reliability patterns.

Get started for free

Works in Chat, Cowork and Code

api.myapp.com
SSL 87 days remaining · DNS healthy
grpc.myapp.com
SSL expires in 9 days — URGENT: renew before freeze
ws.myapp.com
SSL 45 days remaining · DNS healthy
internal-api.myapp.com
CNAME points to deleted load balancer — dangling DNS

Service endpoint health monitoring

Audit SSL certificate expiry, DNS record accuracy, and HTTP response health across your entire service mesh in a single query. Catch a certificate expiring during a holiday freeze or a dangling CNAME from a decommissioned service before they cause an outage.

Check SSL expiry and DNS health for all production endpoints: api.myapp.com, grpc.myapp.com, ws.myapp.com, internal-api.myapp.com, metrics.myapp.com.

api.myapp.com: SSL 87 days, healthy. grpc.myapp.com: SSL expires in 9 days (URGENT — renew before freeze). ws.myapp.com: healthy, 45 days. internal-api.myapp.com: CNAME points to old-internal.elb.amazonaws.com (load balancer deleted — dangling DNS). metrics.myapp.com: SSL healthy 120 days. Actions: renew grpc cert, remove dangling CNAME.

ToolRouter check_dns
api.myapp.com
SSL 87 days · DNS healthy
grpc.myapp.com
SSL expires 9 days — URGENT: renew before freeze
internal-api.myapp.com
CNAME points to deleted load balancer — dangling DNS
metrics.myapp.com
SSL 120 days · DNS healthy

Security posture checks

Audit production endpoints for security header misconfigurations that fail SOC 2 audits — missing HSTS, overly permissive CORS, TLS 1.0 still enabled, and error responses that leak stack traces.

Audit https://api.myapp.com — check security headers, TLS configuration, CORS policy, and whether 500 errors expose stack traces or framework versions.

Findings: HSTS max-age is 86400 (1 day) — should be 31536000 minimum. CORS: Access-Control-Allow-Origin: * on /api/events. TLS 1.0 enabled (deprecate). 500 error returns {"error": "PrismaClientKnownRequestError: ...at /app/src/..."} — exposes file path and ORM. Priority: fix error handling first (P1), then TLS 1.0, then CORS.

ToolRouter scan_url
HSTS max-age
86400 (1 day) — should be 31536000 minimum
CORS
Access-Control-Allow-Origin: * on /api/events
TLS 1.0
Still enabled — deprecate for SOC 2 compliance
500 error leak
Returns ORM error + file path — fix first (P1)

Infrastructure CVE triage

Check Kubernetes, container runtime, and node pool packages for critical CVEs before upgrading production infrastructure. A CVE in containerd or runc at CVSS 9+ can allow container escape.

Check for CVEs in: kubernetes@1.29.3, containerd@1.7.13, runc@1.1.12, kubelet@1.29.3, kube-proxy@1.29.3.

runc@1.1.12: CVE-2024-21626 (CVSS 8.6) — container escape via file descriptor leak. Upgrade to 1.1.13. containerd@1.7.13: clean. kubernetes@1.29.3: CVE-2023-5528 (CVSS 7.2) — insecure volume mount on Windows nodes (Linux not affected). kube-proxy, kubelet: clean. If running Linux nodes: runc is the critical upgrade. Windows nodes: also patch kubernetes.

ToolRouter search_cves
runc@1.1.12 — CRITICAL
CVE-2024-21626 (CVSS 8.6) — container escape · upgrade to 1.1.13
kubernetes@1.29.3
CVE-2023-5528 (CVSS 7.2) — Windows nodes only · Linux unaffected
containerd@1.7.13
Clean — no critical or high CVEs
kube-proxy, kubelet
Clean — no CVEs above CVSS 5.0

Runbook and incident diagram generation

Generate incident response flowcharts, system architecture diagrams for runbooks, and postmortem timelines from a text description. Get Mermaid diagrams that render in PagerDuty runbook links, Confluence, and GitHub.

Create an incident response flowchart for a database latency spike: alert fires at p99>500ms → page SRE on-call → check replica lag → if lag>30s promote replica → redirect read traffic → verify recovery → page off → write postmortem.

Generated Mermaid flowchart with 9 decision nodes. Conditional branches: lag ≤30s goes to query analysis path, lag >30s goes to replica promotion. Recovery verification shown as loop with 3-minute check interval. Postmortem ticket creation annotated as final step with Jira link placeholder.

ToolRouter render_diagram
Alert
p99 > 500ms fires → page SRE on-call
Check
Query replica lag — if >30s: promote replica
Remediate
Redirect read traffic → verify recovery (3-min check loop)
Close
Page off → open postmortem Jira ticket
Output
9-node Mermaid flowchart with conditional branches — ready to embed

SLO and reliability pattern research

Research SLO burn rate alerting, error budget policies, and chaos engineering approaches from Google SRE, Stripe, and Netflix engineering blogs. Build evidence-based reliability practices instead of guessing.

Research multi-window burn rate alerting for a 99.9% monthly availability SLO. What alerting windows does Google SRE recommend, and how do I set Prometheus alert thresholds?

Google SRE recommends 2 alerting windows: 1-hour burn rate >14.4x (consumes 2% error budget in 1 hour) and 6-hour burn rate >6x (consumes 5% error budget). Prometheus: alert on sum(rate(errors[1h]))/sum(rate(requests[1h])) > 14.4 * (1-0.999). Page for 1h window (fast burn), ticket for 6h window (slow burn). Slack notifications for 24h window.

ToolRouter research
1h window (page)
Burn rate >14.4× — consumes 2% error budget/hour
6h window (ticket)
Burn rate >6× — consumes 5% error budget in 6 hours
24h window (Slack)
Burn rate >1× — steady consumption, needs monitoring
Prometheus
sum(rate(errors[1h])) / sum(rate(requests[1h])) > 14.4 * 0.001

Dependency supply chain auditing

Audit the Helm charts, Kubernetes operators, and infrastructure tools your team is adding to the cluster. SREs own the cluster — unknown supply chain compromises in operators running as cluster-admin are catastrophic.

Audit these before adding to the cluster: argo-cd@2.10, external-secrets-operator@0.9, keda@2.14, velero@1.13.

argocd@2.10: maintained by CNCF Argo project, no advisories. external-secrets@0.9: clean, active maintainers. keda@2.14: clean, CNCF project. velero@1.13: CVE-2024-37082 (CVSS 6.5) — backup restoration path traversal. Upgrade to 1.14.0 before deploying. Others safe.

ToolRouter audit_packages
argocd@2.10
Clean — maintained by CNCF Argo project · no advisories
external-secrets@0.9
Clean — active maintainers · no advisories
keda@2.14
Clean — CNCF project · no advisories
velero@1.13
CVE-2024-37082 (CVSS 6.5) — path traversal · upgrade to 1.14.0 first

Ready-to-use prompts

Check all SSL certs

Check SSL expiry and DNS health for these production endpoints: api.myapp.com, grpc.myapp.com, ws.myapp.com, admin.myapp.com, metrics.myapp.com. Flag certs expiring within 21 days and any dangling CNAME records.

Audit API security headers

Audit https://api.myapp.com for security header compliance: HSTS (check max-age value), CORS policy, CSP, X-Content-Type-Options, and whether 500 error responses expose stack traces or framework names.

Kubernetes CVE check

Check these Kubernetes components for CVEs before a node pool upgrade: kubernetes@1.30.0, containerd@1.7.15, runc@1.1.13, coredns@1.11.1, kube-proxy@1.30.0. Flag anything CVSS 7+.

Database incident runbook

Generate a Mermaid incident flowchart for a Postgres primary failure: alert fires → check replication status → if replica lag <30s auto-promote via Patroni → redirect app to new primary → verify connections → write postmortem ticket.

SLO burn rate alerting

Research multi-window burn rate alerting for 99.9% and 99.95% availability SLOs. Include Google SRE recommended window sizes, burn rate thresholds, and Prometheus alerting rule examples.

Audit Kubernetes operators

Audit these Kubernetes operators before cluster installation: argocd@2.10, external-secrets@0.9, keda@2.14, cert-manager@1.14, velero@1.13. Check maintainer activity and known CVEs.

Service mesh research

Compare Istio, Linkerd, and Cilium service mesh for a 50-microservice cluster: mTLS overhead on p99 latency, observability integration with Prometheus, and operational complexity for a 3-person SRE team.

Chaos engineering approaches

Research chaos engineering approaches for testing database failover: Gremlin vs Chaos Monkey vs AWS Fault Injection Service. Include how to safely test Aurora Global Database promotion and impact on connection pool recovery time.

Tools to power your best work

Open Web Search
Web Search icon
Web SearchWeb, news, images & maps — one tool
2

165+ tools.
One conversation.

Everything site reliability engineers need from AI, connected to the assistant you already use. No extra apps, no switching tabs.

Weekly infrastructure health check

Every week, verify SSL/DNS health, audit security headers, and scan infrastructure packages for new CVEs.

1
DNS & Domain icon
DNS & Domain
Check SSL expiry and DNS records for all production domains
2
Security Scanner icon
Security Scanner
Audit production endpoints for security header compliance
3
Vulnerability Database icon
Vulnerability Database
Check Kubernetes and infrastructure packages for new CVEs

Node pool upgrade preparation

Before upgrading a Kubernetes node pool, check for CVEs in the new version, validate supply chain health, and update runbook diagrams.

1
Vulnerability Database icon
Vulnerability Database
Check new Kubernetes and container runtime versions for CVEs
2
Supply Chain Risk icon
Supply Chain Risk
Audit any new operators being added alongside the upgrade
3
Diagram Generator icon
Diagram Generator
Update cluster architecture diagram with new node pool config

Post-incident documentation

After an incident is resolved, generate the timeline diagram, research reliability improvements, and document the runbook changes.

1
Diagram Generator icon
Diagram Generator
Generate incident timeline and failure mode diagram
2
Deep Research icon
Deep Research
Research patterns to prevent this class of failure
3
Vulnerability Database icon
Vulnerability Database
Check if the incident was caused by a known CVE

Frequently Asked Questions

How does DNS & Domain help with SRE work beyond just checking certificate expiry?

DNS & Domain checks A records, CNAME targets, MX records, WHOIS registration expiry, and SSL/TLS certificate details in one pass. For SREs, dangling CNAMEs (pointing to deleted load balancers) and short HSTS max-age values are common issues it surfaces that SSL monitoring alone misses.

Can the Vulnerability Database alert on new CVEs for packages we are already running?

The Vulnerability Database is a lookup tool — you query it on demand. For continuous CVE monitoring, run it as a weekly automated check against your pinned package versions. For real-time alerts, integrate with GitHub Dependabot or Snyk separately.

Does Diagram Generator support sequence diagrams for incident timelines?

Yes. Specify the incident sequence in your prompt — "event A happened, then system B failed, SRE paged at T+5, remediation at T+22" — and Diagram Generator produces a PlantUML or Mermaid sequence diagram with actor annotations and timeline markers.

Does Deep Research have access to the Google SRE book content?

Deep Research synthesizes from web-accessible sources. The Google SRE book is available at sre.google/sre-book and its content is widely referenced in engineering blogs and articles that Deep Research can find and synthesize. For direct chapter citations, specify the topic and it will find the relevant SRE principles.

What does Supply Chain Risk check for that regular CVE scanning misses?

Supply Chain Risk looks for anomalies beyond known CVEs — sudden maintainer changes, unusual recent release patterns, dependency confusion attacks, and abandoned packages that may be squatted. These supply chain vectors are increasingly common in infrastructure tooling and are not covered by the CVE database.

More AI tools by profession

Give your AI superpowers.

Get started for free

Works in Chat, Cowork and Code