AI Tools for Site Reliability Engineers

Q: How does DNS & Domain help with SRE work beyond just checking certificate expiry?

DNS & Domain checks A records, CNAME targets, MX records, WHOIS registration expiry, and SSL/TLS certificate details in one pass. For SREs, dangling CNAMEs (pointing to deleted load balancers) and short HSTS max-age values are common issues it surfaces that SSL monitoring alone misses.

Q: Can the Vulnerability Database alert on new CVEs for packages we are already running?

The Vulnerability Database is a lookup tool — you query it on demand. For continuous CVE monitoring, run it as a weekly automated check against your pinned package versions. For real-time alerts, integrate with GitHub Dependabot or Snyk separately.

Q: Does Diagram Generator support sequence diagrams for incident timelines?

Yes. Specify the incident sequence in your prompt — "event A happened, then system B failed, SRE paged at T+5, remediation at T+22" — and Diagram Generator produces a PlantUML or Mermaid sequence diagram with actor annotations and timeline markers.

Q: Does Deep Research have access to the Google SRE book content?

Deep Research synthesizes from web-accessible sources. The Google SRE book is available at sre.google/sre-book and its content is widely referenced in engineering blogs and articles that Deep Research can find and synthesize. For direct chapter citations, specify the topic and it will find the relevant SRE principles.

Q: What does Supply Chain Risk check for that regular CVE scanning misses?

Supply Chain Risk looks for anomalies beyond known CVEs — sudden maintainer changes, unusual recent release patterns, dependency confusion attacks, and abandoned packages that may be squatted. These supply chain vectors are increasingly common in infrastructure tooling and are not covered by the CVE database.

AI tools that help SREs track CVEs, audit endpoint security, generate incident runbooks, and research SLO and reliability patterns.

Get started for free

Works in Chat, Cowork and Code

DNS & Domain

api.myapp.com

SSL 87 days remaining · DNS healthy

grpc.myapp.com

SSL expires in 9 days — URGENT: renew before freeze

ws.myapp.com

SSL 45 days remaining · DNS healthy

internal-api.myapp.com

CNAME points to deleted load balancer — dangling DNS

Service endpoint health monitoring

Audit SSL certificate expiry, DNS record accuracy, and HTTP response health across your entire service mesh in a single query. Catch a certificate expiring during a holiday freeze or a dangling CNAME from a decommissioned service before they cause an outage.

DNS & Domain

Check SSL expiry and DNS health for all production endpoints: api.myapp.com, grpc.myapp.com, ws.myapp.com, internal-api.myapp.com, metrics.myapp.com.

api.myapp.com: SSL 87 days, healthy. grpc.myapp.com: SSL expires in 9 days (URGENT — renew before freeze). ws.myapp.com: healthy, 45 days. internal-api.myapp.com: CNAME points to old-internal.elb.amazonaws.com (load balancer deleted — dangling DNS). metrics.myapp.com: SSL healthy 120 days. Actions: renew grpc cert, remove dangling CNAME.

ToolRouter check_dns

api.myapp.com

SSL 87 days · DNS healthy

grpc.myapp.com

SSL expires 9 days — URGENT: renew before freeze

internal-api.myapp.com

CNAME points to deleted load balancer — dangling DNS

metrics.myapp.com

SSL 120 days · DNS healthy

Security posture checks

Audit production endpoints for security header misconfigurations that fail SOC 2 audits — missing HSTS, overly permissive CORS, TLS 1.0 still enabled, and error responses that leak stack traces.

Security Scanner

Audit https://api.myapp.com — check security headers, TLS configuration, CORS policy, and whether 500 errors expose stack traces or framework versions.

Findings: HSTS max-age is 86400 (1 day) — should be 31536000 minimum. CORS: Access-Control-Allow-Origin: * on /api/events. TLS 1.0 enabled (deprecate). 500 error returns {"error": "PrismaClientKnownRequestError: ...at /app/src/..."} — exposes file path and ORM. Priority: fix error handling first (P1), then TLS 1.0, then CORS.

ToolRouter scan_url

HSTS max-age

86400 (1 day) — should be 31536000 minimum

CORS

Access-Control-Allow-Origin: * on /api/events

TLS 1.0

Still enabled — deprecate for SOC 2 compliance

500 error leak

Returns ORM error + file path — fix first (P1)

Infrastructure CVE triage

Check Kubernetes, container runtime, and node pool packages for critical CVEs before upgrading production infrastructure. A CVE in containerd or runc at CVSS 9+ can allow container escape.

Vulnerability Database

Check for CVEs in: kubernetes@1.29.3, containerd@1.7.13, runc@1.1.12, kubelet@1.29.3, kube-proxy@1.29.3.

runc@1.1.12: CVE-2024-21626 (CVSS 8.6) — container escape via file descriptor leak. Upgrade to 1.1.13. containerd@1.7.13: clean. kubernetes@1.29.3: CVE-2023-5528 (CVSS 7.2) — insecure volume mount on Windows nodes (Linux not affected). kube-proxy, kubelet: clean. If running Linux nodes: runc is the critical upgrade. Windows nodes: also patch kubernetes.

ToolRouter search_cves

runc@1.1.12 — CRITICAL

CVE-2024-21626 (CVSS 8.6) — container escape · upgrade to 1.1.13

kubernetes@1.29.3

CVE-2023-5528 (CVSS 7.2) — Windows nodes only · Linux unaffected

containerd@1.7.13

Clean — no critical or high CVEs

kube-proxy, kubelet

Clean — no CVEs above CVSS 5.0

Runbook and incident diagram generation

Generate incident response flowcharts, system architecture diagrams for runbooks, and postmortem timelines from a text description. Get Mermaid diagrams that render in PagerDuty runbook links, Confluence, and GitHub.

Diagram Generator

Create an incident response flowchart for a database latency spike: alert fires at p99>500ms → page SRE on-call → check replica lag → if lag>30s promote replica → redirect read traffic → verify recovery → page off → write postmortem.

Generated Mermaid flowchart with 9 decision nodes. Conditional branches: lag ≤30s goes to query analysis path, lag >30s goes to replica promotion. Recovery verification shown as loop with 3-minute check interval. Postmortem ticket creation annotated as final step with Jira link placeholder.

ToolRouter render_diagram

Alert

p99 > 500ms fires → page SRE on-call

Check

Query replica lag — if >30s: promote replica

Remediate

Redirect read traffic → verify recovery (3-min check loop)

Page off → open postmortem Jira ticket

Output

9-node Mermaid flowchart with conditional branches — ready to embed

SLO and reliability pattern research

Research SLO burn rate alerting, error budget policies, and chaos engineering approaches from Google SRE, Stripe, and Netflix engineering blogs. Build evidence-based reliability practices instead of guessing.

Deep Research

Research multi-window burn rate alerting for a 99.9% monthly availability SLO. What alerting windows does Google SRE recommend, and how do I set Prometheus alert thresholds?

Google SRE recommends 2 alerting windows: 1-hour burn rate >14.4x (consumes 2% error budget in 1 hour) and 6-hour burn rate >6x (consumes 5% error budget). Prometheus: alert on sum(rate(errors[1h]))/sum(rate(requests[1h])) > 14.4 * (1-0.999). Page for 1h window (fast burn), ticket for 6h window (slow burn). Slack notifications for 24h window.

ToolRouter research

1h window (page)

Burn rate >14.4× — consumes 2% error budget/hour

6h window (ticket)

Burn rate >6× — consumes 5% error budget in 6 hours

24h window (Slack)

Burn rate >1× — steady consumption, needs monitoring

Prometheus

sum(rate(errors[1h])) / sum(rate(requests[1h])) > 14.4 * 0.001

Dependency supply chain auditing

Audit the Helm charts, Kubernetes operators, and infrastructure tools your team is adding to the cluster. SREs own the cluster — unknown supply chain compromises in operators running as cluster-admin are catastrophic.

Supply Chain Risk

Audit these before adding to the cluster: argo-cd@2.10, external-secrets-operator@0.9, keda@2.14, velero@1.13.

argocd@2.10: maintained by CNCF Argo project, no advisories. external-secrets@0.9: clean, active maintainers. keda@2.14: clean, CNCF project. velero@1.13: CVE-2024-37082 (CVSS 6.5) — backup restoration path traversal. Upgrade to 1.14.0 before deploying. Others safe.

ToolRouter audit_packages

argocd@2.10

Clean — maintained by CNCF Argo project · no advisories

external-secrets@0.9

Clean — active maintainers · no advisories

keda@2.14

Clean — CNCF project · no advisories

velero@1.13

CVE-2024-37082 (CVSS 6.5) — path traversal · upgrade to 1.14.0 first

Ready-to-use prompts

Check all SSL certs

Check SSL expiry and DNS health for these production endpoints: api.myapp.com, grpc.myapp.com, ws.myapp.com, admin.myapp.com, metrics.myapp.com. Flag certs expiring within 21 days and any dangling CNAME records.

Audit API security headers

Audit https://api.myapp.com for security header compliance: HSTS (check max-age value), CORS policy, CSP, X-Content-Type-Options, and whether 500 error responses expose stack traces or framework names.

Kubernetes CVE check

Check these Kubernetes components for CVEs before a node pool upgrade: kubernetes@1.30.0, containerd@1.7.15, runc@1.1.13, coredns@1.11.1, kube-proxy@1.30.0. Flag anything CVSS 7+.

Database incident runbook

Generate a Mermaid incident flowchart for a Postgres primary failure: alert fires → check replication status → if replica lag <30s auto-promote via Patroni → redirect app to new primary → verify connections → write postmortem ticket.

SLO burn rate alerting

Research multi-window burn rate alerting for 99.9% and 99.95% availability SLOs. Include Google SRE recommended window sizes, burn rate thresholds, and Prometheus alerting rule examples.

Audit Kubernetes operators

Audit these Kubernetes operators before cluster installation: argocd@2.10, external-secrets@0.9, keda@2.14, cert-manager@1.14, velero@1.13. Check maintainer activity and known CVEs.

Service mesh research

Compare Istio, Linkerd, and Cilium service mesh for a 50-microservice cluster: mTLS overhead on p99 latency, observability integration with Prometheus, and operational complexity for a 3-person SRE team.

Chaos engineering approaches

Research chaos engineering approaches for testing database failover: Gremlin vs Chaos Monkey vs AWS Fault Injection Service. Include how to safely test Aurora Global Database promotion and impact on connection pool recovery time.

Tools to power your best work

Open DNS & Domain

DNS & DomainDNS, WHOIS, SSL & domain checks

★★★★★1

Open Security Scanner

Security ScannerScan URLs, IPs, domains and files for threats

Open Web Search

Web SearchWeb, news, images & maps — one tool

★★★★★5

Open Vulnerability Database

Vulnerability DatabaseSearch CVEs & track new advisories

Open Deep Research

Deep ResearchAI research reports with citations

Open Diagram Generator

Diagram GeneratorRender Mermaid, PlantUML & more

Open Supply Chain Risk

Supply Chain RiskPackage, dependency & exploit risk

225+ tools.
One conversation.

Everything site reliability engineers need from AI, connected to the assistant you already use. No extra apps, no switching tabs.

Weekly infrastructure health check

Every week, verify SSL/DNS health, audit security headers, and scan infrastructure packages for new CVEs.

DNS & DomainCheck SSL expiry and DNS records for all production domains

Security ScannerAudit production endpoints for security header compliance

Vulnerability DatabaseCheck Kubernetes and infrastructure packages for new CVEs

Node pool upgrade preparation

Before upgrading a Kubernetes node pool, check for CVEs in the new version, validate supply chain health, and update runbook diagrams.

Vulnerability DatabaseCheck new Kubernetes and container runtime versions for CVEs

Supply Chain RiskAudit any new operators being added alongside the upgrade

Diagram GeneratorUpdate cluster architecture diagram with new node pool config

Post-incident documentation

After an incident is resolved, generate the timeline diagram, research reliability improvements, and document the runbook changes.

Diagram GeneratorGenerate incident timeline and failure mode diagram

Deep ResearchResearch patterns to prevent this class of failure

Vulnerability DatabaseCheck if the incident was caused by a known CVE

Frequently Asked Questions

How does DNS & Domain help with SRE work beyond just checking certificate expiry?