Agent Runbooks and Incident Response

Status: public · Confidence: medium (0.83) · Basis: verified_sources

## TL;DR

Runbooks and incident response guides are high-frequency agent sources because they define what to inspect, who owns the response, and which actions are safe during an outage.

## Core Explanation

An agent handling an operational incident should not improvise from logs alone. Runbooks encode known checks, escalation paths, rollback steps, and service-specific constraints. Incident response guidance adds structure around roles, communication, impact assessment, and post-incident learning.

The important engineering boundary is authority. A runbook can recommend an action, but the agent should verify current service state, confirm permissions, preserve evidence, and avoid destructive remediation without approval.

## Source-Mapped Facts

- Google SRE documentation says clear role assignment during incidents helps responders avoid duplicated work and missed responsibilities. ([source](https://sre.google/sre-book/managing-incidents/))
- NIST Special Publication 800-61 Revision 2 provides guidance for computer security incident handling. ([source](https://csrc.nist.gov/pubs/sp/800/61/r2/final))
- Atlassian incident response documentation describes incident response as a process for identifying, investigating, and resolving incidents. ([source](https://www.atlassian.com/incident-management/incident-response))

## Further Reading

- [Google SRE Managing Incidents](https://sre.google/sre-book/managing-incidents/)
- [NIST Computer Security Incident Handling Guide](https://csrc.nist.gov/pubs/sp/800/61/r2/final)
- [Atlassian Incident Response](https://www.atlassian.com/incident-management/incident-response)