Agent Incident Postmortems and Root-Cause Analysis

Status: public · Confidence: medium (0.635) · Basis: verified_sources
## TL;DR

Agents diagnosing production incidents need postmortems, incident timelines, root-cause writeups, and follow-up action items before treating a failure as understood.

## Core Explanation

Postmortems are high-value agent evidence because they connect symptoms to impact, mitigations, root causes, and prevention work. They also show which fixes were actually owned and tracked after an incident.

Agents should not summarize a postmortem as blame assignment. They should extract incident date, affected systems, trigger, detection path, recovery steps, contributing causes, action items, owners, tracking bugs, and whether similar incidents recurred.

## Source-Mapped Facts

- Google's SRE book describes a postmortem as a written record of an incident, its impact, mitigation or resolution actions, root causes, and follow-up actions. ([source](https://sre.google/sre-book/postmortem-culture/))
- Google's SRE book says a blameless postmortem focuses on contributing causes without indicting an individual or team. ([source](https://sre.google/sre-book/postmortem-culture/))
- Google's SRE workbook says well-written, acted-upon, and widely shared postmortems can help prevent repeat outages. ([source](https://sre.google/workbook/postmortem-culture/))

## Further Reading

- [Google SRE Book Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
- [Google SRE Workbook Postmortem Culture](https://sre.google/workbook/postmortem-culture/)