19.4 Postmortems: writing a useful incident report

Overview and links for this section of the guide.

Goal: convert an incident into learning

A good postmortem is not paperwork. It’s how you prevent recurrence and improve systems over time.

The goal is to capture:

  • what happened (timeline),
  • impact,
  • root cause and contributing factors,
  • what worked and what didn’t,
  • follow-ups that reduce future risk.
Postmortems are a reliability feature

Teams that write good postmortems get faster and calmer over time. Teams that don’t repeat the same incidents with different symptoms.

Postmortem principles (blameless, specific, actionable)

  • Blameless: focus on system causes, not individual fault.
  • Specific: concrete times, metrics, diffs, and outcomes.
  • Actionable: follow-ups have owners and deadlines.
  • Truthful: include uncertainty where it exists; don’t invent a narrative.

Incident report template

Incident title:

Summary (3–6 sentences):

Impact:
- Who was affected:
- What was affected:
- Duration:
- Severity:

Timeline (UTC):
- T0: detection
- ...
- Resolution

Root cause:

Contributing factors:
- ...

Detection:
- How did we notice?
- Which alerts/logs/metrics?

Resolution:
- What changed?
- Verification steps:

What went well:
- ...

What went poorly:
- ...

Action items:
- [ ] Action (owner, due date)
- [ ] ...

How to use the model to draft (safely)

LLMs are useful for drafting postmortems because they can organize messy notes quickly. Use them safely:

  • paste redacted notes (no secrets, no PII)
  • demand that the model flags uncertainty (“unknown”)
  • require a clear separation between facts and hypotheses
  • review carefully: the model may invent a clean narrative

Copy-paste prompt

Draft a postmortem using the template below.

Rules:
- Use only the facts I provide.
- If something is unknown, mark it as unknown.
- Do not invent details to make a nicer story.

Facts/notes (redacted):
...

Template:
(paste template)
Models like tidy stories

Incidents are messy. A model may invent missing steps to “complete” the story. Force it to mark unknowns instead.

Follow-up actions that actually prevent repeats

High-leverage action categories:

  • Tests: regression tests and characterization tests
  • Validation: input validation and schema validation
  • Observability: add missing logs/metrics and alert thresholds
  • Rate limit/caching: reduce retry storms and overload
  • Runbooks: document “how to diagnose this class of failure”

Each action item should be concrete and owned.

Where to go next