44.4 Incident handling for AI mistakes

Overview and links for this section of the guide.

Incident Types

Severity Example Response Time
P0 PII leaked, harmful content 15 min
P1 Wrong financial advice, refund promises 1 hour
P2 Incorrect but not harmful responses 24 hours
P3 Style issues, minor inaccuracies Sprint backlog

Runbook

## AI Incident Response Runbook

### P0: Critical (Safety/Legal)
1. Immediately disable the AI feature
2. Page on-call + engineering lead + legal
3. Collect evidence (logs, user report)
4. Assess blast radius (how many users affected?)
5. Prepare customer communication
6. Fix and validate before re-enabling

### P1: High (Business Impact)  
1. Assess if rollback is needed
2. Page on-call + product owner
3. Document affected interactions
4. Deploy fix or rollback
5. Follow up with affected users

### P2-P3: Standard Flow
1. Create ticket with evidence
2. Prioritize in backlog
3. Fix in next sprint

Post-Mortem

## Post-Mortem Template

**Incident:** AI promised unauthorized refund
**Date:** 2024-01-15
**Duration:** 2 hours
**Users affected:** 47

### What happened?
A prompt change removed the "$50 refund limit" rule.

### Root cause
PR merged without eval suite run.

### What we'll fix
1. Mandatory eval run before merge
2. Add test case for refund limits
3. Add specific safety rule check

Where to go next