19.5 Preventing recurrence: monitoring and alerts
Overview and links for this section of the guide.
On this page
Goal: detect issues before users do
Prevention is mostly detection + guardrails. Your goal is to know:
- when error rates spike,
- when latency spikes,
- when cost per success increases,
- when invalid outputs or blocks increase,
- when a new prompt version causes regressions.
For LLM apps, the first signals are usually: rate limits, timeouts, invalid outputs, and cost spikes from retries/context bloat.
What to monitor (LLM app edition)
High-leverage metrics:
- success rate: % ok outcomes
- error rate by category: timeout vs rate_limit vs invalid_output vs blocked
- latency percentiles: p50/p95 for model calls and end-to-end request
- calls per success: how many attempts per successful result
- tokens per success: cost proxy and context bloat detector
- prompt version distribution: which versions are producing failures
Also track tool-call metrics if you use tools (Part V Section 16): tool error rate and tool latency.
Alert design (signal, not noise)
Good alerts are:
- actionable (“do X now”),
- rare (don’t spam),
- tied to user impact.
Examples of useful alerts:
- error rate > baseline for 5 minutes
- p95 latency > threshold for 10 minutes
- invalid_output rate spikes after prompt deployment
- tokens per success doubles (likely context/retry issue)
High traffic is not an incident. Alert on error rate, latency, and outcome categories.
Dashboards and runbooks
Dashboards answer “what’s happening.” Runbooks answer “what do we do next.”
For LLM apps, a minimal runbook should include:
- how to identify whether it’s rate limiting vs timeouts vs auth
- where to find prompt version and schema version
- how to roll back to a previous prompt version
- how to enable safe debug logging temporarily
Synthetic checks and eval probes
Once you have stable prompts and schemas, add synthetic checks:
- run a small set of known inputs periodically
- verify schema validity and basic quality signals
- alert if outputs break format or key criteria
This catches regressions quickly, especially after prompt/model changes.
Budgets and guardrails (cost + safety)
Prevention also means hard limits:
- cap retries
- cap concurrency
- cap input sizes
- cache/dedup repeated calls
- budget tool calls and side effects
Budgets stop one failure mode from turning into a cascade.
Prevention checklist
- Metrics for success rate, latency, error categories, tokens per success.
- Alerts on error spikes and latency spikes (not raw volume).
- Prompt version and schema version logged in all requests.
- Rollback path exists for prompt versions.
- Retry and concurrency budgets enforced.
- Runbook exists for common failure modes.