29.2 Circuit breakers and fallback modes
Overview and links for this section of the guide.
On this page
Goal: prevent cascading failures when the model is unhealthy
When the model or provider is unhealthy, naive retry behavior can turn a partial outage into a full outage:
- requests pile up,
- retries amplify load,
- latency spikes across the app,
- cost skyrockets,
- users lose trust.
Circuit breakers and fallback modes are how you keep the rest of your system healthy during model trouble.
Circuit breakers (what they are)
A circuit breaker is a policy:
- Closed: normal operation; calls go through.
- Open: model calls are blocked (or reduced) because failure rate is high.
- Half-open: allow a small number of test calls to see if recovery happened.
The goal is to stop making failing calls long enough for recovery—and to protect your app from retry storms.
Even if you can “afford” failures, you often can’t afford unlimited retries during outages. Breakers put a hard cap on burn.
Signals to trip breakers
Trip breakers based on objective signals, not feelings:
- Error rate: % of 5xx or timeouts over a rolling window.
- Rate limits: sustained 429 responses.
- Latency: p95/p99 above threshold for a window.
- Validation failure rate: sudden increase in invalid JSON/schema failures (can indicate model/prompt mismatch).
- Cost spikes: token usage per request exceeding budgets.
Different breakers can exist for different features or routes (e.g., “summarizer” vs “RAG Q&A”).
Fallback modes (practical options)
Fallback modes should be designed upfront. Common options:
- Return a safe minimal response: “not available right now” + next steps.
- Use cached results: show last known answer (with a freshness warning).
- Switch models: fall back to a cheaper/faster model or a different provider.
- Reduce scope: smaller output, fewer retrieved chunks, no reranking.
- Non-AI baseline: keyword search + links to docs.
- Queue and async: accept request and deliver later (email/job queue).
In degraded mode, do not relax guardrails. Under stress is when systems leak data and invent facts. Prefer abstention and safe alternatives.
UX for degraded modes
Degraded-mode UX should be:
- transparent: clearly indicate reduced capability.
- actionable: suggest next step (try again, search docs, contact support).
- non-alarming: present it as “temporarily unavailable” rather than “system broken.”
- consistent: same failure state maps to same user experience.
For RAG, a good fallback is often: show top relevant source chunks and let the user click through, instead of generating an answer.
Testing fallback behavior
Fallbacks must be tested like any other product behavior:
- Simulate outages: force timeouts and 5xx responses.
- Simulate rate limits: force 429 responses and confirm backoff.
- Verify breakers trip: ensure they open at thresholds and recover correctly.
- Verify safe messaging: no leaks, no confusing half-answers.