14.5 Hardening: timeouts, retries, and fallbacks
Overview and links for this section of the guide.
On this page
- Goal: make the app reliable under real usage
- Timeouts (hard vs soft)
- Retries (policy, backoff, jitter)
- Fallbacks (what to do when it fails)
- Caching and dedup (save cost + reduce throttling)
- Input limits and guardrails
- User experience under failure
- Rollout strategy (reduce risk)
- Hardening checklist
- Where to go next
Goal: make the app reliable under real usage
Hardening is where you turn a demo into a product. Your goals:
- avoid hanging requests,
- avoid retry storms,
- handle invalid outputs gracefully,
- keep costs predictable,
- make failures understandable to users.
This page focuses on practical reliability primitives you can add without overbuilding.
Once the pipeline works, the next loops should focus on reliability, not more features. Reliability is how you keep momentum long-term.
Timeouts (hard vs soft)
At minimum, set a hard timeout for the model call. Two useful concepts:
- Hard timeout: abort the request and return a timeout outcome.
- Soft timeout: if you exceed a threshold, return a fallback (partial result, cached result, or “try again”).
For v1, a hard timeout with clear UX is enough.
Make timeout configurable via env var so you can adjust without code changes.
Retries (policy, backoff, jitter)
Retries are helpful for transient failures, harmful when uncontrolled.
A safe default policy
- Max attempts: 3
- Retryable: rate_limit, transient network errors, some timeouts
- Non-retryable: auth_error, invalid_request, blocked
- Invalid output: at most 1 retry using a stricter “repair” prompt
- Backoff: exponential + jitter
Always log attempt count and outcome category. Otherwise you’ll misdiagnose cost spikes and latency spikes.
Fallbacks (what to do when it fails)
A fallback is how your product stays usable when the ideal path fails. Practical fallback options:
- User retry: show a retry button with guidance.
- Return partial output: if you got some structured fields, return them with a warning.
- Fallback model: switch to a smaller/faster model for a second attempt (careful: may reduce quality).
- Fallback format: if strict JSON fails, ask for a simpler schema (still validate).
For v1, the most important fallback is a clean “try again” UX with a request id.
Caching and dedup (save cost + reduce throttling)
Summarization often repeats: users resubmit the same text, or your UI retries on refresh.
Two pragmatic techniques:
- Dedup in-flight requests: if the same input arrives twice at once, run one call and share the result.
- Cache recent results: key by (prompt_version + schema_version + normalized input hash).
Be careful with privacy: caching may store user content. Prefer caching only in dev, or cache by hashed keys and store outputs with strict access controls.
If you cache summaries that contain user data, treat the cache like sensitive storage: encryption, access controls, retention limits.
Input limits and guardrails
Guardrails reduce cost and reduce failures:
- max input length,
- rate limit user requests (especially for web apps),
- cap concurrency,
- validate inputs before model calls,
- refuse obviously unsupported inputs early (v1 is allowed to be strict).
User experience under failure
Failure-aware UX is part of reliability:
- Timeout: “This took too long. Try again.”
- Rate limit: “We’re temporarily rate limited. Wait a moment.”
- Invalid output: “We couldn’t parse the response. Try again.”
- Blocked: “We can’t help with that request.” + safe alternatives.
Also: show request ids for support/debugging, but avoid showing internal provider details.
Rollout strategy (reduce risk)
Even for small projects, you can roll out safely:
- start in dev with verbose logs and small input limits,
- use a staging environment with production-like configs,
- roll out to a small group of users,
- watch error categories and latency,
- only then expand.
This is how you avoid turning a prototype into a product incident.
Hardening checklist
- Timeouts configured and enforced.
- Retry policy implemented (caps + backoff + jitter).
- Invalid output handled (parse + schema validation + optional repair retry).
- Blocked/refused handled as a normal outcome state.
- Logs include request id, prompt version, model, latency, outcome category.
- Input limits and concurrency limits applied.
- Fallback UX states implemented end-to-end.