14.5 Hardening: timeouts, retries, and fallbacks

On this page

Goal: make the app reliable under real usage
Timeouts (hard vs soft)
Retries (policy, backoff, jitter)
Fallbacks (what to do when it fails)
Caching and dedup (save cost + reduce throttling)
Input limits and guardrails
User experience under failure
Rollout strategy (reduce risk)
Hardening checklist
Where to go next

Goal: make the app reliable under real usage

Hardening is where you turn a demo into a product. Your goals:

avoid hanging requests,
avoid retry storms,
handle invalid outputs gracefully,
keep costs predictable,
make failures understandable to users.

This page focuses on practical reliability primitives you can add without overbuilding.

Hardening is part of the vibe loop

Once the pipeline works, the next loops should focus on reliability, not more features. Reliability is how you keep momentum long-term.

Timeouts (hard vs soft)

At minimum, set a hard timeout for the model call. Two useful concepts:

Hard timeout: abort the request and return a timeout outcome.
Soft timeout: if you exceed a threshold, return a fallback (partial result, cached result, or “try again”).

For v1, a hard timeout with clear UX is enough.

Expose timeout as config

Make timeout configurable via env var so you can adjust without code changes.

Retries (policy, backoff, jitter)

Retries are helpful for transient failures, harmful when uncontrolled.

A safe default policy

Max attempts: 3
Retryable: rate_limit, transient network errors, some timeouts
Non-retryable: auth_error, invalid_request, blocked
Invalid output: at most 1 retry using a stricter “repair” prompt
Backoff: exponential + jitter

Retries must be observable

Always log attempt count and outcome category. Otherwise you’ll misdiagnose cost spikes and latency spikes.

Fallbacks (what to do when it fails)

A fallback is how your product stays usable when the ideal path fails. Practical fallback options:

User retry: show a retry button with guidance.
Return partial output: if you got some structured fields, return them with a warning.
Fallback model: switch to a smaller/faster model for a second attempt (careful: may reduce quality).
Fallback format: if strict JSON fails, ask for a simpler schema (still validate).

For v1, the most important fallback is a clean “try again” UX with a request id.

Caching and dedup (save cost + reduce throttling)

Summarization often repeats: users resubmit the same text, or your UI retries on refresh.

Two pragmatic techniques:

Dedup in-flight requests: if the same input arrives twice at once, run one call and share the result.
Cache recent results: key by (prompt_version + schema_version + normalized input hash).

Be careful with privacy: caching may store user content. Prefer caching only in dev, or cache by hashed keys and store outputs with strict access controls.

Caching can leak data

If you cache summaries that contain user data, treat the cache like sensitive storage: encryption, access controls, retention limits.

Input limits and guardrails

Guardrails reduce cost and reduce failures:

max input length,
rate limit user requests (especially for web apps),
cap concurrency,
validate inputs before model calls,
refuse obviously unsupported inputs early (v1 is allowed to be strict).

User experience under failure

Failure-aware UX is part of reliability:

Timeout: “This took too long. Try again.”
Rate limit: “We’re temporarily rate limited. Wait a moment.”
Invalid output: “We couldn’t parse the response. Try again.”
Blocked: “We can’t help with that request.” + safe alternatives.

Also: show request ids for support/debugging, but avoid showing internal provider details.

Rollout strategy (reduce risk)

Even for small projects, you can roll out safely:

start in dev with verbose logs and small input limits,
use a staging environment with production-like configs,
roll out to a small group of users,
watch error categories and latency,
only then expand.

This is how you avoid turning a prototype into a product incident.

Hardening checklist

Timeouts configured and enforced.
Retry policy implemented (caps + backoff + jitter).
Invalid output handled (parse + schema validation + optional repair retry).
Blocked/refused handled as a normal outcome state.
Logs include request id, prompt version, model, latency, outcome category.
Input limits and concurrency limits applied.
Fallback UX states implemented end-to-end.

14.5 Hardening: timeouts, retries, and fallbacks

Goal: make the app reliable under real usage

Timeouts (hard vs soft)

Retries (policy, backoff, jitter)

A safe default policy

Fallbacks (what to do when it fails)

Caching and dedup (save cost + reduce throttling)

Input limits and guardrails

User experience under failure

Rollout strategy (reduce risk)

Hardening checklist

Where to go next