29.1 Timeouts, retries, and idempotency

Overview and links for this section of the guide.

Goal: keep the app responsive under model variability

Model calls can be slow, flaky, rate-limited, or temporarily unavailable.

Your goal is not “make the model never fail.” Your goal is:

  • the app never hangs,
  • failures are bounded and recoverable,
  • retries don’t create duplicate side effects,
  • users get a useful fallback when the model can’t respond.

Timeouts: what to time out (everything)

Every external step should have a timeout:

  • Retrieval timeout: vector search, reranking, keyword search.
  • Model timeout: completion call (including streaming).
  • Tool/API timeout: any downstream API calls initiated by the app.
  • End-to-end timeout: total time budget for the user request.

Practical rules:

  • Set an end-to-end budget first: e.g., 8s for “interactive,” 30s for “analysis.”
  • Allocate per-stage budgets: retrieval 1s, rerank 1s, model 5s, validation 0.2s (example).
  • Reserve time for fallbacks: don’t spend 100% of the budget on retries.
Timeouts need cancellation

A timeout that doesn’t cancel work still burns cost. Use cancellation signals where supported and ensure background work is stopped or ignored safely.

Retries: when they help and when they hurt

Retries are helpful when failures are transient:

  • network hiccups,
  • rate-limit responses (after waiting),
  • temporary provider errors.

Retries are harmful when failures are persistent or logical:

  • invalid requests,
  • schema violations caused by prompt design,
  • permissions problems,
  • bad retrieval returning irrelevant chunks.

Practical retry rules:

  • Retry only a small number of times: 1–2 is often enough.
  • Use exponential backoff + jitter: avoid synchronized retry storms.
  • Differentiate errors: retry on 429/5xx/timeouts, not on 4xx “bad request.”
  • Retry with a modified strategy: fewer chunks, stricter schema reminder, or a smaller model.
Retry ≠ repeat

Good retries change something: wait longer, reduce prompt size, switch model, or fall back. Repeating the same call often repeats the same failure.

Idempotency: safe retries without duplicate side effects

Idempotency answers: “If we repeat this request, will we accidentally do it twice?”

For pure generation (text output), retries are usually safe. For actions (tool calls, DB writes), retries can be dangerous.

Idempotency patterns:

  • Idempotency key: attach a unique request id to downstream writes so duplicates are rejected.
  • Read vs write split: allow retries for reads; require special handling for writes.
  • Two-phase commit style: generate a proposal first, then require explicit execution.
  • At-least-once vs exactly-once: design your system to tolerate duplicates when exactly-once is hard.
Streaming complicates retries

If you stream partial output to users, you need policies for mid-stream failure: restart from scratch, resume, or fall back to a summary. Decide this upfront.

Budgets: max retries, max tokens, max latency

Reliability requires hard limits:

  • Max retries: bound how many attempts you make.
  • Max tokens: cap output size and total token usage per request.
  • Max context: cap number of retrieved chunks and total context included.
  • Max time: end-to-end deadline that includes retries.

Budgets keep your system stable during outages and protect you from surprise bills.

Practical patterns (wrappers and policies)

Most teams end up with a wrapper like:

  • compose prompt (with budgets),
  • call model (with timeout),
  • validate output (schema, citations),
  • retry with stricter prompt or reduced context,
  • fallback to not_found / needs_clarification / degraded response.

Make this wrapper consistent across the codebase. Reliability comes from standardization.

Where to go next