13.4 Logging and error handling patterns for LLM calls

On this page

Goal: make LLM calls diagnosable and resilient
Error taxonomy (what can go wrong)
Timeouts (must-have)
Retries with backoff (must be disciplined)
Invalid output handling (schemas, parsing failures)
Safety blocks and refusals (normal outcome)
Logging fields that actually help
User-facing error behavior (don’t leak details)
Minimal metrics (the few that matter)
Copy-paste templates
Where to go next

Goal: make LLM calls diagnosable and resilient

LLM calls will fail sometimes. Your job is to make failure:

visible: you can tell what happened and why,
bounded: failures don’t cascade and take down the app,
recoverable: retries and fallbacks exist,
safe: logs don’t leak sensitive data.

This page gives you concrete patterns you can implement in your wrapper layer (13.3).

Most “random failures” are design failures

If you don’t categorize errors, set timeouts, and cap retries, you’ll experience LLM behavior as random. With the right plumbing, it becomes predictable.

Error taxonomy (what can go wrong)

Start with a simple taxonomy. A practical set:

auth_error: bad credentials, wrong project, missing permissions.
rate_limit: 429s, tokens-per-minute, concurrency throttles.
timeout: request took too long; network hung.
network_error: transient connectivity issues.
blocked: safety refusal / filtered content.
invalid_request: your request is malformed (bad params, too long).
invalid_output: output can’t be parsed/validated (bad JSON, schema mismatch).
unknown: catch-all with request id for investigation.

The important part is that your code returns categories, not just “error.” Categories drive correct retries and correct UX.

Make taxonomy part of the API

Your wrapper should return an explicit status or error_category so callers can handle it. Don’t bury it in log text.

Timeouts (must-have)

Without timeouts:

requests hang,
concurrency grows,
users spam retries,
your app becomes unstable.

Practical guidance:

set a default timeout for all model calls,
expose timeout as a config value (env var),
log timeouts as their own category,
prefer a fast failure + retry message over infinite spinners.

Retries with backoff (must be disciplined)

Retries should be:

selective: only retry retryable errors (rate limit, transient network, some timeouts),
backed off: exponential backoff with jitter,
capped: maximum attempts to prevent storms,
observable: log attempt count and category.

When not to retry automatically

auth_error: won’t fix itself.
invalid_request: your code/spec is wrong.
blocked: repeating the same unsafe request won’t help.
invalid_output: sometimes retrying helps (if model flaked), but cap aggressively and consider switching to a stricter schema or lower temperature.

Retry storms are a real failure mode

Blind retries amplify load, increase cost, and often make rate limiting worse. Always use backoff + caps.

Invalid output handling (schemas, parsing failures)

LLM outputs are not guaranteed to match your expectations, even if the model is “good.” Handle invalid outputs as a normal error category.

A robust invalid-output flow

Attempt parse: JSON parse or structured parse.
Validate schema: required fields, enums, types.
If invalid: return invalid_output with safe details (e.g., “missing field X”).
Optionally retry once: with a stricter “repair” prompt or lower temperature.
Fallback: return a user-friendly error or a safe partial output.

Invalid output is your fault until proven otherwise

Most invalid outputs come from ambiguous prompts or weak schemas. Tighten the contract before blaming the model.

Safety blocks and refusals (normal outcome)

Safety behavior should be handled like any other outcome type:

return blocked status,
show a refusal-aware UX state,
offer safe alternatives or clarifying questions,
log category codes/metadata (not raw content).

Do not treat safety blocks as “mysterious errors.” They’re part of normal product behavior.

Logging fields that actually help

Early logs should answer: “what happened, which prompt/model, how long did it take, what category did it fail with?”

Minimal log fields (good default)

request_id: correlation id
timestamp
app_env: dev/staging/prod
model: name/version
prompt_version: id/version string
latency_ms
attempt (retry attempt number)
outcome_category: ok / rate_limit / timeout / blocked / invalid_output / ...
token_estimates: input/output sizes (approx is fine early)

What to avoid logging by default

raw prompts (unless you have strict controls),
raw user inputs (often sensitive),
raw model outputs (may contain user data),
headers and credentials.

Log versions, not content

Log prompt_version and schema_version so you can reproduce behavior without storing the full text everywhere.

User-facing error behavior (don’t leak details)

Users need clarity, not stack traces. Good UX rules:

Be specific at a high level: “rate limited” vs “something went wrong.”
Give next steps: “try again in a moment” or “reduce input size.”
Don’t leak internals: no raw provider errors or request payloads in UI.
Make retry explicit: a retry button that respects backoff is better than encouraging spam clicks.

Minimal metrics (the few that matter)

If you track only a few things, track:

success rate: % ok
p50/p95 latency: how long calls take
error rate by category: rate_limit vs timeout vs invalid_output
calls per success: retries inflate this
tokens per success: cost proxy

These metrics will tell you where to invest: prompt size, caching, retries, model selection, or schema tightening.

Copy-paste templates

Template: retry policy (drop-in text)

Retry policy:
- Retryable: rate_limit, transient network errors, some timeouts
- Max attempts: 3
- Backoff: exponential + jitter
- Non-retryable: auth_error, invalid_request, blocked
- Invalid output: at most 1 retry with stricter schema/repair prompt
- Log: request_id, attempt, outcome_category, latency_ms

Template: outcome categories

Outcome categories:
- ok
- blocked
- rate_limit
- timeout
- network_error
- invalid_request
- invalid_output
- auth_error
- unknown

Template: user-facing copy

We couldn’t complete this request (timeout / rate limit).
Please wait a moment and try again.
If this keeps happening, reduce input size or try later.

13.4 Logging and error handling patterns for LLM calls

Goal: make LLM calls diagnosable and resilient

Error taxonomy (what can go wrong)

Timeouts (must-have)

Retries with backoff (must be disciplined)

When not to retry automatically

Invalid output handling (schemas, parsing failures)

A robust invalid-output flow

Safety blocks and refusals (normal outcome)

Logging fields that actually help

Minimal log fields (good default)

What to avoid logging by default

User-facing error behavior (don’t leak details)

Minimal metrics (the few that matter)

Copy-paste templates

Template: retry policy (drop-in text)

Template: outcome categories

Template: user-facing copy

Where to go next