29.5 Observability: traces, metrics, and prompt logs

Overview and links for this section of the guide.

Goal: make failures debuggable and costs visible

LLM apps fail in ways that are hard to reproduce unless you log the right context.

Observability is how you answer:

  • “Why did it answer that?”
  • “Why is it slow today?”
  • “Why did costs spike?”
  • “Which prompt version caused this regression?”

This page gives you the minimal observability setup that makes AI features maintainable.

Core principle: every answer should be explainable

For an AI feature, “explainable” means you can reconstruct:

  • the user input and context (redacted if necessary),
  • the retrieval results (chunk ids + scores + doc versions),
  • the prompt version and model settings,
  • the output (or at least the validated JSON),
  • the validation outcome and retry behavior.

This is the difference between systematic debugging and random prompt tweaking.

Link to auditing

For grounded systems, observability overlaps with audit logs. The same data that helps you debug also helps you prove which sources influenced answers.

Metrics to track (baseline dashboard)

Start with a small set of metrics:

  • Request volume: requests per minute per route/feature.
  • Latency: p50/p95/p99 end-to-end latency.
  • Error rate: timeouts, 429s, 5xx, validation failures.
  • Retry rate: how often retries happen and how often they succeed.
  • Not-found rate: how often the system abstains (changes can indicate retrieval issues).
  • Token usage: input tokens, output tokens, cost per request.
  • Cache hit rate: for retrieval and generation caches.
  • RAG-specific: average retrieved chunks, average prompt context size.

Tracing: break down latency by pipeline stage

Trace spans should separate major stages:

  • request received
  • input normalization
  • retrieval
  • reranking (optional)
  • prompt composition
  • model call
  • validation
  • response rendering

This lets you answer “is slowness coming from retrieval or the model?” immediately.

Prompt and output logs (safe logging)

Log enough to debug, but not so much that you leak data.

Practical safe logging rules:

  • Log identifiers and versions: prompt_version, model, corpus/index version.
  • Log chunk ids, not full text: store references to sources; store text only in controlled systems.
  • Redact user inputs: remove PII/secrets where needed.
  • Log validation outcomes: schema pass/fail, citation checks, retries.
  • Sample full payloads: only in controlled environments or with strict retention.
Logs are part of your threat model

Prompt logs can contain sensitive user and company data. If you don’t have a policy, default to logging ids/hashes and keep raw text out of logs.

Alerts and SLOs

Alerts should focus on user impact and cost:

  • Latency SLO breach: p95 above threshold.
  • Error spike: timeouts/429s/5xx above baseline.
  • Validation failure spike: schema failures suddenly increase (prompt/model mismatch).
  • Cost spike: tokens per request or daily spend exceeds budget.
  • Cache drop: cache hit rate collapses (can indicate key/version bugs).

When alerts fire, you should have an incident workflow and runbook (see Part VI incident response patterns).

Debug workflow: from user report to root cause

When a user reports “it answered wrong” or “it’s slow”:

  1. Find the request: by request id or timestamp.
  2. Inspect retrieval: which chunks were retrieved and why?
  3. Inspect prompt version and model: did a recent deployment change behavior?
  4. Inspect validation: did output barely pass, or did retries occur?
  5. Reproduce with the same artifacts: same sources set, same prompt version, same model if possible.
  6. Fix the correct layer: retrieval, chunking, prompt, validator, or UX policy.

This workflow turns “LLMs are weird” into “the pipeline did X and we can change it.”

Where to go next