27.5 Snapshot testing with careful update workflows

Overview and links for this section of the guide.

Goal: detect behavioral drift without brittle tests

Snapshot testing captures an output artifact and alerts you when it changes.

For AI features, snapshot tests are useful when you want:

  • a cheap “something changed” signal,
  • diff-based review of behavior shifts,
  • coverage across many cases without writing assertions for every detail.

Snapshot tests vs golden tests

The difference is mostly intent:

  • Golden tests: encode specific decisions and invariants, often with targeted assertions.
  • Snapshot tests: capture broader outputs and rely on diff review.

In practice, teams often use both:

  • goldens for contracts and critical edge cases,
  • snapshots for wider coverage and drift detection.

When snapshot testing is useful

Snapshot testing works best when outputs are:

  • structured: JSON is easier to diff than free-form text.
  • normalized: stable ordering and formatting reduce noise.
  • bounded: outputs are not massive.
  • reviewable: diffs are small enough for humans to understand.

Great candidates:

  • structured summaries,
  • extraction outputs,
  • tool call plans,
  • RAG answers with citations.

Snapshot risks (and how teams ruin them)

Snapshot testing fails when:

  • Updates are auto-approved: snapshots become meaningless.
  • Snapshots are huge: diffs are unreadable, so reviewers ignore them.
  • Outputs are unstable: randomness creates noisy diffs.
  • No rubric exists: reviewers don’t know what “better” means.
Snapshots without review are worse than no snapshots

If your team blindly updates snapshots, you’re training everyone to ignore regressions. Build a review ritual or don’t bother.

Designing snapshots for LLM outputs

Design snapshots to minimize noise and maximize signal:

  • Use structured output: snapshot JSON, not prose.
  • Canonicalize: stable key ordering, stable array ordering, normalized whitespace.
  • Fix randomness: prefer deterministic settings for snapshot runs when possible.
  • Snapshot only the “contracted” part: avoid including timestamps, request ids, or other changing fields.
  • Split snapshots: store one snapshot per case to keep diffs small.

For RAG, a useful snapshot often includes:

  • question,
  • retrieved chunk ids (and versions),
  • answer JSON with citations.

This lets you see if changes are coming from retrieval or generation.

A safe snapshot update workflow

Use rules that force intentional updates:

  1. Snapshot updates require a reason: tie to prompt change, model change, or bug fix.
  2. Review the diff with a rubric: what improved? what regressed?
  3. Limit blast radius: prefer small changes; avoid “rewrite everything” prompt updates.
  4. Track drift over time: if snapshots change frequently, your system is unstable or too sensitive.
Pair snapshots with automated gates

Use automated checks (schema, citations, quote containment) to fail fast, then use snapshots to review higher-level drift.

Where to go next