27.5 Snapshot testing with careful update workflows
Overview and links for this section of the guide.
On this page
Goal: detect behavioral drift without brittle tests
Snapshot testing captures an output artifact and alerts you when it changes.
For AI features, snapshot tests are useful when you want:
- a cheap “something changed” signal,
- diff-based review of behavior shifts,
- coverage across many cases without writing assertions for every detail.
Snapshot tests vs golden tests
The difference is mostly intent:
- Golden tests: encode specific decisions and invariants, often with targeted assertions.
- Snapshot tests: capture broader outputs and rely on diff review.
In practice, teams often use both:
- goldens for contracts and critical edge cases,
- snapshots for wider coverage and drift detection.
When snapshot testing is useful
Snapshot testing works best when outputs are:
- structured: JSON is easier to diff than free-form text.
- normalized: stable ordering and formatting reduce noise.
- bounded: outputs are not massive.
- reviewable: diffs are small enough for humans to understand.
Great candidates:
- structured summaries,
- extraction outputs,
- tool call plans,
- RAG answers with citations.
Snapshot risks (and how teams ruin them)
Snapshot testing fails when:
- Updates are auto-approved: snapshots become meaningless.
- Snapshots are huge: diffs are unreadable, so reviewers ignore them.
- Outputs are unstable: randomness creates noisy diffs.
- No rubric exists: reviewers don’t know what “better” means.
If your team blindly updates snapshots, you’re training everyone to ignore regressions. Build a review ritual or don’t bother.
Designing snapshots for LLM outputs
Design snapshots to minimize noise and maximize signal:
- Use structured output: snapshot JSON, not prose.
- Canonicalize: stable key ordering, stable array ordering, normalized whitespace.
- Fix randomness: prefer deterministic settings for snapshot runs when possible.
- Snapshot only the “contracted” part: avoid including timestamps, request ids, or other changing fields.
- Split snapshots: store one snapshot per case to keep diffs small.
For RAG, a useful snapshot often includes:
- question,
- retrieved chunk ids (and versions),
- answer JSON with citations.
This lets you see if changes are coming from retrieval or generation.
A safe snapshot update workflow
Use rules that force intentional updates:
- Snapshot updates require a reason: tie to prompt change, model change, or bug fix.
- Review the diff with a rubric: what improved? what regressed?
- Limit blast radius: prefer small changes; avoid “rewrite everything” prompt updates.
- Track drift over time: if snapshots change frequently, your system is unstable or too sensitive.
Use automated checks (schema, citations, quote containment) to fail fast, then use snapshots to review higher-level drift.