27.5 Snapshot testing with careful update workflows

On this page

Goal: detect behavioral drift without brittle tests

Snapshot testing captures an output artifact and alerts you when it changes.

For AI features, snapshot tests are useful when you want:

The difference is mostly intent:

Golden tests: encode specific decisions and invariants, often with targeted assertions.
Snapshot tests: capture broader outputs and rely on diff review.

In practice, teams often use both:

Snapshot testing works best when outputs are:

Great candidates:

Snapshot testing fails when:

Snapshots without review are worse than no snapshots

If your team blindly updates snapshots, you’re training everyone to ignore regressions. Build a review ritual or don’t bother.

Design snapshots to minimize noise and maximize signal:

Use structured output: snapshot JSON, not prose.
Canonicalize: stable key ordering, stable array ordering, normalized whitespace.
Fix randomness: prefer deterministic settings for snapshot runs when possible.
Snapshot only the “contracted” part: avoid including timestamps, request ids, or other changing fields.
Split snapshots: store one snapshot per case to keep diffs small.

For RAG, a useful snapshot often includes:

This lets you see if changes are coming from retrieval or generation.

Use rules that force intentional updates:

Snapshot updates require a reason: tie to prompt change, model change, or bug fix.
Review the diff with a rubric: what improved? what regressed?
Limit blast radius: prefer small changes; avoid “rewrite everything” prompt updates.
Track drift over time: if snapshots change frequently, your system is unstable or too sensitive.

Pair snapshots with automated gates

Use automated checks (schema, citations, quote containment) to fail fast, then use snapshots to review higher-level drift.