28.4 Regression detection across prompt versions

Overview and links for this section of the guide.

Goal: detect regressions before users do

Prompt changes are code changes. They need regression testing.

Regression detection answers:

  • “What changed?”
  • “Did it get better or worse?”
  • “Which cases regressed?”
  • “Can we safely ship?”
The meta-rule

Every time you change prompts, models, retrieval, or schema, run the eval set and review diffs. This is how you keep velocity without losing trust.

Treat prompts like code (versioning rules)

Version your prompts explicitly:

  • Prompt files: prompts live as files in the repo, not only in chat history.
  • Semantic versions: bump version when behavior changes meaningfully.
  • Changelogs: record what changed and why (“improved not_found behavior”, “reduced verbosity”).
  • Link to eval runs: store results or references for each version bump.

Also version the environment:

  • model name/version,
  • retrieval strategy and embedding model version,
  • corpus version/hash,
  • validator/schema versions.

Otherwise “regression” might be caused by a corpus change or a model upgrade.

Regression signals to track

Track signals that matter for your product’s promise:

  • Schema validity rate: % of outputs that parse and validate.
  • Citation integrity rate: % of claims with valid citations and quotes.
  • Not-found accuracy: does it abstain when evidence is missing?
  • Conflict rate: are conflicts detected and surfaced appropriately?
  • Rubric scores: correctness, faithfulness, clarity.
  • Latency/cost: p95 time and token usage per request.

Pick a small set of “must not regress” metrics for shipping gates.

Quality gates (what blocks shipping)

Define gates that block shipping, such as:

  • Schema gate: any decrease in schema-valid outputs blocks.
  • Faithfulness gate: any increase in unfaithful answers blocks.
  • Safety gate: any policy violation blocks.
  • Latency gate: p95 latency above a threshold blocks.
Gate on the most important promise

For grounded systems, that promise is faithfulness. Don’t ship “better tone” at the cost of invented citations.

Workflow: PRs, CI, and review

A production-shaped workflow:

  1. Make changes small: one hypothesis per change.
  2. Run evals in CI: on the fixed eval set; store results as artifacts.
  3. Diff outputs: highlight which cases changed and how.
  4. Review with rubric: focus on regressed cases first.
  5. Decide: ship or iterate, with a clear reason.

Even if your CI can’t call models, you can still run deterministic gates (schema validation, quote containment on recorded outputs) and run full evals as a separate job.

Handling drift from models and corpora

Two common drift sources:

  • Model drift: provider updates change output style or behavior.
  • Corpus drift: docs change, retrieval returns different chunks, answers change.

Practical mitigations:

  • Pin versions when possible: model version, embedding version.
  • Log versions always: even if you can’t pin, you can attribute regressions.
  • Separate eval types: prompt-only eval with frozen sources vs end-to-end eval with retrieval.
  • Canary deploy: run new versions on a small traffic slice and monitor.

Debugging regressions systematically

When something regresses:

  1. Is it retrieval? compare retrieved chunks and filters.
  2. Is it prompt composition? check whether instructions/sources were packed correctly.
  3. Is it generation? check model version/settings and output structure.
  4. Is it validation? did schema/citation checks change?

Regression debugging is pipeline debugging. Avoid random prompt tweaks until you know which layer failed.

Where to go next