19.2 Reproducing bugs with minimal test cases

Overview and links for this section of the guide.

Goal: one-command reproduction

Your goal is to produce a reproduction that is:

  • reliable (fails consistently),
  • small (minimal inputs),
  • fast (runs quickly),
  • portable (a teammate can run it).

Ideally it is one command or one test.

Repro is how you stop guessing

Without a repro, you can’t know if a fix worked. With a repro, you can iterate safely and quickly.

Why reproduction is the highest leverage step

Reproduction converts an incident from “mysterious behavior” into “a failing check.” Once you have a failing check:

  • hypotheses become testable,
  • fixes become verifiable,
  • regressions become preventable.

The minimal reproducible error (MRE) procedure

  1. Start from a failing case: one request id / one input sample.
  2. Make it local: reproduce in staging or dev if possible.
  3. Reduce variables: disable concurrency, retries, and randomness.
  4. Pin versions: runtime version, dependency versions, prompt versions.
  5. Minimize input: remove irrelevant parts while keeping failure.
  6. Write it down: one command with expected vs actual output.

Once this is done, you can ask the model for fixes with high confidence.

Shrinking a repro (delta debugging mindset)

When inputs are large (documents, payloads), use a shrink loop:

  • remove half the input, rerun
  • if it still fails, keep shrinking
  • if it stops failing, add back the last removed chunk and try a different removal

This is a practical way to isolate the minimal trigger.

Ask the model to help shrink

The model can propose which parts of an input are likely irrelevant. You still validate by rerunning the repro.

Turn repro into a failing test

Once you have a one-command repro, convert it into a test:

  • a unit test if the bug is in pure logic
  • an integration test if the bug is in wiring/IO
  • a golden test if output shape must remain stable

The test is your regression lock.

Copy-paste prompts

Prompt: create an MRE plan

Help me create a minimal reproducible error (MRE) for this bug.

Evidence:
- Expected: ...
- Actual: ...
- Logs/output: ...

Task:
1) Propose a step-by-step plan to reproduce it reliably.
2) Propose how to shrink the input to a minimal case.
3) Propose what a regression test should look like (unit vs integration).
Stop after the plan.

Prompt: write the regression test only

Write a regression test for this bug. Do NOT change implementation yet.

Repro:
- Input: ...
- Expected: ...
- Actual: ...

Output: diff-only changes (tests only)

Where to go next