8.2 The "hypothesize → test → iterate" loop

Overview and links for this section of the guide.

The core debugging loop

Debugging with AI works best when you force structure:

  1. Reproduce the failure reliably.
  2. Hypothesize a small set of plausible causes.
  3. Test hypotheses with quick checks.
  4. Iterate based on results (discard wrong hypotheses).
  5. Fix with the smallest diff that resolves the issue.
  6. Lock with a regression test.

This is “scientific method for software,” scaled down to minutes.

Why this matters with LLMs

Without structure, the model will propose plausible fixes and you’ll try them one by one. That feels like progress but often wastes time. The loop makes the model do diagnostic work, not just code generation.

Step 1: reproduce reliably

If you can’t reproduce, you can’t debug. Your first job is to make the failure happen on demand:

  • reduce concurrency, retries, and “randomness,”
  • use the same input every time,
  • write a failing test if possible,
  • capture exact error output.

Once you have a one-command reproduction, you’ve already done most of the hard part.

Step 2: generate hypotheses (without guessing randomly)

Ask the model for a short ranked list of hypotheses. Importantly, require each hypothesis to be connected to evidence.

Good hypotheses are specific:

  • “This function returns None when input is empty, and the caller doesn’t handle it.”
  • “The parser treats unary minus as binary minus in this token sequence.”
  • “The CLI exits with code 0 because exception is swallowed.”

Bad hypotheses are vague:

  • “It’s probably a bug in your code.”
  • “Maybe the environment is wrong.”
  • “Try reinstalling dependencies.”
Force ranking

Ask the model to rank hypotheses by likelihood and impact. Ranking prevents it from listing 20 ideas with no prioritization.

Step 3: design tests to confirm/deny

For each hypothesis, demand a confirming and denying check:

  • Confirming test: “If hypothesis is true, we should observe X.”
  • Denying test: “If hypothesis is false, we should observe Y.”

Examples of quick tests:

  • add one log line to confirm code path,
  • add one unit test for a suspected edge case,
  • inspect a value at a boundary (before/after parsing),
  • run a minimal reproduction with a modified input.
Avoid “fixes” that don’t test a hypothesis

If you apply changes without a hypothesis and an observation, you’re doing random walk debugging.

Step 4: iterate with evidence

Run one check at a time. Then update the model with results:

  • what you ran,
  • what you observed,
  • which hypotheses are now less likely,
  • what the next check should be.

This keeps the model anchored to reality and avoids “narrative debugging.”

Step 5: implement the smallest fix

Once a hypothesis is confirmed, fix it with minimal scope:

  • diff-only changes,
  • avoid refactors during the fix,
  • prefer changing one function over re-architecting,
  • preserve behavior outside the bug.

Step 6: lock the fix (regression test)

If you don’t lock the fix, it will regress. Locking means:

  • add a test that fails before the fix,
  • keep the test after the fix (forever),
  • run tests in CI (eventually) so regressions get caught immediately.
Regression tests are documentation

A good regression test explains “what went wrong” and “what must never happen again.” That’s how teams build reliability over time.

Copy-paste prompt templates

Template A: hypotheses + tests only (no code)

We have a bug. Do NOT write code yet.

Goal:
[expected behavior]

Reproduction:
```sh
[command]
```

Actual output:
```text
[output]
```

Relevant code:
(paste minimal)

Task:
1) Provide 3–5 ranked hypotheses for the root cause.
2) For each hypothesis, propose a confirming test and a denying test.
3) Stop and wait for my results.

Template B: implement the smallest fix

Based on the confirmed hypothesis (#N), implement the smallest fix.

Constraints:
- Diff-only changes
- No refactors beyond what’s required
- Add/keep a regression test

Output:
- Unified diff only

Where to go next