28.3 Pairwise comparisons for model/prompt tuning

Overview and links for this section of the guide.

Goal: choose winners reliably when differences are subtle

When you tweak prompts or swap models, differences are often subtle and subjective.

Pairwise comparison solves this by asking a simpler question:

Given the same input, which output is better?

This is often easier and more consistent than assigning absolute scores.

Why pairwise comparisons work so well

Pairwise is powerful because:

  • humans are better at comparing than at scoring on a scale,
  • it reduces “calibration drift” across reviewers,
  • it makes tradeoffs visible quickly (clarity vs completeness),
  • it supports A/B decisions for shipping.
Pairwise is how you avoid bikeshedding

If reviewers argue about what “4/5 clarity” means, switch to pairwise. Ask: which one is better for the user and why?

Setting up a fair comparison

To make pairwise fair:

  • Same inputs: run A and B on the exact same eval cases.
  • Randomize order: don’t always show A first.
  • Blind reviewers: hide which prompt/model produced which output.
  • Control retrieval: for RAG, either freeze sources or log and compare retrieved chunks as well.
  • Normalize formatting: so reviewers judge content, not whitespace.

Decide what you’re comparing:

  • prompt A vs prompt B,
  • model X vs model Y,
  • retrieval strategy 1 vs 2,
  • reranker on vs off.

Pairwise rubric (what reviewers decide)

Keep it simple. For each case:

  • Winner: A / B / tie
  • Reason tag: correctness / faithfulness / clarity / completeness / safety / other
  • Optional note: one sentence when non-obvious
Faithfulness should dominate for grounded systems

If one answer is more eloquent but less faithful to sources, it should lose. Encode that rule explicitly for reviewers.

Aggregating results (win rate, Elo, confidence)

Simple aggregation:

  • Win rate: percentage of cases where A beats B.
  • Ties: track separately; many ties means changes are low-impact.
  • By-category wins: improvements in clarity but regressions in correctness are visible.

More advanced:

  • Elo rating: useful if you compare many variants over time.
  • Confidence intervals: useful if you need statistical confidence for deployment decisions.

In practice, win rate + category breakdown is enough for most teams.

A practical workflow you can run weekly

  1. Pick the change: one prompt change or one model swap.
  2. Run A and B: on the same 25-case eval set.
  3. Review pairwise: 10–15 minutes per reviewer for a subset.
  4. Aggregate: compute win rates and reasons.
  5. Decide: ship, reject, or iterate with a clear hypothesis.

Pairwise encourages small changes because small changes are easier to compare and attribute.

Pitfalls and how to avoid them

  • Comparing multiple changes at once: you won’t know what caused the improvement.
  • Letting retrieval differ: you might be comparing retrieval luck rather than prompts.
  • Review fatigue: limit batch size and use sampling.
  • Overfitting: don’t tune prompts only to your tiny eval set; rotate in fresh cases.

Where to go next