28.3 Pairwise comparisons for model/prompt tuning
Overview and links for this section of the guide.
On this page
- Goal: choose winners reliably when differences are subtle
- Why pairwise comparisons work so well
- Setting up a fair comparison
- Pairwise rubric (what reviewers decide)
- Aggregating results (win rate, Elo, confidence)
- A practical workflow you can run weekly
- Pitfalls and how to avoid them
- Where to go next
Goal: choose winners reliably when differences are subtle
When you tweak prompts or swap models, differences are often subtle and subjective.
Pairwise comparison solves this by asking a simpler question:
Given the same input, which output is better?
This is often easier and more consistent than assigning absolute scores.
Why pairwise comparisons work so well
Pairwise is powerful because:
- humans are better at comparing than at scoring on a scale,
- it reduces “calibration drift” across reviewers,
- it makes tradeoffs visible quickly (clarity vs completeness),
- it supports A/B decisions for shipping.
If reviewers argue about what “4/5 clarity” means, switch to pairwise. Ask: which one is better for the user and why?
Setting up a fair comparison
To make pairwise fair:
- Same inputs: run A and B on the exact same eval cases.
- Randomize order: don’t always show A first.
- Blind reviewers: hide which prompt/model produced which output.
- Control retrieval: for RAG, either freeze sources or log and compare retrieved chunks as well.
- Normalize formatting: so reviewers judge content, not whitespace.
Decide what you’re comparing:
- prompt A vs prompt B,
- model X vs model Y,
- retrieval strategy 1 vs 2,
- reranker on vs off.
Pairwise rubric (what reviewers decide)
Keep it simple. For each case:
- Winner: A / B / tie
- Reason tag: correctness / faithfulness / clarity / completeness / safety / other
- Optional note: one sentence when non-obvious
If one answer is more eloquent but less faithful to sources, it should lose. Encode that rule explicitly for reviewers.
Aggregating results (win rate, Elo, confidence)
Simple aggregation:
- Win rate: percentage of cases where A beats B.
- Ties: track separately; many ties means changes are low-impact.
- By-category wins: improvements in clarity but regressions in correctness are visible.
More advanced:
- Elo rating: useful if you compare many variants over time.
- Confidence intervals: useful if you need statistical confidence for deployment decisions.
In practice, win rate + category breakdown is enough for most teams.
A practical workflow you can run weekly
- Pick the change: one prompt change or one model swap.
- Run A and B: on the same 25-case eval set.
- Review pairwise: 10–15 minutes per reviewer for a subset.
- Aggregate: compute win rates and reasons.
- Decide: ship, reject, or iterate with a clear hypothesis.
Pairwise encourages small changes because small changes are easier to compare and attribute.
Pitfalls and how to avoid them
- Comparing multiple changes at once: you won’t know what caused the improvement.
- Letting retrieval differ: you might be comparing retrieval luck rather than prompts.
- Review fatigue: limit batch size and use sampling.
- Overfitting: don’t tune prompts only to your tiny eval set; rotate in fresh cases.