28.3 Pairwise comparisons for model/prompt tuning

On this page

Goal: choose winners reliably when differences are subtle

When you tweak prompts or swap models, differences are often subtle and subjective.

Pairwise comparison solves this by asking a simpler question:

Given the same input, which output is better?

This is often easier and more consistent than assigning absolute scores.

Pairwise is powerful because:

Pairwise is how you avoid bikeshedding

If reviewers argue about what “4/5 clarity” means, switch to pairwise. Ask: which one is better for the user and why?

To make pairwise fair:

Same inputs: run A and B on the exact same eval cases.
Randomize order: don’t always show A first.
Blind reviewers: hide which prompt/model produced which output.
Control retrieval: for RAG, either freeze sources or log and compare retrieved chunks as well.
Normalize formatting: so reviewers judge content, not whitespace.

Decide what you’re comparing:

Keep it simple. For each case:

Winner: A / B / tie
Reason tag: correctness / faithfulness / clarity / completeness / safety / other
Optional note: one sentence when non-obvious

Faithfulness should dominate for grounded systems

If one answer is more eloquent but less faithful to sources, it should lose. Encode that rule explicitly for reviewers.

Simple aggregation:

Win rate: percentage of cases where A beats B.
Ties: track separately; many ties means changes are low-impact.
By-category wins: improvements in clarity but regressions in correctness are visible.

More advanced:

Elo rating: useful if you compare many variants over time.
Confidence intervals: useful if you need statistical confidence for deployment decisions.

In practice, win rate + category breakdown is enough for most teams.

Pairwise encourages small changes because small changes are easier to compare and attribute.

Comparing multiple changes at once: you won’t know what caused the improvement.
Letting retrieval differ: you might be comparing retrieval luck rather than prompts.
Review fatigue: limit batch size and use sampling.
Overfitting: don’t tune prompts only to your tiny eval set; rotate in fresh cases.