28.2 Scoring outputs with rubrics

Overview and links for this section of the guide.

Goal: make “better” measurable

If you can’t define “better,” you can’t improve reliably.

A rubric is a set of scoring criteria that lets you compare outputs across:

  • prompt versions,
  • model versions,
  • retrieval strategies,
  • system changes.

The goal is not perfect measurement. The goal is consistent measurement.

Why rubrics beat vibes

Without a rubric:

  • reviewers disagree without realizing it,
  • prompt changes become subjective debates,
  • teams “optimize for tone” and miss correctness regressions,
  • nobody can explain why a change was accepted.

With a rubric:

  • you can compare runs and track trends,
  • you can make tradeoffs explicit (clarity vs completeness),
  • you can prioritize what matters (faithfulness over eloquence).
Keep rubrics small

Pick 3–5 dimensions that matter most. Huge rubrics create fatigue and inconsistent scoring.

Common rubric dimensions (choose 3–5)

Pick dimensions that align with your feature’s promise:

  • Correctness: does the answer match the intended truth?
  • Faithfulness/grounding: are claims supported by sources?
  • Completeness: does it cover key aspects (rules + exceptions)?
  • Clarity: is it easy to understand and act on?
  • Safety/policy: does it avoid disallowed content and handle refusals?
  • Format quality: valid JSON, schema compliance, correct citations.
  • Efficiency: concision, token usage, and latency impact (if relevant).

For RAG systems, faithfulness is usually the top dimension.

Scoring scales and anchors

Use a small scale with explicit anchors. Two practical options:

  • 0–2 scale: 0 = fail, 1 = partial, 2 = pass.
  • 1–5 scale: more nuance, but harder to score consistently.

Whatever scale you choose, define anchors:

  • What counts as a “pass”?
  • What is a “hard fail”?
  • What is “partial”?

Anchors are what keep scoring consistent across reviewers.

Example rubric (grounded Q&A)

This is a practical 0–2 rubric for a “answer with references” system:

  • Faithfulness (0–2)
    • 0: claims not supported by cited quotes; invented citations; quotes don’t match sources.
    • 1: mostly supported, but one claim is weakly supported or overstated.
    • 2: all claims clearly supported by citations and quotes.
  • Correctness (0–2)
    • 0: incorrect or misleading; misses key exception; wrong policy interpretation.
    • 1: mostly correct but incomplete or slightly off in nuance.
    • 2: correct and aligned with sources.
  • Clarity (0–2)
    • 0: hard to understand; confusing structure; unusable.
    • 1: understandable but verbose or poorly structured.
    • 2: clear, structured, actionable.
  • Abstention behavior (0–2)
    • 0: guesses when evidence is missing.
    • 1: partially abstains but still implies certainty.
    • 2: correctly uses not_found / needs clarification when required.
Always include at least one “hard fail” dimension

For grounded systems, that dimension is usually faithfulness. If faithfulness is 0, you should not ship even if tone improved.

Workflow: scoring without wasting time

Practical scoring workflow:

  1. Calibrate: review 3–5 cases together and align on what scores mean.
  2. Score in batches: 10–20 cases per session to avoid fatigue.
  3. Record notes only for failures: don’t write essays for “2/2/2”.
  4. Promote failures: add common failure patterns to golden tests or fuzz corpora.
  5. Use pairwise when close: for subtle improvements, pairwise is faster than absolute scoring (28.3).

Model-assisted scoring (use carefully)

You can use a model to help grade outputs, but treat it as an assistant, not a judge.

Good uses:

  • flag potential faithfulness issues (“this quote doesn’t support this claim”),
  • pre-sort outputs into “obvious pass” vs “needs human review”,
  • extract structured grading notes for reviewers.

Risky uses:

  • letting the model be the final arbiter for correctness without human checks,
  • grading subjective criteria without calibration.

If you use model-assisted grading, validate it against human judgment periodically.

Where to go next