28.1 Build a tiny eval set that matters

Overview and links for this section of the guide.

Goal: 25 great eval cases (not 10,000 mediocre ones)

A tiny eval set is the fastest path to measurable improvement.

The goal is a set of cases that:

  • represents real user intent,
  • covers your riskiest failure modes,
  • is small enough to run frequently,
  • is strong enough to catch regressions.
Why 25?

25 is large enough to cover variety and small enough that people will actually run it and review diffs. Scale later.

Principles of a high-leverage eval set

  • Realistic: cases look like actual user requests (language, ambiguity, messiness).
  • High-signal: each case teaches you something; avoid filler.
  • Risk-weighted: include high-impact cases (where wrong answers are costly).
  • Edge-aware: include tricky cases (exceptions, conflicts, negations, “not found”).
  • Stable: cases don’t change every week; stability makes regressions detectable.
  • Actionable: failures point to a fixable layer (retrieval, prompt, validator, UX policy).

Where eval cases come from (real > invented)

Best sources:

  • Production logs: anonymized user queries (with consent and redaction).
  • Support tickets: what people actually ask and what confused them.
  • Internal stakeholders: “top 10 questions we always get.”
  • Known incidents: failures you never want to repeat.

If you must invent cases, use a constraint: each invented case must correspond to a plausible real user scenario and be labeled as “synthetic.”

Avoid “eval theater”

If eval cases are written in perfect prompt-engineer language, they won’t catch real failures. Real inputs are messy.

Coverage map: what you must include

Use a coverage map to ensure variety. For most AI features, include:

  • Happy path: common cases that should work well.
  • Ambiguity: questions that should trigger clarification.
  • Not found: questions outside the corpus/task scope.
  • Edge constraints: long inputs, short inputs, unusual formatting.
  • High-risk: policy/compliance/security sensitive cases.
  • Adversarial: injection-like attempts and format breakers.
  • Conflicts (RAG): contradictory sources that should be surfaced.

If your system has multiple modes, include a few cases per mode (summarize, extract, answer-with-sources, etc.).

Case format (what to store per case)

Store enough information to reproduce behavior. A practical case format includes:

  • id: stable identifier.
  • input: user query or task input.
  • context: user role/tenant, app mode, constraints (if relevant).
  • expected outcome type: answered / not_found / needs_clarification / conflict / refused.
  • rubric focus: which dimensions matter (correctness, faithfulness, clarity).
  • notes: why this case exists; what it’s testing.
  • optional labels: expected relevant doc_ids/chunk_ids (for retrieval eval).

For RAG cases, you may also store:

  • a frozen set of source chunks (for prompt-level eval), or
  • expected relevant chunk ids (for retrieval-level eval).

Workflow: build, run, refine

  1. Collect 50 candidates: from real sources (logs/tickets).
  2. Cluster them: group by intent and risk.
  3. Select 25: maximize coverage and risk-weighting.
  4. Write short notes: “what this catches.”
  5. Run your system: capture outputs and failures.
  6. Refine cases: remove redundant ones, add missing edges, keep the set small.

The purpose is iteration: the eval set becomes your product’s “quality map.”

Maintaining the eval set over time

Rules that keep the eval set useful:

  • Keep it curated: don’t let it grow without a reason.
  • Promote failures: any serious production failure becomes an eval case.
  • Retire obsolete cases: when product scope changes, remove cases that no longer apply.
  • Track versions: prompt version, model version, corpus version; otherwise results are confusing.

Copy-paste prompts

Prompt: propose eval cases from requirements

Help me build a small eval set (25 cases) for an AI feature.

Feature description: ...
Primary user tasks: ...
Risks if wrong: ...
Possible system states: answered / not_found / needs_clarification / conflict / refused

Task:
1) Propose 25 eval cases as a table with:
   - id
   - input
   - expected outcome type
   - why it matters (risk/edge case)
2) Ensure coverage across: happy path, ambiguity, not found, adversarial, high-risk.

Where to go next