28.1 Build a tiny eval set that matters

On this page

Goal: 25 great eval cases (not 10,000 mediocre ones)
Principles of a high-leverage eval set
Where eval cases come from (real > invented)
Coverage map: what you must include
Case format (what to store per case)
Workflow: build, run, refine
Maintaining the eval set over time
Copy-paste prompts
Where to go next

Goal: 25 great eval cases (not 10,000 mediocre ones)

A tiny eval set is the fastest path to measurable improvement.

The goal is a set of cases that:

represents real user intent,
covers your riskiest failure modes,
is small enough to run frequently,
is strong enough to catch regressions.

Why 25?

25 is large enough to cover variety and small enough that people will actually run it and review diffs. Scale later.

Principles of a high-leverage eval set

Realistic: cases look like actual user requests (language, ambiguity, messiness).
High-signal: each case teaches you something; avoid filler.
Risk-weighted: include high-impact cases (where wrong answers are costly).
Edge-aware: include tricky cases (exceptions, conflicts, negations, “not found”).
Stable: cases don’t change every week; stability makes regressions detectable.
Actionable: failures point to a fixable layer (retrieval, prompt, validator, UX policy).

Where eval cases come from (real > invented)

Best sources:

Production logs: anonymized user queries (with consent and redaction).
Support tickets: what people actually ask and what confused them.
Internal stakeholders: “top 10 questions we always get.”
Known incidents: failures you never want to repeat.

If you must invent cases, use a constraint: each invented case must correspond to a plausible real user scenario and be labeled as “synthetic.”

Avoid “eval theater”

If eval cases are written in perfect prompt-engineer language, they won’t catch real failures. Real inputs are messy.

Coverage map: what you must include

Use a coverage map to ensure variety. For most AI features, include:

Happy path: common cases that should work well.
Ambiguity: questions that should trigger clarification.
Not found: questions outside the corpus/task scope.
Edge constraints: long inputs, short inputs, unusual formatting.
High-risk: policy/compliance/security sensitive cases.
Adversarial: injection-like attempts and format breakers.
Conflicts (RAG): contradictory sources that should be surfaced.

If your system has multiple modes, include a few cases per mode (summarize, extract, answer-with-sources, etc.).

Case format (what to store per case)

Store enough information to reproduce behavior. A practical case format includes:

id: stable identifier.
input: user query or task input.
context: user role/tenant, app mode, constraints (if relevant).
expected outcome type: answered / not_found / needs_clarification / conflict / refused.
rubric focus: which dimensions matter (correctness, faithfulness, clarity).
notes: why this case exists; what it’s testing.
optional labels: expected relevant doc_ids/chunk_ids (for retrieval eval).

For RAG cases, you may also store:

a frozen set of source chunks (for prompt-level eval), or
expected relevant chunk ids (for retrieval-level eval).

Workflow: build, run, refine

Collect 50 candidates: from real sources (logs/tickets).
Cluster them: group by intent and risk.
Select 25: maximize coverage and risk-weighting.
Write short notes: “what this catches.”
Run your system: capture outputs and failures.
Refine cases: remove redundant ones, add missing edges, keep the set small.

The purpose is iteration: the eval set becomes your product’s “quality map.”

Maintaining the eval set over time

Rules that keep the eval set useful:

Keep it curated: don’t let it grow without a reason.
Promote failures: any serious production failure becomes an eval case.
Retire obsolete cases: when product scope changes, remove cases that no longer apply.
Track versions: prompt version, model version, corpus version; otherwise results are confusing.

Copy-paste prompts

Prompt: propose eval cases from requirements

Help me build a small eval set (25 cases) for an AI feature.

Feature description: ...
Primary user tasks: ...
Risks if wrong: ...
Possible system states: answered / not_found / needs_clarification / conflict / refused

Task:
1) Propose 25 eval cases as a table with:
   - id
   - input
   - expected outcome type
   - why it matters (risk/edge case)
2) Ensure coverage across: happy path, ambiguity, not found, adversarial, high-risk.

28.1 Build a tiny eval set that matters

Goal: 25 great eval cases (not 10,000 mediocre ones)

Principles of a high-leverage eval set

Where eval cases come from (real > invented)

Coverage map: what you must include

Case format (what to store per case)

Workflow: build, run, refine

Maintaining the eval set over time

Copy-paste prompts

Prompt: propose eval cases from requirements

Where to go next