27.1 What you can unit test vs what you must evaluate statistically
Overview and links for this section of the guide.
On this page
- Goal: choose the right testing strategy
- The split: deterministic vs probabilistic
- What you can unit test (high-leverage invariants)
- What you must evaluate statistically (quality and usefulness)
- Hybrid approach: deterministic gates + evals
- Decision checklist
- Copy-paste prompts (planning tests and evals)
- Where to go next
Goal: choose the right testing strategy
AI features fail because teams try to test them like deterministic code or they don’t test them at all.
The right approach is a split:
- Unit test what must always hold (contracts and invariants).
- Evaluate what varies (quality, usefulness, and tradeoffs).
This page gives you a practical rubric for that split.
The split: deterministic vs probabilistic
Ask one question:
Is there an outcome that must be true for every input?
If yes, that’s an invariant. You should unit test it.
If no (because quality is subjective, multi-valid, or depends on nuance), that’s evaluation territory.
Unit tests are how you enforce “the system can’t break the app.” Evals are how you enforce “the system remains useful.”
What you can unit test (high-leverage invariants)
These are the parts of an AI feature that should be deterministic and therefore testable:
1) Output structure and parsing
- Valid JSON: response parses.
- Schema compliance: required fields exist; types are correct; enums are valid.
- Normalization: canonical sorting, trimming, stable formatting.
- Validation failures are handled: retries or safe fallbacks trigger correctly.
- Partial outputs: your code handles missing optional fields safely.
These tests live in your application code (validators, parsers), not in the model.
2) Prompt construction and context packing
- Prompt templates render correctly: variables inserted, no missing placeholders.
- Context budget enforcement: you never exceed your intended size limits.
- Source packaging: chunk ids and boundaries are preserved; sources are clearly separated from instructions.
- Redaction rules: sensitive fields are removed/obscured before inclusion.
These unit tests prevent subtle “we accidentally stopped sending the constraints” regressions.
3) Retrieval and permissions (for RAG)
- Metadata filters work: tenant/team/role restrictions are enforced before the model sees text.
- Doc type filters work: canonical sources are preferred when required.
- Deletion handling works: inactive chunks are not retrievable.
- Audit logs are produced: you record retrieved chunk ids and versions.
These are security-critical and should be unit tested with synthetic data.
4) Safety and policy gates
- Not-found and abstention behavior: when sources are missing, the system does not fabricate.
- Refusal-aware UX: policy refusals result in safe user messages and escalation paths.
- Tool budgets: if tool calling exists, budgets and allowlists prevent runaway loops.
5) Cost and latency budgets
You can unit test guardrails that keep performance predictable:
- timeouts are set,
- maximum chunks included is capped,
- retry count is bounded,
- caching keys are computed safely,
- logs do not include raw secrets or full source dumps.
Unit tests must be fast, deterministic, and stable. If you call the model directly, you will get flaky tests and the team will stop running them. Use mocking or recorded fixtures for unit tests.
What you must evaluate statistically (quality and usefulness)
Some aspects of AI output are not well-defined as “pass/fail” across all inputs:
- Helpfulness: does it address the user’s intent?
- Correctness in open-ended tasks: many valid outputs exist; edge cases vary.
- Clarity and tone: subjective and audience-dependent.
- Summarization quality: coverage vs concision tradeoffs.
- Reasoning quality: whether it chooses the right approach, not just whether it compiles.
- UX satisfaction: whether users feel the system is reliable.
These require evaluation sets, rubrics, and often human review (Section 28).
Statistical evaluation is also needed because:
- small prompt changes can shift behavior in surprising ways,
- model/provider upgrades can cause regressions,
- retrieval and corpus changes alter answers even when code stays the same.
Hybrid approach: deterministic gates + evals
The most practical approach is hybrid:
- Deterministic gates: validate schema, citations, permissions, budgets.
- Evals: measure quality on a curated set of real cases.
- Human review for high-risk outputs: gate shipping when impact is high.
This creates a strong default posture:
- If output violates contract → reject and retry/fallback.
- If output passes contract but quality is poor → improve with eval-driven iteration.
Unit tests are like compiler checks: they keep the system from breaking. Evals are like benchmarks: they measure the thing you actually care about.
Decision checklist
Use this checklist for any AI feature:
- What are the non-negotiables? (schema, safety, citations, permissions)
- What must never happen? (data leakage, invented citations, unsafe tool use)
- What is “good enough”? (rubric + eval set)
- What is the fallback? (not found, ask a question, degrade mode)
- What changes frequently? (prompts, corpora, models)
- How will you detect regressions? (eval harness run on every change)
- How will you debug failures? (logs: retrieved chunks, prompt version, model version)
Copy-paste prompts (planning tests and evals)
Prompt: define unit-testable invariants
We are building an AI feature. Help us define unit-testable invariants.
Context:
- Feature description: ...
- Output format (schema): ...
- Safety/permissions requirements: ...
- Failure fallback behavior: ...
Task:
1) List the invariants that must always hold (schema, citations, safety, budgets).
2) For each invariant, propose a unit test idea (what input, what assertion).
3) List what should NOT be unit tested (needs eval instead) and why.
Prompt: define an eval plan
Help us design an evaluation plan for this AI feature.
Context:
- Users: ...
- Primary tasks: ...
- Risks of wrong output: ...
- Output rubric dimensions we care about: ...
Task:
1) Propose 25 eval cases (realistic prompts/inputs).
2) Propose a scoring rubric (clear criteria).
3) Recommend how to run regressions across prompt versions.