27.3 Property-based tests for robustness
Overview and links for this section of the guide.
On this page
Goal: prove invariants hold across many inputs
Golden tests cover specific examples. Property-based tests cover classes of inputs.
The idea is simple:
- Generate lots of inputs automatically (including weird ones).
- Assert that invariants always hold (schema, constraints, safety, permissions).
For AI features, property-based testing is one of the best ways to prevent “we never considered this input shape” failures.
What property-based testing is (in plain terms)
Instead of writing:
- Input A → expect output X
- Input B → expect output Y
You write:
- For any input in this class, the output must satisfy properties P1–P5.
This is perfect for systems where the exact content may vary but the rules must hold.
You’re not testing whether the model is “smart.” You’re testing whether your system behaves safely and predictably for many inputs.
High-value properties for AI systems
These properties catch a surprising number of production incidents:
- Always-parseable: outputs are valid JSON (or safely handled when not).
- Schema-compliant: required fields exist and types match.
- Budget-bounded: output length and number of items are capped.
- No forbidden strings: no secrets, no system prompt leakage, no internal tokens.
- Deterministic post-processing: normalizers and validators are stable and idempotent.
- Permissions safety (RAG): retrieved results never include forbidden tenants/roles.
- Citation integrity: cited chunk ids must exist in retrieved set; quotes must exist in chunk text.
- Fail-closed behavior: if constraints aren’t met, system returns not_found / needs_clarification / refused, not a guessed answer.
Concrete example properties
Property: output is either valid or safely rejected
For any input string:
- Either output parses and validates against schema,
- or your code returns a safe fallback and logs the validation failure.
Property: normalization is idempotent
For any valid output JSON:
normalize(normalize(x)) === normalize(x)
This prevents subtle bugs where repeated processing changes meaning or ordering.
Property: retrieval respects permissions
For any user context and any query:
- All retrieved chunks must match the user’s permissions filters.
These are deterministic tests against your retrieval code and metadata filters, and they are security-critical.
Property: citations are internally consistent
For any generated answer that claims to have citations:
- every cited chunk id must be among the provided sources,
- every quoted snippet must be a substring of that chunk’s text.
This is a cheap way to catch invented citations.
Property: prompt injection strings don’t break contracts
For any input containing injection-like patterns (“ignore instructions”, “reveal secrets”, etc.):
- output still adheres to schema,
- system still refuses or abstains appropriately,
- no disallowed content appears in logs or outputs.
Boundaries: what this cannot test
Property-based testing is not a substitute for evaluation of usefulness.
It cannot reliably test:
- whether the answer is the “best” answer,
- whether the summary is “helpful,”
- whether the tone matches your brand.
Those belong in eval harnesses and human review (Section 28).
Workflow: start small and keep it cheap
To keep property tests from becoming expensive and flaky:
- Focus on deterministic code paths: validators, context packing, retrieval filters.
- Avoid live model calls: property tests should be fast.
- Use a small fuzz corpus: generate input strings but keep runtime bounded.
- Make failures actionable: log the minimal failing case so it’s easy to reproduce.