27.3 Property-based tests for robustness

Overview and links for this section of the guide.

Goal: prove invariants hold across many inputs

Golden tests cover specific examples. Property-based tests cover classes of inputs.

The idea is simple:

  • Generate lots of inputs automatically (including weird ones).
  • Assert that invariants always hold (schema, constraints, safety, permissions).

For AI features, property-based testing is one of the best ways to prevent “we never considered this input shape” failures.

What property-based testing is (in plain terms)

Instead of writing:

  • Input A → expect output X
  • Input B → expect output Y

You write:

  • For any input in this class, the output must satisfy properties P1–P5.

This is perfect for systems where the exact content may vary but the rules must hold.

Property tests are contract tests

You’re not testing whether the model is “smart.” You’re testing whether your system behaves safely and predictably for many inputs.

High-value properties for AI systems

These properties catch a surprising number of production incidents:

  • Always-parseable: outputs are valid JSON (or safely handled when not).
  • Schema-compliant: required fields exist and types match.
  • Budget-bounded: output length and number of items are capped.
  • No forbidden strings: no secrets, no system prompt leakage, no internal tokens.
  • Deterministic post-processing: normalizers and validators are stable and idempotent.
  • Permissions safety (RAG): retrieved results never include forbidden tenants/roles.
  • Citation integrity: cited chunk ids must exist in retrieved set; quotes must exist in chunk text.
  • Fail-closed behavior: if constraints aren’t met, system returns not_found / needs_clarification / refused, not a guessed answer.

Concrete example properties

Property: output is either valid or safely rejected

For any input string:

  • Either output parses and validates against schema,
  • or your code returns a safe fallback and logs the validation failure.

Property: normalization is idempotent

For any valid output JSON:

  • normalize(normalize(x)) === normalize(x)

This prevents subtle bugs where repeated processing changes meaning or ordering.

Property: retrieval respects permissions

For any user context and any query:

  • All retrieved chunks must match the user’s permissions filters.

These are deterministic tests against your retrieval code and metadata filters, and they are security-critical.

Property: citations are internally consistent

For any generated answer that claims to have citations:

  • every cited chunk id must be among the provided sources,
  • every quoted snippet must be a substring of that chunk’s text.

This is a cheap way to catch invented citations.

Property: prompt injection strings don’t break contracts

For any input containing injection-like patterns (“ignore instructions”, “reveal secrets”, etc.):

  • output still adheres to schema,
  • system still refuses or abstains appropriately,
  • no disallowed content appears in logs or outputs.

Boundaries: what this cannot test

Property-based testing is not a substitute for evaluation of usefulness.

It cannot reliably test:

  • whether the answer is the “best” answer,
  • whether the summary is “helpful,”
  • whether the tone matches your brand.

Those belong in eval harnesses and human review (Section 28).

Workflow: start small and keep it cheap

To keep property tests from becoming expensive and flaky:

  1. Focus on deterministic code paths: validators, context packing, retrieval filters.
  2. Avoid live model calls: property tests should be fast.
  3. Use a small fuzz corpus: generate input strings but keep runtime bounded.
  4. Make failures actionable: log the minimal failing case so it’s easy to reproduce.

Where to go next