27. Testing AI Features Like a Real Engineer

Overview and links for this section of the guide.

What this section is for

Section 27 teaches you how to test AI features like an engineer, not like a magician.

That means:

  • separating what is deterministic from what is probabilistic,
  • testing the deterministic parts aggressively,
  • evaluating the probabilistic parts with curated datasets and rubrics,
  • building feedback loops that catch regressions before users do.
Tests are still useful with probabilistic systems

You can’t unit test “helpfulness” directly, but you can unit test schemas, refusal behavior, safety constraints, and invariants that must never break.

Core principle: test contracts, evaluate quality

Think of your AI feature as a pipeline:

  • Inputs: user prompt + context + retrieved sources + configuration
  • Model call: probabilistic output
  • Post-processing: parsing, validation, business rules, formatting
  • UX policy: confidence, “not found,” conflict detection, escalation

You can test contracts at multiple points:

  • “Output is valid JSON”
  • “Required fields exist”
  • “Citations reference provided chunk ids”
  • “Not-found triggers when evidence is missing”
  • “We never leak secrets in logs”

Then you evaluate quality using eval sets (Section 28).

Testing layers for AI features

A practical test stack:

  • Unit tests: deterministic invariants (schemas, validators, prompt builders).
  • Golden tests: “known good” input/output pairs for structured outputs.
  • Property-based tests: generate many inputs to ensure invariants always hold.
  • Fuzz tests: adversarial and malformed inputs to harden against injection and weird edge cases.
  • Snapshot tests: capture outputs with controlled update workflows to avoid accidental drift.

Most teams get high leverage by starting with: schema validation + golden tests + a small eval set.

Section 27 map (27.1–27.5)

Where to start