31.5 Red-teaming your own prompts

Overview and links for this section of the guide.

Goal: proactively find failures before users do

Red-teaming is structured adversarial testing. It’s how you answer:

  • “How could this feature be abused?”
  • “Will it leak data under pressure?”
  • “Can prompt injection bypass our constraints?”
  • “Will our tool calling do something unsafe?”

The goal is not paranoia. The goal is confidence: you know what your system does in worst-case scenarios.

Make it a habit

One-time red-teaming helps. Continuous red-teaming (a maintained corpus run on every change) is what keeps systems safe as they evolve.

What “red-teaming prompts” means (practically)

In this guide, “red-teaming prompts” means:

  • designing adversarial inputs (user prompts and retrieved docs),
  • running them against your system,
  • checking outputs against safety and contract requirements,
  • turning failures into permanent regression cases.

Scope: what you are testing

You are testing the whole pipeline, not just the model:

  • input layer: sanitization, allowlists, rate limits, length caps.
  • retrieval layer (RAG): permissions filters, corpus boundaries, injection handling.
  • generation layer: schema adherence, citations, abstention behavior.
  • tool layer: permission checks, allowlists, approval gates, budgets.
  • logging layer: redaction and retention.

Build a red-team corpus (small, high-impact)

Start with 25–50 cases. Include categories that match your threat model:

  • Injection attempts: attempts to override rules or request hidden instructions.
  • Exfiltration attempts: attempts to obtain secrets or restricted content.
  • Tool abuse attempts: attempts to trigger unsafe actions or broaden scope.
  • RAG injection: retrieved “SOURCES” containing hostile instructions.
  • Format breakers: inputs that try to break JSON/schema behavior.
  • Ambiguity traps: prompts that should trigger clarification, not guessing.

Write each case with an expected safe outcome: refused, not_found, needs_clarification, or safe limited answer with citations.

Don’t publish an “attack cookbook”

Your red-team corpus is a security artifact. Store it appropriately and don’t turn it into public exploit instructions. Use it to test defenses, not to teach attackers.

Workflow: run, label, fix, regress

Use a tight loop:

  1. Run corpus: against current prompt/tool/retrieval versions.
  2. Label outcomes: pass/fail + category (leak, injection, unsafe tool, schema failure).
  3. Fix the right layer: input validation, retrieval filters, prompt composition, tool design, logging.
  4. Add regression tests: turn the failure into an automated gate.
  5. Repeat: re-run corpus after every change.

Gates: what must never pass

Define “never” rules:

  • No secrets in outputs or logs.
  • No restricted data returned to unauthorized users.
  • No tool calls outside allowlists.
  • No schema violations for structured outputs.
  • No invented citations for grounded answers.

These are the tests that should block shipping.

Integrate into Part IX eval workflows

Red-teaming becomes powerful when it becomes routine:

  • store red-team cases next to your eval set,
  • run them in CI or as a pre-merge check,
  • review diffs and failures like you review failing tests,
  • keep the corpus small and high-signal.
Outcome

Your system becomes safer over time because every real failure becomes a permanent regression case.

Where to go next