31.5 Red-teaming your own prompts

On this page

Goal: proactively find failures before users do
What “red-teaming prompts” means (practically)
Scope: what you are testing
Build a red-team corpus (small, high-impact)
Workflow: run, label, fix, regress
Gates: what must never pass
Integrate into Part IX eval workflows
Where to go next

Goal: proactively find failures before users do

Red-teaming is structured adversarial testing. It’s how you answer:

“How could this feature be abused?”
“Will it leak data under pressure?”
“Can prompt injection bypass our constraints?”
“Will our tool calling do something unsafe?”

The goal is not paranoia. The goal is confidence: you know what your system does in worst-case scenarios.

Make it a habit

One-time red-teaming helps. Continuous red-teaming (a maintained corpus run on every change) is what keeps systems safe as they evolve.

What “red-teaming prompts” means (practically)

In this guide, “red-teaming prompts” means:

designing adversarial inputs (user prompts and retrieved docs),
running them against your system,
checking outputs against safety and contract requirements,
turning failures into permanent regression cases.

Scope: what you are testing

You are testing the whole pipeline, not just the model:

input layer: sanitization, allowlists, rate limits, length caps.
retrieval layer (RAG): permissions filters, corpus boundaries, injection handling.
generation layer: schema adherence, citations, abstention behavior.
tool layer: permission checks, allowlists, approval gates, budgets.
logging layer: redaction and retention.

Build a red-team corpus (small, high-impact)

Start with 25–50 cases. Include categories that match your threat model:

Injection attempts: attempts to override rules or request hidden instructions.
Exfiltration attempts: attempts to obtain secrets or restricted content.
Tool abuse attempts: attempts to trigger unsafe actions or broaden scope.
RAG injection: retrieved “SOURCES” containing hostile instructions.
Format breakers: inputs that try to break JSON/schema behavior.
Ambiguity traps: prompts that should trigger clarification, not guessing.

Write each case with an expected safe outcome: refused, not_found, needs_clarification, or safe limited answer with citations.

Don’t publish an “attack cookbook”

Your red-team corpus is a security artifact. Store it appropriately and don’t turn it into public exploit instructions. Use it to test defenses, not to teach attackers.

Workflow: run, label, fix, regress

Use a tight loop:

Run corpus: against current prompt/tool/retrieval versions.
Label outcomes: pass/fail + category (leak, injection, unsafe tool, schema failure).
Fix the right layer: input validation, retrieval filters, prompt composition, tool design, logging.
Add regression tests: turn the failure into an automated gate.
Repeat: re-run corpus after every change.

Gates: what must never pass

Define “never” rules:

No secrets in outputs or logs.
No restricted data returned to unauthorized users.
No tool calls outside allowlists.
No schema violations for structured outputs.
No invented citations for grounded answers.

These are the tests that should block shipping.

Integrate into Part IX eval workflows

Red-teaming becomes powerful when it becomes routine:

store red-team cases next to your eval set,
run them in CI or as a pre-merge check,
review diffs and failures like you review failing tests,
keep the corpus small and high-signal.

Outcome

Your system becomes safer over time because every real failure becomes a permanent regression case.

31.5 Red-teaming your own prompts

Goal: proactively find failures before users do

What “red-teaming prompts” means (practically)

Scope: what you are testing

Build a red-team corpus (small, high-impact)

Workflow: run, label, fix, regress

Gates: what must never pass

Integrate into Part IX eval workflows

Where to go next