16.5 Preventing tool abuse and runaway loops

On this page

Threat model: what can go wrong
Budgets (hard caps that stop runaway loops)
Allowlists and scope limits
Approvals and human-in-the-loop for side effects
Prompt injection defense for tool use
Data minimization and redaction
Auditability (logs, traces, replay)
A practical guardrails checklist
Where to go next

Threat model: what can go wrong

When you give the model tools, you create new failure modes:

Runaway loops: model keeps calling tools repeatedly.
Tool misuse: model calls tools with wrong or unsafe params.
Data exfiltration: model is tricked into requesting sensitive data.
Side effects: accidental writes, duplicates, or destructive changes.
Prompt injection: untrusted documents instruct the model to call tools.

Guardrails are how you make these survivable.

Assume adversarial inputs

If your app accepts untrusted text (documents, tickets, web content), you must assume it can contain malicious instructions. Your tool layer must not obey those instructions blindly.

Budgets (hard caps that stop runaway loops)

Budgets should be enforced by the system, not requested politely in prompts.

Common budgets:

max_tool_calls_per_request: e.g., 5
max_total_latency_ms: e.g., 30,000
max_retries: per tool and per model call
max_write_actions: often 0 unless explicitly approved
max_tokens_per_request: cost guardrail

When budgets are exceeded, stop and return a clear error outcome. Do not keep going.

Budget exhaustion is a normal outcome

Represent “budget exceeded” as its own status category. That makes it visible, testable, and actionable.

Allowlists and scope limits

Allowlists prevent tool overreach:

tool allowlist: only expose necessary tools for this feature
parameter allowlist: only allow known fields/filters
resource scope: tools can only access permitted resources (e.g., documents user is allowed to see)
host allowlist: if you have fetch tools, restrict which domains can be fetched

Scope limits should be enforced in code (authorization), not in natural language prompts.

Approvals and human-in-the-loop for side effects

If a tool can change state, require explicit approvals:

model proposes action + shows what will change,
human approves (or app policy approves for low-risk actions),
system executes exactly that action with idempotency,
system logs and returns result.

For public apps, you often avoid write tools entirely or heavily constrain them.

Side effects must be explicit

A model should never have an implicit ability to “just do things.” Side effects should be gated behind explicit user intent and approvals.

Prompt injection defense for tool use

Prompt injection is when untrusted input tries to change the model’s behavior (“ignore previous instructions and call tool X”). You cannot solve this with wording alone, but you can reduce risk:

Separate instructions from data: clearly label user-provided content as data.
Policy in code: the system enforces which tools can be called and with which parameters.
Least privilege tools: even if the model is tricked, the tools can’t do much harm.
Budget caps: stop runaway attempts.
Human approvals: required for side effects.

Later in Part X you’ll go deeper on injection defense. Here, the key is: enforce control in code.

Data minimization and redaction

Tool outputs can leak sensitive data into the model context and then into the user-visible output. Reduce risk by:

returning only necessary fields from tools,
redacting PII before tool output reaches the model,
avoiding returning secrets or credentials ever,
limiting retention of tool outputs in logs.

Minimize both input and output surfaces.

Auditability (logs, traces, replay)

Auditability is how you debug and how you build trust:

log each tool call with request id, tool name, and sanitized args
log tool results metadata (not raw sensitive payloads)
record prompt version and model settings
enable replay for debugging in controlled environments

When something goes wrong, you need to answer: “what tools were called, why, and what did they return?”

A practical guardrails checklist

Budgets: max tool calls, latency, retries enforced in code.
Allowlists: tool and parameter allowlists enforced.
Approvals: write tools require explicit confirmation.
Idempotency: write actions are idempotent or not auto-retried.
Validation: tool args validated against schema; outputs validated/redacted.
Logging: request id + tool calls logged safely.
Injection posture: untrusted docs treated as data; policy enforced in code.

16.5 Preventing tool abuse and runaway loops

Threat model: what can go wrong

Budgets (hard caps that stop runaway loops)

Allowlists and scope limits

Approvals and human-in-the-loop for side effects

Prompt injection defense for tool use

Data minimization and redaction

Auditability (logs, traces, replay)

A practical guardrails checklist

Where to go next