43.1 Prompt reviews like code reviews

On this page

Why Review Prompts
Review Process
Review Checklist
Tooling
Where to go next

Why Review Prompts

Prompts are code. They determine system behavior, affect user experience, and can introduce bugs. They deserve the same review rigor as any other code.

// This innocent change broke everything
- "You are a helpful customer support agent."
+ "You are an extremely helpful and eager customer support agent who always says yes."
// Result: Agent started promising refunds we couldn't honor

Review Process

// prompts/customer-support.md
---
version: 2.3.0
author: [email protected]
reviewers: [[email protected], [email protected]]
eval_set: golden-support-v2
last_tested: 2024-01-15
accuracy: 94.2%
---

You are a customer support agent for Acme Corp.

## Rules
- Never promise refunds > $50 without escalation
- Always verify order ID before discussing details
- Escalate any mention of legal action

## Tone
- Professional but warm
- Apologize for issues, don't blame the customer

Review Checklist

## Prompt Review Checklist

### Safety
- [ ] No new capabilities that could be misused
- [ ] Sensitive data handling is appropriate
- [ ] Escalation rules are maintained
- [ ] No prompt injection vulnerabilities

### Quality  
- [ ] Clear, unambiguous instructions
- [ ] Examples are representative
- [ ] Edge cases are handled
- [ ] Output format is specified

### Testing
- [ ] Eval suite passes (>= previous accuracy)
- [ ] New test cases added for new behavior
- [ ] Manual spot-check completed
- [ ] No regression on existing capabilities

### Documentation
- [ ] Version number updated
- [ ] Changelog entry added
- [ ] Breaking changes documented

Tooling

// prompt-diff.ts
// Show meaningful diffs for prompt changes

function diffPrompts(oldPrompt: string, newPrompt: string): PromptDiff {
  // Structural diff, not just text diff
  const oldSections = parsePromptSections(oldPrompt);
  const newSections = parsePromptSections(newPrompt);
  
  return {
    addedSections: newSections.filter(s => !oldSections.find(o => o.name === s.name)),
    removedSections: oldSections.filter(s => !newSections.find(n => n.name === s.name)),
    modifiedSections: findModifiedSections(oldSections, newSections),
    ruleChanges: diffRules(oldPrompt, newPrompt),
    toneChanges: detectToneShift(oldPrompt, newPrompt)
  };
}

// In CI: fail if accuracy drops
async function promptCICheck(pr: PullRequest): Promise {
  const changedPrompts = pr.files.filter(f => f.path.startsWith('prompts/'));
  
  for (const prompt of changedPrompts) {
    const baseline = await getBaseline(prompt.path);
    const newResults = await runEvalSuite(prompt.content);
    
    if (newResults.accuracy < baseline.accuracy - 0.02) {
      return { passed: false, reason: `Accuracy dropped: ${baseline.accuracy} → ${newResults.accuracy}` };
    }
  }
  
  return { passed: true };
}

Where to go next

43.2 Shared prompt libraries and templates