33.1 Cutting prompt size without losing accuracy

Overview and links for this section of the guide.

Why Size Matters

Larger prompts = more tokens to process = slower time-to-first-token and longer generation.

// Relationship is roughly linear
Small prompt (1K tokens): ~300ms TTFT
Medium prompt (10K tokens): ~500ms TTFT  
Large prompt (100K tokens): ~2s TTFT

Reduction Techniques

// 1. Only include relevant context
// Before: Send entire file
const prompt = `Fix the bug in this file:\n${entireFile}`;

// After: Send relevant section
const relevantSection = extractFunction(file, 'buggyFunction');
const prompt = `Fix the bug:\n${relevantSection}`;

// 2. Compress examples
// Before: 5 detailed examples (500 tokens each)
// After: 3 minimal examples (100 tokens each)

// 3. Remove redundancy
// Before: "Please read the following code carefully and analyze..."
// After: "Analyze this code:"

// 4. Use abbreviations in system prompts
// Before: "You must always respond in JSON format..."
// After: "Output: JSON only"

// 5. Trim chat history
function trimHistory(messages: Message[], maxTokens: number) {
  let tokens = 0;
  const kept: Message[] = [];
  
  // Keep most recent messages
  for (const msg of messages.reverse()) {
    tokens += countTokens(msg);
    if (tokens > maxTokens) break;
    kept.unshift(msg);
  }
  
  return kept;
}

Measuring Impact

async function measureLatencyBySize() {
  const sizes = [1000, 5000, 10000, 50000];
  
  for (const size of sizes) {
    const prompt = generatePromptOfSize(size);
    const start = Date.now();
    await model.generate(prompt);
    const elapsed = Date.now() - start;
    console.log(`${size} tokens: ${elapsed}ms`);
  }
}

Where to go next