Home/
Part XI — Performance & Cost Optimization (Making It Fast and Affordable)/33. Latency Optimization/33.1 Cutting prompt size without losing accuracy
33.1 Cutting prompt size without losing accuracy
Overview and links for this section of the guide.
Why Size Matters
Larger prompts = more tokens to process = slower time-to-first-token and longer generation.
// Relationship is roughly linear
Small prompt (1K tokens): ~300ms TTFT
Medium prompt (10K tokens): ~500ms TTFT
Large prompt (100K tokens): ~2s TTFT
Reduction Techniques
// 1. Only include relevant context
// Before: Send entire file
const prompt = `Fix the bug in this file:\n${entireFile}`;
// After: Send relevant section
const relevantSection = extractFunction(file, 'buggyFunction');
const prompt = `Fix the bug:\n${relevantSection}`;
// 2. Compress examples
// Before: 5 detailed examples (500 tokens each)
// After: 3 minimal examples (100 tokens each)
// 3. Remove redundancy
// Before: "Please read the following code carefully and analyze..."
// After: "Analyze this code:"
// 4. Use abbreviations in system prompts
// Before: "You must always respond in JSON format..."
// After: "Output: JSON only"
// 5. Trim chat history
function trimHistory(messages: Message[], maxTokens: number) {
let tokens = 0;
const kept: Message[] = [];
// Keep most recent messages
for (const msg of messages.reverse()) {
tokens += countTokens(msg);
if (tokens > maxTokens) break;
kept.unshift(msg);
}
return kept;
}
Measuring Impact
async function measureLatencyBySize() {
const sizes = [1000, 5000, 10000, 50000];
for (const size of sizes) {
const prompt = generatePromptOfSize(size);
const start = Date.now();
await model.generate(prompt);
const elapsed = Date.now() - start;
console.log(`${size} tokens: ${elapsed}ms`);
}
}