Home/
Part XIII — Expert Mode: Systems, Agents, and Automation/39. Prompt Engineering for Experts (The Real Stuff)/39.5 Prompt compression and distillation
39.5 Prompt compression and distillation
Overview and links for this section of the guide.
On this page
Why Compress
Long prompts cost more and may exceed context limits. Compression reduces tokens while preserving meaning.
Before: 5,000 tokens → $0.05 per call
After: 1,500 tokens → $0.015 per call (70% savings)
Techniques
1. Reference Instead of Inline
// Before: Include full examples
"Example 1: [500 tokens of code]
Example 2: [500 tokens of code]
Example 3: [500 tokens of code]"
// After: Reference examples
"Use patterns from examples #1-3 in your training."
// Store examples in system context or fine-tuning
2. Summarize Context
// Before: Full file contents
const prompt = files.map(f => f.content).join('\n');
// After: Summarize
async function summarizeForContext(files: File[]) {
return Promise.all(files.map(async f => {
if (f.content.length < 500) return f.content;
return model.generate({
prompt: `Summarize this file in 100 words, focusing on key functions and types:\n${f.content}`,
maxTokens: 150
});
}));
}
3. Remove Redundancy
// Before: Verbose prompt
"You are an AI assistant that helps with code review.
As a code review assistant, your job is to review code.
When reviewing code, you should look for bugs and issues.
Make sure to check for bugs when you review the code."
// After: Concise
"You are a code reviewer. Find bugs, security issues, and style problems."
Distillation
Train a smaller/cheaper model to mimic a larger model's behavior:
// distillation.ts
async function createDistillationDataset(
prompts: string[],
teacherModel: string, // e.g., 'gemini-1.5-pro'
studentModel: string // e.g., 'gemini-1.5-flash'
) {
const dataset = [];
for (const prompt of prompts) {
const teacherResponse = await generate({
model: teacherModel,
prompt
});
dataset.push({
input: prompt,
output: teacherResponse,
// Include chain of thought if available
reasoning: teacherResponse.thinking
});
}
// Fine-tune student model on teacher outputs
await fineTune({
model: studentModel,
trainingData: dataset
});
}
// Result: Flash model that performs like Pro for your use case
Measure Before Compressing
Always measure accuracy before and after compression. Some compression techniques degrade quality. Find the right balance for your use case.