Home/
Part XI — Performance & Cost Optimization (Making It Fast and Affordable)/33. Latency Optimization/33.1 Cutting prompt size without losing accuracy
33.1 Cutting prompt size without losing accuracy
Overview and links for this section of the guide.
Identifying Prompt Bloat
Big prompts are slow. The prefill stage (reading the prompt) is fast, but a huge prompt often leads to a "slower start" as the model attends to all tokens, and it increases the chance of the model generating a long-winded response.
Common sources of bloat:
- Copy-pasted entire files when only a function signature was needed.
- Excessive XML tags used for structure (JSON is tighter).
- Over-polite instructions ("Please, if you would be so kind...").
Compression Techniques
- Remove "Chatter": Delete polite phrases. "Write a function to..." is better than "I was wondering if you could help me write a function to..."
- Use Reference IDs: Instead of repeating a filename 10 times, say "File A" and define it once.
- Ask for Brevity: Explicitly instruct: "Do not explain. Return code only." This saves massive output latency.
The "Code Only" Rule
The fastest way to reduce latency is to stop the model from explaining itself. If you just need the diff, ask for the diff. Explanations take seconds to generate.