Home/
Part XI — Performance & Cost Optimization (Making It Fast and Affordable)/33. Latency Optimization
33. Latency Optimization
Overview and links for this section of the guide.
On this page
Latency Breakdown
Understand where time goes before optimizing:
┌─────────────────────────────────────────────────────────────────┐
│ REQUEST LATENCY BREAKDOWN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CLIENT │
│ └─ Network to server ────────────────────── ~50ms │
│ │
│ SERVER (your code) │
│ ├─ Build prompt ─────────────────────────── ~10ms │
│ ├─ Fetch context (DB, RAG) ──────────────── ~100-500ms │
│ └─ Pre-processing ───────────────────────── ~20ms │
│ │
│ LLM API │
│ ├─ Network to API ───────────────────────── ~50ms │
│ ├─ Queue wait ───────────────────────────── ~0-200ms │
│ ├─ Time to first token (TTFT) ───────────── ~200-500ms │
│ └─ Token generation ─────────────────────── ~1-5s │
│ │
│ TOTAL: 1.5s - 7s typical │
│ │
└─────────────────────────────────────────────────────────────────┘
Optimization Levers
| Lever | Impact | Effort |
|---|---|---|
| Streaming responses | Perceived speed 10x | Low |
| Use Flash model | 2-3x faster | Low |
| Parallel context fetch | 2-5x faster prep | Medium |
| Reduce prompt size | 1.5-2x faster | Medium |
| Result caching | 100x faster (cache hit) | Medium |
| Connection pooling | 50-100ms saved | Low |
Where to go next
Explore next
33. Latency Optimization sub-sections
5 pages