Home/ Part XI — Performance & Cost Optimization (Making It Fast and Affordable)/33. Latency Optimization

33. Latency Optimization

Overview and links for this section of the guide.

On this page

Latency Breakdown
Optimization Levers
Where to go next

Latency Breakdown

Understand where time goes before optimizing:

┌─────────────────────────────────────────────────────────────────┐
│                    REQUEST LATENCY BREAKDOWN                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CLIENT                                                          │
│  └─ Network to server ────────────────────── ~50ms              │
│                                                                  │
│  SERVER (your code)                                              │
│  ├─ Build prompt ─────────────────────────── ~10ms              │
│  ├─ Fetch context (DB, RAG) ──────────────── ~100-500ms         │
│  └─ Pre-processing ───────────────────────── ~20ms              │
│                                                                  │
│  LLM API                                                         │
│  ├─ Network to API ───────────────────────── ~50ms              │
│  ├─ Queue wait ───────────────────────────── ~0-200ms           │
│  ├─ Time to first token (TTFT) ───────────── ~200-500ms         │
│  └─ Token generation ─────────────────────── ~1-5s              │
│                                                                  │
│  TOTAL: 1.5s - 7s typical                                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Optimization Levers

Lever	Impact	Effort
Streaming responses	Perceived speed 10x	Low
Use Flash model	2-3x faster	Low
Parallel context fetch	2-5x faster prep	Medium
Reduce prompt size	1.5-2x faster	Medium
Result caching	100x faster (cache hit)	Medium
Connection pooling	50-100ms saved	Low

Where to go next

Explore next

33. Latency Optimization sub-sections

5 pages

33.1 Cutting prompt size without losing accuracy

Open page

33.2 Streaming UX patterns that feel instant

Open page

33.3 Parallelizing retrieval and preprocessing

Open page

33.4 Caching: what to cache and what not to

Open page

33.5 Warm starts and connection reuse

Open page