33. Latency Optimization

Overview and links for this section of the guide.

Latency Breakdown

Understand where time goes before optimizing:

┌─────────────────────────────────────────────────────────────────┐
│                    REQUEST LATENCY BREAKDOWN                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CLIENT                                                          │
│  └─ Network to server ────────────────────── ~50ms              │
│                                                                  │
│  SERVER (your code)                                              │
│  ├─ Build prompt ─────────────────────────── ~10ms              │
│  ├─ Fetch context (DB, RAG) ──────────────── ~100-500ms         │
│  └─ Pre-processing ───────────────────────── ~20ms              │
│                                                                  │
│  LLM API                                                         │
│  ├─ Network to API ───────────────────────── ~50ms              │
│  ├─ Queue wait ───────────────────────────── ~0-200ms           │
│  ├─ Time to first token (TTFT) ───────────── ~200-500ms         │
│  └─ Token generation ─────────────────────── ~1-5s              │
│                                                                  │
│  TOTAL: 1.5s - 7s typical                                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Optimization Levers

Lever Impact Effort
Streaming responses Perceived speed 10x Low
Use Flash model 2-3x faster Low
Parallel context fetch 2-5x faster prep Medium
Reduce prompt size 1.5-2x faster Medium
Result caching 100x faster (cache hit) Medium
Connection pooling 50-100ms saved Low

Where to go next