33. Latency Optimization

Overview and links for this section of the guide.

Why speed matters in AI apps

AI models are inherently slow compared to database queries. A complex reasoning task can take 5-10 seconds. In user interface terms, 10 seconds is an eternity. Users perceive anything over 1 second as a delay and anything over 10 seconds as "broken."

Since we can't change the speed of light or the fundamental inference speed of the model, we have to use engineering tricks to mask latency or reduce the work required.

Perceived Latency vs. Actual Latency

Streaming the first token in 0.5s feels instantaneous, even if the full answer takes 10s. Always optimize for "time to first token" (TTFT) in interactive apps.

The Physics of Latency

Latency comes from three places:

  1. Network: Sending the prompt to Google's servers.
  2. Prefill (Input Processing): The model reading your prompt. This is usually fast.
  3. Decoding (Output Generation): The model generating the answer token-by-token. This is the slow part.

Optimization Strategies

In this chapter, we'll explore how to attack each source of latency:

  • Prompt Engineering: Shorter prompts process faster.
  • Streaming: Showing progress immediately.
  • Parallelism: Doing work while the user is typing or while the model is thinking.
  • Caching: Reusing answers to skip the model entirely.

Where to go next