33.4 Caching: what to cache and what not to
Overview and links for this section of the guide.
Semantic Caching
Exact caching (checking if the query string is identical) rarely works in chat because users phrase things differently ("Hello" vs "Hi there").
Semantic Caching uses embeddings. 1. User asks "How do I reset my password?" 2. You embed this query. 3. You check your cache (vector DB) for similar previous queries. 4. You find "How can I change my password?" which has a cached answer. 5. If similarity > 0.95, return the cached answer immediately (0ms latency).
Context Caching (The Feature)
Google AI Studio and Gemini support Context Caching. This is different. This caches the prefix of your prompt (system instructions, huge files, few-shot examples).
- Scenario: You are building a "Chat with Book" app. You upload "War and Peace" (500k tokens).
- Without Cache: You pay 500k tokens for every question.
- With Cache: You pay to cache "War and Peace" once. Subsequent queries only pay for the new question + the answer.
This dramatically reduces Cost and Prefill Latency.
Use it when you have a static context > 32k tokens that is reused across many requests. It usually has a "minimum life" (e.g. you pay for 1 hour minimum), so it's not for one-off scripts.