29.4 Caching strategies (prompt+context caching)

On this page

Goal: reduce cost and latency safely

LLM apps are often limited by cost and latency. Caching is the highest-leverage tool to improve both.

But caching can also break correctness and privacy if done carelessly.

The goal is: cache what is safe and stable, and version cache keys so you don’t serve stale or cross-tenant answers.

High-value caching targets:

Embeddings: chunk embeddings and query embeddings.
Retrieval results: top-k chunk ids for frequent queries (with filters/versioning).
Reranking results: selected chunk ids for frequent queries.
Prompt templates: rendered system prompts or “house rules.”
Final answers: for repeated identical queries in the same context (careful!).
Derived artifacts: doc summaries, constraint extracts, chunk indexes.

Things you usually should not cache broadly:

Answers containing sensitive information unless you have strong access control and isolation.
Cross-tenant cached outputs (high leakage risk).
Outputs without version keys (stale answers silently ship).

Caching is a data-leak risk multiplier

If you cache incorrectly, you can leak one user’s content to another. Treat cache design as a security problem, not just a performance trick.

The cache key must include everything that changes the answer:

If you don’t include versions, you will serve stale outputs after updates.

Rules that reduce risk:

Partition caches by tenant: separate namespaces or separate stores.
Never cache secrets: redact before caching or avoid caching those outputs.
Cache ids, not content: store chunk ids and retrieval decisions, not full text, when possible.
Encrypt at rest: if caching contains sensitive derived data.
Log access to sensitive caches: treat it like any other sensitive store.

Two main approaches:

TTL-based: cache expires after N minutes/hours.
Version-based: cache key includes version; new versions naturally miss cache.

Version-based invalidation is usually safer for correctness. TTL is useful for cost control and protecting against memory growth.

For grounded systems, caching based on “sources set” is often safer than caching based only on “question string.”