29.4 Caching strategies (prompt+context caching)

Overview and links for this section of the guide.

Goal: reduce cost and latency safely

LLM apps are often limited by cost and latency. Caching is the highest-leverage tool to improve both.

But caching can also break correctness and privacy if done carelessly.

The goal is: cache what is safe and stable, and version cache keys so you don’t serve stale or cross-tenant answers.

What you can cache (and what you should not)

High-value caching targets:

  • Embeddings: chunk embeddings and query embeddings.
  • Retrieval results: top-k chunk ids for frequent queries (with filters/versioning).
  • Reranking results: selected chunk ids for frequent queries.
  • Prompt templates: rendered system prompts or “house rules.”
  • Final answers: for repeated identical queries in the same context (careful!).
  • Derived artifacts: doc summaries, constraint extracts, chunk indexes.

Things you usually should not cache broadly:

  • Answers containing sensitive information unless you have strong access control and isolation.
  • Cross-tenant cached outputs (high leakage risk).
  • Outputs without version keys (stale answers silently ship).
Caching is a data-leak risk multiplier

If you cache incorrectly, you can leak one user’s content to another. Treat cache design as a security problem, not just a performance trick.

Cache keys and correctness (version everything)

The cache key must include everything that changes the answer:

  • User context: tenant, role, permissions filters.
  • Prompt version: prompt template id/version.
  • Model version: model name and settings that affect output.
  • Corpus/version: doc hash, index version, embedding version.
  • Retrieval parameters: top-k, filters, reranker settings.

If you don’t include versions, you will serve stale outputs after updates.

Privacy and multi-tenant safety

Rules that reduce risk:

  • Partition caches by tenant: separate namespaces or separate stores.
  • Never cache secrets: redact before caching or avoid caching those outputs.
  • Cache ids, not content: store chunk ids and retrieval decisions, not full text, when possible.
  • Encrypt at rest: if caching contains sensitive derived data.
  • Log access to sensitive caches: treat it like any other sensitive store.

TTL and invalidation strategies

Two main approaches:

  • TTL-based: cache expires after N minutes/hours.
  • Version-based: cache key includes version; new versions naturally miss cache.

Version-based invalidation is usually safer for correctness. TTL is useful for cost control and protecting against memory growth.

Caching patterns by pipeline stage

Retrieval caching

  • Cache top-k chunk ids for frequent queries.
  • Key includes: query, filters, index version, embedding version.
  • Benefit: reduces vector DB load and latency.

Generation caching

  • Cache final answers for repeated identical prompts in identical contexts.
  • Key includes: prompt version, sources/chunk ids, model version, user context.
  • Benefit: large cost savings for repeated questions.

For grounded systems, caching based on “sources set” is often safer than caching based only on “question string.”

Artifact caching

  • Cache doc summaries, constraint extracts, chunk indexes.
  • Invalidate when doc_hash changes.
  • Benefit: reduces repeated long-context processing.

Where to go next