24.2 Choosing chunk size, overlap, and metadata

Overview and links for this section of the guide.

Goal: chunks that retrieve well and cite cleanly

Chunking is where most RAG systems are won or lost.

Your goal is not “split text into pieces.” Your goal is to produce chunks that are:

  • retrievable: relevant chunks are likely to appear in top-k results,
  • understandable: a chunk makes sense on its own,
  • citable: you can reference a stable id and quote evidence,
  • stable over time: small doc edits don’t rewrite every chunk id.

The key tradeoff: precision vs completeness

Chunk size is a tradeoff:

  • Smaller chunks improve retrieval precision (less unrelated text) but risk missing needed context.
  • Larger chunks preserve context but reduce precision and waste context budget.

You want chunks that contain a complete “answer unit” for the kinds of questions users ask.

Design chunking around questions

Chunk boundaries should reflect how users query: “How do refunds work?” needs rules + exceptions, not a single paragraph.

Chunk size guidelines (practical ranges)

Token counts vary by language and formatting, but these are useful starting points:

  • Reference docs / policies: ~200–500 tokens per chunk (often one subsection).
  • API docs: one endpoint per chunk (include request, response, errors).
  • Tutorials: one step or concept per chunk (include prerequisites).
  • Logs/transcripts: time-based chunks (e.g., 1–3 minutes) with overlap.

When in doubt: start medium (a subsection), then adjust based on retrieval failures.

Smells that your chunks are the wrong size

  • Too small: retrieved chunks mention terms but don’t contain the rule/answer.
  • Too large: retrieved chunks contain the answer but bury it among unrelated text; the model misses it.
  • Boundary split: definitions are in one chunk and rules are in another, so retrieval returns only half.

Overlap: when it helps and when it hurts

Overlap is used to preserve continuity across boundaries.

Overlap helps when:

  • you chunk by fixed sizes (sliding window),
  • the document has cross-paragraph dependencies,
  • you’re chunking transcripts/logs with time continuity.

Overlap hurts when:

  • it causes the same content to appear multiple times in retrieval results,
  • it wastes context budget with duplicated text,
  • it makes citations messy (“same quote appears in multiple chunks”).

Practical default: 10–20% overlap for sliding-window chunking; 0% overlap for clean heading-based chunking.

Metadata you should store (minimum viable)

Without metadata, you can’t filter, cite, or audit. Store at least:

  • doc_id: stable identifier for the source document.
  • chunk_id: stable identifier for the chunk (unique within doc_id).
  • title/path: heading hierarchy or section path.
  • source location: page number, heading, or paragraph range.
  • created_at / version: document version or timestamp.
  • permissions tags: tenant, role, classification (public/internal/restricted).
  • content type: policy, api, runbook, ticket, transcript.
Permissions belong in metadata

Don’t “filter later.” Retrieval must enforce access control before the model ever sees the text.

Stable chunk ids and versioning

Chunk ids should be stable and meaningful. Two common strategies:

  • Structure-based ids: derive from heading path (e.g., security/keys/rotation).
  • Content-hash ids: hash of normalized text (stable only if text doesn’t change).

In practice, many systems use a hybrid:

  • ids derived from structure when available,
  • plus a version field when the content changes.

Always store:

  • doc_version and/or a doc_hash
  • chunk_hash (for detecting changes)
  • embedding_model_version (for re-embedding decisions)

How to evaluate chunking quality

Chunking quality is measurable through retrieval behavior:

  • Retrieval spot-check: for 25 real questions, inspect top 5 retrieved chunks.
  • Coverage: does at least one retrieved chunk contain the answer?
  • Redundancy: are you retrieving near-duplicates due to overlap?
  • Citation clarity: can you point to the exact chunk/quote for claims?

If retrieval is consistently missing the right chunk, fix chunking and metadata before tuning prompts.

Copy-paste prompts

Prompt: design chunking for a corpus

I am building a RAG system. Help me design a chunking strategy.

Corpus description:
- Doc types: [policies/api docs/runbooks/transcripts/etc]
- Typical questions users ask: [list 10]
- Constraints: [need citations, permissions, update rate]

Task:
1) Propose chunking rules per doc type (where to split, target size).
2) Recommend overlap rules (if any) and why.
3) Define minimum metadata fields we must store (for filtering + citations).
4) List common chunking failure modes for this corpus and how to detect them.

Where to go next