22.1 Chunking strategies that preserve meaning
Overview and links for this section of the guide.
On this page
Goal: chunk without destroying meaning
Chunking is how you turn a long document into pieces that can be:
- retrieved later,
- cited reliably,
- summarized with less hallucination,
- kept within a context budget.
The goal is not “split every N characters.” The goal is: each chunk should be understandable and useful on its own.
Principles of good chunks
- Semantic coherence: keep one idea or subtopic per chunk.
- Self-contained context: include definitions and prerequisites when possible.
- Stable boundaries: chunk boundaries should not change wildly with minor doc edits.
- Retrieval friendliness: include terms users will query for (headings help).
- Minimal redundancy: overlap helps, but too much overlap wastes budget and confuses answers.
Splitting by raw length often cuts definitions from their usage, tables from their explanations, and policies from their exceptions.
Chunking methods you can actually use
1) Structure-first (headings/sections)
If your doc has headings, use them. This is the highest leverage method:
- split by
h2/h3sections (or markdown headings), - keep the heading path as metadata (e.g.,
“Security > Secrets > Rotation”), - merge tiny sections with a neighbor; split huge sections by paragraphs.
2) Sliding window with overlap (for messy text)
For logs, transcripts, or unstructured text, use a sliding window:
- choose a target chunk size,
- include an overlap (10–20%) to preserve continuity,
- preserve time ranges (for logs/transcripts) as metadata.
3) Semantic chunking (split by topic shifts)
Use a model-assisted pass to find topic shifts:
- ask the model to label paragraphs by topic,
- group adjacent paragraphs with the same label,
- use those groups as chunks.
This works well for long prose but requires careful validation, because the model can mis-label boundaries.
Metadata: the difference between “text” and “knowledge”
Without metadata, you can’t ground answers. At minimum, store:
- doc_id: which document this came from.
- chunk_id: a stable id (don’t rely on array index alone).
- title_path: heading hierarchy (if available).
- source location: page number, paragraph index, or byte range.
- timestamp/version: doc version or last updated time.
Example chunk record (conceptual):
{
"doc_id": "policy-security-v3",
"chunk_id": "3.2-secrets-rotation",
"title_path": ["Security", "Secrets", "Rotation"],
"source": { "page": 12, "start_paragraph": 4, "end_paragraph": 9 },
"text": "..."
}
How to evaluate chunk quality
Use a quick checklist:
- Can a human understand it alone? If not, it’s missing context.
- Does it contain both rules and exceptions? Policy chunks should include caveats.
- Does it include keywords a user would search for? Headings matter.
- Is it too large? Huge chunks reduce retrieval precision and increase cost.
- Is it too small? Tiny chunks lose meaning and increase retrieval noise.
Then test with a few representative queries: can you retrieve the chunk you expect?
Copy-paste prompts
Prompt: propose chunk boundaries for a document
I have a long document. I want to chunk it for retrieval.
Requirements:
- Chunks should be semantically coherent and understandable on their own.
- Prefer splitting on headings. If headings are missing, split on topic shifts.
- Output chunk boundaries with stable ids.
Return JSON:
{
"chunks": [{
"chunk_id": string,
"title_path": string[],
"start_hint": string,
"end_hint": string,
"notes": string
}]
}
Ask clarifying questions if needed (doc type, average length, intended queries).