25.2 Indexing pipeline: ingest → chunk → embed → store
Overview and links for this section of the guide.
On this page
Goal: build an indexer you can run repeatedly
The indexing pipeline is the foundation of RAG. If it’s brittle, everything downstream becomes unreliable.
Your indexer should be:
- repeatable: you can re-run it and get the same results for unchanged docs,
- incremental: it updates only what changed,
- auditable: you can trace each embedding back to a doc and version,
- safe: it doesn’t leak or mislabel permissions.
Pipeline stages (ingest → chunk → embed → store)
1) Ingest
Ingest means: load documents, extract text, and attach metadata.
Decisions to make:
- doc_id: stable identifier (path, URL, database id).
- doc_version: timestamp or version number (or a content hash).
- permissions tags: tenant/team/role/classification.
- extraction: what you do for PDFs, HTML, docs; keep raw + extracted text if possible.
2) Chunk
Chunking should produce stable chunk ids and metadata (see 24.2). Store:
- doc_id, doc_version
- chunk_id, chunk_hash
- title path / section path
- chunk text
- permissions tags (copied from doc)
If your chunk ids aren’t stable, your citations and audits will drift over time.
3) Embed
Embedding converts chunk text into vectors. Requirements:
- batching: embed in batches for speed and cost.
- retry strategy: handle transient failures without duplicating records.
- versioning: store embedding model name/version and preprocessing version.
- rate limits: respect quotas (plan for backoff).
4) Store
You typically store two things:
- Chunk store: chunk_id → text + metadata (for citations and audits).
- Vector index: chunk_id → embedding vector + metadata filters.
Even if your vector DB can store text, it’s often useful to keep a separate chunk store to simplify audit logs and versioning.
Data model: documents, chunks, embeddings
A simple conceptual data model:
Document {
doc_id
doc_version
title
source_uri
permissions_tags
extracted_text
doc_hash
}
Chunk {
doc_id
doc_version
chunk_id
chunk_hash
title_path
text
permissions_tags
}
Embedding {
chunk_id
embedding_model
embedding_version
vector
created_at
}
Key invariant: chunk_id must map to exactly one chunk text for a given version.
Idempotency, retries, and versioning
Indexing is a batch job. Treat it like production software:
- Idempotency: re-running indexing on unchanged docs should not create duplicates.
- Change detection: use doc_hash/chunk_hash to detect what changed.
- Partial failures: if embedding fails halfway, you can resume safely.
- Version upgrades: embedding model upgrades require planned re-embedding.
doc_hash and chunk_hash let you do incremental updates without trusting timestamps or filesystem quirks.
Quality checks and smoke tests
Before you move on to query-time retrieval, run basic checks:
- Doc count: did you ingest the expected number of docs?
- Chunk count: are chunk counts reasonable (no accidental 1-char chunks)?
- Metadata: do chunks include required tags (doc_id, chunk_id, permissions)?
- Sample retrieval: pick 5 questions and confirm retrieval returns plausible chunks.
Copy-paste prompts
Prompt: design an indexing pipeline
Help me design an indexing pipeline for a RAG system.
Inputs:
- Corpus: [types, where stored, sensitivity]
- Need citations: yes/no
- Need permissions filtering: yes/no
- Update frequency: [daily/weekly/etc]
Output:
1) Data model (documents, chunks, embeddings) with required fields.
2) Chunking strategy per doc type.
3) Idempotent indexing approach (hashing/versioning).
4) Failure handling (retries, partial runs).
5) Smoke tests to validate the index.
Ship points
- Ship point 1: indexer runs end-to-end on a small corpus.
- Ship point 2: re-running indexer doesn’t duplicate data.
- Ship point 3: retrieval returns plausible chunks for 5–10 questions.