25.2 Indexing pipeline: ingest → chunk → embed → store

Overview and links for this section of the guide.

Goal: build an indexer you can run repeatedly

The indexing pipeline is the foundation of RAG. If it’s brittle, everything downstream becomes unreliable.

Your indexer should be:

  • repeatable: you can re-run it and get the same results for unchanged docs,
  • incremental: it updates only what changed,
  • auditable: you can trace each embedding back to a doc and version,
  • safe: it doesn’t leak or mislabel permissions.

Pipeline stages (ingest → chunk → embed → store)

1) Ingest

Ingest means: load documents, extract text, and attach metadata.

Decisions to make:

  • doc_id: stable identifier (path, URL, database id).
  • doc_version: timestamp or version number (or a content hash).
  • permissions tags: tenant/team/role/classification.
  • extraction: what you do for PDFs, HTML, docs; keep raw + extracted text if possible.

2) Chunk

Chunking should produce stable chunk ids and metadata (see 24.2). Store:

  • doc_id, doc_version
  • chunk_id, chunk_hash
  • title path / section path
  • chunk text
  • permissions tags (copied from doc)

If your chunk ids aren’t stable, your citations and audits will drift over time.

3) Embed

Embedding converts chunk text into vectors. Requirements:

  • batching: embed in batches for speed and cost.
  • retry strategy: handle transient failures without duplicating records.
  • versioning: store embedding model name/version and preprocessing version.
  • rate limits: respect quotas (plan for backoff).

4) Store

You typically store two things:

  • Chunk store: chunk_id → text + metadata (for citations and audits).
  • Vector index: chunk_id → embedding vector + metadata filters.

Even if your vector DB can store text, it’s often useful to keep a separate chunk store to simplify audit logs and versioning.

Data model: documents, chunks, embeddings

A simple conceptual data model:

Document {
  doc_id
  doc_version
  title
  source_uri
  permissions_tags
  extracted_text
  doc_hash
}

Chunk {
  doc_id
  doc_version
  chunk_id
  chunk_hash
  title_path
  text
  permissions_tags
}

Embedding {
  chunk_id
  embedding_model
  embedding_version
  vector
  created_at
}

Key invariant: chunk_id must map to exactly one chunk text for a given version.

Idempotency, retries, and versioning

Indexing is a batch job. Treat it like production software:

  • Idempotency: re-running indexing on unchanged docs should not create duplicates.
  • Change detection: use doc_hash/chunk_hash to detect what changed.
  • Partial failures: if embedding fails halfway, you can resume safely.
  • Version upgrades: embedding model upgrades require planned re-embedding.
Use hashes as your “truth”

doc_hash and chunk_hash let you do incremental updates without trusting timestamps or filesystem quirks.

Quality checks and smoke tests

Before you move on to query-time retrieval, run basic checks:

  • Doc count: did you ingest the expected number of docs?
  • Chunk count: are chunk counts reasonable (no accidental 1-char chunks)?
  • Metadata: do chunks include required tags (doc_id, chunk_id, permissions)?
  • Sample retrieval: pick 5 questions and confirm retrieval returns plausible chunks.

Copy-paste prompts

Prompt: design an indexing pipeline

Help me design an indexing pipeline for a RAG system.

Inputs:
- Corpus: [types, where stored, sensitivity]
- Need citations: yes/no
- Need permissions filtering: yes/no
- Update frequency: [daily/weekly/etc]

Output:
1) Data model (documents, chunks, embeddings) with required fields.
2) Chunking strategy per doc type.
3) Idempotent indexing approach (hashing/versioning).
4) Failure handling (retries, partial runs).
5) Smoke tests to validate the index.

Ship points

  • Ship point 1: indexer runs end-to-end on a small corpus.
  • Ship point 2: re-running indexer doesn’t duplicate data.
  • Ship point 3: retrieval returns plausible chunks for 5–10 questions.

Where to go next