25.2 Indexing pipeline: ingest → chunk → embed → store

On this page

Goal: build an indexer you can run repeatedly
Pipeline stages (ingest → chunk → embed → store)
Data model: documents, chunks, embeddings
Idempotency, retries, and versioning
Quality checks and smoke tests
Copy-paste prompts
Ship points
Where to go next

Goal: build an indexer you can run repeatedly

The indexing pipeline is the foundation of RAG. If it’s brittle, everything downstream becomes unreliable.

Your indexer should be:

repeatable: you can re-run it and get the same results for unchanged docs,
incremental: it updates only what changed,
auditable: you can trace each embedding back to a doc and version,
safe: it doesn’t leak or mislabel permissions.

Pipeline stages (ingest → chunk → embed → store)

1) Ingest

Ingest means: load documents, extract text, and attach metadata.

Decisions to make:

doc_id: stable identifier (path, URL, database id).
doc_version: timestamp or version number (or a content hash).
permissions tags: tenant/team/role/classification.
extraction: what you do for PDFs, HTML, docs; keep raw + extracted text if possible.

2) Chunk

Chunking should produce stable chunk ids and metadata (see 24.2). Store:

doc_id, doc_version
chunk_id, chunk_hash
title path / section path
chunk text
permissions tags (copied from doc)

If your chunk ids aren’t stable, your citations and audits will drift over time.

3) Embed

Embedding converts chunk text into vectors. Requirements:

batching: embed in batches for speed and cost.
retry strategy: handle transient failures without duplicating records.
versioning: store embedding model name/version and preprocessing version.
rate limits: respect quotas (plan for backoff).

4) Store

You typically store two things:

Chunk store: chunk_id → text + metadata (for citations and audits).
Vector index: chunk_id → embedding vector + metadata filters.

Even if your vector DB can store text, it’s often useful to keep a separate chunk store to simplify audit logs and versioning.

Data model: documents, chunks, embeddings

A simple conceptual data model:

Document {
  doc_id
  doc_version
  title
  source_uri
  permissions_tags
  extracted_text
  doc_hash
}

Chunk {
  doc_id
  doc_version
  chunk_id
  chunk_hash
  title_path
  text
  permissions_tags
}

Embedding {
  chunk_id
  embedding_model
  embedding_version
  vector
  created_at
}

Key invariant: chunk_id must map to exactly one chunk text for a given version.

Idempotency, retries, and versioning

Indexing is a batch job. Treat it like production software:

Idempotency: re-running indexing on unchanged docs should not create duplicates.
Change detection: use doc_hash/chunk_hash to detect what changed.
Partial failures: if embedding fails halfway, you can resume safely.
Version upgrades: embedding model upgrades require planned re-embedding.

Use hashes as your “truth”

doc_hash and chunk_hash let you do incremental updates without trusting timestamps or filesystem quirks.

Quality checks and smoke tests

Before you move on to query-time retrieval, run basic checks:

Doc count: did you ingest the expected number of docs?
Chunk count: are chunk counts reasonable (no accidental 1-char chunks)?
Metadata: do chunks include required tags (doc_id, chunk_id, permissions)?
Sample retrieval: pick 5 questions and confirm retrieval returns plausible chunks.

Copy-paste prompts

Prompt: design an indexing pipeline

Help me design an indexing pipeline for a RAG system.

Inputs:
- Corpus: [types, where stored, sensitivity]
- Need citations: yes/no
- Need permissions filtering: yes/no
- Update frequency: [daily/weekly/etc]

Output:
1) Data model (documents, chunks, embeddings) with required fields.
2) Chunking strategy per doc type.
3) Idempotent indexing approach (hashing/versioning).
4) Failure handling (retries, partial runs).
5) Smoke tests to validate the index.

Ship points

Ship point 1: indexer runs end-to-end on a small corpus.
Ship point 2: re-running indexer doesn’t duplicate data.
Ship point 3: retrieval returns plausible chunks for 5–10 questions.

25.2 Indexing pipeline: ingest → chunk → embed → store

Goal: build an indexer you can run repeatedly

Pipeline stages (ingest → chunk → embed → store)

1) Ingest

2) Chunk

3) Embed

4) Store

Data model: documents, chunks, embeddings

Idempotency, retries, and versioning

Quality checks and smoke tests

Copy-paste prompts

Prompt: design an indexing pipeline

Ship points

Where to go next