33.3 Parallelizing retrieval and preprocessing

Overview and links for this section of the guide.

The Serial Trap

A naive RAG app often does this:

1. Receive User Query
2. (Wait) Generate Embedding for Query
3. (Wait) Search Vector DB
4. (Wait) Fetch Full Documents
5. (Wait) Call LLM
6. Response

This "waterfall" kills performance.

Parallelizing the Pipeline

You can do many things at once:

  • Speculative Retrieval: Start searching your docs as soon as the user stops typing (debounce), before they even hit enter.
  • Parallel Chunks: If you need to summarize 5 documents, send 5 separate requests to the model in parallel (map-reduce), rather than asking it to summarize them one by one.
  • Hybrid Search: Run your keyword search (Elasticsearch) and vector search (Pinecone) at the same time, then merge results.
The "Optimistic" UI

If you know the user will likely need a specific tool (e.g., they opened the "SQL Editor" tab), start pre-loading the schema context in the background so it's ready when they ask a question.

Where to go next