24.4 Ranking and re-ranking intuition

Overview and links for this section of the guide.

Goal: improve relevance beyond naive “top-k vectors”

Embedding similarity is a great baseline, but it often retrieves:

  • chunks that are semantically similar but not actually answering the question,
  • chunks with the same buzzwords but wrong scope (e.g., policy vs implementation),
  • the right doc but the wrong subsection.

Ranking and reranking are how you turn “kind of related” into “the best evidence.”

A simple retrieval pipeline mental model

Think of retrieval as two phases:

  1. Candidate generation: get a shortlist (fast, broad recall).
  2. Candidate selection: choose the best few (slower, higher precision).

Common pattern:

  • Generate 50–200 candidates with a fast method (vector and/or keyword).
  • Rerank the top 20–50 with a stronger method.
  • Include 5–12 chunks in the final prompt (depending on context budget).
Retrieval k ≠ prompt k

Retrieve broadly to avoid missing evidence; include narrowly to stay within budget and reduce distraction.

Hybrid search (keyword + vector)

Keyword search (BM25) is good when:

  • exact terms matter (error codes, parameter names, product SKUs),
  • the user uses the same words as the docs,
  • you need high precision on specific entities.

Vector search is good when:

  • users paraphrase,
  • synonyms and related concepts matter,
  • the corpus uses different wording than users.

Hybrid search combines them. Practical ways to do it:

  • Union: take top-k from both, dedupe, then rerank.
  • Weighted score: combine normalized BM25 and vector similarity.
  • Conditional: use keyword when query contains “hard tokens” (ids, error codes), else vector.

Reranking: why it helps and how to use it

Reranking answers the question: “Given this query, which of these candidates is truly most relevant?”

Reranking helps because it can use richer signals than embedding similarity:

  • exact match of constraints and exceptions,
  • scope alignment (policy vs tutorial vs ticket),
  • entity relationships (“this endpoint returns that field”),
  • negation and nuance that embeddings can blur.

Practical reranking strategies:

  • Cross-encoder reranker: dedicated model that scores (query, chunk) pairs.
  • LLM-as-reranker: ask the model to pick the top N chunks with reasons (works, but watch latency and cost).
  • Heuristic reranker: boost chunks with matching doc types, recency, or canonical tags.
Do not rerank without evaluation

Rerankers can feel better while silently dropping the key chunk. Measure recall and faithfulness on an eval set.

Metadata filters and “authority” weighting

RAG systems usually need to prefer “more authoritative” sources:

  • canonical policies > informal notes,
  • current version > deprecated version,
  • official docs > tickets and chats.

Encode this in metadata and ranking rules:

  • Filter: restrict retrieval to doc types for a given question.
  • Boost: increase scores for canonical tags.
  • Decay: reduce scores for stale content unless explicitly requested.

Tuning k and context packing

Two practical tuning knobs:

  • Candidate k: how many chunks you retrieve before reranking.
  • Prompt k: how many chunks you include in the final prompt.

Symptoms and fixes:

  • Answer misses key exception: increase candidate k; improve chunking so exceptions are included.
  • Answer rambles or contradicts itself: decrease prompt k; improve reranking; tighten prompts.
  • Answer is slow/costly: reduce candidate k; use caching; use cheaper reranker.

How to evaluate ranking changes

Track both retrieval quality and answer quality:

  • Retrieval: recall@k and MRR on labeled questions.
  • Answer faithfulness: do citations actually support claims?
  • Not-found accuracy: does it abstain when evidence is missing?
  • Latency/cost: time per query and token usage.

Small ranking tweaks can create large downstream effects; treat them like code changes with tests.

Copy-paste prompts

Prompt: LLM rerank top candidates

You are reranking retrieved document chunks for a question.

Rules:
- Select the best 5 chunks for answering the question.
- Prefer authoritative/canonical sources if they contain the answer.
- Explain the selection in 1 sentence per chosen chunk.

Question: [question]

Candidates (id + type + text):
[id: ... | type: ...]
```text
...
```

Return JSON:
{ "selected": [{ "chunk_id": string, "reason": string }] }

Where to go next