25.1 Spec: "Answer questions about my docs with references"

Overview and links for this section of the guide.

Goal: a spec that makes RAG measurable

RAG projects fail when “it answered my question once” becomes the success metric.

Your spec should define:

  • what the system must do,
  • what it must never do,
  • how you’ll evaluate it,
  • what “done” looks like.
Spec-first protects you from infrastructure wander

If you can’t state your output contract, “not found” behavior, and citation requirements, don’t start picking vector databases.

User stories (what people actually want)

Pick a small set of concrete stories. Example patterns:

  • “Where does it say that?” Users want an answer and a quote they can verify.
  • “What’s the rule and the exception?” Users want nuance, not a single sentence.
  • “Is this allowed?” Users want policy-grounded guidance with escalation when unclear.
  • “Which doc covers this?” Users want the right source, not a generated paragraph.

Notice the common thread: traceability and safe uncertainty.

Core requirements (grounding contract)

These are the behaviors that make the system trustworthy:

  • Sources-only: answers must be supported by retrieved chunks.
  • Citations: every important claim includes chunk ids and short quotes.
  • Not found: if the corpus doesn’t contain the answer, the system says so.
  • Conflict handling: if sources conflict, surface the conflict.
  • Permissions: retrieval respects access control before generation.
  • Deterministic structure: output must be machine-parseable (schema validation).
  • Audit logging: store which chunks influenced the answer.
Decide your “abstain” policy now

If you don’t allow abstention, the system will guess. Guessing is the root of trust loss.

Non-goals (scope control)

Write explicit non-goals to avoid building a giant agent system by accident:

  • not a general web search assistant,
  • not a replacement for legal/security review,
  • not a “chat about anything” bot,
  • not an auto-editor of source documents,
  • not a tool-executing agent (unless explicitly designed with approvals).

Output contract (JSON shape)

Define a stable output format early. Here is a practical baseline:

{
  "answer": {
    "summary": string,
    "bullets": [{
      "claim": string,
      "sources": [{ "chunk_id": string, "quote": string }]
    }]
  },
  "not_found": boolean,
  "confidence": "high" | "medium" | "low",
  "missing_info": string[],
  "conflicts": [{
    "topic": string,
    "source_a": { "chunk_id": string, "quote": string },
    "source_b": { "chunk_id": string, "quote": string }
  }],
  "follow_up_question": string|null,
  "retrieval": {
    "used_chunk_ids": string[]
  }
}

Notes:

  • confidence is UX-oriented and can be derived from heuristics; don’t treat it as a calibrated probability.
  • retrieval.used_chunk_ids helps debugging and auditing (also store scores in logs).
  • follow_up_question is how you handle ambiguity without guessing.

Acceptance criteria (definition of done)

Define “done” as measurable checks, not vibes:

  • Schema: 100% of responses parse and validate against schema.
  • Citations: 100% of bullets include at least one citation.
  • Faithfulness: in manual spot-checks, citations actually support claims.
  • Not found: for questions outside the corpus, the system abstains.
  • Latency: p95 response time under your target (set a number).
  • Permissions: no retrieval/generation leaks restricted content in tests.
  • Audit: each answer records model version, prompt version, and chunk ids used.

Edge cases you must decide upfront

Decide policies for these now to avoid hidden behavior:

  • Contradictions: do you ask a question, pick the newest, or escalate?
  • Multiple valid answers: do you list options with sources?
  • Missing evidence: do you return “not found” or ask for clarification first?
  • Long answers: do you cap bullets, summarize, or paginate?
  • Private docs: how do you filter by user/team/tenant?

Your first eval set (start now)

Create an eval set immediately. Don’t wait until after you build the pipeline.

Good eval questions:

  • come from real user needs,
  • have a clear “should be answerable” vs “not found,”
  • cover common and high-risk topics,
  • include tricky cases (exceptions, negations, conflicts).

Start with 25 questions. It’s enough to catch major regressions and fast enough to review.

Copy-paste prompts

Prompt: turn rough idea into a RAG spec

Help me write a spec for a RAG system that answers questions about my documents with references.

Inputs:
- Users and use cases: ...
- Corpus: (types, size, update rate, sensitivity): ...
- Must-have behaviors: (citations, not-found, permissions): ...

Output:
1) User stories
2) Core requirements and non-goals
3) Output JSON schema (include citations per claim)
4) Acceptance criteria (measurable)
5) First eval set: 25 questions + expected "answerable" vs "not found"

Where to go next