25.1 Spec: "Answer questions about my docs with references"
Overview and links for this section of the guide.
On this page
- Goal: a spec that makes RAG measurable
- User stories (what people actually want)
- Core requirements (grounding contract)
- Non-goals (scope control)
- Output contract (JSON shape)
- Acceptance criteria (definition of done)
- Edge cases you must decide upfront
- Your first eval set (start now)
- Copy-paste prompts
- Where to go next
Goal: a spec that makes RAG measurable
RAG projects fail when “it answered my question once” becomes the success metric.
Your spec should define:
- what the system must do,
- what it must never do,
- how you’ll evaluate it,
- what “done” looks like.
If you can’t state your output contract, “not found” behavior, and citation requirements, don’t start picking vector databases.
User stories (what people actually want)
Pick a small set of concrete stories. Example patterns:
- “Where does it say that?” Users want an answer and a quote they can verify.
- “What’s the rule and the exception?” Users want nuance, not a single sentence.
- “Is this allowed?” Users want policy-grounded guidance with escalation when unclear.
- “Which doc covers this?” Users want the right source, not a generated paragraph.
Notice the common thread: traceability and safe uncertainty.
Core requirements (grounding contract)
These are the behaviors that make the system trustworthy:
- Sources-only: answers must be supported by retrieved chunks.
- Citations: every important claim includes chunk ids and short quotes.
- Not found: if the corpus doesn’t contain the answer, the system says so.
- Conflict handling: if sources conflict, surface the conflict.
- Permissions: retrieval respects access control before generation.
- Deterministic structure: output must be machine-parseable (schema validation).
- Audit logging: store which chunks influenced the answer.
If you don’t allow abstention, the system will guess. Guessing is the root of trust loss.
Non-goals (scope control)
Write explicit non-goals to avoid building a giant agent system by accident:
- not a general web search assistant,
- not a replacement for legal/security review,
- not a “chat about anything” bot,
- not an auto-editor of source documents,
- not a tool-executing agent (unless explicitly designed with approvals).
Output contract (JSON shape)
Define a stable output format early. Here is a practical baseline:
{
"answer": {
"summary": string,
"bullets": [{
"claim": string,
"sources": [{ "chunk_id": string, "quote": string }]
}]
},
"not_found": boolean,
"confidence": "high" | "medium" | "low",
"missing_info": string[],
"conflicts": [{
"topic": string,
"source_a": { "chunk_id": string, "quote": string },
"source_b": { "chunk_id": string, "quote": string }
}],
"follow_up_question": string|null,
"retrieval": {
"used_chunk_ids": string[]
}
}
Notes:
- confidence is UX-oriented and can be derived from heuristics; don’t treat it as a calibrated probability.
- retrieval.used_chunk_ids helps debugging and auditing (also store scores in logs).
- follow_up_question is how you handle ambiguity without guessing.
Acceptance criteria (definition of done)
Define “done” as measurable checks, not vibes:
- Schema: 100% of responses parse and validate against schema.
- Citations: 100% of bullets include at least one citation.
- Faithfulness: in manual spot-checks, citations actually support claims.
- Not found: for questions outside the corpus, the system abstains.
- Latency: p95 response time under your target (set a number).
- Permissions: no retrieval/generation leaks restricted content in tests.
- Audit: each answer records model version, prompt version, and chunk ids used.
Edge cases you must decide upfront
Decide policies for these now to avoid hidden behavior:
- Contradictions: do you ask a question, pick the newest, or escalate?
- Multiple valid answers: do you list options with sources?
- Missing evidence: do you return “not found” or ask for clarification first?
- Long answers: do you cap bullets, summarize, or paginate?
- Private docs: how do you filter by user/team/tenant?
Your first eval set (start now)
Create an eval set immediately. Don’t wait until after you build the pipeline.
Good eval questions:
- come from real user needs,
- have a clear “should be answerable” vs “not found,”
- cover common and high-risk topics,
- include tricky cases (exceptions, negations, conflicts).
Start with 25 questions. It’s enough to catch major regressions and fast enough to review.
Copy-paste prompts
Prompt: turn rough idea into a RAG spec
Help me write a spec for a RAG system that answers questions about my documents with references.
Inputs:
- Users and use cases: ...
- Corpus: (types, size, update rate, sensitivity): ...
- Must-have behaviors: (citations, not-found, permissions): ...
Output:
1) User stories
2) Core requirements and non-goals
3) Output JSON schema (include citations per claim)
4) Acceptance criteria (measurable)
5) First eval set: 25 questions + expected "answerable" vs "not found"