Home/
Part VIII — Retrieval, Grounding, and "Don't Make Stuff Up" Engineering/26. Guardrails for Grounded Systems/26.5 Auditing: storing which chunks influenced answers
26.5 Auditing: storing which chunks influenced answers
Overview and links for this section of the guide.
On this page
Goal: make every answer explainable later
If a user asks “why did it say that?”, you should be able to answer with evidence:
- which sources were retrieved,
- which were included in the prompt,
- which were cited in the answer,
- which model/prompt versions were used.
This is how you debug issues and build trust.
Why auditing matters (beyond compliance)
Auditing is valuable even if you don’t have compliance requirements:
- Debugging: reproduce failures and fix the right layer (retrieval vs prompt vs corpus).
- Quality improvement: find common “not found” gaps and update docs.
- Regression detection: identify when a new prompt version changed behavior.
- Security: detect suspicious queries or attempted prompt injection patterns.
Audit logs can become a privacy liability
Log identifiers and hashes when possible. Be deliberate about storing raw user inputs and source text.
What to log (minimum viable)
At minimum, log:
- request metadata: timestamp, request id, user/tenant id (or hashed), environment.
- question: raw or redacted question text.
- retrieval: query variants, filters used, top-k chunk ids and scores.
- prompt version: identifier for the prompt template and rules.
- model version: model name, parameters (temperature), safety settings.
- answer: final JSON output plus validation status.
- citations: which chunk ids were cited.
Optionally log:
- token usage and latency breakdown,
- reranking decisions,
- “not found” missing_info fields,
- conflict detection outputs.
Audit log schema (practical)
{
"request_id": string,
"timestamp": string,
"actor": { "user_id": string|null, "tenant_id": string|null },
"question": { "text": string, "redacted": boolean },
"retrieval": {
"filters": object,
"queries": string[],
"candidates": [{ "chunk_id": string, "score": number, "doc_version": string|null }]
},
"generation": {
"model": string,
"model_version": string|null,
"prompt_version": string,
"temperature": number|null
},
"result": {
"status": "answered" | "not_found" | "needs_clarification" | "conflict" | "restricted" | "refused" | "error",
"validated": boolean,
"answer_json": object|null,
"cited_chunk_ids": string[]
}
}
Privacy and retention considerations
Decide:
- What’s logged: raw text vs redacted vs hashed identifiers.
- Where it’s stored: secure store with access control and audit.
- How long it’s retained: short by default; longer only with policy.
- Deletion: ability to delete logs when required.
For sensitive corpora, consider storing:
- chunk ids and doc versions only,
- not the raw chunk text (which is stored elsewhere under stricter controls).
Debug workflow: “why did it answer that?”
When an answer is wrong or suspicious, debug from logs:
- Check retrieval: were the right chunks retrieved? were filters correct?
- Check context packing: were the best chunks included in the prompt?
- Check citations: do citations match claims? are quotes accurate?
- Check versions: was the corpus stale? did doc versions change?
- Check prompt/model changes: did a new version alter behavior?
This workflow prevents random prompt tweaking and forces root-cause fixes.