How to diagnose RAG failures from traces
If a RAG system fails in production, the first question we should be asking is "what broke in this trace?". Until you can answer that, most scorers or dashboards aren't going to help you.
In practice, weak RAG systems usually fail in ordinary, inspectable ways. Retrieval misses the governing document. Chunking splits the clause that matters. The model answers from partial evidence and smooths over the gap with fluent prose.
This post is a debugging sequence to work out what's going wrong. It starts with one trace, classifies the failure precisely, and only then turns that diagnosis into a metric worth automating.
About this post
If you want the short version, jump to Failure classes, From traces to metrics, and Default workflow.
The core point is simple: evaluate relationships, not outputs in isolation.
For each request, inspect:
- The question
- The retrieved evidence
- The answer
Then ask:
- Did retrieval find the right evidence at all?
- Was the evidence sufficient to answer fully?
- Did the answer stay grounded in that evidence?
- Did the answer resolve the actual user need?
- Should the system have answered at all?
Once you know which relationship broke, the next engineering move is usually obvious.
Start with one trace
The fastest way to make this concrete is to inspect a real failure end to end.
Suppose an internal HR assistant gets this question:
Can part-time employees combine paid parental leave with annual leave, and if so, under what conditions?
Now look at the top retrieval results from a naive vector search:
[
{
"rank": 1,
"chunk_id": "handbook-112",
"source": "employee-handbook.pdf",
"score": 0.842,
"text": "Annual leave can be taken in blocks or as single days with manager approval. The standard approval process applies to all permanent employees."
},
{
"rank": 2,
"chunk_id": "benefits-041",
"source": "benefits-faq.md",
"score": 0.801,
"text": "The company supports new parents with paid parental leave. Speak to HR if your circumstances are complex."
},
{
"rank": 3,
"chunk_id": "policy-287",
"source": "parental-leave-policy.pdf",
"score": 0.784,
"text": "... employees may take parental leave in multiple blocks subject to operational approval. Requests must be submitted through HRIS ..."
}
]
And here is the answer the model produces:
Yes. Part-time employees can combine 12 weeks of paid parental leave with annual leave, subject to manager approval.
This looks plausible at a glance. That is exactly why trace review matters.
The trace exposes three concrete faults:
- Retrieval never surfaced the clause that defines the paid entitlement for part-time employees.
- The parental leave chunk is incomplete and does not establish whether annual leave can be combined with paid parental leave.
- The answer invents
12 weeks, even though that number never appears in the retrieved evidence.
At that point the problem stops being "RAG quality is low" and becomes a bounded pipeline failure.
Failure classes
These six failure classes map cleanly to engineering fixes:
| Failure mode | What broke | Typical fix |
|---|---|---|
| Failed recall | The right document never enters the candidate set | Hybrid search, metadata filters, query rewriting, better indexing |
| Failed sufficiency | Retrieved chunks are relevant but incomplete | Better chunk boundaries, overlap tuning, follow-up retrieval |
| Failed grounding | The answer contains unsupported claims | Tighter prompts, structured generation, grounded judges |
| Failed response adequacy | The evidence is present but the answer misses the actual question | Better answer planning, explicit response templates |
| Unanswerable question | The corpus cannot support a confident answer | Refusal policy, escalation path, answerability checks |
| Broken citations | Citations exist but do not support the claims they annotate | Span-level citation checks, stricter source mapping |
The value of this taxonomy is operational. Each class points to a different part of the stack. In practice, a single trace may contain more than one failure mode. The taxonomy is a debugging aid - production failures don't generally arrive one at a time. Difficult cases still need domain judgement, especially when the dispute is about interpretation rather than retrieval.
Retrieval first
The first question is blunt: did retrieval bring back the evidence required to answer the question?
In the HR trace, the answer is no. The chunks are on topic, but "on topic" is not the same as "sufficient to answer". That distinction matters more than most teams admit.
In policy-heavy systems, the relevant unit is often the document or section, not the individual chunk. A highly ranked fragment can still be insufficient if the governing exception sits two pages later or in the next table.
For first-pass retrieval, Recall@k is usually the first metric that matters. You want to know whether the governing source entered the candidate set at all. Once recall is acceptable, ranking quality matters more and Precision@k, MRR, and NDCG@k become useful.
Recall@k: The proportion of queries where at least one truly relevant result appears in the top k. Use this first to check whether retrieval is even surfacing the governing source.
Precision@k: The proportion of the top k results that are relevant. This tells you how much noise you are feeding into generation.
MRR: Mean Reciprocal Rank - averages 1 / rank for the first relevant result. It rewards systems that place the first useful source near the top.
NDCG@k: Normalised Discounted Cumulative Gain at k - scores ranking quality when relevance is graded, giving more credit to highly relevant items near the top.
If retrieval is weak, generation metrics are mostly decorative. You cannot prompt your way out of missing evidence.
Typical retrieval fixes are not glamorous:
- Hybrid search instead of vectors alone
- Metadata filters for policy type, version, region, or employee class
- Better chunk boundaries around tables, clauses, and exceptions
- Query rewriting for short or ambiguous questions
- Reranking that prefers primary documents over generic summaries
Groundedness next
Once retrieval is reasonable, check whether the answer stayed inside the evidence it was given.
This is a groundedness question, not a correctness vibe check. In a source-grounded system, a claim is unsupported if the retrieved evidence does not back it. It does not matter whether the model guessed correctly from pretraining.
In the example trace, 12 weeks is an unsupported claim. That is the failure.
This is the smallest structured output you should use to evaluate it:
from pydantic import BaseModel
class GroundingEval(BaseModel):
is_supported: bool
supporting_quotes: list[str]
unsupported_claims: list[str]
Minimal judge prompt:
You are evaluating whether an answer is supported by retrieved evidence.
Given the user question, the retrieved chunks, and the answer, return JSON matching GroundingEval.
Only mark a claim as supported if a quoted span from the retrieved chunks directly supports it.
List every important unsupported claim explicitly.
Example judge output:
{
"is_supported": false,
"supporting_quotes": [
"Annual leave can be taken in blocks or as single days with manager approval.",
"Employees may take parental leave in multiple blocks subject to operational approval."
],
"unsupported_claims": [
"Part-time employees can combine paid parental leave with annual leave.",
"Part-time employees receive 12 weeks of paid parental leave."
]
}
This is far more useful than a generic faithfulness score because it tells you what failed and why.
Sufficiency matters
A lot of teams collapse relevance and sufficiency into one fuzzy concept. That loses an important distinction.
Recall fails when the right source never enters the candidate set. Sufficiency fails when it does enter, but arrives incomplete for the task. Evidence can be relevant and still be insufficient.
In the HR trace, we have a partial annual leave rule, a vague FAQ, and a clipped policy chunk. That is related material. It is not enough to support the answer.
This is the structured output for that check:
from pydantic import BaseModel
class SufficiencyEval(BaseModel):
has_enough_evidence: bool
missing_facts: list[str]
follow_up_queries: list[str]
Example output:
{
"has_enough_evidence": false,
"missing_facts": [
"The paid parental leave entitlement for part-time employees.",
"Whether annual leave can be combined with paid parental leave.",
"Any approval or sequencing constraints for combining the two leave types."
],
"follow_up_queries": [
"part-time parental leave entitlement site:intranet.company policy pdf",
"combine annual leave with parental leave site:intranet.company policy pdf"
]
}
When this check fails repeatedly, model changes are usually the wrong first response. Query rewriting, chunk overlap, and source selection are higher leverage.
Response adequacy failures
An answer can be grounded and still fail.
This happens when the system retrieves useful evidence, stays faithful to it, and then answers a nearby question instead of the actual one.
Here is a simple example:
Question:
"Can contractors expense home office equipment, and if so, is manager approval required?"
[
{
"chunk_id": "expense-014",
"source": "contractor-expense-policy.pdf",
"text": "Contractors may expense approved home office equipment up to $500 per quarter."
},
{
"chunk_id": "expense-018",
"source": "contractor-expense-policy.pdf",
"text": "All contractor equipment expenses require prior manager approval and a receipt."
}
]
Answer:
"The contractor expense policy covers home office equipment and sets a $500 quarterly limit."
That answer is grounded. It is also inadequate, because it dodges the question about manager approval.
In practice, use one blunt check:
After reading the user’s question and the response, would a reviewer say the response fully answered the question?
If not, the system has a response adequacy failure.
Refusal quality
Sometimes the correct answer is no answer.
If the evidence is insufficient, the system should refuse to answer confidently. In high-trust systems that is not a broken outcome. It is the correct outcome.
Model that explicitly:
from pydantic import BaseModel
class AnswerabilityEval(BaseModel):
should_answer: bool
reason: str
safest_next_step: str
The hard part is not the schema. The hard part is the UX.
Bad refusal:
I do not know.
Better refusal:
I cannot find a policy that confirms whether part-time employees can combine paid parental leave with annual leave. I found the general annual leave policy and part of the parental leave policy, but neither answers that specific point. If you need a definitive answer, HR should confirm it. I can link the relevant policy documents if useful.
Good refusal UX does three things:
- It says what is missing.
- It names the closest relevant evidence that was found.
- It routes the user to the next useful action.
That is how you fail without bluffing.
Citation checks
If your system emits citations, evaluate the citations themselves.
A citation is only useful if it supports the claim it is attached to. It is easy to build a system that looks trustworthy because every sentence has a source badge beside it, while the underlying evidence chain is weak.
In the HR trace, citing the annual leave clause beside the claim about 12 weeks of paid parental leave would still be a citation failure. The badge is present. The support is not.
This is the structured output to use:
from pydantic import BaseModel
class CitationCheck(BaseModel):
sentence: str
chunk_ids: list[str]
supporting_quotes: list[str]
is_supported: bool
class CitationEval(BaseModel):
citations: list[CitationCheck]
Example output:
{
"citations": [
{
"sentence": "Part-time employees can combine 12 weeks of paid parental leave with annual leave.",
"chunk_ids": ["handbook-112", "policy-287"],
"supporting_quotes": [],
"is_supported": false
}
]
}
That gives you something much more useful than "citations were present". It lets you verify whether each important sentence mapped to the exact evidence that justifies it.
Sentence-level checks are usually a practical approximation, not a perfect standard. Some claims are distributed across multiple sentences or require multi-hop support across more than one source.
Runtime versus offline
Do not assume every eval belongs in the request path.
For most products, these checks should run asynchronously on logged traces. The user gets the response, the trace is stored, and eval workers score groundedness, sufficiency, citations, and refusal quality afterwards. That is the right default for most support, search, and internal knowledge tools because it avoids turning one user request into three extra model calls.
There are exceptions. In medical, legal, financial, or other high-trust flows, a synchronous sufficiency or answerability gate can be worth the latency because the cost of an unsupported answer is higher than the cost of waiting longer.
The distinction is simple:
- Runtime checks are guardrails
- Asynchronous checks are diagnostics and regression metrics
If you blur those together, you either ship a slow product or a blind one.
From traces to metrics
Trace review and dashboards are not competing ideas. They are stages in the same workflow.
You also need a real place to inspect traces. In production that usually means an observability stack rather than terminal logs. LangSmith, Langfuse and Braintrust are obvious managed options. Phoenix is a good open-source choice for local iteration. If your broader telemetry already lives in Datadog, keeping LLM traces beside application signals is often good enough. The tool matters less than the data shape.
For each request, we want:
- One trace per run
- Spans for retrieval and model calls
- Retrieved chunk IDs
- Prompt inputs
- Outputs
- Latency
- Eval results attached to the same run
This is the lifecycle to use:
- Review 30 to 50 traces manually.
- Cluster the failures into a small set of repeatable modes.
- Write targeted structured evals for those exact modes.
- Validate those evals against a human-labelled set.
- Put the validated evals on a dashboard and watch for regressions across retrieval, prompt, or model changes.
If you skip straight to the dashboard, you usually end up with generic scores that correlate weakly with the thing you care about. If you start with traces, the automated metric has a job to do.
Evaluate the judge
LLM-as-a-judge is useful. It is not ground truth.
Judges are useful because they scale a rubric, not because they replace human judgement. Every eval encodes an operational preference about what you want the system to do.
Before you trust a new judge on thousands of traces, test it against a human baseline:
- Label 100 traces by hand.
- Compare the judge output with the human labels.
- Inspect where it disagrees and why.
- Tighten the rubric, schema, or examples until the judge is reliable enough for the job.
If your grounding judge misses obvious unsupported claims, do not put it on a dashboard and call it science. Fix the judge first.
Default workflow
The default flow is straightforward:
- Start with one trace, not one dashboard.
- Separate recall, sufficiency, grounding, response adequacy, answerability, and citation failures.
- Fix retrieval before generation whenever the evidence is missing or incomplete.
- Treat unsupported claims as the primary unit of failure.
- Run most evals asynchronously, and reserve runtime gates for higher-trust flows.
- Validate every judge against human labels before using it as a regression metric.
That sequence is what stops RAG evaluation turning into metric collection without diagnosis.
Photo by Alexander Lyashkov on Unsplash