How to diagnose RAG failures from traces

14 Apr 2026 · Updated 09 Jun 2026

If a RAG system fails in production, the first question we should be asking is "what broke in this trace?". Until you can answer that, most scorers or dashboards aren't going to help you.

In practice, weak RAG systems usually fail in ordinary, inspectable ways. Retrieval misses the governing document. Chunking splits the clause that matters. The model answers from partial evidence and smooths over the gap with fluent prose.

This post is a debugging sequence to work out what's going wrong. It starts with one trace, classifies the failure precisely, and only then turns that diagnosis into a metric worth automating.

About this post

If you want the short version, jump to Failure classes, From traces to metrics, and Default workflow.

The core point is simple: evaluate relationships, not outputs in isolation.

For each request, inspect:

The question
The retrieved evidence
The answer

Then ask:

Did retrieval find the right evidence at all?
Was the evidence sufficient to answer fully?
Did the answer stay grounded in that evidence?
Did the answer resolve the actual user need?
Should the system have answered at all?

Once you know which relationship broke, the next engineering move is usually obvious.

Start with one trace

The fastest way to make this concrete is to inspect a real failure end to end.

Suppose an internal HR assistant gets this question:

Can part-time employees combine paid parental leave with annual leave, and if so, under what conditions?

Now look at the top retrieval results from a naive vector search:

[
  {
    "rank": 1,
    "chunk_id": "handbook-112",
    "source": "employee-handbook.pdf",
    "score": 0.842,
    "text": "Annual leave can be taken in blocks or as single days with manager approval. The standard approval process applies to all permanent employees."
  },
  {
    "rank": 2,
    "chunk_id": "benefits-041",
    "source": "benefits-faq.md",
    "score": 0.801,
    "text": "The company supports new parents with paid parental leave. Speak to HR if your circumstances are complex."
  },
  {
    "rank": 3,
    "chunk_id": "policy-287",
    "source": "parental-leave-policy.pdf",
    "score": 0.784,
    "text": "... employees may take parental leave in multiple blocks subject to operational approval. Requests must be submitted through HRIS ..."
  }
]

Here is the answer the model produces:

Yes. Part-time employees can combine 12 weeks of paid parental leave with annual leave, subject to manager approval.

This looks plausible at a glance, and is exactly why trace review matters.

The trace exposes three concrete faults:

Retrieval never surfaced the clause that defines the paid entitlement for part-time employees.
The parental leave chunk is incomplete and does not establish whether annual leave can be combined with paid parental leave.
The answer invents 12 weeks, even though that number never appears in the retrieved evidence.

At that point the problem stops being "RAG quality is low" and becomes a bounded pipeline failure.

Failure classes

These six failure classes map cleanly to engineering fixes:

Failure mode	Likely error	Typical fix
Failed recall	The right document never enters the candidate set	Hybrid search, metadata filters, query rewriting, better indexing
Failed sufficiency	Retrieved chunks are relevant but incomplete	Better chunk boundaries, overlap tuning, follow-up retrieval
Failed grounding	The answer contains unsupported claims	Tighter prompts, structured generation, grounded judges
Failed response adequacy	The evidence is present but the answer misses the actual question	Better answer planning, explicit response templates
Unanswerable question	The corpus cannot support a confident answer	Refusal policy, escalation path, answerability checks
Broken citations	Citations exist but do not support the claims they annotate	Span-level citation checks, stricter source mapping

The value of this taxonomy is operational, as each class points to a different part of the stack.

In practice, a single trace may contain more than one failure mode. The taxonomy is a debugging aid, as production failures don't generally arrive one at a time. Difficult cases still need domain judgement, especially when the dispute is about interpretation rather than retrieval.

Retrieval first

The first question is blunt: did retrieval bring back the evidence required to answer the question?

In the HR trace, the answer is no. The chunks are "on topic", but "on topic" is not the same as "sufficient to answer".

In policy-heavy systems, the relevant unit is often the document or section rather than the individual chunk. A highly ranked fragment can still be insufficient if the governing exception sits two pages later or in the next table.

For first-pass retrieval, Recall@k is usually the first metric that matters. You want to know whether the governing source entered the candidate set at all. Once recall is acceptable, ranking quality matters more and Precision@k, MRR, and NDCG@k become useful.

Retrieval metrics explained

Recall@k: The proportion of queries where at least one truly relevant result appears in the top k. Use this first to check whether retrieval is even surfacing the governing source.

Precision@k: The proportion of the top k results that are relevant. This tells you how much noise you are feeding into generation.

MRR: Mean Reciprocal Rank - averages 1 / rank for the first relevant result. It rewards systems that place the first useful source near the top.

NDCG@k: Normalised Discounted Cumulative Gain at k - scores ranking quality when relevance is graded, giving more credit to highly relevant items near the top.

If retrieval is weak then generation metrics are mostly decorative; you cannot prompt your way out of missing evidence.

Typical retrieval fixes are not glamorous:

Hybrid search instead of vectors alone
Metadata filters for policy type, version, region, or employee class
Better chunk boundaries around tables, clauses, and exceptions
Query rewriting for short or ambiguous questions
Reranking that prefers primary documents over generic summaries

Groundedness next

Once retrieval is reasonable, check whether the answer stayed inside the evidence it was given.

Here we are checking for groundedness in the retrieved context, not that the answer is correct. In a source-grounded system, a claim is unsupported if the retrieved evidence does not back it.

In the example trace, 12 weeks is an unsupported claim, so this is a groundedness failure.

This is the smallest structured output you should use to evaluate it:

from pydantic import BaseModel

class GroundingEval(BaseModel):
    is_supported: bool
    supporting_quotes: list[str]
    unsupported_claims: list[str]

Minimal judge prompt:

You are evaluating whether an answer is supported by retrieved evidence.
Given the user question, the retrieved chunks, and the answer, return JSON matching GroundingEval.
Only mark a claim as supported if a quoted span from the retrieved chunks directly supports it.
List every important unsupported claim explicitly.

Example judge output:

{
  "is_supported": false,
  "supporting_quotes": [
    "Annual leave can be taken in blocks or as single days with manager approval.",
    "Employees may take parental leave in multiple blocks subject to operational approval."
  ],
  "unsupported_claims": [
    "Part-time employees can combine paid parental leave with annual leave.",
    "Part-time employees receive 12 weeks of paid parental leave."
  ]
}

This is far more useful than a generic faithfulness score because it tells you what failed and why.

Sufficiency matters

A lot of teams collapse relevance and sufficiency into one fuzzy concept, so they lose an important distinction.

Recall fails when the right source never enters the candidate set. Sufficiency fails when the wrong part or an incomplete section of source is retrieved. Evidence can be relevant and still be insufficient.

In the HR trace, we have a partial annual leave rule, a vague FAQ, and a clipped policy chunk. That is the correct document but the wrong parts; we do not have all the evidence we need.

This is the structured output for that check:

from pydantic import BaseModel

class SufficiencyEval(BaseModel):
    has_enough_evidence: bool
    missing_facts: list[str]
    follow_up_queries: list[str]

Example output:

{
  "has_enough_evidence": false,
  "missing_facts": [
    "The paid parental leave entitlement for part-time employees.",
    "Whether annual leave can be combined with paid parental leave.",
    "Any approval or sequencing constraints for combining the two leave types."
  ],
  "follow_up_queries": [
    "part-time parental leave entitlement site:intranet.company policy pdf",
    "combine annual leave with parental leave site:intranet.company policy pdf"
  ]
}

When this check fails repeatedly, model changes are usually the wrong first response. Query rewriting, chunk overlap, and source selection are likely to improve results.

Response adequacy failures

An answer can be grounded and still fail. This happens when the system retrieves useful evidence, stays faithful to it, and then answers a nearby question instead of the actual one.

Here is a simple example:

Question:
"Can contractors expense home office equipment, and if so, is manager approval required?"

[
  {
    "chunk_id": "expense-014",
    "source": "contractor-expense-policy.pdf",
    "text": "Contractors may expense approved home office equipment up to $500 per quarter."
  },
  {
    "chunk_id": "expense-018",
    "source": "contractor-expense-policy.pdf",
    "text": "All contractor equipment expenses require prior manager approval and a receipt."
  }
]

Answer:
"The contractor expense policy covers home office equipment and sets a $500 quarterly limit."

That answer is grounded. It is also inadequate, because it dodges the question about manager approval.

In practice, use one blunt check:

After reading the user's question and the response, would a reviewer say the response fully answered the question?

If not, the system has a response adequacy failure.

Refusal quality

Sometimes the correct answer is no answer.

If the evidence is insufficient, the system should refuse to answer confidently. This is especially true in high-trust environments where abstention is preferred over an incorrect response.

Model it explicitly:

from pydantic import BaseModel

class AnswerabilityEval(BaseModel):
    should_answer: bool
    reason: str
    safest_next_step: str

You can handle this through UX like displaying your sources, but giving no answer.

Bad refusal:

I do not know.

Better refusal:

I cannot find a policy that confirms whether part-time employees can combine paid parental leave with annual leave. I found the general annual leave policy and part of the parental leave policy, but neither answers that specific point. If you need a definitive answer, HR should confirm it. I can link the relevant policy documents if useful.

Good refusal UX does three things:

It says what is missing.
It names the closest relevant evidence that was found.
It routes the user to the next useful action.

Citation checks

If your system emits citations, evaluate the citations themselves.

A citation is only useful if it supports the claim it is attached to. It is easy to build a system that looks trustworthy because every sentence has a source badge beside it, while the underlying evidence chain is weak.

In the HR trace, citing the annual leave clause beside the claim about 12 weeks of paid parental leave would still be a citation failure.

This is the structured output to use:

from pydantic import BaseModel

class CitationCheck(BaseModel):
    sentence: str
    chunk_ids: list[str]
    supporting_quotes: list[str]
    is_supported: bool

class CitationEval(BaseModel):
    citations: list[CitationCheck]

Example output:

{
  "citations": [
    {
      "sentence": "Part-time employees can combine 12 weeks of paid parental leave with annual leave.",
      "chunk_ids": ["handbook-112", "policy-287"],
      "supporting_quotes": [],
      "is_supported": false
    }
  ]
}

That gives you something much more useful than "citations were present", and lets you verify whether each important sentence mapped to the exact evidence that justifies it.

Sentence-level checks are usually a practical approximation as some claims are distributed across multiple sentences or require multi-hop support across more than one source.

Runtime versus offline

Do not assume every eval belongs in the request path.

For most products, these checks should run asynchronously on logged traces. The user gets the response, the trace is stored, and eval workers score groundedness, sufficiency, citations, and refusal quality afterwards. This works for most support, search, and internal knowledge tools because it avoids turning one user request into three extra model calls.

There are exceptions. In medical, legal, financial, or other high-trust flows, a synchronous sufficiency or answerability gate can be worth the latency because the cost of an unsupported answer is higher than the cost of waiting longer.

Runtime checks are guardrails
Asynchronous checks are diagnostics and regression metrics

If you blur those together, you either ship a slow product or a blind one.

From traces to metrics

You also need a real place to inspect traces. In production that usually means an observability stack rather than terminal logs. LangSmith, Langfuse and Braintrust are obvious managed options. Phoenix is a good open-source choice for local iteration. If your broader telemetry already lives in Datadog, keeping LLM traces beside application signals is often good enough. The tool matters less than the data shape.

For each request, we want:

One trace per run
Spans for retrieval and model calls
Retrieved chunk IDs
Prompt inputs
Outputs
Latency
Eval results attached to the same run

Start with this lifecycle:

Review 30 to 50 traces manually.
Cluster the failures into a small set of repeatable modes.
Write targeted structured evals for those exact modes.
Validate those evals against a human-labelled set.
Put the validated evals on a dashboard and watch for regressions across retrieval, prompt, or model changes.

If you skip straight to the dashboard, you usually end up with generic scores that correlate weakly with the thing you care about. If you start with traces, the automated metric has a job to do.

Evaluate the judge

LLM-as-a-judge is useful but not ground truth.

Judges are useful because they scale a rubric, they don't replace subject matter expert judgement.

Before you trust a new judge on thousands of traces, test it against a human baseline:

Label 75-100 traces by hand.
Compare the judge output with the human labels.
Inspect where it disagrees and why.
Tighten the rubric, schema, or examples until the judge is reliable enough for the job.

If your grounding judge misses obvious unsupported claims, do not put it on a dashboard and call it science. Fix the judge first.

Default workflow

Start with one trace, not one dashboard.
Separate recall, sufficiency, grounding, response adequacy, answerability, and citation failures.
Fix retrieval before generation whenever the evidence is missing or incomplete.
Treat unsupported claims as the primary unit of failure.
Run most evals asynchronously, and reserve runtime gates for higher-trust flows.
Validate every judge against human labels before using it as a regression metric.

Photo by Alexander Lyashkov on Unsplash