RAG Evaluation

RAG evaluation measures the quality of retrieval-augmented generation systems across retrieval accuracy, context relevance, answer faithfulness, and end-to-end response quality.

RAG evaluation measures the quality of retrieval-augmented generation systems across retrieval accuracy, context relevance, answer faithfulness, and end-to-end response quality. It treats retrieval and generation as separate concerns with distinct metrics, then evaluates their combined behavior.

Why RAG needs its own evaluation approach

A RAG system can fail in ways that standard LLM evaluation does not cover. The model might generate a correct-sounding answer from wrong documents (retrieval failure). It might receive the right documents but hallucinate details not in them (faithfulness failure). It might faithfully summarize the retrieved context but miss the point of the question (relevance failure). Evaluating only the final answer hides which component is broken.

RAG evaluation separates these failure modes so you can diagnose and fix them independently.

Retrieval metrics

Recall@K. Of all relevant documents for a query, what fraction appears in the top K retrieval results? Low recall means the model never sees the information it needs. This is the single most important retrieval metric -- if the right documents are not retrieved, generation quality cannot compensate.

Precision@K. Of the K documents returned, what fraction is actually relevant? Low precision means the context is polluted with irrelevant information, which can distract the model and reduce answer quality.

Mean Reciprocal Rank (MRR). How high in the ranked results does the first relevant document appear? Higher MRR means the model attends to relevant context sooner.

Generation metrics

Faithfulness. The percentage of claims in the generated answer that are supported by the retrieved context. This is the primary defense against hallucination in RAG systems. A faithfulness score of 0.95 means 95% of the answer's claims are grounded in what the retriever found.

Answer relevancy. Whether the generated answer actually addresses the user's question. A response can be faithful to the context (every claim is supported) but irrelevant to the query (it latches onto a tangential detail). Relevancy measures alignment between the question and the answer.

End-to-end metrics

Context precision. Evaluated across the full pipeline -- whether the retrieved context, as seen by the generation model, is relevant to the question. Combines retrieval quality with context assembly quality.

Context recall. Whether the retrieved context contains all the facts needed to fully answer the question. Unlike retrieval recall (which checks document IDs), context recall checks semantic coverage.

The open-source RAGAS framework provides standardized implementations of these metrics and is widely used as a starting point for RAG evaluation.

RAG evaluation in practice

Production RAG evaluation follows the same three-stage pattern as general LLM evaluation:

  1. Development-time. Engineers test retrieval changes (embedding models, chunk sizes, re-ranking) against retrieval fixtures with known-relevant documents. Fast, cheap, no LLM judge needed for retrieval metrics.
  2. Pre-deploy. The CI/CD pipeline runs both retrieval and generation evaluations. Faithfulness and relevancy scores are gated -- if they drop below thresholds or regress against the baseline, the deploy is blocked.
  3. Production monitoring. Scheduled evaluation runs against the live index catch drift from document updates, embedding model changes, and usage pattern shifts.

The RAG evaluation guide walks through the full approach, including fixture design, metric selection, and automation strategies. For teams building RAG test infrastructure, the RAG testing framework guide covers implementation details with code examples.

RAG evaluation connects to the broader LLMOps lifecycle -- retrieval quality scores feed into eval gates that gate deployment, and faithfulness metrics become part of production LLM observability. For teams evaluating RAG tooling, see how Coverge compares in our Langfuse alternative and Braintrust alternative pages.