RAG evaluation: how to measure retrieval quality, faithfulness, and answer relevance

You built a RAG pipeline. It answers questions using your documents. In the demo, it looks great. Then you deploy it, and within a week someone screenshots an answer where your system confidently cites a paragraph that says the opposite of what it claims. The retrieved context was right. The generation was wrong. And your team has no metric that caught it.

This is the central problem of RAG evaluation: you are not testing one model. You are testing a system with multiple failure points — retrieval, context assembly, generation, and the interactions between them. A RAG pipeline can fail because it retrieved the wrong chunks, because it retrieved the right chunks but ignored them, because the chunks themselves were poorly split, or because the question was ambiguous and the system did not ask for clarification.

Search volume for "rag evaluation" sits at 260 monthly searches with 12% year-over-year growth. That growth tracks with a broader shift: teams that built RAG prototypes in 2024 are now running them in production and discovering that "it works on my documents" is not the same as "it works reliably at scale." The questions have moved from "how do I build RAG" to "how do I know my RAG is working."

This guide covers the metrics that matter for RAG evaluation, how to implement them with RAGAS and other frameworks, how to evaluate your chunking strategy, and how to set up continuous monitoring for retrieval quality in production.

Why RAG evaluation is different from general LLM evaluation

If you have read our LLM evaluation guide, you know the basics of evaluating language model outputs — reference-based metrics, LLM-as-a-judge patterns, criteria-based checks. RAG evaluation builds on all of that but adds a critical dimension: you need to evaluate the retrieval step independently from the generation step.

Consider a simple failure: a user asks "What is our refund policy for enterprise customers?" and your system returns a generic consumer refund policy. The generation model does its job perfectly — it summarizes the retrieved context accurately. But the answer is wrong because retrieval failed. If you only evaluate the final answer, you might blame the LLM when the real problem is your vector search or chunking strategy.

The reverse is equally common. Retrieval returns the perfect chunks, but the model ignores key details or hallucinates additional conditions that are not in the source material. Evaluating only the retrieval step would show everything is fine.

Effective RAG evaluation decomposes the pipeline into stages and measures each one, then also measures the end-to-end result. This is the insight behind frameworks like RAGAS.

The core metrics: what to measure and why

RAG evaluation metrics fall into three categories: retrieval quality, generation faithfulness, and answer quality. Here is what each one captures and when it matters.

Retrieval metrics

Context recall measures whether the retrieval step found all the relevant information needed to answer the question. If the ground truth answer requires information from three different document sections and retrieval only found two, context recall is low. This metric requires ground truth annotations — you need to know what the ideal retrieved context looks like.

# Context recall: what fraction of the ground truth answer
# is attributable to the retrieved context?
# High context recall = retrieval found the right documents
# Low context recall = retrieval missed important information

Context precision measures whether the retrieved chunks are actually relevant to the question. If you retrieve ten chunks but only three are relevant, precision is 0.3. Low precision means your context window is cluttered with irrelevant text, which wastes tokens and can confuse the generation model. Context precision also evaluates ranking — relevant chunks should appear earlier in the retrieval results.

Context relevance is a softer version of precision. Instead of binary relevant/not-relevant, it scores how relevant each chunk is to the question on a continuous scale. This is typically evaluated using an LLM judge.

Generation metrics

Faithfulness (also called groundedness) measures whether the generated answer is supported by the retrieved context. This is the hallucination detector. A faithfulness score breaks the answer into individual claims and checks whether each claim can be attributed to the provided context. A model that generates accurate information from its training data but not from the retrieved context scores low on faithfulness — because the point of RAG is to ground answers in your documents, not in the model's parametric knowledge.

# Faithfulness decomposition:
# 1. Split the answer into individual claims
# 2. For each claim, check if the retrieved context supports it
# 3. Faithfulness = supported_claims / total_claims
#
# Example:
# Answer: "Our enterprise plan costs $500/month and includes 24/7 support."
# Context mentions $500/month but says nothing about support hours.
# Claim 1 (price): supported -> 1
# Claim 2 (support): not supported -> 0
# Faithfulness: 0.5

Answer relevance measures whether the generated answer actually addresses the question. A model might faithfully summarize retrieved context but miss the point of the question entirely. Answer relevance checks the alignment between the question asked and the answer provided, independent of whether the answer is factually grounded.

End-to-end metrics

Answer correctness compares the final answer against a ground truth reference. This is the metric that matters most to users but is the hardest to compute accurately for open-ended questions. It typically combines semantic similarity with factual overlap.

Answer similarity uses embedding-based comparison between the generated answer and a reference answer. Less granular than correctness but cheaper to compute and useful for regression detection.

Implementing evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted framework for RAG evaluation. It provides implementations of all the metrics above and works with any RAG pipeline.

Setting up RAGAS

# Install ragas
# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_entity_recall,
    answer_similarity,
    answer_correctness,
)
from datasets import Dataset

# Prepare your evaluation dataset
# Each sample needs: question, answer, contexts, and ground_truth
eval_data = {
    "question": [
        "What is the maximum file upload size?",
        "How do I reset my API key?",
        "What regions are supported for data residency?",
    ],
    "answer": [
        "The maximum file upload size is 100MB for standard plans and 1GB for enterprise.",
        "You can reset your API key from the Settings page under API Keys.",
        "We support data residency in US, EU, and APAC regions.",
    ],
    "contexts": [
        [
            "File uploads are limited to 100MB on standard plans. Enterprise customers can upload files up to 1GB. Files larger than the limit will be rejected with a 413 error.",
        ],
        [
            "API keys can be managed from the Settings page. Navigate to Settings > API Keys to view, create, or revoke keys. For security, revoked keys cannot be restored.",
        ],
        [
            "Coverge supports data residency in three regions: US (us-east-1), EU (eu-west-1), and APAC (ap-southeast-1). Contact sales for additional region requirements.",
            "Data residency ensures that all customer data is stored and processed within the selected region.",
        ],
    ],
    "ground_truth": [
        "The maximum file upload size is 100MB for standard plans and 1GB for enterprise plans.",
        "Navigate to Settings > API Keys to reset your API key.",
        "Supported data residency regions are US, EU, and APAC.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation with selected metrics
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness,
    ],
)

# Results as a pandas DataFrame
df = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy", "context_precision", "context_recall"]])

Interpreting RAGAS scores

RAGAS metrics return scores between 0 and 1. Here is a calibration guide based on what we see in production pipelines:

Metric	Good	Acceptable	Needs work
Faithfulness	> 0.9	0.7 - 0.9	< 0.7
Answer relevancy	> 0.85	0.7 - 0.85	< 0.7
Context precision	> 0.8	0.6 - 0.8	< 0.6
Context recall	> 0.8	0.6 - 0.8	< 0.6
Answer correctness	> 0.8	0.6 - 0.8	< 0.6

These thresholds are starting points. What "good" looks like depends on your domain and risk tolerance. A medical question-answering system needs faithfulness above 0.95. An internal knowledge base for engineering docs might be fine at 0.85.

The more important pattern is tracking trends over time, which is where LLM regression testing and baseline comparison become essential. A faithfulness score that drops from 0.92 to 0.85 after a chunking strategy change is a signal, regardless of whether 0.85 is "good enough" in absolute terms.

Evaluating your chunking strategy

Chunking is one of the highest-impact decisions in a RAG pipeline, and also one of the least evaluated. Most teams pick a chunking strategy (fixed-size, recursive, semantic) and a chunk size (512 tokens, 1024 tokens) based on intuition or a blog post, then never measure whether it was the right choice.

What chunking affects

Your chunking strategy directly impacts multiple evaluation metrics:

Context recall suffers when chunks split relevant information across boundaries. If a paragraph explaining your pricing tiers gets split into two chunks and only one is retrieved, the answer will be incomplete.

Context precision suffers when chunks are too large and include irrelevant information alongside the relevant content. A 2000-token chunk might contain the answer in one sentence but pad the context with unrelated material.

Faithfulness can be affected when chunk boundaries create misleading context. A chunk that starts mid-paragraph might lose the qualifying statement that preceded a claim.

Running chunking experiments

The right way to evaluate chunking is to treat it as a hyperparameter and run controlled experiments:

from ragas import evaluate
from ragas.metrics import context_precision, context_recall, faithfulness, answer_correctness
from datasets import Dataset

def evaluate_chunking_strategy(
    strategy_name: str,
    documents: list[str],
    eval_questions: list[dict],
    chunk_fn: callable,
    retriever_fn: callable,
    generator_fn: callable,
) -> dict:
    """
    Run RAG evaluation for a specific chunking strategy.
    
    chunk_fn: takes documents, returns chunks
    retriever_fn: takes question + chunks, returns relevant contexts
    generator_fn: takes question + contexts, returns answer
    """
    chunks = chunk_fn(documents)
    
    questions = []
    answers = []
    contexts = []
    ground_truths = []
    
    for q in eval_questions:
        retrieved = retriever_fn(q["question"], chunks)
        answer = generator_fn(q["question"], retrieved)
        
        questions.append(q["question"])
        answers.append(answer)
        contexts.append(retrieved)
        ground_truths.append(q["ground_truth"])
    
    dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })
    
    result = evaluate(
        dataset,
        metrics=[context_precision, context_recall, faithfulness, answer_correctness],
    )
    
    return {
        "strategy": strategy_name,
        "num_chunks": len(chunks),
        "avg_chunk_size": sum(len(c) for c in chunks) / len(chunks),
        **{m.name: result[m.name] for m in [context_precision, context_recall, faithfulness, answer_correctness]},
    }

# Compare strategies
strategies = [
    ("fixed_512", lambda docs: fixed_size_chunk(docs, size=512, overlap=50)),
    ("fixed_1024", lambda docs: fixed_size_chunk(docs, size=1024, overlap=100)),
    ("recursive_512", lambda docs: recursive_chunk(docs, size=512, overlap=50)),
    ("semantic", lambda docs: semantic_chunk(docs, threshold=0.8)),
]

results = []
for name, chunk_fn in strategies:
    r = evaluate_chunking_strategy(name, documents, eval_questions, chunk_fn, retriever, generator)
    results.append(r)
    print(f"{name}: precision={r['context_precision']:.3f} recall={r['context_recall']:.3f} faithfulness={r['faithfulness']:.3f}")

Common chunking failure patterns

Through evaluating dozens of production RAG pipelines, certain failure patterns recur:

Tables and structured data split across chunks. A pricing table chunked at a fixed token boundary loses its structure. The model sees half a table and hallucinates the rest. Fix: use document-aware chunking that preserves structural elements.

Headers separated from their content. A section header ends up at the tail of one chunk while the actual content starts the next chunk. The content chunk loses semantic context. Fix: ensure headers are always included with their following content.

Overlapping chunks creating duplicate retrieval. Aggressive overlap (more than 20% of chunk size) means the same text appears in multiple chunks. Retrieval returns near-duplicate contexts, wasting token budget. Fix: tune overlap to 10-15% or use semantic deduplication at retrieval time.

Chunks too small for complex topics. Splitting a detailed explanation into 256-token chunks means the model needs multiple chunks to get the full picture, but retrieval might only return some of them. Fix: for technical documentation, 512-1024 tokens usually works better than smaller sizes.

Beyond RAGAS: other evaluation tools

RAGAS is not the only option. Here is how other tools in the ecosystem handle RAG evaluation.

DeepEval

DeepEval provides RAG evaluation metrics similar to RAGAS but integrates with pytest, which makes it easier to add RAG tests to existing test suites. If you are already looking at DeepEval as an alternative to other eval tools, its RAG metrics are a strong reason to consider it.

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    ContextualRecallMetric,
    AnswerRelevancyMetric,
)

def test_rag_response():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output="Full refunds are available within 30 days of purchase.",
        expected_output="Customers can get a full refund within 30 days.",
        retrieval_context=[
            "Refund Policy: Customers may request a full refund within 30 days of their purchase date. After 30 days, only partial refunds are available."
        ],
    )
    
    faithfulness = FaithfulnessMetric(threshold=0.8)
    relevancy = ContextualRelevancyMetric(threshold=0.7)
    recall = ContextualRecallMetric(threshold=0.7)
    answer_rel = AnswerRelevancyMetric(threshold=0.8)
    
    assert_test(test_case, [faithfulness, relevancy, recall, answer_rel])

The pytest integration means you can run RAG evaluation as part of your CI pipeline and block merges when quality drops below thresholds. This is where the connection between evaluation and deployment gates becomes concrete.

Custom evaluation with LLM-as-a-judge

For domain-specific RAG pipelines, generic metrics sometimes miss what matters. A legal RAG system needs to evaluate whether citations are to the correct statute sections. A medical RAG system needs to check whether dosage information is precise. In these cases, building custom evaluation prompts using an LLM judge is often more effective than generic metrics:

FAITHFULNESS_JUDGE_PROMPT = """You are evaluating whether an AI assistant's answer 
is fully supported by the provided source documents.

Question: {question}
Source documents:
{contexts}

AI Answer: {answer}

For each claim in the answer:
1. Identify the specific claim
2. Find the supporting text in the source documents (quote it)
3. Rate support: SUPPORTED, PARTIALLY_SUPPORTED, or NOT_SUPPORTED

Then provide an overall faithfulness score from 0.0 to 1.0.

Respond in JSON format:
{{
  "claims": [
    {{"claim": "...", "source_quote": "...", "rating": "..."}},
  ],
  "overall_score": 0.0,
  "reasoning": "..."
}}"""

This approach gives you fine-grained control over what "faithful" means in your domain and produces explainable results — you can see exactly which claims were unsupported and why.

Continuous RAG monitoring in production

Evaluation datasets catch known failure modes. Production monitoring catches unknown ones. The gap between "works on eval set" and "works in production" is where most RAG pipelines stumble.

What to monitor

Retrieval latency and empty results. Track p50 and p99 latency for your vector search. Also track the rate of queries that return no results or results below a relevance threshold. A spike in empty retrievals usually means a new category of questions your knowledge base does not cover.

Retrieval relevance scores. Log the similarity scores from your vector search for every query. Plot the distribution over time. A leftward shift means retrieval quality is degrading — possibly because your embeddings are stale while documents have been updated, or because user questions are drifting from the topics your knowledge base covers.

Faithfulness on a sample of production queries. You cannot run faithfulness evaluation on every production request (it requires an LLM judge call, which adds cost and latency). But you can sample 1-5% of requests and run async evaluation. Log the results and alert when the rolling average drops below your threshold.

User feedback signals. Thumbs up/down, follow-up questions (which often indicate the first answer was not sufficient), copy-paste actions (positive signal), and explicit corrections. These are noisy individually but powerful in aggregate.

Building a monitoring pipeline

import asyncio
from datetime import datetime

async def monitor_rag_request(
    question: str,
    retrieved_contexts: list[str],
    retrieval_scores: list[float],
    generated_answer: str,
    latency_ms: float,
):
    """Log RAG metrics for every request, run eval on a sample."""
    
    # Always log these
    metrics = {
        "timestamp": datetime.utcnow().isoformat(),
        "question_length": len(question),
        "num_contexts_retrieved": len(retrieved_contexts),
        "avg_retrieval_score": sum(retrieval_scores) / len(retrieval_scores) if retrieval_scores else 0,
        "min_retrieval_score": min(retrieval_scores) if retrieval_scores else 0,
        "retrieval_latency_ms": latency_ms,
        "answer_length": len(generated_answer),
        "empty_retrieval": len(retrieved_contexts) == 0,
    }
    
    await log_metrics("rag_monitoring", metrics)
    
    # Sample-based deep evaluation (e.g., 2% of requests)
    if should_sample(rate=0.02):
        eval_result = await run_async_evaluation(
            question=question,
            contexts=retrieved_contexts,
            answer=generated_answer,
        )
        await log_metrics("rag_deep_eval", {
            **metrics,
            "faithfulness": eval_result["faithfulness"],
            "answer_relevancy": eval_result["answer_relevancy"],
        })
        
        # Alert on low faithfulness
        if eval_result["faithfulness"] < 0.7:
            await send_alert(
                f"Low faithfulness detected: {eval_result['faithfulness']:.2f}",
                context={"question": question, "answer": generated_answer},
            )

Setting SLOs for RAG quality

Defining service level objectives for RAG quality forces your team to agree on what "good enough" means and creates accountability for maintaining it:

SLO	Target	Measurement
Retrieval non-empty rate	> 95%	Percentage of queries returning at least 1 context above relevance threshold
Mean faithfulness (sampled)	> 0.85	Rolling 24h average of sampled faithfulness scores
Answer relevancy (sampled)	> 0.80	Rolling 24h average of sampled relevancy scores
Retrieval p99 latency	< 500ms	99th percentile vector search latency
User satisfaction (thumbs up ratio)	> 75%	Among users who provide feedback

When an SLO is breached, you have a concrete signal to investigate. Was it a data ingestion issue? A model update that changed generation behavior? A new category of questions your knowledge base does not handle? The SLO does not tell you the cause, but it tells you something needs attention.

End-to-end RAG pipeline testing

Unit-testing individual components (retrieval, generation) is necessary but not sufficient. You also need end-to-end tests that exercise the full pipeline — the same path a user request takes.

Regression test suites

Build a curated set of questions that represent your most important use cases and known edge cases. Run the full RAG pipeline on this set before every deployment:

import json
from pathlib import Path

# rag_regression_tests.json contains curated test cases
REGRESSION_SUITE = json.loads(Path("rag_regression_tests.json").read_text())

def run_rag_regression(pipeline, threshold: float = 0.8):
    """
    Run the full RAG pipeline on the regression suite.
    Fail if any metric drops below threshold.
    """
    results = []
    
    for case in REGRESSION_SUITE:
        # Run full pipeline
        response = pipeline.query(case["question"])
        
        # Evaluate
        scores = evaluate_single(
            question=case["question"],
            answer=response.answer,
            contexts=response.retrieved_contexts,
            ground_truth=case["expected_answer"],
        )
        
        results.append({
            "question": case["question"],
            "category": case.get("category", "general"),
            **scores,
        })
    
    # Aggregate by category
    categories = set(r["category"] for r in results)
    for cat in categories:
        cat_results = [r for r in results if r["category"] == cat]
        avg_faithfulness = sum(r["faithfulness"] for r in cat_results) / len(cat_results)
        avg_correctness = sum(r["answer_correctness"] for r in cat_results) / len(cat_results)
        
        print(f"Category '{cat}': faithfulness={avg_faithfulness:.3f}, correctness={avg_correctness:.3f}")
        
        if avg_faithfulness < threshold:
            raise AssertionError(
                f"Faithfulness below threshold for category '{cat}': "
                f"{avg_faithfulness:.3f} < {threshold}"
            )
    
    return results

Testing retrieval and generation independently

When an end-to-end test fails, you need to know which component caused the failure. Structure your tests to isolate each stage:

def test_retrieval_quality(retriever, eval_cases):
    """Test retrieval independently using known-good context mappings."""
    for case in eval_cases:
        retrieved = retriever.retrieve(case["question"], top_k=5)
        retrieved_texts = [doc.text for doc in retrieved]
        
        # Check if expected documents were retrieved
        for expected_doc in case["expected_documents"]:
            assert any(
                expected_doc in text for text in retrieved_texts
            ), f"Missing expected document for: {case['question']}"

def test_generation_with_golden_context(generator, eval_cases):
    """Test generation using perfect retrieval (golden contexts)."""
    for case in eval_cases:
        # Feed known-good context to isolate generation quality
        answer = generator.generate(
            question=case["question"],
            contexts=case["golden_contexts"],
        )
        
        faithfulness_score = evaluate_faithfulness(
            answer=answer,
            contexts=case["golden_contexts"],
        )
        
        assert faithfulness_score > 0.9, (
            f"Generation unfaithful even with golden context: "
            f"{faithfulness_score:.3f} for: {case['question']}"
        )

This decomposition is invaluable for debugging. If generation tests pass with golden context but fail with actual retrieval, you know the problem is in retrieval. If generation tests fail even with golden context, the issue is in your prompt template or model choice.

Advanced patterns

Evaluating multi-hop RAG

Some questions require information from multiple documents that must be synthesized together. These multi-hop queries are where RAG pipelines fail most often:

# Multi-hop example:
# Q: "How does our premium plan pricing compare to competitors mentioned in the Q3 report?"
# Requires: 1) pricing page data 2) Q3 competitor analysis report
# The answer must synthesize information from both sources

def evaluate_multi_hop(pipeline, multi_hop_cases):
    """Evaluate questions that require multiple retrieval hops."""
    for case in multi_hop_cases:
        response = pipeline.query(case["question"])
        
        # Check that contexts from multiple source documents were retrieved
        source_docs = set(ctx.metadata["source"] for ctx in response.contexts)
        required_sources = set(case["required_sources"])
        
        missing_sources = required_sources - source_docs
        if missing_sources:
            print(f"Missing sources for multi-hop query: {missing_sources}")
        
        # Multi-hop faithfulness: answer should synthesize from all sources
        # not just parrot one source and ignore others
        synthesis_score = evaluate_synthesis(
            answer=response.answer,
            contexts_by_source={
                src: [c for c in response.contexts if c.metadata["source"] == src]
                for src in source_docs
            },
        )

Evaluating citation accuracy

If your RAG system provides citations (and it should), evaluate whether citations point to the correct source and whether cited passages actually support the claims they are attached to:

def evaluate_citations(answer_with_citations, retrieved_contexts):
    """Check that citations reference the correct source material."""
    citations = extract_citations(answer_with_citations)
    
    results = []
    for citation in citations:
        claim = citation["claim"]
        cited_source = citation["source_id"]
        
        # Verify the cited source exists in retrieved contexts
        source_text = find_context_by_id(cited_source, retrieved_contexts)
        if source_text is None:
            results.append({"claim": claim, "valid": False, "reason": "source_not_found"})
            continue
        
        # Verify the source actually supports the claim
        supports = check_support(claim, source_text)
        results.append({
            "claim": claim,
            "valid": supports,
            "reason": "supported" if supports else "claim_not_supported_by_source",
        })
    
    accuracy = sum(1 for r in results if r["valid"]) / len(results) if results else 0
    return {"citation_accuracy": accuracy, "details": results}

Adversarial testing for RAG

Production RAG systems face adversarial inputs — users who (intentionally or not) ask questions designed to elicit wrong answers:

Contradiction queries: questions where retrieved context contains conflicting information from different time periods or sources
Out-of-scope queries: questions that sound related to your domain but are not covered by your knowledge base
Prompt injection through documents: malicious content in indexed documents that tries to override system instructions

Build specific test cases for each adversarial category and include them in your regression suite.

Where Coverge fits

Evaluating a RAG pipeline means running experiments across chunking strategies, retrieval methods, and generation parameters. Each experiment needs versioning, metric tracking, and comparison against baselines. Doing this manually with scripts works at first but becomes a bottleneck as your pipeline matures.

Coverge handles this by treating the entire RAG pipeline — ingestion, chunking, retrieval, and generation — as a single versioned unit that can be evaluated end-to-end. When you change your chunking strategy, Coverge runs your evaluation suite automatically and shows you exactly how each metric changed compared to the previous version. The evaluation results feed into deployment gates, so a configuration change that drops faithfulness below your threshold never reaches production.

For teams building the testing layer around their RAG systems, our RAG testing framework guide provides a hands-on companion to this article. This is the same pattern described in our LLM evaluation guide, applied specifically to RAG: define quality metrics, automate evaluation, gate deployments on results.

Frequently asked questions

What is the difference between RAGAS and DeepEval for RAG evaluation?

RAGAS is a focused framework specifically designed for RAG evaluation metrics. It provides the standard metrics (faithfulness, context recall, context precision, answer relevancy) and is widely used as a research benchmark. DeepEval provides similar RAG metrics but wraps them in a pytest-compatible testing framework, making it easier to integrate into CI/CD pipelines. RAGAS is better if you want a lightweight evaluation library. DeepEval is better if you want opinionated test infrastructure. Many teams use both — RAGAS for exploratory analysis and DeepEval for automated testing.

How many evaluation examples do I need for reliable RAG metrics?

For development iteration, 50-100 well-curated examples covering your main use cases give you directionally useful signals. For production confidence, aim for 200-500 examples stratified across question types, difficulty levels, and document categories. The key is coverage, not volume — 100 examples that span your query distribution are worth more than 1000 examples that all test the same pattern. Update your evaluation set monthly as you discover new failure modes in production.

Should I evaluate RAG components separately or only end-to-end?

Both. End-to-end evaluation tells you whether your system is working for users. Component-level evaluation tells you why it is or is not working. Start with end-to-end metrics to establish a baseline, then add component-level tests as you need to debug specific failures. At minimum, track retrieval relevance scores independently from generation faithfulness — this decomposition is the single most useful debugging tool for RAG pipelines.

How do I evaluate RAG when I do not have ground truth answers?

You can still measure faithfulness and answer relevancy without ground truth — these metrics only need the question, retrieved context, and generated answer. Context recall and answer correctness do require ground truth. For bootstrapping a ground truth dataset, use a strong model (like Claude or GPT-4) to generate reference answers from your documents, then have domain experts review and correct a subset. This hybrid approach gets you a usable evaluation set much faster than manual annotation from scratch.

What is a good faithfulness score?

It depends on your use case and risk tolerance. For customer-facing applications where incorrect information has real consequences (healthcare, finance, legal), target faithfulness above 0.95. For internal knowledge bases or low-stakes use cases, 0.85 is often acceptable. The absolute number matters less than the trend — track faithfulness over time and investigate any drops. A faithfulness score that was stable at 0.92 and suddenly drops to 0.85 after a model update is a problem regardless of whether 0.85 is "good enough" on paper.

How often should I re-run RAG evaluation?

Run your full evaluation suite on every change to the pipeline — chunking strategy, embedding model, prompt template, retrieval parameters, or generation model. Run a lighter smoke test on every deployment. In production, continuously sample and evaluate 1-5% of requests for faithfulness and answer relevancy. The continuous monitoring catches drift that discrete evaluations miss, like a gradual degradation as your knowledge base grows stale relative to user questions.

Can I use RAG evaluation metrics to compare different RAG architectures?

Yes, and this is one of the highest-value uses of RAG evaluation. When comparing architectures — say, vector search vs. hybrid search, or single-stage vs. multi-stage retrieval — run the same evaluation suite against both and compare metrics side by side. Make sure your evaluation set is representative of production traffic, not cherry-picked examples that favor one architecture. Pay special attention to performance on edge cases and multi-hop queries, which is where architectural differences show up most clearly.