RAG testing framework: how to test retrieval-augmented generation end to end

RAG systems have a testing problem. You can't just check the final answer because a correct answer might be based on the wrong context (lucky hallucination), and a wrong answer might be caused by bad retrieval rather than bad generation. The two stages — find the right documents, then generate a good answer from them — interact in ways that make end-to-end assertions unreliable.

Most teams respond by not testing their RAG pipelines systematically. They spot-check a few queries, eyeball the outputs, and ship. This works until it doesn't — and when it doesn't, debugging is painful because you can't tell whether the retrieval or the generation broke.

A proper RAG testing framework tests each stage independently and then tests them together. This article walks through the approach, the metrics, and the automation needed to test RAG systems with confidence.

Why RAG testing is harder than testing a plain LLM call

A plain LLM call has one failure mode: the model produces a bad response. A RAG system has at least four.

Retrieval miss. The relevant documents exist in your index but the retrieval step did not find them. The model generates an answer from irrelevant context, and the output is either a hallucination or a refusal.

Retrieval noise. The relevant documents are retrieved, but they are buried in irrelevant results. The model's attention is split, and the answer is less precise than it would be with cleaner context.

Context window overflow. The retrieved documents are correct but too long. They get truncated or the model loses track of information deep in the context. The answer is incomplete because the model could not attend to the relevant section.

Generation failure. The retrieval is perfect — the right documents, well-ranked, within the context window — but the model still produces a bad answer. Maybe the prompt is poorly structured, or the model misinterprets the context, or it hallucinates despite having the right information.

Testing only the final output conflates all four failure modes. When a test fails, you don't know which stage broke. Separate testing solves this.

Testing retrieval separately

Retrieval testing is the most deterministic part of RAG testing. Given a query, you know which documents should be retrieved (you define this in your test fixtures). The assertion is: did the retriever return the right documents?

Retrieval metrics

Three metrics cover the retrieval quality space.

Recall@K. Of all the relevant documents for this query, what fraction did the retriever return in the top K results? If there are 3 relevant documents and the retriever returned 2 of them in the top 5, recall@5 is 0.67.

function recallAtK(
  retrieved: string[],
  relevant: string[],
  k: number
): number {
  const topK = retrieved.slice(0, k);
  const found = relevant.filter((doc) => topK.includes(doc));
  return found.length / relevant.length;
}

Precision@K. Of the K documents the retriever returned, what fraction are actually relevant? If the top 5 results contain 2 relevant and 3 irrelevant documents, precision@5 is 0.4.

function precisionAtK(
  retrieved: string[],
  relevant: string[],
  k: number
): number {
  const topK = retrieved.slice(0, k);
  const found = topK.filter((doc) => relevant.includes(doc));
  return found.length / k;
}

Mean Reciprocal Rank (MRR). Where does the first relevant document appear in the ranked results? If the first relevant document is at position 3, the reciprocal rank is 1/3. Averaged across queries, this tells you how quickly users (and models) get to useful information.

function reciprocalRank(
  retrieved: string[],
  relevant: string[]
): number {
  for (let i = 0; i < retrieved.length; i++) {
    if (relevant.includes(retrieved[i])) {
      return 1 / (i + 1);
    }
  }
  return 0;
}

For most RAG systems, recall@K is the metric that matters most. Low precision means the model sees some irrelevant context — annoying but usually survivable. Low recall means the model never sees the relevant context — fatal for answer quality.

Building retrieval test fixtures

A retrieval test fixture maps queries to their known-relevant documents.

type RetrievalFixture = {
  id: string;
  query: string;
  relevantDocumentIds: string[];
  metadata: {
    category: string;
    difficulty: "easy" | "medium" | "hard";
  };
};

const fixtures: RetrievalFixture[] = [
  {
    id: "ret-001",
    query: "What is the refund policy for enterprise contracts?",
    relevantDocumentIds: [
      "doc-enterprise-terms-v3",
      "doc-refund-policy-2025",
    ],
    metadata: {
      category: "policy_lookup",
      difficulty: "easy",
    },
  },
  {
    id: "ret-002",
    query: "How does the billing API handle currency conversion?",
    relevantDocumentIds: [
      "doc-billing-api-reference",
      "doc-currency-handling",
      "doc-international-billing-guide",
    ],
    metadata: {
      category: "technical_reference",
      difficulty: "medium",
    },
  },
  {
    id: "ret-003",
    query: "Can I downgrade mid-cycle?",
    relevantDocumentIds: ["doc-subscription-changes"],
    metadata: {
      category: "policy_lookup",
      difficulty: "hard", // colloquial phrasing, needs semantic match
    },
  },
];

The hard part is creating the relevance judgments. For a small corpus (under 500 documents), you can do this manually. For larger corpora, start with a few dozen hand-labeled fixtures and expand them using a two-pass approach: first retrieve with a high K (50+), then have a human or strong LLM label each result as relevant or not.

Running retrieval tests

async function runRetrievalTests(
  retriever: Retriever,
  fixtures: RetrievalFixture[],
  k: number = 10
): Promise<RetrievalReport> {
  const results = await Promise.all(
    fixtures.map(async (fixture) => {
      const retrieved = await retriever.search(fixture.query, k);
      const retrievedIds = retrieved.map((r) => r.documentId);

      return {
        fixtureId: fixture.id,
        recall: recallAtK(retrievedIds, fixture.relevantDocumentIds, k),
        precision: precisionAtK(retrievedIds, fixture.relevantDocumentIds, k),
        mrr: reciprocalRank(retrievedIds, fixture.relevantDocumentIds),
        retrievedIds,
      };
    })
  );

  return {
    results,
    aggregates: {
      meanRecall: mean(results.map((r) => r.recall)),
      meanPrecision: mean(results.map((r) => r.precision)),
      meanMRR: mean(results.map((r) => r.mrr)),
    },
  };
}

Retrieval tests are fast and cheap — no LLM calls needed. Run them on every PR that touches anything in the retrieval stack: embedding models, chunking strategies, index configuration, query preprocessing.

Testing generation separately

Generation testing isolates the LLM's response quality by feeding it known-good context. This removes retrieval as a variable.

Fixed-context generation tests

For each test case, provide the exact context the model should use, and evaluate only the generation quality.

type GenerationFixture = {
  id: string;
  query: string;
  context: string; // known-good retrieved context
  scoringCriteria: {
    faithfulness: string; // rubric for LLM judge
    relevance: string;
    completeness: string;
  };
  referenceAnswer?: string;
};

const genFixtures: GenerationFixture[] = [
  {
    id: "gen-001",
    query: "What is the refund policy for enterprise contracts?",
    context: `Enterprise Refund Policy (v3, effective Jan 2026):
    Enterprise customers may request a full refund within 30 days of
    contract signing. After 30 days, refunds are prorated based on
    remaining contract term. Annual contracts require 60 days written
    notice for cancellation. Custom enterprise agreements may have
    different terms as specified in the SOW.`,
    scoringCriteria: {
      faithfulness:
        "All claims in the response must be supported by the provided context. No fabricated details about pricing or timelines.",
      relevance:
        "The response must directly answer the refund policy question without unnecessary information about other policies.",
      completeness:
        "Must mention: 30-day full refund window, prorated refunds after 30 days, 60-day notice for annual contracts.",
    },
  },
];

This test tells you: given perfect context, can the model produce a correct answer? If it can't, the problem is in your prompt engineering or model selection, not your retrieval.

Faithfulness testing

Faithfulness — whether the response is grounded in the provided context — is the most important generation metric for RAG evaluation. The original RAG paper from Meta AI showed that retrieval-augmented models hallucinate less than pure generative models, but "less" is not "never." A response that sounds good but contains information not in the context is a hallucination, and hallucinations are the primary failure mode of RAG generation.

async function scoreFaithfulness(
  query: string,
  context: string,
  response: string
): Promise<{ score: number; unsupportedClaims: string[] }> {
  // Step 1: Extract claims from the response
  const claims = await extractClaims(response);

  // Step 2: Check each claim against the context
  const results = await Promise.all(
    claims.map(async (claim) => {
      const supported = await checkClaimSupport(claim, context);
      return { claim, supported };
    })
  );

  const unsupported = results
    .filter((r) => !r.supported)
    .map((r) => r.claim);

  return {
    score: 1 - unsupported.length / Math.max(claims.length, 1),
    unsupportedClaims: unsupported,
  };
}

This claim-level approach gives you actionable debugging information. Instead of a single score, you get the specific claims that are not grounded in the context. That tells you exactly where the model is hallucinating.

RAGAS metrics for end-to-end evaluation

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that provides a standardized set of metrics for evaluating RAG systems. It measures both the retrieval and generation stages, and it does not require human-labeled ground truth for most metrics.

Core RAGAS metrics

Context Precision. Measures whether the retrieved context is relevant to the question. High context precision means the retriever is not polluting the context with irrelevant documents. This is evaluated by checking whether each piece of retrieved context is useful for answering the question.

Context Recall. Measures whether the retrieved context contains all the information needed to answer the question. Unlike retrieval recall@K (which checks document IDs), context recall checks semantic coverage — does the retrieved text contain the facts needed for the answer?

Faithfulness. Measures whether every claim in the generated answer is supported by the retrieved context. RAGAS extracts individual statements from the answer and verifies each one against the context. A faithfulness score of 0.85 means 85% of the claims are grounded.

Answer Relevancy. Measures whether the generated answer actually addresses the question. A response can be faithful to the context but irrelevant to the query (e.g., the model latches onto a tangential detail in the context).

Using RAGAS in your test suite

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Prepare your evaluation dataset
eval_data = {
    "question": [
        "What is the refund policy for enterprise contracts?",
        "How does the billing API handle currency conversion?",
    ],
    "answer": [
        # Generated answers from your RAG pipeline
        generated_answers[0],
        generated_answers[1],
    ],
    "contexts": [
        # Retrieved contexts for each question
        [retrieved_chunks[0]],
        [retrieved_chunks[1]],
    ],
    "ground_truth": [
        # Optional: human-written reference answers
        "Enterprise customers get a full refund within 30 days...",
        "The billing API converts currencies using daily ECB rates...",
    ],
}

dataset = Dataset.from_dict(eval_data)

results = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

print(results)
# {'context_precision': 0.92, 'context_recall': 0.87,
#  'faithfulness': 0.94, 'answer_relevancy': 0.89}

RAGAS uses LLM-as-a-judge under the hood, so it has the same variance characteristics as other LLM-based evaluations. Run each evaluation multiple times for critical test cases. The RAGAS documentation covers additional metrics and configuration options for specific use cases. The RAGAS GitHub repository is the best place to follow updates.

When to use RAGAS vs. custom metrics

RAGAS is a good default for teams starting their RAG evaluation practice. It gives you a standard set of metrics with reasonable implementations and saves you from building claim extraction and verification from scratch.

Build custom metrics when:

Your domain has specific correctness criteria (medical accuracy, legal citation requirements, numerical precision)
You need deterministic checks that don't require LLM calls (format compliance, citation format, source attribution)
The RAGAS metrics don't align with your quality dimensions (e.g., you care about conciseness or reading level)

For a deeper treatment of how these metrics fit into evaluation-gated deployment pipelines, see the LLM regression testing guide.

Building test fixtures for RAG

Test fixtures for RAG are more complex than for single-LLM-call applications because you need to control the document corpus, the retrieval results, and the generation inputs independently.

Fixture architecture

fixtures/
  corpus/
    documents.json        # The test document corpus
    embeddings.json       # Pre-computed embeddings for the corpus
  retrieval/
    queries.json          # Queries with known-relevant documents
  generation/
    contexts.json         # Fixed contexts for generation testing
  end-to-end/
    cases.json            # Full pipeline test cases

Test corpus management

Your test corpus should be a controlled subset of your production corpus. Include documents that cover your test scenarios, plus enough noise documents to make retrieval non-trivial.

type TestCorpus = {
  documents: {
    id: string;
    content: string;
    metadata: Record<string, string>;
    chunks: {
      id: string;
      content: string;
      embedding?: number[];
    }[];
  }[];
};

async function buildTestCorpus(
  productionDocs: Document[],
  fixtureQueries: RetrievalFixture[]
): Promise<TestCorpus> {
  // Include all documents referenced by fixtures
  const relevantDocIds = new Set(
    fixtureQueries.flatMap((f) => f.relevantDocumentIds)
  );
  const relevantDocs = productionDocs.filter((d) =>
    relevantDocIds.has(d.id)
  );

  // Add noise documents (2-3x the relevant set)
  const noiseDocs = productionDocs
    .filter((d) => !relevantDocIds.has(d.id))
    .slice(0, relevantDocs.length * 3);

  const allDocs = [...relevantDocs, ...noiseDocs];

  // Chunk and embed all documents
  const corpus = await Promise.all(
    allDocs.map(async (doc) => ({
      id: doc.id,
      content: doc.content,
      metadata: doc.metadata,
      chunks: await chunkAndEmbed(doc),
    }))
  );

  return { documents: corpus };
}

Pre-compute embeddings and store them in the fixture. This makes retrieval tests fast (no embedding API calls) and deterministic (same embeddings every time). When you change your embedding model or chunking strategy, regenerate the fixture embeddings and re-run the retrieval tests — the delta between old and new results shows you the impact of the change.

Snapshot testing for RAG

Snapshot testing — comparing current outputs to a stored reference — works well for RAG when combined with semantic comparison rather than exact matching.

async function snapshotTest(
  pipeline: RagPipeline,
  fixture: EndToEndFixture,
  storedSnapshot: Snapshot
): Promise<SnapshotResult> {
  const result = await pipeline.run(fixture.query);

  // Semantic similarity between current and stored response
  const similarity = await embeddingSimilarity(
    result.answer,
    storedSnapshot.answer
  );

  // Did the retrieved documents change?
  const retrievalOverlap = jaccardSimilarity(
    result.retrievedDocIds,
    storedSnapshot.retrievedDocIds
  );

  return {
    answerSimilarity: similarity,
    retrievalOverlap,
    changed: similarity < 0.9 || retrievalOverlap < 0.8,
    currentAnswer: result.answer,
    storedAnswer: storedSnapshot.answer,
  };
}

When a snapshot test detects a change, don't auto-fail — flag it for review. The change might be an improvement. Let a human (or your LLM judge) decide whether to accept the new output as the updated snapshot.

Automating RAG quality checks

CI integration

# .github/workflows/rag-tests.yml
name: RAG Quality Gate

on:
  pull_request:
    paths:
      - "src/retrieval/**"
      - "src/generation/**"
      - "src/rag/**"
      - "prompts/**"
      - "fixtures/**"

jobs:
  retrieval-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run retrieval quality tests
        run: bun run test:retrieval
      - name: Check retrieval thresholds
        run: |
          RECALL=$(jq '.aggregates.meanRecall' retrieval-report.json)
          if (( $(echo "$RECALL < 0.85" | bc -l) )); then
            echo "Retrieval recall below threshold: $RECALL < 0.85"
            exit 1
          fi

  generation-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run generation quality tests
        run: bun run test:generation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

  end-to-end:
    runs-on: ubuntu-latest
    needs: [retrieval-tests, generation-tests]
    steps:
      - uses: actions/checkout@v4
      - name: Run end-to-end RAG evaluation
        run: bun run test:rag-e2e
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Compare against baseline
        run: bun run test:rag-compare-baseline

The path filter matters — only run the expensive RAG tests when RAG-related code changes. A frontend CSS change should not trigger a 10-minute RAG evaluation. For the broader pattern of eval-gated CI/CD for LLM applications using eval gates, the same tiered approach applies — cheap tests on every PR, expensive tests on merge.

Monitoring retrieval index changes

Your RAG quality can degrade without any code changes if the document index changes. New documents are added, old ones are updated, embeddings are recomputed. Set up a scheduled test that runs against the production index.

// Daily RAG health check against production index
async function ragHealthCheck(): Promise<void> {
  const retriever = createRetriever({ index: "production" });
  const generator = createGenerator({ model: "production" });
  const pipeline = createRagPipeline(retriever, generator);

  const fixtures = await loadFixtures("end-to-end/cases.json");
  const baseline = await loadBaseline("rag-baseline.json");

  const results = await runEndToEndEval(pipeline, fixtures);

  const regressions = detectRegressions(results, baseline, {
    faithfulness: 0.03,
    relevance: 0.05,
    recall: 0.05,
  });

  if (regressions.length > 0) {
    await notify({
      channel: "rag-quality",
      message: `RAG quality regression detected in production index:\n${formatRegressions(regressions)}`,
    });
  }
}

Building a quality dashboard

Track these metrics over time to spot trends before they become incidents:

Retrieval recall@K and precision@K (per query category)
Faithfulness score distribution (overall and per category)
Context precision (is the retriever returning clean results?)
Answer relevancy (are the answers on-topic?)
Latency breakdown (retrieval time vs. generation time)

When faithfulness drifts down over weeks, that tells you something is changing in your index or model behavior. The dashboard makes this visible before it crosses a threshold and pages someone.

Putting it together: a RAG testing strategy

Here is the testing strategy in order of implementation priority:

Retrieval recall tests (day 1). Build 20 fixtures mapping queries to relevant documents. Run recall@10 on every PR that touches retrieval. Gate on recall >= 0.85. This is cheap, fast, and catches the most damaging failures.
Faithfulness testing (week 1). Build 15 generation fixtures with fixed context. Score faithfulness with an LLM judge. Gate on faithfulness >= 0.90. This catches hallucinations — the failure mode users notice fastest.
RAGAS end-to-end evaluation (week 2). Wire up the full RAGAS metric suite on 30 end-to-end cases. Run on merges to main. This gives you the complete picture: retrieval + generation quality + their interaction.
Baseline regression detection (week 3). Store baselines from successful deployments. Compare every new evaluation against the baseline. Block deploys on statistically significant regressions.
Scheduled production monitoring (month 1). Run the test suite against the production index daily. Alert on drift. This catches regressions caused by index changes rather than code changes.

For a deeper dive into evaluation metrics and tooling options, see our RAG evaluation guide. This connects to the broader LLMOps best practices around separating build from deploy — the same scoring functions, judge prompts, and threshold configurations should be reused across your pre-deploy and post-deploy evaluation layers.

What Coverge does differently

Coverge treats RAG quality as a deployment governance concern, not just a testing concern. When a pipeline includes retrieval-augmented generation, Coverge runs the full eval suite — retrieval metrics, faithfulness scoring, and end-to-end quality checks — as part of the deployment proof bundle. The proof bundle captures which retrieval fixtures were tested, what scores each metric produced, and who approved the deployment.

If retrieval recall drops after an index update or faithfulness degrades after a model swap, the eval gate blocks deployment automatically. The team sees exactly which test cases regressed and on which dimensions, making the regression actionable. For teams already tracking LLM regression baselines, Coverge extends the same baseline-comparison approach to RAG-specific metrics — context precision, recall, and faithfulness all get the same statistical significance testing and per-case tracking that generation metrics receive.

FAQ

What is a RAG testing framework?

A RAG testing framework is a structured approach to evaluating retrieval-augmented generation systems across both the retrieval and generation stages. It includes test fixtures (queries with known-relevant documents and expected behaviors), metrics for each stage (recall, precision, faithfulness, relevancy), baseline comparison for regression detection, and CI integration to gate deployments on quality thresholds.

How do I test retrieval and generation separately?

For retrieval, build fixtures that map queries to known-relevant document IDs and measure recall@K, precision@K, and MRR. No LLM calls needed — this is fast and deterministic. For generation, provide fixed known-good context and evaluate only the LLM's output quality using faithfulness, relevance, and completeness criteria. Separate testing tells you exactly which stage broke when end-to-end quality drops.

What are the key RAGAS metrics?

RAGAS provides four core metrics: context precision (is the retrieved context relevant?), context recall (does the context contain all needed information?), faithfulness (are all claims in the answer supported by the context?), and answer relevancy (does the answer address the question?). Together they cover both the retrieval and generation quality dimensions of a RAG system.

How often should I run RAG tests?

Tier by cost: retrieval tests (no LLM calls) on every PR that touches retrieval code, generation faithfulness tests on every PR that touches prompts or generation logic, full RAGAS end-to-end evaluation on merges to main, and scheduled production monitoring daily or weekly. Only the end-to-end and production monitoring runs require significant LLM API spend.

What retrieval recall threshold should I target?

For most production RAG systems, recall@10 >= 0.85 is a reasonable starting point. Below 0.80, you will see frequent answer quality issues from missing context. Above 0.95 is excellent but hard to achieve on diverse query sets. The right threshold depends on your domain — medical or legal RAG systems should target higher recall because missing context has worse consequences.

How do I test RAG when my document index changes?

Schedule a daily or weekly test run that executes your test suite against the production index rather than a frozen test corpus. Compare results against the last known-good baseline. If retrieval metrics drop after an index update, you know the new or modified documents are affecting retrieval quality. Store the index state (document count, last-modified timestamps) alongside your baselines for debugging.