LLM regression testing: catching quality drift before your users do

You ship a prompt change on Tuesday. Wednesday morning, your support queue fills with tickets about the summarizer producing garbage for long documents. You didn't change anything about long documents -- you tweaked the system prompt for short-form queries. But the model's behavior shifted in ways you didn't test for, and now production is broken. Research on non-deterministic LLM behavior confirms that even temperature-zero outputs can vary across runs, making regression detection fundamentally harder than in traditional software.

This is what regression testing is supposed to prevent. In traditional software, regressions are changes that break previously working functionality. The same concept applies to LLM applications, but the mechanics are different. You can't assert on exact outputs. Quality degrades gradually rather than failing outright. And the failure surface is enormous because a single prompt change affects every query.

LLM regression testing is the practice of measuring whether a change -- to prompts, retrieval configs, model versions, or orchestration logic -- made things worse for cases that were already working. Done right, it catches the Tuesday-to-Wednesday disaster before it ships.

What makes LLM regressions different

In a traditional web application, a regression is a broken test. The function returned 4 when it should have returned 5. The API endpoint started throwing 500s. These failures are binary and immediate.

LLM regressions are different in three specific ways.

Quality is continuous, not binary

An LLM response is not right or wrong. It exists on a spectrum. Your summarizer might go from producing "good" summaries to "slightly worse" summaries -- still technically correct, but less concise, or missing a key detail 15% of the time instead of 5%. These partial regressions are hard to catch with pass/fail tests. You need scoring systems that detect shifts in a distribution.

Regressions are correlated across dimensions

When you change a prompt, the regression often shows up in a dimension you were not optimizing for. You tighten the output format, and factual accuracy drops because the model spends its "attention budget" on formatting constraints. You improve accuracy on medical questions, and creative writing quality degrades. LLM behaviors are not independent modules -- they share a single underlying model, and changes leak across boundaries.

The blast radius is large and unpredictable

A code change to a function affects the call sites of that function. A prompt change affects every query that touches that prompt. If your system has a single system prompt used across all query types, the blast radius of any change is 100% of traffic. And you can't predict which queries will be affected based on the diff alone.

Building golden datasets

A golden dataset is the foundation of LLM regression testing. It is a curated set of inputs, expected behaviors (not exact outputs), and scoring criteria that represent the quality baseline of your system.

Structure of a golden dataset

Each entry in a golden dataset needs four components:

type GoldenCase = {
  id: string;
  input: {
    query: string;
    context?: string; // for RAG systems
    conversationHistory?: Message[]; // for multi-turn
  };
  referenceOutput?: string; // a known-good response for comparison
  scoringCriteria: {
    dimension: string; // e.g., "factual_accuracy", "format_compliance"
    evaluator: "deterministic" | "llm_judge" | "embedding_similarity";
    threshold: number;
    rubric?: string; // for LLM judges
  }[];
  metadata: {
    category: string; // e.g., "long_document_summary", "code_generation"
    addedDate: string;
    source: "manual" | "production_failure" | "adversarial";
    priority: "critical" | "standard";
  };
};

The referenceOutput field is optional but valuable. It gives you a comparison target for embedding-based similarity scoring and helps LLM judges calibrate their assessments. When you have a known-good response from a previous model version, save it.

Sourcing golden dataset entries

Start with three sources.

Production failures. Every bug report, every support ticket, every case where a user said "this is wrong" becomes a golden dataset entry. These are the highest-signal cases because they represent real failure modes. After investigation, annotate each case with the correct behavior (or at least the scoring criteria that would catch the failure), and add it to the dataset permanently.

Coverage-driven curation. Map out the categories of queries your system handles, following the evaluation-driven development patterns that have become standard across the industry. If your application does document summarization, those categories might be: short documents, long documents, technical documents, legal documents, multi-language documents, documents with tables, documents with images described in text. Create at least 5 golden cases per category. This is the equivalent of code coverage -- you want to exercise every meaningful path through your system.

Adversarial inputs. Add inputs designed to break your system. Edge cases, ambiguous queries, queries that test safety filters, queries at the boundary of your system's capabilities. These cases catch the regressions that optimistic testing misses.

// Building a golden dataset from production failures
async function addProductionFailure(
  incident: SupportTicket,
  correctBehavior: string
): Promise<GoldenCase> {
  const newCase: GoldenCase = {
    id: `prod-${incident.id}`,
    input: {
      query: incident.userQuery,
      context: incident.retrievedContext,
    },
    referenceOutput: correctBehavior,
    scoringCriteria: [
      {
        dimension: "correctness",
        evaluator: "llm_judge",
        threshold: 0.8,
        rubric: `The response must: ${correctBehavior}. It must NOT: ${incident.failureDescription}`,
      },
    ],
    metadata: {
      category: classifyQuery(incident.userQuery),
      addedDate: new Date().toISOString(),
      source: "production_failure",
      priority: "critical",
    },
  };

  await goldenDataset.insert(newCase);
  return newCase;
}

Dataset size guidelines

Bigger is not always better. A 50-case golden dataset that covers your critical paths well is more useful than a 2,000-case dataset full of redundant queries. Here is a practical starting point:

Critical paths: 20-30 cases covering the most common query types
Production failures: every failure, no cap (typically 10-30 cases after a few months)
Edge cases: 15-20 adversarial or boundary inputs
Per-category minimum: 5 cases per query category

That gives you roughly 50-100 cases total. Run the full suite on every merge to main. If your suite takes more than 10 minutes, tier it -- run the critical and production-failure cases on every PR, and the full suite on merges.

Detecting quality drift

Quality drift is the slow degradation of output quality over time. It happens because of model version updates, changes to upstream dependencies, data distribution shifts, or the accumulation of small prompt changes that individually look fine but collectively degrade performance.

Baseline snapshots

Every successful deployment creates a new baseline. Store the eval results -- per-case scores, aggregate metrics, distribution statistics -- alongside the deployment artifact.

type BaselineSnapshot = {
  deploymentId: string;
  timestamp: string;
  commitSha: string;
  modelVersion: string;
  results: {
    caseId: string;
    scores: Record<string, number>;
  }[];
  aggregates: {
    metric: string;
    mean: number;
    p5: number;
    p50: number;
    p95: number;
    stddev: number;
  }[];
};

async function captureBaseline(
  deploymentId: string,
  evalResults: EvalResult[]
): Promise<BaselineSnapshot> {
  const aggregates = computeAggregates(evalResults);
  const snapshot: BaselineSnapshot = {
    deploymentId,
    timestamp: new Date().toISOString(),
    commitSha: process.env.GIT_SHA!,
    modelVersion: process.env.MODEL_VERSION!,
    results: evalResults.map((r) => ({
      caseId: r.caseId,
      scores: r.scores,
    })),
    aggregates,
  };

  await storage.saveBaseline(snapshot);
  return snapshot;
}

Statistical regression detection

Comparing two numbers (current mean vs. baseline mean) is too noisy for LLM evaluation. A single outlier case can swing the mean. Use distributional comparison instead.

function detectRegression(
  current: BaselineSnapshot,
  baseline: BaselineSnapshot,
  config: RegressionConfig
): RegressionReport {
  const regressions: RegressionDetail[] = [];

  for (const metric of config.trackedMetrics) {
    const baselineScores = baseline.results.map(
      (r) => r.scores[metric]
    );
    const currentScores = current.results.map(
      (r) => r.scores[metric]
    );

    // Compare medians rather than means (less sensitive to outliers)
    const baselineMedian = percentile(baselineScores, 50);
    const currentMedian = percentile(currentScores, 50);
    const medianDelta = currentMedian - baselineMedian;

    // Check the tail: did the worst cases get worse?
    const baselineP5 = percentile(baselineScores, 5);
    const currentP5 = percentile(currentScores, 5);
    const tailDelta = currentP5 - baselineP5;

    // Per-case regression: which specific cases got worse?
    const caseRegressions = findCaseRegressions(
      current.results,
      baseline.results,
      metric,
      config.perCaseTolerance
    );

    if (
      medianDelta < -config.medianTolerance ||
      tailDelta < -config.tailTolerance ||
      caseRegressions.length > config.maxRegressedCases
    ) {
      regressions.push({
        metric,
        medianDelta,
        tailDelta,
        regressedCases: caseRegressions,
      });
    }
  }

  return {
    passed: regressions.length === 0,
    regressions,
    summary: formatRegressionSummary(regressions),
  };
}

Three signals matter:

Median shift. If the median score drops by more than your tolerance (typically 0.03-0.05 on a 0-1 scale), something changed globally.
Tail degradation. If the 5th percentile drops, your worst cases got worse. This often matters more than the median because tail cases are the ones that generate support tickets.
Per-case regression count. If more than N individual cases regressed (their score dropped by more than the per-case tolerance), the change has a wide blast radius even if the aggregate metrics look acceptable.

Drift monitoring in production

Regression testing happens pre-deploy, but drift happens continuously. Model providers update endpoints. User behavior shifts. Retrieved documents change. Set up a scheduled job that runs your golden dataset against production on a regular cadence -- daily for high-stakes systems, weekly for lower-stakes ones.

// Scheduled drift check
async function checkDrift(): Promise<void> {
  const currentBaseline = await storage.getLatestBaseline();
  const freshResults = await runEvalSuite(goldenDataset, {
    environment: "production",
  });

  const report = detectRegression(
    snapshotFromResults(freshResults),
    currentBaseline,
    DRIFT_CONFIG
  );

  if (!report.passed) {
    await alerting.send({
      channel: "llm-quality",
      severity: "warning",
      message: `Quality drift detected:\n${report.summary}`,
    });
  }

  await storage.saveDriftCheck(report);
}

This catches the regressions that pre-deploy testing misses: model version changes from your provider, shifts in your RAG index, and slow accumulation of configuration changes across services. Drift monitoring is a key component of any mature LLM observability practice.

Building an automated regression suite

Putting it all together into a suite that runs in CI/CD requires a few structural decisions.

Suite organization

Organize your regression suite by layer, not by feature. Each layer has different performance characteristics and cost:

regression/
  deterministic/        # Format, schema, length checks -- free, fast
    format-compliance.ts
    output-schema.ts
    safety-filters.ts
  semantic/             # Embedding similarity -- cheap, medium speed
    response-similarity.ts
    topic-drift.ts
  judge/                # LLM-as-a-judge -- expensive, slow
    factual-accuracy.ts
    helpfulness.ts
    instruction-following.ts
  golden-dataset/
    cases.json          # The golden dataset
    baselines/          # Historical baseline snapshots

Run deterministic checks on every push. Run semantic checks on every PR. Run judge-based checks on merges to main. This tiered approach keeps feedback fast during development while still catching subtle regressions before deploy.

CI/CD integration

Wire the regression suite into your existing pipeline as a gate between build and deploy. Here is a GitHub Actions example:

# .github/workflows/llm-regression.yml
name: LLM Regression Gate

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  deterministic-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run format and schema checks
        run: bun run regression:deterministic

  semantic-checks:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - name: Run semantic similarity checks
        run: bun run regression:semantic
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

  full-regression:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    needs: [deterministic-checks]
    steps:
      - uses: actions/checkout@v4
      - name: Run full regression suite
        run: bun run regression:full
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Compare against baseline
        run: bun run regression:compare

      - name: Gate deployment
        run: |
          if [ -f regression-report.json ]; then
            PASSED=$(jq '.passed' regression-report.json)
            if [ "$PASSED" != "true" ]; then
              echo "Regression detected. Blocking deployment."
              jq '.regressions' regression-report.json
              exit 1
            fi
          fi

      - name: Update baseline on pass
        if: success()
        run: bun run regression:update-baseline

The key integration point with your eval-gated CI/CD pipeline is the comparison step. The regression suite does not just check absolute quality -- it compares against the stored baseline and blocks deployment if quality dropped. This is a more sensitive signal than threshold checks alone, because it detects relative degradation even when absolute scores look acceptable.

Handling flaky evaluations

LLM-as-a-judge evaluations are inherently noisy. As OpenAI's documentation on reproducible outputs notes, even deterministic settings do not guarantee identical results across runs. The same case scored by the same judge model can get 0.8 on one run and 0.7 on the next. This noise causes false positives in regression detection.

Three strategies reduce flakiness:

Run each case multiple times. Score each golden case 3 times and take the median. This adds cost but dramatically reduces variance. For a 100-case dataset, going from 1 run to 3 runs increases your judge API cost by 3x but cuts false-positive regressions by roughly 60%.

Use wider tolerances for noisy metrics. If your helpfulness judge has high variance (stddev > 0.1 across repeated runs), use a wider regression tolerance for that metric. Tight tolerances on noisy metrics just cause alert fatigue.

Separate deterministic from stochastic failures. If a case fails deterministic checks (wrong format, safety violation, schema mismatch), that is a real regression regardless of run-to-run variance. Reserve the statistical comparison for stochastic judge-based metrics.

Connecting regression testing to your eval stack

Regression testing is one layer of a broader evaluation strategy. Here is how the pieces fit together:

Unit evals test individual components (a single prompt template, a single retrieval function) in isolation
Regression testing (this article) tests the integrated system against a golden dataset, comparing new results to a known baseline
Online evaluation samples production traffic and scores it in near-real-time, catching issues that pre-deploy testing cannot

The regression suite draws on the same scoring functions and judge prompts as your other eval layers. If you are building evaluation infrastructure, build it as a shared library that all three layers use. If you are evaluating platforms that provide this tooling, check whether their evaluation SDK supports both pre-deploy regression testing and post-deploy monitoring with the same configuration.

Regression testing also feeds directly into your CI/CD pipeline as the quality gate between build and deploy. The pipeline orchestration decides when to run the regression suite; the suite itself decides whether the deployment should proceed. For teams building RAG systems, the same regression approach extends to retrieval metrics like context recall and faithfulness, which is covered in our RAG evaluation guide. And for a broader view of the operational practices that keep LLMOps deployments reliable, see our LLMOps best practices.

Common mistakes

Testing on the training distribution only. If your golden dataset only contains queries your system handles well, you will never catch regressions on edge cases. Include queries at the boundary of your system's capabilities -- they regress first.

Never pruning the dataset. A golden dataset that grows without pruning becomes slow and expensive. Remove cases that have never regressed and no longer test a meaningful scenario. Keep the dataset focused on cases that actually discriminate between good and bad deployments.

Ignoring the baseline update strategy. If you update the baseline after every deployment (including bad ones that slipped through), you are normalizing degraded quality. Only update the baseline after deployments that are confirmed good -- either through manual review or after a stabilization period in production.

Treating all regressions equally. A 0.02 drop in helpfulness for creative writing is not the same as a 0.02 drop in factual accuracy for medical queries. Weight your regression signals by business impact. Flag critical-path regressions as blocking, and surface non-critical regressions as warnings.

FAQ

What is LLM regression testing?

LLM regression testing measures whether changes to an LLM application -- prompt updates, model version swaps, retrieval config changes, or code modifications -- degrade output quality for previously working cases. It compares evaluation scores against a stored baseline from the last known-good deployment, flagging any statistical decrease in quality metrics like accuracy, faithfulness, or format compliance.

How do I build a golden dataset for LLM testing?

Start with three sources: production failures (every bug report becomes a test case), coverage-driven curation (at least 5 cases per query category your system handles), and adversarial inputs (edge cases designed to expose weaknesses). Each entry should include the input, optional reference output, scoring criteria with thresholds, and metadata about the case's source and priority. A starting dataset of 50-100 cases is enough for most systems.

What is quality drift in LLM applications?

Quality drift is the gradual degradation of LLM output quality over time, caused by model provider updates, changes in retrieved documents, shifts in user behavior, or accumulated configuration changes. Unlike sudden regressions from a specific code change, drift happens continuously and is only visible through regular scheduled evaluation against a stable golden dataset -- comparing current production behavior against a known-good baseline.

How do I integrate LLM regression testing into CI/CD?

Tier your regression suite by cost and speed. Run deterministic checks (format, schema, length) on every push -- they are free and fast. Run semantic similarity checks on every pull request. Run the full regression suite with LLM-as-a-judge scoring on merges to main. Compare results against the stored baseline, and block deployment if quality metrics drop below tolerance thresholds. Store the new baseline only after a confirmed-good deployment.

How many test cases do I need in a regression suite?

Start with 50-100 cases: 20-30 covering critical paths, all production failures (no cap), 15-20 adversarial inputs, and at least 5 per query category. Quality of coverage matters more than raw count. A focused 50-case suite that exercises every meaningful path is more effective than 2,000 redundant cases. Scale up when you identify coverage gaps through production incidents.

How do I handle flaky LLM evaluations in CI?

Three strategies: run each case 3 times and take the median score to reduce variance, use wider regression tolerances for inherently noisy metrics (like helpfulness judgments), and separate deterministic failures (format violations, schema mismatches) from stochastic judge-based metrics. Deterministic failures are always real regressions; statistical comparison is only needed for subjective quality dimensions.