LLM CI/CD: why your deployment pipeline needs an eval gate

Your CI/CD pipeline probably looks something like this: push code, run unit tests, run integration tests, build an artifact, deploy to staging, run smoke tests, promote to production. This workflow — what Martin Fowler described as continuous delivery — has been refined over two decades of software engineering and it works. For deterministic software.

LLM applications break this workflow in a fundamental way. The same input can produce different outputs across runs. A prompt change that improves responses for one class of queries can degrade another. Model provider updates can shift behavior without any change on your side. The test suite that gave you confidence in traditional software gives you false confidence here.

This is not a reason to abandon CI/CD. It is a reason to extend it. The core idea — automated quality gates between code change and production — is more important for LLM applications, not less. You just need different gates.

Why traditional CI/CD fails for LLMs

Traditional CI/CD relies on assertions: given input X, expect output Y. If the output matches, the test passes. If it doesn't, the test fails. This binary model works when your system is deterministic.

LLM applications have at least four properties that break this model.

Non-deterministic outputs

Even with temperature set to 0, LLM outputs are not perfectly reproducible. The same prompt can produce slightly different wording across API calls due to batching, hardware differences, and quantization effects. An assertion like expect(response).toBe("The capital of France is Paris.") will fail intermittently even when the model is performing correctly — it might return "Paris is the capital of France" or "The capital of France is Paris, which is also its largest city."

Multi-dimensional quality

A traditional function either returns the right answer or it doesn't. An LLM response can be factually correct but poorly formatted, or well-formatted but incomplete, or complete but tonally wrong. Quality is a vector, not a boolean. A CI pipeline that reduces this to pass/fail loses the information you need to make deployment decisions.

Slow feedback loops

A unit test runs in milliseconds. An LLM eval that sends 200 test cases through an API, waits for responses, and scores them takes minutes or tens of minutes. If your CI pipeline runs evals on every commit, developers wait. If it only runs them on merge, regressions slip through.

Coupled components

In a traditional application, you can test a function in isolation. In an LLM application, the output quality depends on the interaction between prompts, retrieval pipelines, model versions, and post-processing logic. Changing the chunk size in your RAG pipeline changes what context the model sees, which changes every response. Unit testing any single component misses these interaction effects.

Building an eval-gated pipeline

An eval-gated pipeline replaces binary test assertions with scored evaluations. Instead of "did the output match the expected string," it asks "did the output meet quality thresholds across multiple dimensions." Here is what the architecture looks like.

Step 1: Define your eval suite

An eval suite is a dataset of inputs paired with scoring criteria — not expected outputs. The difference matters.

// Bad: exact-match assertions that break with non-deterministic outputs
const tests = [
  { input: "Summarize this contract", expected: "This contract establishes..." }
];

// Good: scoring criteria that tolerate variation
const evalCases = [
  {
    input: "Summarize this contract",
    context: contractText,
    criteria: {
      factualAccuracy: "Summary must mention all parties, effective date, and termination clause",
      completeness: "Must cover payment terms and liability sections",
      length: { min: 100, max: 300 },
      format: "Plain prose, no bullet points"
    }
  }
];

Start with the failure modes you care about most. If your application summarizes documents, your eval cases should cover: factual accuracy (does the summary contain hallucinated information?), completeness (did it miss key sections?), format compliance (does it follow the output schema?), and safety (does it leak sensitive content from the context?).

Step 2: Build deterministic checks first

Before you reach for LLM-as-a-judge, extract every quality dimension that can be checked programmatically. These checks are fast, reliable, and free.

interface EvalResult {
  case_id: string;
  scores: Record<string, number>;
  pass: boolean;
}

function runDeterministicChecks(response: string, criteria: Criteria): Partial<EvalResult['scores']> {
  const scores: Record<string, number> = {};

  // Format compliance
  if (criteria.format === 'json') {
    try { JSON.parse(response); scores.format = 1.0; }
    catch { scores.format = 0.0; }
  }

  // Length bounds
  const wordCount = response.split(/\s+/).length;
  if (criteria.length) {
    scores.length = (wordCount >= criteria.length.min && wordCount <= criteria.length.max) ? 1.0 : 0.0;
  }

  // Refusal detection
  const refusalPatterns = /I cannot|I'm unable|as an AI|I don't have access/i;
  scores.noFalseRefusal = refusalPatterns.test(response) ? 0.0 : 1.0;

  // PII leakage check
  const piiPatterns = /\b\d{3}-\d{2}-\d{4}\b|\b\d{16}\b/;
  scores.noPiiLeakage = piiPatterns.test(response) ? 0.0 : 1.0;

  return scores;
}

Deterministic checks catch a surprising amount. In practice, format compliance failures and length violations account for a large share of production bugs in LLM applications. Catching them without an API call saves both time and money.

Step 3: Add LLM-as-a-judge for subjective quality

For dimensions like factual accuracy, helpfulness, and tone, use a separate LLM call to score the output using the LLM-as-a-judge pattern. The judge model should be at least as capable as the model being evaluated — using GPT-4o to judge GPT-4o-mini outputs works; the reverse does not.

async function judgeFactualAccuracy(
  input: string,
  context: string,
  response: string,
  criteria: string
): Promise<number> {
  const judgePrompt = `You are evaluating an AI response for factual accuracy.

Input: ${input}
Source context: ${context}
AI response: ${response}

Evaluation criteria: ${criteria}

Score the response from 0.0 to 1.0:
- 1.0: All claims are supported by the source context
- 0.7: Minor omissions but no fabricated information
- 0.4: Contains at least one unsupported claim
- 0.0: Contains fabricated information contradicting the source

Return only the numeric score.`;

  const result = await llm.complete({ prompt: judgePrompt, temperature: 0 });
  return parseFloat(result.trim());
}

Calibrate the judge against human ratings. Take 50 response pairs that humans have scored, run the judge on them, and check agreement. If the judge disagrees with humans more than 20% of the time, your rubric needs refinement — the criteria are probably too vague or the scoring scale too granular.

Step 4: Set thresholds and gate deployments

This is where eval becomes CI/CD. Define minimum thresholds per metric and fail the pipeline if any threshold is breached.

# .coverge/eval-config.yaml
eval_gate:
  thresholds:
    factual_accuracy: 0.85
    format_compliance: 1.0
    no_pii_leakage: 1.0
    completeness: 0.75
    latency_p95_ms: 3000
  min_eval_cases: 100
  fail_on_regression: true
  regression_tolerance: 0.05  # fail if any metric drops >5% vs baseline

The fail_on_regression flag is important and is the essence of an eval gate. Even if a change keeps all metrics above absolute thresholds, a sudden 4% drop in factual accuracy signals something changed that you should investigate before deploying. Compare each eval run against a stored baseline from the last successful deployment.

Step 5: Wire it into your pipeline

Here is a simplified GitHub Actions workflow that demonstrates the eval gate concept:

# .github/workflows/llm-ci.yaml
name: LLM CI/CD
on: [push]

jobs:
  standard-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run typecheck
      - run: npm run lint
      - run: npm test

  eval-gate:
    needs: standard-checks
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Run eval suite
        run: npm run eval -- --output results.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check thresholds
        run: npm run eval:check -- --results results.json --config .coverge/eval-config.yaml
      - name: Upload eval artifact
        uses: actions/upload-artifact@v4
        with:
          name: eval-results-${{ github.sha }}
          path: results.json

  deploy-canary:
    needs: eval-gate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to 5% traffic
        run: ./scripts/deploy.sh --canary --percentage 5
      - name: Monitor canary (15 min)
        run: ./scripts/monitor-canary.sh --duration 900 --fail-on-regression
      - name: Promote or rollback
        run: ./scripts/promote-or-rollback.sh

The key structural change from a traditional pipeline: eval-gate sits between standard checks and deployment. Standard tests verify that your code compiles and your logic is correct. The eval gate verifies that your system produces good outputs. Both must pass before code reaches production.

This is the approach Coverge takes — every AI pipeline change goes through compilation, graph validation, and an eval suite before a human approves it. The eval results, approval decision, and deploy metadata are packaged into an immutable proof bundle that serves as the audit trail. The pipeline does not reach production without passing the gate and getting sign-off.

Testing non-deterministic outputs

The hardest part of LLM CI/CD is writing evaluations that are reliable enough to gate deployments. A flaky eval gate that randomly fails is worse than no gate — teams will learn to ignore it or skip it.

Statistical evaluation over single runs

Never make a deployment decision based on a single LLM response. Run each eval case multiple times and aggregate the scores.

async function evaluateWithConfidence(
  evalCase: EvalCase,
  runs: number = 5
): Promise<{ mean: number; stddev: number; pass: boolean }> {
  const scores: number[] = [];

  for (let i = 0; i < runs; i++) {
    const response = await runPipeline(evalCase.input);
    const score = await scoreResponse(response, evalCase.criteria);
    scores.push(score);
  }

  const mean = scores.reduce((a, b) => a + b, 0) / scores.length;
  const variance = scores.reduce((a, b) => a + (b - mean) ** 2, 0) / (scores.length - 1);
  const stddev = Math.sqrt(variance);

  return {
    mean,
    stddev,
    pass: mean >= evalCase.threshold && stddev < 0.15
  };
}

The standard deviation check catches an important failure mode: a pipeline that scores 0.9 half the time and 0.5 half the time has a mean of 0.7, which might pass your threshold. But the high variance means users are getting unpredictable quality. Fail the eval if variance is too high, even when the mean looks acceptable.

Semantic similarity over exact matching

When you do need to compare outputs against reference answers, use embedding-based similarity rather than string matching.

async function semanticSimilarity(response: string, reference: string): Promise<number> {
  const [respEmbedding, refEmbedding] = await Promise.all([
    embed(response),
    embed(reference)
  ]);
  return cosineSimilarity(respEmbedding, refEmbedding);
}

// Use a threshold like 0.85 instead of exact equality
const score = await semanticSimilarity(response, referenceAnswer);
const passes = score >= 0.85;

This tolerates variation in phrasing while still catching substantive differences. "The contract is effective from January 1" and "Effective date: Jan 1" are semantically similar. "The contract has no effective date" is not.

Regression detection over absolute scoring

The most reliable eval signal is not "is this response good" but "is this response worse than what we had before." Store eval results from the last known good deployment as a baseline. Compare each new eval run against that baseline.

function detectRegression(
  current: EvalResults,
  baseline: EvalResults,
  tolerance: number = 0.05
): { regressed: boolean; details: string[] } {
  const details: string[] = [];
  let regressed = false;

  for (const metric of Object.keys(baseline.aggregates)) {
    const delta = current.aggregates[metric] - baseline.aggregates[metric];
    if (delta < -tolerance) {
      regressed = true;
      details.push(
        `${metric}: ${baseline.aggregates[metric].toFixed(3)} → ${current.aggregates[metric].toFixed(3)} (Δ${delta.toFixed(3)})`
      );
    }
  }

  return { regressed, details };
}

Regression detection is more reliable than absolute thresholds because it adapts to your system's actual performance level. If your factual accuracy has been running at 0.92, a drop to 0.87 is a meaningful signal even though 0.87 might look fine in absolute terms.

Deployment strategies for AI systems

Once your eval gate passes, you still need a deployment strategy that accounts for the uncertainty inherent in LLM applications. Pre-deploy evaluation cannot catch every production issue because production traffic is more diverse than any test suite.

Canary deployments

Route a small percentage of traffic (1-5%) to the new version. Monitor quality metrics for a defined window — the duration depends on your traffic volume and how quickly you can detect regressions with statistical significance.

The math matters here. If your baseline error rate is 2% and you want to detect a doubling to 4%, you need roughly 1,500 requests through the canary to detect the difference with 95% confidence. At 100 requests per minute to the canary, that takes 15 minutes. At 10 requests per minute, it takes 2.5 hours. Plan your canary duration based on actual traffic, not gut feel.

Blue-green with automated comparison

Maintain two identical environments. Route all traffic to green (current production). Deploy the new version to blue. Run a shadow copy of production traffic through both environments and compare outputs.

This is more expensive than canary deployment — you are running two environments and paying for double the LLM API calls — but it gives you full production traffic without any user exposure to the new version. The comparison can run for hours or days before you switch traffic.

Progressive rollout with feature flags

For changes that affect a specific feature rather than the entire application, use feature flags to progressively increase exposure:

async function handleRequest(req: Request): Promise<Response> {
  const useNewPipeline = await featureFlag('new-summarizer-v2', {
    userId: req.userId,
    percentage: getGlobalRolloutPercentage()
  });

  const pipeline = useNewPipeline ? newSummarizer : currentSummarizer;
  const response = await pipeline.run(req.input);

  // Log which pipeline served this request for analysis
  await logPipelineMetrics({
    pipeline: useNewPipeline ? 'v2' : 'v1',
    latency: response.latencyMs,
    tokenCount: response.tokens,
    userId: req.userId
  });

  return response;
}

The advantage of feature flags is granularity. You can roll out to internal users first, then beta users, then 10%, 50%, 100%. At each stage, compare quality metrics between the two cohorts. If the new version is worse for any segment, roll back that segment without affecting others.

Rollback triggers

Regardless of which deployment strategy you use, define automatic rollback triggers. These should fire without human intervention.

Common triggers:

Error rate exceeds 2x the baseline for more than 5 minutes
Latency p95 exceeds the SLA for more than 10 minutes
Quality score (from online eval sampling) drops more than 10% below baseline
Cost per request exceeds 1.5x the projected budget

Wire these triggers into your deployment system so rollback happens in seconds, not the 45 minutes it takes when someone has to be paged, open a laptop, find the right commit, and deploy manually.

Practical considerations

Cost management

Running evals on every push gets expensive fast. A 200-case eval suite with LLM-as-a-judge scoring costs roughly $2-5 per run with GPT-4o. If your team pushes 50 times per day, that is $100-250 per day on CI evals alone.

Tier your eval strategy: run deterministic checks on every push (free and fast), run a small eval subset on every PR (50 cases, ~$1), run the full suite on merges to main (200+ cases, ~$5). This keeps feedback loops tight during development without burning through your eval budget. If you are evaluating hosted eval platforms, factor in their per-run pricing on top of the LLM API cost — some charge per eval, others per trace. Building a solid LLM evaluation practice upfront will save you from costly production incidents.

Eval dataset maintenance

An eval suite that never changes becomes a Goodhart's Law trap — your team optimizes for the specific cases in the suite rather than for general quality. Rotate cases regularly. Add every production failure as a new eval case. Remove cases that have never failed and no longer test a relevant scenario.

The golden rule: if a bug reaches production, the query that exposed it becomes a permanent eval case. Your eval suite should be a living record of every failure mode your system has encountered.

Handling model provider updates

When OpenAI updates gpt-4o or Anthropic ships a new Claude version, your eval gate catches the impact automatically — that is the whole point. But you need to handle these updates proactively rather than waiting for a regression.

Pin model versions explicitly in your config (e.g., gpt-4o-2024-08-06 instead of gpt-4o). When a new model version is available, create a branch that updates the pin, run the full eval suite, and review the results before merging. This gives you a deliberate upgrade path instead of a surprise behavior change.

This ties directly into the LLMOps practice of versioning everything — model versions are one of many components that need explicit tracking and controlled updates.

FAQ

What is LLM CI/CD?

LLM CI/CD extends traditional continuous integration and deployment with evaluation gates designed for non-deterministic AI systems. Instead of only running unit tests and type checks, the pipeline scores LLM outputs against quality criteria — factual accuracy, format compliance, safety — and blocks deployment if scores drop below thresholds or regress versus the production baseline.

Why can't I use regular unit tests for LLM applications?

Unit tests assert exact outputs: given input X, expect output Y. LLM outputs vary across runs even with identical inputs. The same prompt can produce different phrasing, different levels of detail, or different structure while being equally correct. You need evaluation methods that score quality across dimensions rather than asserting exact matches — semantic similarity, rubric-based LLM-as-a-judge scoring, and statistical aggregation across multiple runs.

How do I test non-deterministic LLM outputs reliably?

Three techniques work well together. First, run each eval case multiple times (3-5 runs) and check both mean score and variance — high variance signals unreliable quality even when the mean looks good. Second, use semantic similarity (embedding cosine distance) instead of exact string matching to tolerate phrasing variation. Third, compare against a baseline from the last successful deployment rather than relying on absolute thresholds, since regression detection is more reliable than fixed targets.

How much does an eval-gated pipeline cost to run?

A 200-case eval suite with LLM-as-a-judge scoring costs roughly $2-5 per run using GPT-4o. Tier your strategy to manage costs: deterministic checks on every push (free), a small eval subset on PRs (~~$1), and the full suite on merges to main (~~$5). For a team pushing 50 commits per day, this works out to approximately $50-75/day — a fraction of the cost of a single production incident caused by an untested deployment.

What deployment strategy should I use for LLM applications?

Start with canary deployments: route 1-5% of traffic to the new version and monitor quality metrics for a statistically significant window. If you need zero user exposure to untested changes, use blue-green deployment with shadow traffic comparison. For feature-level changes, progressive rollout with feature flags gives the most granular control. Regardless of strategy, define automatic rollback triggers — error rate spikes, latency SLA breaches, quality score regressions — so recovery happens in seconds without human intervention.

How does LLM CI/CD relate to LLMOps?

LLM CI/CD implements two of the core LLMOps best practices: "eval before deploy" and "separate build from deploy." For teams running RAG pipelines, the eval gate should include retrieval-specific metrics like faithfulness and context recall, which our RAG testing framework guide covers in depth. The eval gate ensures that every change to any component — prompts, retrieval config, model version, orchestration logic — is scored before it reaches production. The pipeline structure enforces a clear separation between building an artifact, evaluating it, and deploying it. Together with production monitoring and automated rollback, this forms the operational foundation that LLMOps is built on.