Proof Bundle

A proof bundle is an immutable record that packages evaluation results, approval decisions, and deployment metadata into a single auditable artifact for AI pipeline governance.

A proof bundle is an immutable record that packages evaluation results, approval decisions, and deployment metadata into a single auditable artifact for AI pipeline governance. Think of it as the receipt for a production deployment -- not just what was deployed, but the evidence that it was safe to deploy.

What a proof bundle contains

A proof bundle captures everything needed to answer the question: "Why was this version of the pipeline allowed into production?"

Evaluation results. The full output of every evaluation suite that ran against this pipeline version. Not just pass/fail — the individual scores, the test cases that were run, the rubrics that were used, and the thresholds that were configured. If an LLM-as-a-Judge scored responses, the judge's scores and rationales are included.

Pipeline version. A content-addressable identifier for the exact pipeline configuration that was evaluated: prompt versions, model identifiers, retrieval parameters, guardrail settings, post-processing logic. This is the full snapshot, not just a git SHA — because parts of the pipeline configuration may live outside your code repository.

Approval decision. Who or what approved the deployment. This might be an automated policy ("all eval scores above threshold, auto-approve"), a human approval with the approver's identity and timestamp, or a combination. The approval is cryptographically tied to the evaluation results, so you cannot approve one set of results and deploy a different version.

Deployment target. Where this pipeline version was deployed: production, staging, a specific customer tenant, a canary group. The target matters because a version that is approved for staging is not necessarily approved for production.

Timestamps. When the evaluation ran, when the approval was granted, when the deployment was executed. These establish a timeline that auditors can verify.

How it differs from a deployment log

A deployment log tells you that a deployment happened. A proof bundle tells you that a deployment was justified.

Your CI/CD system's deployment log might record: "Pipeline v47 deployed to production at 14:32 UTC by GitHub Actions." That is useful for debugging, but it tells an auditor nothing about quality.

A proof bundle records: "Pipeline v47 was evaluated against 342 test cases across 5 evaluation suites. Accuracy scored 0.91 (threshold: 0.85). Toxicity scored 0.02 (threshold: 0.05). Latency p95 was 1.2s (threshold: 2.0s). All thresholds passed. Deployment was auto-approved by policy 'production-standard' at 14:30 UTC. Deployed to production at 14:32 UTC."

The difference is that the proof bundle establishes a chain of evidence: this specific version was tested, the tests showed it met quality bars, and the deployment was authorized based on those test results.

Why proof bundles matter for compliance

Regulatory frameworks like the EU AI Act require that operators of high-risk AI systems maintain records of their system's operations and can demonstrate compliance with quality standards. The NIST AI Risk Management Framework similarly emphasizes the need for documented risk management and testing practices. The question is not "do you test your AI?" but "can you prove you tested this specific version that is running in production right now?"

Proof bundles answer that question by design. Each production deployment has a corresponding proof bundle that an auditor can inspect. The bundle is immutable — it cannot be retroactively modified to change scores or add approvals that did not happen.

For teams going through SOC 2, ISO 27001, or industry-specific compliance audits, proof bundles provide the documentation layer that maps AI-specific quality controls to audit requirements. Instead of assembling evidence manually before an audit, every deployment automatically produces its own evidence package.

How proof bundles connect to the deployment pipeline

In Coverge, proof bundles are a natural output of the deployment process, not an extra step:

  1. A developer pushes a change to a pipeline configuration
  2. An eval gate runs the configured evaluation suites
  3. The eval gate produces structured results with scores per dimension
  4. If all thresholds pass, the system checks the approval policy
  5. The approval (automated or human) is recorded
  6. A proof bundle is assembled from the eval results, approval, pipeline version snapshot, and deployment target
  7. The bundle is stored immutably and linked to the deployment record
  8. The deployment proceeds

If the eval gate fails, no proof bundle is created and no deployment happens. There is no way to produce a proof bundle without passing evaluation — the bundle is evidence of passage, not a formality.

Querying proof bundles

Proof bundles are most useful when they are queryable. When something goes wrong in production, the first thing you pull up is the proof bundle for the current deployment. Common queries: "which pipeline version is running and what were its eval scores?", "show all deployments where accuracy dipped below 0.90", "compare the eval results between v46 and v47."

This queryability turns governance from a checkbox exercise into an operational tool.

Further reading

  • AI audit trails — the broader logging strategy that proof bundles fit into
  • Eval gates — the quality checkpoints that produce the eval results in a proof bundle
  • AI governance — the framework that proof bundles support
  • LLM CI/CD — how proof bundles integrate into deployment pipelines
  • LLM evaluation — the evaluation process that generates scores for the bundle