AI agent testing: why traditional testing breaks and what to do instead

Your unit tests pass. Your integration tests pass. Your type checker is green. You deploy your agent to production, and within an hour it hallucinates a product that does not exist, recommends a competitor's pricing page, and tells a customer their refund has been processed when the refund API was never called.

This is the fundamental problem with testing AI agents: the techniques that work for deterministic software do not transfer. An agent that passes every test you wrote can still fail in ways you never anticipated, because its behavior is shaped by model weights, prompt context, and runtime inputs that combine in ways no finite test suite can enumerate.

Search interest in "ai agent testing" reached 44 monthly queries in early 2026, growing 228% year-over-year. The raw number is small, but the growth rate signals that teams are moving past "does my prompt work?" and into "how do I know my agent system is reliable enough to ship?"

This guide covers why traditional testing approaches break for agents, how to build eval-based test suites for multi-agent pipelines, and how proof bundles turn test evidence into deployable artifacts. For the broader agent platform picture, see the AI agent platform guide. For evaluation methodology beyond agents, see the LLM evaluation guide.

Why traditional testing fails for agents

Traditional software testing rests on determinism. Given input X, the function returns Y. You assert that Y matches expectations. If it does, the code works. If it does not, you have a bug.

Agents violate this assumption at every level.

Non-deterministic outputs. The same prompt with the same input can produce different outputs across runs. Temperature settings, model updates, and even token sampling variance mean you cannot write expect(agent.run("summarize this")).toBe("exact string"). The output space is too large for exact matching.

Emergent control flow. A well-tested agent might decide to call tools in an unexpected order, retry a step it has never retried before, or interpret an ambiguous input in a way no test case covered. The control flow is not defined in your code — it emerges from the model's reasoning at runtime.

Cascading failures in pipelines. When agent B consumes agent A's output, a subtle quality degradation in agent A can cause a complete failure in agent B. The individual agent tests pass because each agent works fine with well-formed input. The pipeline fails because real inter-agent communication is messy.

Stateful context accumulation. Agents that run multi-turn conversations or accumulate context across steps build up internal state that is invisible to per-call tests. The 50th message in a conversation might trigger a failure that no short test conversation would catch.

Model version drift. Your tests pass today. The model provider ships a minor update next Tuesday. Your tests still pass, but the agent's behavior in production has shifted enough that downstream systems break. Traditional tests do not catch behavioral drift — they catch regressions against exact expectations.

The result is that teams build a false sense of security. They have 95% test coverage, green CI, and agents that fail unpredictably in production. The LangChain State of Agent Engineering survey (1,340 respondents, December 2025) quantified this gap: 89% of teams have observability in place, but only 52% run automated evaluations. Nearly 23% of organizations with agents in production run no evaluations at all. Teams watch their agents more than they test them.

Testing non-deterministic systems

If you cannot assert exact outputs, what do you assert? The answer is properties, not values.

Property-based assertions

Instead of checking that the agent returns a specific string, check properties of the response:

Structural correctness. Does the response contain the required fields? Does the JSON parse? Does the tool call match the expected schema?
Semantic relevance. Is the response about the right topic? Does it address the user's question? This requires an LLM-as-a-judge or embedding similarity check, not a string comparison.
Safety boundaries. Does the response avoid forbidden content? Does it stay within the guardrails you defined? Does it refuse requests it should refuse?
Factual grounding. When the agent cites data, can you trace that data back to a source document? This matters especially for RAG-based agents — see our RAG evaluation glossary for terminology.
Cost and latency bounds. Does the agent complete within acceptable time and token budgets? A correct answer that costs $2 per query is still a failure in most production contexts.

declare const agent: { run: (input: string) => Promise<{ response: string; toolCalls: Array<{ name: string; args: Record<string, unknown> }>; tokenUsage: number; latencyMs: number }> };
declare function scoreRelevance(query: string, response: string): Promise<number>;

interface TestCase {
  input: string;
  requiredProperties: {
    maxTokens?: number;
    maxLatencyMs?: number;
    minRelevance?: number;
    forbiddenPatterns?: RegExp[];
    requiredToolCalls?: string[];
  };
}

async function runPropertyTest(tc: TestCase) {
  const result = await agent.run(tc.input);
  const props = tc.requiredProperties;
  const failures: string[] = [];

  if (props.maxTokens != null && result.tokenUsage > props.maxTokens) {
    failures.push(`Token usage ${result.tokenUsage} exceeds max ${props.maxTokens}`);
  }
  if (props.maxLatencyMs != null && result.latencyMs > props.maxLatencyMs) {
    failures.push(`Latency ${result.latencyMs}ms exceeds max ${props.maxLatencyMs}ms`);
  }
  if (props.minRelevance != null) {
    const relevance = await scoreRelevance(tc.input, result.response);
    if (relevance < props.minRelevance) {
      failures.push(`Relevance ${relevance} below minimum ${props.minRelevance}`);
    }
  }
  if (props.forbiddenPatterns) {
    for (const pattern of props.forbiddenPatterns) {
      if (pattern.test(result.response)) {
        failures.push(`Response matched forbidden pattern: ${pattern}`);
      }
    }
  }
  if (props.requiredToolCalls) {
    const calledTools = result.toolCalls.map((call) => call.name);
    for (const required of props.requiredToolCalls) {
      if (!calledTools.includes(required)) {
        failures.push(`Expected tool call '${required}' not found`);
      }
    }
  }

  return { passed: failures.length === 0, failures };
}

Statistical pass criteria

A single test run is not meaningful for a non-deterministic system. Run each test case multiple times and evaluate the distribution.

If your agent answers correctly 92 out of 100 runs, that is a 92% pass rate — and you need to decide if that is acceptable for your use case. Customer-facing agents might need 98%+. Internal research assistants might accept 85%. The pass threshold is a product decision, not an engineering one.

Promptfoo handles this well — you define test cases with assertions, run them across multiple configurations, and get a matrix of pass/fail results. It supports LLM-graded assertions, which means you can write "the response should be helpful and accurate" as a test assertion and have a judge model score it. Promptfoo has over 20,000 GitHub stars, supports red-teaming with 50+ adversarial attack plugins, and was acquired by OpenAI in March 2026 — a signal that the foundation model labs view eval tooling as strategic infrastructure. For a deeper comparison, see our Promptfoo alternative analysis.

DeepEval takes a metrics-first approach with 33+ evaluation metrics including G-Eval, faithfulness, answer relevancy, contextual recall, hallucination detection, and agent-specific metrics like goal accuracy, tool correctness, and step efficiency. Each metric produces a score between 0 and 1 with an explanation of why the score was assigned. DeepEval's tool correctness evaluation checks at three strictness levels: correct tool name, correct name plus parameters, and correct name plus parameters plus output. It has over 14,000 GitHub stars, 3 million monthly downloads, and integrates directly with pytest. See our DeepEval alternative comparison.

Offline eval vs production monitoring

Testing happens in two phases that serve different purposes.

Offline evaluation (pre-deployment)

Offline evals run against a fixed dataset before you deploy. They answer: "Is this version at least as good as what is currently in production?"

Build your offline eval suite around:

Golden datasets. Curated input-output pairs where the expected output has been human-reviewed. Start with 50-100 cases. Grow them based on production failures.
Regression tests. Specific inputs that caused past failures. Every production bug becomes a test case. This is the single most valuable testing practice for agents.
Adversarial inputs. Inputs designed to break the agent — prompt injections, out-of-scope requests, edge cases in tool calling. Promptfoo's red-teaming plugins generate these automatically.
Behavioral consistency checks. Rephrase the same question 5 different ways. The agent should give semantically equivalent answers. Large variance across rephrasings signals brittleness.

declare const currentAgent: { run: (input: string) => Promise<{ response: string }> };
declare const candidateAgent: { run: (input: string) => Promise<{ response: string }> };
declare function scoreAnswer(input: string, response: string, reference: string): Promise<number>;

interface GoldenCase {
  input: string;
  referenceOutput: string;
  minScore: number;
}

async function regressionTest(goldenSet: GoldenCase[]) {
  const results = await Promise.all(
    goldenSet.map(async (gc) => {
      const [currentResult, candidateResult] = await Promise.all([
        currentAgent.run(gc.input),
        candidateAgent.run(gc.input),
      ]);
      const [currentScore, candidateScore] = await Promise.all([
        scoreAnswer(gc.input, currentResult.response, gc.referenceOutput),
        scoreAnswer(gc.input, candidateResult.response, gc.referenceOutput),
      ]);
      return {
        input: gc.input,
        currentScore,
        candidateScore,
        regression: candidateScore < currentScore - 0.05,
        belowMinimum: candidateScore < gc.minScore,
      };
    })
  );

  const regressions = results.filter((r) => r.regression);
  const belowMinimum = results.filter((r) => r.belowMinimum);

  return {
    passed: regressions.length === 0 && belowMinimum.length === 0,
    totalCases: results.length,
    regressions,
    belowMinimum,
  };
}

Production monitoring (post-deployment)

Production monitoring catches what offline evals miss — the inputs you did not anticipate, the model drift you did not simulate, the edge cases that only appear at scale.

Key production signals to monitor:

User feedback signals. Thumbs up/down, regeneration requests, conversation abandonment. These are noisy but directional.
LLM-as-a-judge scoring. Sample a percentage of production responses and score them asynchronously. This catches quality drift before users report it. The LLM evaluation guide covers judge implementation patterns.
Tool call success rates. Are the agent's tool calls succeeding? A spike in tool errors often indicates the agent is constructing bad arguments.
Cost anomalies. A sudden increase in tokens-per-request might mean the agent is looping or the model is generating verbose outputs after an update.
Trace anomalies. Unusual span durations, unexpected tool call sequences, or missing spans in the trace. The AI agent observability guide covers instrumentation in detail.

The feedback loop matters: production failures become regression tests, which become part of your offline eval suite, which gates the next deployment. Without this loop, your eval suite is frozen in time while production evolves.

Building test suites for multi-agent pipelines

Single-agent testing is hard. Multi-agent pipeline testing is a different order of difficulty because you need to test not just each agent, but their interactions.

Layer your tests

Layer 1: Individual agent evals. Test each agent in isolation with well-formed inputs. This catches regressions in individual agents but misses integration failures.

Layer 2: Interface contract tests. Define the schema for each agent's output — what fields it must contain, what format they must be in, what ranges are acceptable. Test that each agent's output conforms to the schema that the next agent expects. This catches "agent A changed its output format and broke agent B" without running the full pipeline.

Layer 3: Pipeline integration evals. Run the full pipeline end-to-end with golden inputs. Measure quality at the final output, but also capture intermediate outputs for debugging. When a pipeline eval fails, the intermediate captures tell you which agent in the chain caused the failure.

Layer 4: Adversarial pipeline tests. Feed deliberately bad output from agent A into agent B. Does agent B handle it gracefully, or does it hallucinate its way to a confident-sounding wrong answer? These tests catch the cascading failure patterns that individual agent tests miss.

declare const researchAgent: { run: (input: string) => Promise<{ sources: Array<{ url: string; snippet: string }>; summary: string }> };
declare const writerAgent: { run: (context: { sources: Array<{ url: string; snippet: string }>; summary: string; topic: string }) => Promise<{ draft: string; citations: string[] }> };
declare function scoreGroundedness(draft: string, sources: Array<{ snippet: string }>): Promise<number>;

async function testPipelineIntegration(topic: string) {
  const research = await researchAgent.run(topic);

  const contractChecks = {
    hasSources: research.sources.length > 0,
    sourcesHaveUrls: research.sources.every((s) => s.url.startsWith("http")),
    sourcesHaveSnippets: research.sources.every((s) => s.snippet.length > 0),
    hasSummary: research.summary.length > 50,
  };

  const contractFailures = Object.entries(contractChecks)
    .filter(([, passed]) => !passed)
    .map(([check]) => check);

  if (contractFailures.length > 0) {
    return { stage: "research", passed: false, failures: contractFailures };
  }

  const draft = await writerAgent.run({ sources: research.sources, summary: research.summary, topic });

  const groundedness = await scoreGroundedness(draft.draft, research.sources);

  return {
    stage: "full-pipeline",
    passed: groundedness > 0.7 && draft.citations.length > 0,
    groundedness,
    citationCount: draft.citations.length,
  };
}

Testing inter-agent communication

The hardest bugs to catch are in the data flowing between agents. Agent A might return technically valid output that agent B interprets incorrectly. A few patterns help:

Log intermediate outputs. Always capture what each agent passed to the next. This is not just for debugging — it is your primary forensic tool when pipeline quality drops.
Score intermediate quality. Do not wait until the final output to measure quality. Score the research agent's sources for relevance before the writer agent consumes them. Catch garbage early.
Inject known-bad intermediates. Build test cases where you manually feed agent B a degraded version of agent A's output. Measure how gracefully the pipeline degrades. A good pipeline produces lower-quality output; a bad pipeline hallucinates.

For observability patterns that support multi-agent debugging, see our guide on multi-agent orchestration.

Regression testing with golden datasets

Golden datasets are the backbone of agent evaluation. They are curated sets of inputs paired with reference outputs that have been human-reviewed and approved. But building and maintaining them requires discipline.

Building your first golden dataset

Start small. Fifty well-curated examples beat five hundred hastily assembled ones. For each example:

The input — a real query or task, not a synthetic one. Pull from production logs if you have them.
The reference output — what a good response looks like. This does not need to be the only acceptable response, but it should represent the quality bar.
Evaluation criteria — what properties matter for this specific case. Some cases test factual accuracy. Others test tone. Others test tool usage. Make it explicit.
Difficulty tier — tag each case as easy, medium, or hard. Easy cases are table stakes. Hard cases are stretch goals. This lets you track improvement at each difficulty level separately.

Growing the dataset from production

Every production failure that gets reported by a user should become a golden dataset entry. The workflow:

User reports a bad response.
An engineer reviews the input and the response.
The engineer writes a reference output or defines the correct behavior.
The case enters the golden dataset tagged with the failure mode (hallucination, wrong tool, safety violation, format error).
The next eval run includes this case.

Over time, your golden dataset becomes a living record of every failure mode your agent has encountered. A dataset of 500 production-sourced examples, built over six months, catches more real-world failures than a dataset of 5,000 synthetic examples built in an afternoon.

Snapshot testing for behavioral drift

Traditional snapshot testing compares exact outputs. For agents, you need semantic snapshots: capture the output properties — topics mentioned, entities referenced, tool calls made, sentiment — and compare those across versions.

When you upgrade a model or change a prompt, run your golden dataset against both the old and new configuration. Flag any case where the semantic properties diverge beyond a threshold. This catches behavioral drift that property tests alone might miss because the new output passes all the property checks but is qualitatively different.

The proof bundle concept

Testing agents is not just an engineering problem. As AI systems move into regulated industries, you need to prove that your agent was tested before it was deployed. Not "we have a CI pipeline" — actual evidence that specific tests ran, what the results were, and who approved the deployment.

A proof bundle is an immutable artifact that captures:

What was tested. The exact eval suite version, golden dataset version, and configuration.
What the results were. Pass/fail for every test case, scores for every metric, latency and cost data.
What version was deployed. The exact model version, prompt version, and code version.
Who approved it. The human who reviewed the results and approved the deployment.
When it happened. Timestamps for the eval run, the approval, and the deployment.

interface ProofBundle {
  id: string;
  timestamp: string;

  pipeline: {
    version: string;
    commitHash: string;
    modelVersions: Record<string, string>;
    promptHashes: Record<string, string>;
  };

  evaluation: {
    suiteVersion: string;
    goldenDatasetVersion: string;
    totalCases: number;
    passedCases: number;
    failedCases: number;
    metrics: Record<string, number>;
    regressions: Array<{ caseId: string; currentScore: number; previousScore: number }>;
  };

  approval: {
    approver: string;
    timestamp: string;
    notes: string;
  };

  artifacts: {
    evalResultsUrl: string;
    traceUrl: string;
    costReport: { totalCost: number; perAgentCost: Record<string, number> };
  };
}

This concept comes from the intersection of CI/CD and regulatory compliance. In traditional software, your CI pipeline is the proof — the build passed, the tests passed, it shipped. For AI systems, the proof needs to be richer because the tests are probabilistic and the behavior is non-deterministic. A proof bundle says "we ran 500 eval cases, 487 passed, the 13 failures were reviewed and accepted as edge cases by this person on this date."

The EU AI Act, which begins enforcement for high-risk systems in August 2026, requires documentation of testing methodology and results for AI systems. A proof bundle is not a legal document, but it is the engineering artifact that feeds the legal documentation. See our AI governance engineering guide for how proof bundles fit into broader governance workflows, and the AI audit trail guide for immutable record-keeping requirements.

Agent testing tools comparison

Tool	Focus	Agent support	Key strength	Pricing
Promptfoo	Red-teaming + eval	Multi-step agent evaluation, 50+ adversarial plugins, MCP support	20K+ stars, acquired by OpenAI (March 2026), CI/CD-first	Free (open-source)
DeepEval	Metrics-first eval	8 agentic metrics, tool correctness at 3 strictness levels, MCP metrics	14K+ stars, 3M monthly downloads, pytest integration	Free tier + paid
Braintrust	Eval + observability	Pipeline tracing with eval, traces-to-datasets	$80M Series B at $800M valuation, GitHub Action for PR eval	Usage-based
RAGAS	RAG evaluation	ToolCallF1, agent goal accuracy (agent evals "coming soon")	13K+ stars, open-source, focused RAG metrics	Free (open-source)
Patronus AI	Automated eval	Generative Simulators for dynamic agent testing	$40M raised, adaptive simulation environments	Enterprise
Okareo	Agent simulation	Synthetic user personas ("Drivers"), loop detection	Found 25% of production traffic shows looping behavior	Usage-based

For production deployments, most teams combine tools: Promptfoo or DeepEval for offline eval in CI, Braintrust or a custom pipeline for production monitoring, and targeted tools like RAGAS for RAG-specific quality. No single tool covers the full testing lifecycle yet.

Putting it together: the agent testing workflow

A mature agent testing workflow looks like this:

Developer writes or modifies an agent. Changes go through standard code review.
CI runs offline evals. The golden dataset, regression tests, and adversarial inputs all execute. Results are compared against the previous version. Any regression blocks the merge.
Staging environment runs integration evals. The full pipeline runs end-to-end with production-like data. Interface contracts are validated. Inter-agent quality is scored.
Human reviews the eval results. Not a rubber stamp — someone looks at the failed cases, the edge cases, and the quality distribution.
Proof bundle is generated. The eval results, code version, model versions, and approval are bundled into an immutable artifact.
Deployment proceeds. The proof bundle travels with the deployment as a record of what was tested and approved.
Production monitoring begins. Sampled responses are scored. Anomalies trigger alerts. Failures feed back into the golden dataset.

This workflow adds overhead compared to "push to main and deploy." But it replaces the alternative, which is finding out your agent is broken when a customer screenshots its bad response and posts it on social media.

FAQ

What is AI agent testing?

AI agent testing is the practice of validating that AI agents — systems that use LLMs to reason, select tools, and take actions — behave correctly and reliably. Unlike traditional software testing, agent testing must handle non-deterministic outputs, emergent control flow, and multi-step reasoning chains. It combines property-based assertions, eval metrics like relevance and faithfulness, adversarial testing, and production monitoring to build confidence that an agent system works as intended.

Why do unit tests not work for AI agents?

Unit tests assert exact outputs for given inputs. Agents produce different outputs for the same input because LLMs are probabilistically sampled, context windows accumulate state, and tool call patterns emerge at runtime. You need property-based testing (asserting properties of the output rather than exact values) and statistical evaluation (running each test case multiple times and evaluating the pass rate distribution) instead of traditional assert-equals patterns.

How do you test a multi-agent pipeline?

Layer your testing: (1) individual agent evals with well-formed inputs, (2) interface contract tests verifying each agent's output schema matches the next agent's expected input, (3) end-to-end pipeline evals measuring final output quality while capturing intermediate outputs for debugging, and (4) adversarial injection tests where you feed deliberately bad intermediate outputs to downstream agents to test graceful degradation. See our multi-agent orchestration guide for related patterns.

What is a golden dataset for agent testing?

A golden dataset is a curated set of input-output pairs where the reference outputs have been human-reviewed. Each entry includes the input, a reference output representing the quality bar, specific evaluation criteria, and a difficulty tier. Start with 50-100 real queries from production logs, not synthetic data. Grow the dataset by converting every production failure into a new test case. Over time, the golden dataset becomes a living record of every failure mode your agent has encountered.

What tools are best for AI agent testing in 2026?

Promptfoo and DeepEval are the most popular open-source options. Promptfoo (20K+ GitHub stars, acquired by OpenAI in March 2026) excels at red-teaming with 50+ adversarial attack plugins and integrates well with CI/CD pipelines. DeepEval (14K+ stars, 3M monthly downloads) offers 33+ evaluation metrics with pytest integration and tool correctness evaluation at three strictness levels. For production monitoring, Braintrust ($80M Series B) combines eval with observability. Most teams combine multiple tools because no single tool covers offline eval, production monitoring, and adversarial testing.

What is a proof bundle?

A proof bundle is an immutable artifact that captures everything about an agent deployment's test evidence: the eval suite version, golden dataset version, test results, model versions, code version, who approved the deployment, and when. It connects CI/CD practices with regulatory compliance requirements. As AI regulation increases — the EU AI Act begins enforcement for high-risk systems in August 2026 — proof bundles provide the engineering evidence that feeds legal documentation. See our AI governance engineering guide for the broader governance workflow.

How often should I re-run agent evaluations?

Run offline evals on every code change (prompt, tool definition, or orchestration logic) in CI. Run a full eval suite including adversarial tests before any production deployment. Run sampled production evaluations continuously to catch model drift and distribution shift. Re-run your complete golden dataset weekly even without code changes, because model provider updates can silently change behavior. If any eval run shows regression, investigate before the next deployment.