AI agent observability: tracing, debugging, and monitoring multi-agent systems

You already have LLM observability. You trace model calls, capture token counts, and watch latency percentiles. Then you deploy a multi-agent pipeline — a research agent feeding a writer agent feeding a review agent — and your existing instrumentation falls apart. The trace shows three isolated LLM calls. It does not show that agent two received bad input from agent one, that the reviewer approved garbage because it only saw the final draft, or that the whole chain cost $0.47 per request because the research agent ran four retrieval loops instead of one.

AI agent observability is not "more LLM observability." It is a different problem. LLM observability answers "what did the model do?" Agent observability answers "what did the system decide, and why did it decide that?"

Search interest in "ai agent observability" reached 168 monthly queries in early 2026, growing 193% year-over-year. That growth tracks with the shift from single-model prototypes to multi-agent production systems — and the realization that existing tools do not capture what teams need to debug agent failures.

This article covers how agent observability differs from LLM observability, how to propagate traces across agent boundaries, what to log at each node, and how to connect agent traces to evaluation and compliance workflows. For the broader observability picture — traces, metrics, logs, and platform comparisons — see the LLM observability guide. For how observability fits into the agent platform decision, see the AI agent platform guide.

What makes agent observability different

LLM observability instruments a single call: prompt in, completion out, tokens consumed, latency measured. The unit of work is the inference request.

Agent observability instruments a decision chain. An agent receives a task, reasons about it, selects tools, calls models (sometimes multiple times), interprets results, and produces an output that another agent consumes. The unit of work is the agent execution — which might contain five LLM calls, three tool invocations, and a retry loop.

Three properties make agent systems harder to observe than standalone LLM calls:

Non-deterministic control flow. A single agent might call a model, decide to use a tool based on the response, call the model again with the tool output, and repeat this loop an unpredictable number of times. You cannot pre-define the trace structure because the execution path depends on runtime decisions.

Inter-agent data flow. When agent B consumes agent A's output, you need to trace across that boundary. If agent B hallucinates, the root cause might be agent A producing ambiguous context three steps earlier. Without cross-agent tracing, you are debugging each agent in isolation — which is like debugging a distributed system by reading logs from one service at a time.

Compounding cost and latency. Each agent multiplies LLM calls. A four-agent pipeline where each agent averages two model calls means eight inference requests per user query. Without per-agent cost attribution, you see a single expensive request with no breakdown of where the money went.

Traditional application performance monitoring (APM) tools like Datadog or New Relic can capture HTTP latency and error rates for your agent service. But they do not understand what happened inside the agent — the reasoning steps, the tool selections, the intermediate outputs that shaped the final result. That is the gap agent-specific observability fills.

Span propagation across agent boundaries

The foundation of agent observability is distributed tracing — the same concept that powers microservice observability, adapted for agent systems. Each agent execution becomes a span. Child spans capture individual LLM calls, tool invocations, and reasoning steps within that agent. The parent trace ties every agent span into a single, queryable execution.

The trace structure for multi-agent systems

A well-instrumented multi-agent pipeline produces a trace that looks like this:

trace: user-query-abc123
├── span: agent.research (2.4s, $0.12)
│   ├── span: gen_ai.chat — query planning (0.3s)
│   ├── span: tool.vector_search (0.8s)
│   ├── span: tool.web_search (0.6s)
│   └── span: gen_ai.chat — synthesize findings (0.7s)
├── span: agent.writer (1.8s, $0.08)
│   ├── span: gen_ai.chat — draft generation (1.2s)
│   └── span: gen_ai.chat — self-review (0.6s)
└── span: agent.reviewer (1.1s, $0.05)
    ├── span: gen_ai.chat — quality check (0.8s)
    └── span: gen_ai.chat — scoring (0.3s)

This structure answers the questions that flat LLM logging cannot: which agent was slowest, which agent was most expensive, what data flowed between agents, and where the pipeline broke when it broke.

Implementing cross-agent tracing with OpenTelemetry

The OpenTelemetry GenAI semantic conventions define standard attributes for LLM calls — gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons. These conventions are still experimental as of early 2026, but they have become the de facto standard for LLM instrumentation.

The spec now includes a dedicated agent spans page that defines agent-specific operations: create_agent (span name: create_agent {gen_ai.agent.name}), invoke_agent (span name: invoke_agent {gen_ai.agent.name}), and execute_tool (span name: execute_tool {tool.name}). Agent spans use gen_ai.agent.name and gen_ai.agent.id as standard attributes. For in-process agents, span kind is INTERNAL; for remote agent-to-agent calls, it is CLIENT.

In practice, the pattern is to wrap each agent execution in a span and nest the standard gen_ai.chat spans as children:

import { trace, context, SpanKind, propagation } from "@opentelemetry/api";

declare const model: { chat: (opts: { messages: Array<{ role: string; content: string }> }) => Promise<{ content: string; usage: { input_tokens: number; output_tokens: number } }> };

const tracer = trace.getTracer("agent-pipeline");

async function runAgent(
  agentName: string,
  task: string,
  parentContext?: Record<string, string>
) {
  let activeContext = context.active();
  if (parentContext) {
    activeContext = propagation.extract(context.active(), parentContext);
  }

  return context.with(activeContext, () =>
    tracer.startActiveSpan(
      `agent.${agentName}`,
      { kind: SpanKind.INTERNAL },
      async (agentSpan) => {
        agentSpan.setAttribute("agent.name", agentName);
        agentSpan.setAttribute("agent.input", task.substring(0, 500));

        const result = await tracer.startActiveSpan(
          "gen_ai.chat",
          async (llmSpan) => {
            llmSpan.setAttribute("gen_ai.system", "anthropic");
            llmSpan.setAttribute("gen_ai.request.model", "claude-sonnet-4-20250514");

            const response = await model.chat({
              messages: [{ role: "user", content: task }],
            });

            llmSpan.setAttribute("gen_ai.usage.input_tokens", response.usage.input_tokens);
            llmSpan.setAttribute("gen_ai.usage.output_tokens", response.usage.output_tokens);
            llmSpan.end();
            return response.content;
          }
        );

        agentSpan.setAttribute("agent.output", result.substring(0, 500));
        agentSpan.end();
        return result;
      }
    )
  );
}

The key mechanism is propagation.extract and propagation.inject. When agent A finishes and hands off to agent B, you extract the trace context from agent A's span and inject it into agent B's context. This links the spans into a single trace, even if the agents run in different processes or services.

For in-process agent pipelines (all agents in the same Node.js or Python process), OpenTelemetry handles context propagation automatically through context.active(). For agents that communicate over HTTP or message queues, you need to explicitly propagate the traceparent header:

import { propagation, context } from "@opentelemetry/api";

function getTraceHeaders(): Record<string, string> {
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);
  return carrier;
}

async function orchestrate(query: string) {
  const researchResult = await runAgent("research", query);

  const traceHeaders = getTraceHeaders();

  const writeResult = await runAgent(
    "writer",
    `Write based on this research: ${researchResult}`,
    traceHeaders
  );

  return writeResult;
}

What to attach to each span

The GenAI semantic conventions cover the LLM-specific attributes. For agent spans, you need additional context that makes traces useful for debugging:

Attribute	Why it matters	Example
`agent.name`	Identifies which agent in the pipeline	`"research"`, `"writer"`
`agent.input` (truncated)	Shows what the agent received	First 500 chars of task
`agent.output` (truncated)	Shows what the agent produced	First 500 chars of result
`agent.tool_calls`	Records which tools the agent invoked	`["vector_search", "web_search"]`
`agent.iteration_count`	How many reasoning loops the agent ran	`3`
`agent.model`	Which model this agent used	`"claude-sonnet-4-20250514"`
`gen_ai.usage.input_tokens`	Token consumption per LLM call	`1,247`
`gen_ai.usage.output_tokens`	Token output per LLM call	`834`
`agent.cost_usd`	Computed cost for this agent's execution	`0.0142`
`agent.decision`	Routing decisions for supervisor agents	`"delegated_to: code_agent"`

Truncate inputs and outputs. Full prompt and completion text in span attributes will blow up your trace storage costs. Log the full text to a separate store (S3, a blob column) and reference it from the span with a pointer.

What to log at each node in the pipeline

Different positions in a multi-agent pipeline need different observability emphasis.

Entry node: the router or supervisor

The first agent in the pipeline usually classifies the request and decides which downstream agents to invoke. Log:

The routing decision and why. If the supervisor chose to skip the code analysis agent because it classified the query as "non-technical," record that classification. When the pipeline produces a bad result on a technical query, you will find the misclassification instantly.
Input preprocessing results. PII scrubbing, input validation, guardrail checks. If a guardrail blocked a legitimate request, the trace should show it.
Selected pipeline variant. If your system has multiple agent configurations (fast path vs. thorough path), record which one was selected and what triggered the selection.

Middle nodes: workers and processors

Worker agents do the heavy lifting — retrieval, generation, analysis. Log:

Tool call inputs and outputs. When a research agent calls a vector database, log the query and the number of results returned. When it calls a web search API, log the query and the URLs retrieved. The tool results are the context that shapes the agent's response.
Intermediate reasoning. If the agent runs multiple LLM calls (plan → execute → reflect), log each call as a child span. A common failure mode is the reflect step contradicting the execute step, causing a retry loop that burns tokens.
Quality signals. If the agent has self-assessment logic (confidence scoring, output validation), log the scores. These become the early warning system for degradation — when confidence scores trend downward, the pipeline is about to start producing bad output.

Exit node: the final responder

The last agent assembles the final output. Log:

The final output and its provenance. Which upstream agent outputs were combined, and how. If the exit agent summarized three research documents, log which documents and what was excluded.
Output guardrail results. Content safety checks, format validation, schema conformance. A passed guardrail check is as important to log as a failed one — it confirms the pipeline worked end-to-end.
End-to-end metrics. Total pipeline latency, total cost, total token consumption. These go into your SLO dashboards.

Connecting traces to evals

Observability tells you what happened. Evaluation tells you whether what happened was good. The connection between the two is where agent observability becomes genuinely useful instead of just "more dashboards."

Online evaluation scoring

Attach evaluation scores to production traces. When a user query flows through your multi-agent pipeline, run lightweight eval checks on the output and record the scores as span attributes:

import { trace } from "@opentelemetry/api";

declare function scoreRelevance(query: string, response: string): Promise<number>;
declare function scoreFaithfulness(retrievedContext: string, response: string): Promise<number>;

const tracer = trace.getTracer("agent-pipeline");

async function evaluateAndTrace(
  query: string,
  agentOutput: string,
  retrievedContext: string
) {
  return tracer.startActiveSpan("eval.online", async (span) => {
    const startTime = Date.now();
    const relevance = await scoreRelevance(query, agentOutput);
    const faithfulness = await scoreFaithfulness(retrievedContext, agentOutput);

    span.setAttribute("eval.relevance_score", relevance);
    span.setAttribute("eval.faithfulness_score", faithfulness);
    span.setAttribute("eval.latency_ms", Date.now() - startTime);
    span.setAttribute("eval.passed", relevance > 0.7 && faithfulness > 0.8);
    span.end();

    return { relevance, faithfulness };
  });
}

This creates a feedback loop: traces with low eval scores get flagged for review. You examine the trace, find which agent produced the problematic output, and add the failing case to your offline eval dataset. Your evaluation pipeline gets a new test case from production, and your observability gets a new quality signal. For the specifics of LLM evaluation metrics and how to build offline eval suites for agent systems, see our AI agent testing guide.

Identifying degradation patterns

With eval scores attached to traces, you can query for patterns that signal degradation:

Per-agent quality trends. If the research agent's relevance scores drop while other agents stay stable, the problem is isolated. Maybe the vector database index is stale, or a retrieval reranker was updated.
Quality vs. cost tradeoffs. Traces with higher token consumption do not always produce higher quality. Plotting eval scores against cost per trace reveals whether your agents are doing useful work or just burning tokens on unnecessary reasoning loops.
Latency-quality correlation. Longer agent executions sometimes indicate the agent is stuck in a retry loop, which can mean it is struggling with the input. Traces where a single agent takes 3x its median latency are worth investigating even if the final output passes eval.

Tying observability to compliance

For teams operating in regulated environments — finance, healthcare, government — agent observability is not just a debugging tool. It is a compliance requirement.

The EU AI Act requires "traceability" for high-risk AI systems. Article 12 specifically mandates logging capabilities that allow the monitoring of the system's operation and the identification of the AI system's decisions. A multi-agent system where you cannot reconstruct why a specific output was produced does not meet that bar.

What compliance auditors want to see

Auditors do not read OpenTelemetry spans. They want evidence that your system has:

Complete decision lineage. For any output, you can trace back through every agent that contributed, what data each agent received, and what model version produced each intermediate result.
Immutable records. Traces cannot be retroactively modified. Append-only storage with integrity guarantees (checksums, write-once object storage).
Version pinning. Each trace records which prompt version and model version every agent used. If you update a prompt and quality degrades, you can identify exactly when the change took effect.
Retention policies. Traces are retained for the duration required by your regulatory framework. EU AI Act does not specify a retention period, but SOC 2 typically requires 12 months.

Agent observability produces the raw data. The compliance layer transforms that data into audit-ready artifacts. For a deeper treatment of how to structure audit trails for multi-agent systems, see our AI audit trail guide. For the full governance picture — version control as governance, eval gates as policy enforcement — see the AI governance engineering guide.

Proof bundles: packaging observability for auditors

A proof bundle is a self-contained artifact that packages everything an auditor needs to evaluate a specific pipeline execution:

interface ProofBundle {
  traceId: string;
  timestamp: string;
  pipeline: {
    version: string;
    agents: Array<{
      name: string;
      modelVersion: string;
      promptVersion: string;
    }>;
  };
  execution: {
    spans: Array<{
      agentName: string;
      input: string;
      output: string;
      toolCalls: Array<{ tool: string; query: string; result: string }>;
      tokenUsage: { input: number; output: number };
      durationMs: number;
    }>;
    totalCostUsd: number;
    totalDurationMs: number;
  };
  evaluation: {
    scores: Record<string, number>;
    passed: boolean;
    evaluatorVersion: string;
  };
  integrity: {
    checksum: string;
    storageLocation: string;
  };
}

This is the bridge between your engineering observability (traces, spans, metrics) and what regulatory review actually requires (evidence that decisions were traceable, versioned, and evaluated). Your observability infrastructure generates the data; the proof bundle structures it for consumption by non-engineers.

The tooling options for agent observability

Several platforms handle agent-level tracing, though the depth of agent-specific support varies.

Langfuse (24,100+ GitHub stars, 23M+ monthly SDK installs) provides a hierarchical tracing model — traces contain spans, spans contain generations (individual LLM calls) and events. Its @observe() decorator in Python and manual SDK for TypeScript make it straightforward to instrument agent pipelines. Langfuse also supports OpenTelemetry ingestion via an OTLP endpoint, so you can forward OTel spans from your agent instrumentation directly. The Agent Graphs feature (GA since late 2025) visualizes multi-agent execution paths as node graphs, with inline tool call visibility showing all available tools at each generation. The self-hosted option gives you control over trace data residency — relevant for compliance use cases. See our Langfuse comparison for a detailed feature breakdown.

Arize Phoenix (9,100+ GitHub stars) is an open-source observability tool that auto-instruments popular frameworks (LangChain, LlamaIndex, OpenAI SDK, Claude Agent SDK, Vercel AI SDK). It uses OpenInference, its own telemetry standard built on OpenTelemetry, and defines ten span kinds including a dedicated AGENT kind alongside LLM, TOOL, RETRIEVER, GUARDRAIL, and EVALUATOR. Phoenix connects traces directly to evaluation metrics, letting you score individual spans and visualize quality trends. The Agent Graph view maps multi-agent execution as a node graph. The commercial Arize platform processes over 1 trillion spans per month across customers. See our Arize comparison for more detail.

Helicone focuses on LLM gateway-level observability — it captures every model call as a proxy and adds cost tracking, caching, and rate limiting. For agent observability specifically, Helicone captures individual LLM calls well but does not natively understand agent-level groupings. You would layer it with a trace-level tool.

Braintrust combines evaluation with observability. Its logging captures LLM calls within experiment runs, and its online eval features let you score production traces. Braintrust raised $80M in early 2026, signaling a bet that eval-integrated observability is the direction the market is heading.

Feature	Langfuse	Arize Phoenix	Helicone	Braintrust
Agent-level spans	Native (traces → spans → generations)	Via OpenInference auto-instrumentation	LLM calls only	Via experiment logging
OTel support	OTLP ingestion	OTel + OpenInference	Proxy-based capture	SDK-based
Eval integration	Score attachment to traces	Built-in eval visualizations	No	Core feature
Self-hosted option	Yes (Docker)	Yes (open source)	No	No
Cost tracking	Per-generation token costs	Per-span costs	Per-request gateway-level	Per-experiment
Compliance features	Data residency via self-hosting	Open source data control	Limited	Enterprise features

None of these tools solve agent observability out of the box for every architecture. Most teams end up combining a tracing platform (Langfuse or Phoenix for agent-level traces) with a gateway (Helicone or Portkey for LLM-level capture) and an eval framework (DeepEval or Promptfoo for quality scoring). For a pricing-oriented comparison of these platforms, see our LLMOps tools pricing comparison. The instrumentation code you write — the span definitions, the context propagation, the attribute selection — matters more than which platform you send the data to.

Getting started: a practical instrumentation checklist

If you are adding observability to an existing multi-agent system, start with these steps in order:

Add a root trace per user request. Every agent invocation within a single user request should share a trace ID. This is the minimum viable instrumentation.
Wrap each agent in a span. Record agent name, input (truncated), output (truncated), duration, and cost. Do not instrument individual LLM calls yet — get the agent-level view first.
Propagate context between agents. If agents run in-process, OpenTelemetry handles this automatically. If agents communicate over HTTP or queues, inject and extract trace headers.
Add LLM-level spans inside each agent. Use the GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. Now each agent span has children showing every model call.
Attach eval scores to exit spans. Run lightweight quality checks on the final output and record scores as span attributes. This creates the trace-to-eval connection.
Set up alerting on quality degradation. When average eval scores for a specific agent drop below a threshold, page someone. This is the agent equivalent of a traditional error rate alert.
Build compliance exports. If you operate in a regulated environment, automate the generation of proof bundles from trace data. Do not wait until an auditor asks for them.

The order matters. Teams that start with step 4 (detailed LLM instrumentation) before step 1 (root traces) end up with granular data they cannot correlate. Start from the top of the pipeline and work down.

FAQ

What is AI agent observability?

AI agent observability is the practice of instrumenting multi-agent AI systems to understand what each agent decided, what data it received, what tools it used, and how its output influenced downstream agents. It extends LLM observability — which tracks individual model calls — to cover the decision chains, data flows, and cost attribution across an entire agent pipeline. The goal is to make agent system behavior inspectable and debuggable in production, not just in development.

How does agent observability differ from LLM observability?

LLM observability instruments individual model calls: prompt in, completion out, tokens consumed. Agent observability instruments decision chains across multiple agents. A single agent execution might include several LLM calls, tool invocations, and reasoning loops. Agent observability captures the relationships between these operations — which agent's output fed into which agent's input — and attributes cost, latency, and quality to each agent rather than to individual model calls.

What OpenTelemetry conventions should I use for agent tracing?

Use the GenAI semantic conventions for LLM-specific attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens). The spec now includes agent-specific conventions with operations like invoke_agent and execute_tool, plus attributes gen_ai.agent.name and gen_ai.agent.id. For additional context beyond the standard attributes, add custom attributes like agent.input, agent.output, agent.tool_calls, and agent.cost_usd. Nest gen_ai.chat spans as children of agent spans.

How do I propagate traces across agent boundaries?

For in-process agents (same runtime), OpenTelemetry propagates context automatically through context.active(). For agents communicating over HTTP, inject the trace context into request headers using propagation.inject() on the sending side, and extract it with propagation.extract() on the receiving side. The W3C traceparent header is the standard carrier. For message queue architectures, attach trace headers as message metadata.

Which tools support agent-level observability?

Langfuse provides native hierarchical tracing (traces → spans → generations) with OTel ingestion support. Arize Phoenix auto-instruments popular frameworks and connects traces to evaluations. Both support self-hosting for data residency requirements. Helicone and Portkey capture LLM calls at the gateway level but do not natively group them into agent-level spans. Most production setups combine a tracing platform with a gateway and an eval framework.

How does agent observability help with compliance?

The EU AI Act requires traceability for high-risk AI systems — you must be able to reconstruct why a specific output was produced. Agent observability provides the raw data: which agents ran, what inputs they received, what models and prompt versions they used, and what they produced. Combined with immutable storage and proof bundles, this data satisfies audit requirements for decision lineage, version pinning, and quality evidence. See our AI audit trail and AI governance engineering guide for the full compliance picture.

What should I log for each agent in a pipeline?

At minimum: agent name, truncated input and output, tool calls with their results, iteration count, model version, token usage, cost, and duration. For supervisor agents, also log routing decisions. For worker agents, log intermediate reasoning steps as child spans. For exit agents, log end-to-end pipeline metrics and output guardrail results. Truncate full prompt and completion text to avoid blowing up storage costs — store the full text separately and reference it from the span.