LLM observability guide: traces, metrics, and monitoring for production AI systems

Your LLM pipeline works in dev. It passes eval. It handles the golden dataset at 94% accuracy. Then it ships to production, and within a week you are debugging a customer complaint about a response that makes no sense — and you have no idea which prompt version, which retrieval context, or which model call produced it.

This is the gap that observability fills. Not logging. Not metrics dashboards. Observability — the ability to understand what your system is doing by examining its outputs.

Traditional application observability gives you request traces, error rates, and latency percentiles. LLM observability needs all of that plus things traditional systems never had to worry about: the quality of generated text, the relevance of retrieved context, the cost of each inference call, and the decision chain across multi-agent workflows. The signal-to-noise ratio matters differently when a "successful" HTTP 200 response can still be completely wrong.

"LLM observability" hit 720 monthly searches in early 2026, growing 22% year-over-year. The growth tracks a pattern: teams that shipped their first LLM features in 2024 are now dealing with production incidents they cannot debug with traditional tools.

This guide covers how to build an observability practice for LLM systems — from the foundational data model (traces, metrics, logs) through the tooling choices to the feedback loop that connects production monitoring back to evaluation and governance. If you are still getting oriented with the broader LLMOps discipline, our what is LLMOps overview covers how observability fits alongside evaluation, deployment, and governance.

Traces vs metrics vs logs: the LLM observability data model

The three pillars of traditional observability — traces, metrics, and logs — all apply to LLM systems. But each one captures different information, and the relative importance shifts compared to traditional software.

Traces: the backbone of LLM observability

A trace follows a single request from entry to completion. In a traditional web service, a trace might span an API gateway, an application server, and a database query. In an LLM pipeline, a trace spans:

User input preprocessing (PII scrubbing, input validation, guardrail checks)
Retrieval (embedding generation, vector search, reranking)
Prompt assembly (template rendering, context injection, system prompt construction)
LLM inference (the actual model call, including retries and fallbacks)
Output processing (guardrail checks, structured output parsing, response formatting)
Tool calls (if the model invokes tools, each tool execution is a child span)

Each step is a span. The full collection of spans is the trace. The power of tracing is that when something goes wrong, you can see exactly which step produced the bad output and what inputs it received.

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("rag-pipeline");

async function handleQuery(userQuery: string) {
  return tracer.startActiveSpan("rag.query", async (rootSpan) => {
    rootSpan.setAttribute("user.query", userQuery);
    rootSpan.setAttribute("pipeline.version", "v47");

    // Retrieval span
    const context = await tracer.startActiveSpan(
      "rag.retrieve",
      async (retrieveSpan) => {
        retrieveSpan.setAttribute("retrieval.strategy", "hybrid");
        retrieveSpan.setAttribute("retrieval.top_k", 5);

        const chunks = await vectorSearch(userQuery);
        retrieveSpan.setAttribute("retrieval.chunks_returned", chunks.length);
        retrieveSpan.setAttribute(
          "retrieval.avg_score",
          avgScore(chunks)
        );
        retrieveSpan.end();
        return chunks;
      }
    );

    // LLM inference span
    const response = await tracer.startActiveSpan(
      "rag.generate",
      async (genSpan) => {
        genSpan.setAttribute("gen_ai.system", "openai");
        genSpan.setAttribute("gen_ai.request.model", "gpt-4o-2024-11-20");
        genSpan.setAttribute("gen_ai.request.temperature", 0.1);

        const result = await callLLM(userQuery, context);

        genSpan.setAttribute("gen_ai.usage.input_tokens", result.inputTokens);
        genSpan.setAttribute("gen_ai.usage.output_tokens", result.outputTokens);
        genSpan.setAttribute("gen_ai.response.finish_reason", result.finishReason);
        genSpan.end();
        return result;
      }
    );

    rootSpan.setStatus({ code: SpanStatusCode.OK });
    rootSpan.end();
    return response;
  });
}

Without traces, debugging production LLM issues means reading through unstructured logs trying to correlate events by timestamp. With traces, you click on a failing request and see the full execution timeline — which retrieval chunks were used, what prompt was assembled, how the model responded, and how long each step took.

Metrics: the aggregate view

Metrics give you the aggregate picture that traces cannot. A trace tells you what happened to one request. Metrics tell you what is happening to all requests.

The metrics that matter for LLM systems:

Category	Metric	Why it matters
Latency	Time to first token (TTFT)	User-perceived responsiveness
Latency	Total generation time	End-to-end request duration
Latency	Retrieval latency (p50, p95, p99)	Bottleneck identification
Cost	Tokens per request (input + output)	Budget tracking and anomaly detection
Cost	Cost per request (by model)	Financial monitoring
Cost	Cost per pipeline version	Regression detection after changes
Quality	Online eval scores (sampled)	Production quality monitoring
Quality	Guardrail trigger rate	Safety signal
Quality	Retrieval relevance scores	Context quality over time
Reliability	Error rate by error type	Model API failures, timeout rates
Reliability	Retry rate	Provider stability signal
Reliability	Fallback activation rate	How often the primary model fails

The key insight: traditional SRE metrics (latency, error rate, throughput) are necessary but not sufficient. An LLM system can have perfect uptime, low latency, and zero errors while producing garbage output. Quality metrics are the gap. You need to measure how good the responses are, not just whether they arrived.

Logs: the raw record

Logs capture the details that do not fit neatly into spans or metrics — the actual prompt text, the full model response, the retrieved context chunks, the guardrail evaluation results.

For LLM systems, logging has a tension that traditional systems do not face: the data is expensive. A single LLM request might involve a 10,000-token prompt and a 2,000-token response. Log every prompt and response for every request, and your log storage costs can rival your inference costs.

The practical approach:

Always log: request metadata (model, parameters, pipeline version), token counts, latency, error information, guardrail results
Sample log: full prompt and response text (10-25% sampling rate for high-traffic systems)
Always log on failure: every request that triggers a guardrail, produces an error, or gets flagged by quality scoring gets its full content logged regardless of sampling rate
Never log: raw user PII (scrub before logging), API keys or credentials, full embedding vectors (too large, not useful for debugging)

The sampling strategy matters because you need enough full-content logs to debug issues and build eval datasets, but you do not need 100% of them at scale.

OpenTelemetry for AI: the emerging standard

The OpenTelemetry project — the CNCF standard for distributed tracing — is building GenAI-specific semantic conventions. These conventions define how LLM calls should be instrumented, what attributes to capture, and how to structure spans for AI workloads.

GenAI semantic conventions

The OpenTelemetry GenAI semantic conventions define a standard vocabulary for LLM observability:

Span naming: gen_ai.{operation} — for example, gen_ai.chat for a chat completion call.

Standard attributes:

gen_ai.system — the AI provider (openai, anthropic, google_vertex_ai)
gen_ai.request.model — the model requested (claude-sonnet-4-20250514)
gen_ai.response.model — the model that actually responded (providers may route differently)
gen_ai.usage.input_tokens — tokens in the prompt
gen_ai.usage.output_tokens — tokens in the completion
gen_ai.response.finish_reason — why generation stopped (stop, length, tool_calls)

Events on spans: prompt and completion content are modeled as events attached to the span, not as span attributes. This is intentional — it allows sampling of content independently from span metadata.

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("gen-ai-pipeline");

async function chatCompletion(messages: Message[]) {
  return tracer.startActiveSpan("gen_ai.chat", async (span) => {
    // Standard GenAI attributes
    span.setAttribute("gen_ai.system", "anthropic");
    span.setAttribute("gen_ai.request.model", "claude-sonnet-4-20250514");
    span.setAttribute("gen_ai.request.max_tokens", 4096);
    span.setAttribute("gen_ai.request.temperature", 0.3);

    // Log prompt as an event (allows independent sampling)
    span.addEvent("gen_ai.content.prompt", {
      "gen_ai.prompt": JSON.stringify(messages),
    });

    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 4096,
      temperature: 0.3,
      messages,
    });

    // Log completion as an event
    span.addEvent("gen_ai.content.completion", {
      "gen_ai.completion": JSON.stringify(response.content),
    });

    // Standard response attributes
    span.setAttribute("gen_ai.response.model", response.model);
    span.setAttribute("gen_ai.usage.input_tokens", response.usage.input_tokens);
    span.setAttribute("gen_ai.usage.output_tokens", response.usage.output_tokens);
    span.setAttribute("gen_ai.response.finish_reason", response.stop_reason);

    span.end();
    return response;
  });
}

Why OpenTelemetry matters for LLM observability

Adopting OpenTelemetry conventions gives you three things:

Vendor portability. If you instrument with OTel, you can send traces to any backend — Jaeger, Grafana Tempo, Langfuse, Arize Phoenix, or any OTel-compatible platform. You are not locked into a single observability vendor.
Ecosystem integration. Your LLM traces join the same distributed trace as your web server, database, and cache spans. A single trace can show: "The API request hit the gateway (50ms) → called the RAG pipeline (200ms) → retrieved context from Postgres (30ms) → called Claude (1,200ms) → ran output guardrails (80ms) → returned to user." Full-stack visibility.
Auto-instrumentation. Libraries like opentelemetry-instrumentation-openai and opentelemetry-instrumentation-anthropic automatically instrument SDK calls without code changes. You install the package, configure the exporter, and every LLM call generates a properly attributed span.

The ecosystem is not fully mature — the GenAI conventions are still stabilizing, and not every LLM observability platform supports OTel natively. But the direction is clear: OTel will be the standard, and teams that adopt it now avoid a migration later.

Span propagation in agent workflows

Single-model RAG pipelines are the simple case. The hard case is multi-agent workflows where multiple LLM calls happen in sequence or parallel, each potentially triggering tool calls that themselves involve LLM calls. If you are building on an AI agent platform, observability is the capability that turns opaque agent behavior into debuggable execution traces.

The challenge: agent execution graphs

Consider a research agent workflow:

Orchestrator agent receives a user query and decides which specialist agents to invoke
Research agent searches a knowledge base and summarizes findings (2-3 LLM calls)
Analysis agent takes the research output and generates recommendations (1-2 LLM calls)
Review agent checks the analysis for factual consistency (1 LLM call)
Orchestrator synthesizes the results into a final response (1 LLM call)

That is 5-8 LLM calls, plus retrieval operations, tool executions, and inter-agent communication. Without proper span propagation, each of these appears as an independent trace — you can see that 8 LLM calls happened, but you cannot see that they were part of the same user request or understand their dependency relationships.

Proper span propagation creates a tree:

user.request (root)
├── orchestrator.plan
├── research.agent (parallel with analysis prep)
│   ├── rag.retrieve
│   ├── gen_ai.chat (summarize findings)
│   └── gen_ai.chat (extract key facts)
├── analysis.agent
│   ├── gen_ai.chat (generate recommendations)
│   └── gen_ai.chat (format output)
├── review.agent
│   └── gen_ai.chat (consistency check)
└── orchestrator.synthesize
    └── gen_ai.chat (final response)

This tree structure is what lets you answer questions like: "The final response was wrong because the research agent's retrieval returned irrelevant chunks, which caused the analysis agent to hallucinate a recommendation."

For a deeper treatment of tracing across agent boundaries, see our AI agent observability guide.

Implementing span propagation

The key is passing the trace context through every agent boundary. With OpenTelemetry, this happens automatically within a single process (the active span context propagates to child spans). Across process or service boundaries, you need to explicitly propagate the context:

import { context, propagation, trace } from "@opentelemetry/api";

// Agent A: extract context to pass to Agent B
function callAgentB(task: AgentTask) {
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);

  // Pass carrier alongside the task payload
  return agentBClient.execute({
    task,
    traceContext: carrier,
  });
}

// Agent B: restore context from Agent A
function handleTask(request: { task: AgentTask; traceContext: Record<string, string> }) {
  const parentContext = propagation.extract(context.active(), request.traceContext);

  return context.with(parentContext, async () => {
    const tracer = trace.getTracer("agent-b");
    return tracer.startActiveSpan("agent-b.execute", async (span) => {
      span.setAttribute("agent.name", "research");
      span.setAttribute("agent.task_type", request.task.type);

      const result = await processTask(request.task);

      span.end();
      return result;
    });
  });
}

If your agents communicate via message queues (Kafka, SQS, Redis streams), the trace context goes into message headers. If they communicate via HTTP, it goes into HTTP headers (the W3C Trace Context standard). The mechanism is standard — the challenge is making sure every agent boundary handles propagation correctly.

For patterns on orchestrating multi-agent workflows, see our multi-agent orchestration guide.

Cost tracking: the metric nobody had before LLMs

Traditional software has infrastructure costs — servers, databases, bandwidth. But the cost per request is approximately zero. With LLMs, every request costs money, and the cost varies dramatically based on the model, prompt length, and output length.

Why per-request cost tracking matters

A prompt regression can silently double your LLM spend. Imagine a retrieval change that includes more context chunks in the prompt — the quality might improve, but the input tokens per request jump from 2,000 to 6,000. At scale, that is a 3x cost increase that nobody noticed because quality went up and error rates stayed flat.

Cost tracking needs to happen at three levels:

Per-request. Every LLM call records its input tokens, output tokens, model used, and calculated cost. This is the raw data.

Per-pipeline-version. Aggregate cost metrics by pipeline version so you can see that v48 costs 40% more per request than v47. This catches the silent regressions.

Per-feature. If your application has multiple AI features (search, summarization, chat), track costs by feature so you can see which ones are driving spend and whether their unit economics work.

interface CostRecord {
  traceId: string;
  spanId: string;
  timestamp: Date;
  model: string;
  provider: string;
  inputTokens: number;
  outputTokens: number;
  costUsd: number;
  pipelineVersion: string;
  feature: string;
}

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number
): number {
  // Pricing as of early 2026 — keep this updated
  const pricing: Record<string, { input: number; output: number }> = {
    "gpt-4o-2024-11-20": { input: 2.50 / 1_000_000, output: 10.0 / 1_000_000 },
    "claude-sonnet-4-20250514": { input: 3.0 / 1_000_000, output: 15.0 / 1_000_000 },
    "claude-haiku-4-5-20251001": { input: 0.80 / 1_000_000, output: 4.0 / 1_000_000 },
    "gemini-2.0-flash": { input: 0.10 / 1_000_000, output: 0.40 / 1_000_000 },
  };

  const rates = pricing[model];
  if (!rates) return 0;
  return inputTokens * rates.input + outputTokens * rates.output;
}

Cost anomaly detection

Beyond tracking, you need alerting. Set up alerts for:

Cost per request exceeds threshold — catches prompt regressions and retrieval blowups
Daily spend exceeds budget — catches runaway loops or unexpected traffic spikes
Cost per pipeline version changes by more than 20% — catches regressions from version changes
Model mismatch — catches cases where a fallback to a more expensive model triggers without anyone noticing

The cost anomaly is the canary. When cost spikes, something changed — and that something often correlates with a quality or reliability problem.

Latency monitoring for LLM systems

LLM latency is fundamentally different from traditional API latency. A database query takes 5ms or 500ms, and you can tell immediately if it is slow. An LLM response takes 1-15 seconds depending on the model, prompt length, output length, and provider load. Users tolerate this — they are used to "AI thinking" — but within that tolerance band, latency still matters.

What to measure

Time to first token (TTFT): the time from request to the first byte of the streaming response. This is the user-perceived latency — how long they wait before seeing anything. TTFT is mostly a function of prompt size and provider queue depth.

Tokens per second (TPS): the generation speed once streaming starts. This determines how fast the response "types out." Typical ranges: 30-80 TPS for GPT-4o, 50-100 TPS for Claude Sonnet, 100-200 TPS for smaller/flash models.

Total latency: end-to-end, from user request to complete response. This includes preprocessing, retrieval, prompt assembly, inference, and post-processing. For RAG pipelines, the retrieval step often dominates total latency.

Retrieval latency: time spent in vector search, reranking, and context assembly. In many pipelines, this is the most optimizable component — embedding caching, index tuning, and reranker selection can cut retrieval latency by 50-80%.

Latency budgets

Set latency budgets by pipeline stage so you know where to optimize:

Stage	Typical range	Budget target
Input validation + guardrails	10-50ms	< 100ms
Embedding generation	20-100ms	< 150ms
Vector search	10-200ms	< 200ms
Reranking	50-300ms	< 300ms
Prompt assembly	1-10ms	< 20ms
LLM inference (TTFT)	200-2000ms	< 1500ms
LLM generation (total)	1-15s	< 8s
Output guardrails	10-100ms	< 200ms
Total pipeline	1.5-18s	< 10s

When total latency exceeds your budget, the trace tells you which stage is the bottleneck. Without stage-level timing, you are guessing.

Quality scoring in production

This is the metric category that separates LLM observability from traditional observability. Your system can be fast, cheap, and reliable while producing terrible output. Quality scoring catches that.

Online eval: scoring production responses

Offline eval (running test suites before deployment) tells you how a pipeline version should perform. Online eval (scoring production responses) tells you how it actually performs. The gap between the two is where production issues hide.

Online eval approaches:

LLM-as-a-judge (sampled). Take a sample of production requests (5-15%) and run them through a judge model that scores for relevance, faithfulness, and helpfulness. This gives you a continuous quality signal but adds latency and cost if run synchronously.

interface OnlineEvalResult {
  traceId: string;
  pipelineVersion: string;
  timestamp: Date;
  scores: {
    relevance: number;    // 0-1: does the response answer the question?
    faithfulness: number; // 0-1: is the response grounded in the context?
    helpfulness: number;  // 0-1: would a user find this useful?
    safety: number;       // 0-1: does the response avoid harmful content?
  };
  judgeModel: string;
  judgeLatencyMs: number;
}

async function scoreProduction(
  query: string,
  context: string[],
  response: string,
  traceId: string
): Promise<OnlineEvalResult> {
  const start = Date.now();

  const judgment = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 500,
    system: `Score the following response on four dimensions.
Return JSON: { relevance: 0-1, faithfulness: 0-1, helpfulness: 0-1, safety: 0-1 }
Relevance: does the response address the user's question?
Faithfulness: is every claim in the response supported by the provided context?
Helpfulness: would the user find this response useful and actionable?
Safety: does the response avoid harmful, biased, or inappropriate content?`,
    messages: [
      {
        role: "user",
        content: `Query: ${query}\n\nContext:\n${context.join("\n---\n")}\n\nResponse: ${response}`,
      },
    ],
  });

  const scores = JSON.parse(judgment.content[0].type === "text" ? judgment.content[0].text : "{}");
  return {
    traceId,
    pipelineVersion: getCurrentPipelineVersion(),
    timestamp: new Date(),
    scores,
    judgeModel: "claude-haiku-4-5-20251001",
    judgeLatencyMs: Date.now() - start,
  };
}

Heuristic scoring. For high-volume systems where LLM-as-a-judge is too expensive, use heuristic proxies: response length distribution (sudden changes signal problems), retrieval score distributions, refusal rate, output format compliance, and entity extraction checks.

User feedback. Thumbs up/down, regeneration requests, and session abandonment rates are quality signals. They are noisy and delayed, but they are ground truth — the user is telling you whether the system worked. Connect feedback events to traces so you can see what went wrong in the requests users flagged.

Quality dashboards

Build dashboards that show quality alongside traditional metrics:

Quality over time: daily average relevance, faithfulness, and safety scores, broken down by pipeline version
Quality by segment: scores segmented by query type, user segment, or feature — a pipeline might perform well on simple queries but badly on complex ones
Quality vs. cost: scatter plot of quality score vs. cost per request — this tells you whether you are paying for quality or wasting money
Quality regressions: automatic detection of score drops greater than 2 standard deviations, correlated with pipeline version changes

The quality dashboard is the single most important observability artifact. It answers the question that traditional monitoring cannot: "Is our AI system actually working well?"

For background on evaluation methodology and metrics, see our LLM evaluation guide.

Connecting observability to evaluation and governance

Observability data is not just for dashboards. It feeds directly into evaluation and governance workflows, creating a closed loop that improves the system over time.

The observability → evaluation feedback loop

Production traces contain the most valuable eval data you will ever have: real user queries, real retrieval results, and real model responses. The feedback loop:

Collect production traces with quality scores and user feedback
Identify failure patterns — queries where quality scores dropped, users gave negative feedback, or guardrails triggered
Curate eval datasets from production failures — these become your regression test suite
Run offline eval with the curated dataset against pipeline changes
Deploy with confidence knowing your eval suite covers real failure modes

This loop is why teams with good observability have better eval suites. Teams without observability are guessing at what to test. Teams with observability are testing the exact scenarios that failed in production.

// Example: curating eval datasets from production traces
async function curateEvalDataset(
  timeRange: { start: Date; end: Date },
  qualityThreshold: number
): Promise<EvalCase[]> {
  // Find traces where quality was below threshold
  const lowQualityTraces = await traceStore.query({
    timeRange,
    filter: {
      "online_eval.relevance": { lt: qualityThreshold },
    },
    limit: 500,
  });

  // Also include traces with negative user feedback
  const negFeedbackTraces = await traceStore.query({
    timeRange,
    filter: {
      "user_feedback.rating": "negative",
    },
    limit: 200,
  });

  // Deduplicate and format as eval cases
  const allTraces = deduplicateById([...lowQualityTraces, ...negFeedbackTraces]);

  return allTraces.map((t) => ({
    input: t.attributes["user.query"],
    expectedBehavior: "human-review-needed",
    context: t.attributes["retrieval.chunks"],
    metadata: {
      sourceTraceId: t.traceId,
      originalScore: t.attributes["online_eval.relevance"],
      feedbackType: t.attributes["user_feedback.rating"] ?? "none",
    },
  }));
}

The observability → governance connection

Observability data triggers governance workflows:

Quality degradation detected → governance system investigates whether a recent pipeline change caused it → potential rollback
Cost anomaly detected → governance system checks whether the cost increase was authorized → blocks or escalates if not
Safety guardrail trigger rate increases → governance system alerts the safety team → requires review before any further changes
Latency SLO breach → governance system logs the incident → requires root cause analysis before next deployment

Without this connection, observability gives you dashboards that humans have to watch. With it, observability gives you automated detection and response. The governance system becomes the enforcement layer for what observability detects.

For teams building audit trails, every observability event that triggers a governance action becomes part of the audit record — creating a chain of evidence from "we detected a problem" to "here is how we responded."

Observability platform comparison

The LLM observability market has consolidated around a few major platforms. Here is how they compare on the dimensions that matter for production systems.

Feature	Langfuse	Arize Phoenix	Helicone	Braintrust	Portkey
Primary focus	LLM tracing + eval	ML observability for LLMs	LLM proxy + logging	Eval + observability	AI gateway + observability
Open source	Yes (MIT)	Yes (Apache 2.0)	Yes (Apache 2.0)	Closed source	Closed source
OpenTelemetry support	Native OTel export	OTel-compatible	Limited	OTel export	OTel-compatible
Tracing model	Hierarchical spans	Hierarchical spans	Request-level	Sessions + spans	Request-level
Cost tracking	Per-request, per-model	Per-request	Per-request, per-model	Per-request	Per-request, per-model
Online eval	LLM-as-a-judge, custom	Built-in eval framework	Basic scoring	Full eval platform	Basic scoring
Multi-agent support	Span trees	Span trees	Limited	Session-level	Limited
Self-host option	Yes (Docker)	Yes (Docker)	Yes (Docker)	No	No
Pricing model	Open source + cloud	Open source + cloud	Free tier + usage	Usage-based	Free tier + usage
Monthly searches	~12,100	~1,800	~2,400	~3,600	~3,200

Langfuse is the most searched observability tool in the LLM space, with roughly 12,100 monthly searches for "langfuse" alone. Its strength is the combination of tracing, evaluation, and prompt management in an open-source package. The tracing UI is good for debugging individual requests, and the eval integration lets you run LLM-as-a-judge scoring on production traces. It supports hierarchical spans, making it workable for multi-agent systems. The Langfuse GitHub repository has strong community activity. For a detailed comparison with Coverge, see our Langfuse alternative analysis.

Arize Phoenix comes from the ML observability world — Arize has been monitoring traditional ML models for years. Phoenix brings that experience to LLMs with a strong focus on retrieval analysis and embedding visualization. It processes over 1 trillion spans per month across its customer base. The Arize Phoenix documentation covers its LLM-specific features. For a comparison, see our Arize alternative analysis. Where Phoenix stands out is connecting LLM observability to traditional ML monitoring — useful if you have both.

Helicone takes a proxy-based approach — you route your LLM calls through Helicone's proxy, and it automatically captures requests, responses, latency, and cost. This is the lowest-friction setup: change your base URL and you get observability. The tradeoff is depth — request-level logging is great for cost tracking and latency monitoring, but it does not give you the hierarchical trace structure needed for multi-step pipeline debugging.

Braintrust started as an eval platform and added observability. Its strength is the tight integration between production monitoring and evaluation — production traces feed directly into eval datasets, and eval results inform production quality scoring. Braintrust raised $80M in funding, signaling significant investment in the product.

Portkey is an AI gateway that includes observability as a feature. It sits between your application and LLM providers, handling routing, fallbacks, caching, and load balancing while capturing detailed request logs. The observability is a side effect of the gateway — if you need a gateway anyway (for routing, failover, or cost control), Portkey gives you observability for free. For more on gateways, see our LLM gateway guide.

How to choose

If you need a standalone observability platform: Langfuse or Arize Phoenix. Both are open source, both support self-hosting, both have good tracing models. Langfuse is better if eval integration matters most. Arize Phoenix is better if you also monitor traditional ML models.

If you need the lowest friction setup: Helicone. Change a base URL and you are observing. Good for teams that want cost and latency visibility without building instrumentation.

If eval is your primary concern and observability is secondary: Braintrust. The eval platform is strong, and the observability features support the eval workflow.

If you need a gateway with observability: Portkey. Do not adopt a gateway just for observability, but if you need routing and failover, the observability comes along.

If you need full pipeline governance: None of these platforms handle the complete governance loop (versioning + eval gates + approval workflows + observability + audit trails). They cover the observability leg. Platforms like Coverge integrate observability into the governance workflow — connecting production monitoring to deployment gates, approval workflows, and compliance reporting.

Observability anti-patterns

Logging everything, analyzing nothing

The team instruments every LLM call, captures every prompt and response, and stores terabytes of trace data. But nobody looks at it. There are no dashboards, no alerts, no quality scores. The data exists in case someone needs it, but nobody has built the analysis layer.

Fix this by starting with questions, not data. What do you need to know? "Is quality degrading?" Build the quality scoring pipeline. "Are costs within budget?" Build the cost dashboard. "Which pipeline version is fastest?" Build the version comparison view. Instrument to answer specific questions, not to collect data.

Observability without action

Dashboards show a quality drop. Everyone sees it. Nobody does anything because there is no defined response. Who owns the quality metric? What is the threshold that triggers investigation? Who has the authority to roll back?

Fix this by connecting observability to governance workflows. Define SLOs (service level objectives) for quality metrics. Define who gets paged when an SLO is breached. Define the response playbook. Observability without action is just a fancy screensaver.

Trace sprawl in multi-agent systems

Every agent creates its own traces. Span context is not propagated across agent boundaries. You have 50 disconnected traces that were actually one user request. Debugging means manually correlating traces by timestamp, which is fragile and slow.

Fix this by treating span propagation as infrastructure, not a nice-to-have. Every agent boundary must propagate trace context. Every inter-agent communication channel (HTTP, message queue, function call) must carry the trace context header. Test this like you test any other infrastructure — if propagation breaks, your observability is broken.

Confusing monitoring with observability

The team has Grafana dashboards showing request rate, error rate, and latency percentiles. They call this "LLM observability." It is monitoring — a subset of observability that tells you that something is wrong, not why.

Observability requires the ability to ask arbitrary questions about system behavior. "Show me all requests in the last hour where retrieval returned fewer than 3 chunks AND the quality score was below 0.7" — that is an observability query. It requires traces, not just metrics. Teams that stop at metrics miss the debugging power that makes observability valuable.

Building an observability practice: phased approach

Phase 1: Basic instrumentation (week 1)

Add OpenTelemetry instrumentation to your LLM calls (use auto-instrumentation libraries if available)
Capture: model, tokens, latency, error status for every call
Export to any OTel-compatible backend (Jaeger for local dev, your production choice for staging)
Build a single dashboard: request rate, error rate, latency percentiles, cost per day

At this point you can see how your system is performing in aggregate and debug individual requests through traces.

Phase 2: Quality scoring (weeks 2-3)

Implement sampled LLM-as-a-judge scoring on production traffic (start at 5-10% sample rate)
Score for relevance and faithfulness at minimum
Add quality metrics to your dashboard alongside latency and cost
Set up alerts for quality score drops greater than 2 standard deviations

At this point you can detect quality degradation, not just availability issues.

Phase 3: Cost and pipeline version tracking (weeks 3-4)

Tag every trace with the pipeline version that served it
Build per-version cost comparisons
Add cost anomaly alerts (daily spend exceeds 1.5x rolling 7-day average)
Build a version comparison view: for any two pipeline versions, show latency, cost, and quality differences

At this point you can answer "did this deployment make things better or worse?" with data.

Phase 4: Feedback loop (weeks 5-8)

Connect user feedback to traces (thumbs up/down, regeneration events)
Build the eval dataset curation pipeline from production traces
Feed curated datasets into your offline eval suite
Integrate observability alerts with governance workflows (auto-create investigation tickets on quality drops)

At this point your observability practice is feeding your eval practice, and your eval practice is informing your deployment decisions. This is the closed loop.

For guidance on building the eval side of this loop, see our LLM evaluation guide and the LLM regression testing guide. For the governance integration, see our AI governance engineering guide.

Frequently asked questions

What is LLM observability?

LLM observability is the practice of instrumenting AI systems so you can understand what they are doing in production — not just whether they are running (monitoring), but how well they are performing. It includes distributed tracing across pipeline stages (retrieval, inference, post-processing), quality scoring of generated outputs, cost tracking per request and per pipeline version, and latency analysis at each stage. The goal is to answer "why is the output wrong?" not just "is the system up?"

What are the best LLM observability tools in 2026?

The leading open-source tools are Langfuse (~12,100 monthly searches, MIT license) and Arize Phoenix (~1,800 monthly searches, Apache 2.0). Langfuse excels at tracing + eval integration. Arize Phoenix is strong for teams that also monitor traditional ML models. Helicone offers the lowest-friction setup via proxy-based logging. Braintrust combines eval with observability. Portkey provides observability as a feature of its AI gateway. The right choice depends on whether you need standalone observability, an eval platform, or a gateway.

How is LLM observability different from traditional APM?

Traditional application performance monitoring tracks latency, error rates, and throughput — all binary metrics (the request either worked or it did not). LLM observability adds a quality dimension: a successful HTTP 200 response can still contain a hallucinated, irrelevant, or harmful answer. LLM observability also tracks cost per request (traditional APIs have near-zero marginal cost), token usage, retrieval quality, and multi-step execution graphs that are unique to agent workflows.

How do I set up OpenTelemetry for LLM applications?

Install the OpenTelemetry SDK for your language, configure an exporter (OTLP to your observability backend), and add GenAI semantic conventions to your LLM calls. For Python, packages like opentelemetry-instrumentation-openai and opentelemetry-instrumentation-langchain provide auto-instrumentation. For TypeScript, instrument manually using the @opentelemetry/api package with gen_ai.* attributes. Log prompt and completion content as span events (not attributes) to allow independent sampling.

How much does production LLM observability cost?

The observability cost depends on your approach. Open-source self-hosted solutions (Langfuse, Arize Phoenix) cost only infrastructure — typically $100-500/month for moderate traffic. Cloud-hosted platforms charge per trace or per event, ranging from $50 to $2,000+/month depending on volume. The biggest cost variable is content logging: storing full prompts and responses for every request at high traffic requires significant storage. Use sampling (10-25% for content, 100% for metadata) to control costs. Budget 5-15% of your LLM inference spend for observability.

What quality metrics should I track in production?

At minimum: relevance (does the response address the query), faithfulness (is the response grounded in the provided context), and safety (does it avoid harmful content). For RAG systems, add context recall (did retrieval find the right information) and answer completeness. For agent systems, add task completion rate and tool use accuracy. Use LLM-as-a-judge for scoring, starting at a 5-10% sample rate of production traffic. A cheaper model (Claude Haiku, GPT-4o-mini) works well as a judge for most quality dimensions.

How do I connect observability to my eval pipeline?

Build a curation pipeline that extracts low-scoring production traces (where quality metrics fell below threshold) and negative-feedback traces into eval dataset format. Include the user query, retrieved context, model response, and quality scores as metadata. Review curated cases manually to add expected outputs or behavior labels. Feed the curated dataset into your offline eval suite so that every production failure becomes a regression test. This feedback loop means your eval suite improves continuously based on real production issues.

Where LLM observability is heading

Three trends will shape the next phase:

Unified AI observability. Today, LLM observability, ML monitoring, and traditional APM are separate tools with separate dashboards. The trend is toward unified platforms that show the full picture — from the user's HTTP request through the application code into the ML model into the LLM call and back. OpenTelemetry is the connective tissue making this possible.

Real-time quality scoring. Current online eval is sampled and asynchronous — you find out about quality problems minutes or hours after they happen. The trend is toward real-time quality scoring using lightweight models or heuristic ensembles that can score every response at low latency. This enables dynamic routing: if quality drops below a threshold, automatically switch to a different model or pipeline version.

Observability-driven governance. Instead of humans watching dashboards and deciding when to act, the observability system will trigger governance actions automatically. Quality degradation triggers an investigation. Cost anomalies pause deployments. Safety metric changes require review. The governance platform becomes the control plane, and observability becomes its sensor network.

The teams that build this stack now — traces, quality scores, cost tracking, and the feedback loop into eval and governance — are the ones that will ship AI features with confidence rather than anxiety. The gap between "it works in dev" and "it works in production" is exactly what observability closes.