LLM Tracing

LLM tracing is the practice of recording the full execution path of a language model request, including prompt construction, model calls, tool use, and response generation, as a structured trace.

LLM tracing is the practice of recording the full execution path of a language model request -- from the initial user input through prompt construction, model calls, tool use, retrieval steps, and response generation -- as a structured trace. The concept extends distributed tracing from microservices to AI-specific operations. Each step in the pipeline becomes a span in the trace, and the trace gives you a complete picture of what happened, how long it took, and what it cost.

What traces capture

A trace for a single LLM request typically records:

Prompt construction. The assembly of the final prompt from templates, system instructions, few-shot examples, retrieved context, and user input. The trace should capture the complete prompt sent to the model, not just the template. This connects directly to prompt management -- tracing validates that the right prompt version was assembled. This is where debugging happens most often.

Retrieval steps. If the pipeline includes RAG, the trace captures the search query, the chunks retrieved, their relevance scores, and any re-ranking. When a response hallucinates, the retrieval span is usually where you find the problem.

Model calls. The request sent to the model provider, the response, latency, token counts, the model identifier, and parameters like temperature. If the pipeline makes multiple model calls, each is a separate span.

Tool use. If the model invokes tools (function calls, API calls, code execution), each invocation is a span with its input, output, latency, and errors. In multi-agent orchestration, tool use spans can nest deeply as agents delegate to sub-agents.

Post-processing and output. Guardrail checks, format validation, content filtering, and the final response delivered to the user with total latency and cost metadata.

Span types in LLM traces

Borrowing from distributed tracing conventions, LLM traces organize operations into typed spans:

Generation spans represent model inference calls. They capture the prompt, completion, model parameters, token usage, and latency.
Retrieval spans represent search or lookup operations. They capture the query, results, and relevance scores.
Tool spans represent external function calls or API invocations triggered by the model.
Chain spans represent orchestration logic — the glue code that connects steps in the pipeline.
Agent spans represent autonomous agent loops that may include multiple generation, tool, and retrieval spans.

These span types let you filter and aggregate traces in ways that matter: "show me all retrieval spans where latency exceeded 500ms" or "show me generation spans where token usage exceeded 4k output tokens." The OpenTelemetry semantic conventions for GenAI are standardizing these span types across the ecosystem.

OpenTelemetry for LLMs

OpenTelemetry (OTel) is the standard instrumentation framework for distributed tracing, and the LLM ecosystem has adopted it with semantic conventions for model calls, token counts, and prompt content.

The practical benefit: your LLM traces integrate with the same observability infrastructure your backend team already runs. When a user reports a slow response, you can follow the trace from the API endpoint through the LLM pipeline to the model provider. Platforms like Langfuse, Arize, and Coverge support OTel-based ingestion, so you are not locked into a single vendor.

Why tracing matters for debugging

Without traces, debugging an LLM application is guesswork. "The model gave a bad answer" could mean a dozen things: prompt bug, irrelevant retrieval context, hallucination, guardrail misfire, or a model version change.

With traces, you pull up the specific request, inspect each span, and pinpoint the failure. The retrieval span shows the top-3 chunks were off-topic — problem is retrieval, not generation. The prompt span shows duplicated system instructions — problem is template logic. This is especially important for agent-based systems where execution paths are dynamic and vary per request.

Why tracing matters for compliance

AI governance frameworks increasingly require that organizations explain how an AI system produced a specific output. When an auditor asks "what data did the model use to generate this recommendation?", retrieval spans answer that question. When a customer disputes an AI-generated decision, the trace shows the full chain of operations.

In Coverge, traces are linked to the pipeline version and proof bundle that was active when the request was processed, connecting runtime behavior to deployment governance.

Practical considerations

Sampling. Most teams trace 100% of requests in development and sample 10-20% in production, plus 100% of errors and latency outliers.

Sensitive data. Traces capture prompts and completions, which may contain PII. Your tracing pipeline needs retention policies, access control, and redaction. Do not send customer data to a third-party service without understanding where it is stored.

Overhead. Tracing instrumentation should be asynchronous and non-blocking. Export spans in batches, not synchronously per-request. A single trace with full prompt text can be several KB, so storage costs add up at scale.