LangChain in production: the operational playbook for shipping real applications

LangChain is easy to prototype with. A few lines of code, a model call, a retrieval chain — you have a working demo in an afternoon. The gap between that demo and a production system that handles real traffic without surprise failures is where most teams get stuck.

The framework has matured significantly. LangChain 1.0 and LangGraph 1.0 went GA in October 2025 with a no-breaking-changes pledge until 2.0. The older AgentExecutor is deprecated in favor of LangGraph's stateful graph runtime. LangSmith added OpenTelemetry support in March 2025, connecting to existing observability pipelines. These changes addressed legitimate criticisms about stability and vendor lock-in — but the framework itself is only part of the production story.

This guide covers the operational layer that sits on top of LangChain: how to observe what your application is doing, evaluate whether it is doing it well, version the components that change, and govern how changes reach users. These are core LLMOps practices that apply to any production LLM application.

Start with observability or you are flying blind

The single most important production decision you make with LangChain is instrumentation. When a chain produces a bad response, you need to see the full trace — which retriever ran, what documents came back, what the prompt looked like after template rendering, which model version responded, and how long each step took. Without this, debugging an LLM application in production is guesswork.

LangSmith tracing

LangSmith is the obvious choice if you are already in the LangChain ecosystem. It captures every node, tool call, and model interaction as a distributed trace. You get dashboards for cost, latency, and error rates out of the box.

The integration is minimal — set two environment variables and traces flow automatically:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-key

What matters more than the setup is what you do with the traces. Three things to configure immediately:

Latency alerts. Set a P95 latency threshold for your critical paths. LangChain applications often have multi-step chains where one slow retrieval or one large model response cascades into a timeout. A target of P95 under 5 seconds is reasonable for most user-facing applications — adjust based on your SLA.

Error rate monitoring. Track tool error rate separately from model error rate. A healthy production application should run below 3% tool error rate. If your web search tool fails 10% of the time, that is a tool reliability problem, not a model problem — and the fix is different.

Cost tracking per trace. LLM costs accumulate fast in multi-step agents. A single LangGraph agent that calls GPT-4o three times, runs a retrieval step, and makes a tool call can cost $0.05-0.15 per invocation. At 10,000 daily requests, that is $500-1,500/day. You need per-trace cost visibility to catch regressions — a prompt change that accidentally doubles the context window doubles your cost before you notice.

OpenTelemetry for existing stacks

If your team already runs Datadog, Grafana, or another observability platform, you do not need to abandon it for LangSmith. LangSmith's OpenTelemetry support means you can export LangChain traces to your existing tooling. This matters in organizations where the platform team has already standardized on an observability stack and is not going to adopt a new dashboard for one service.

The tradeoff: generic observability tools show you latency, errors, and throughput. They do not show you prompt contents, model outputs, or evaluation scores. You get infrastructure visibility but lose the LLM-specific context. For many teams, the answer is both — LangSmith for development and debugging, OpenTelemetry exports for production alerting and SLA tracking.

Evaluation: the quality gate you probably do not have

Observability tells you that your application is running. Evaluation tells you that it is running well. Most LangChain applications in production have the first but not the second.

Offline evaluation with datasets

LangSmith lets you build evaluation datasets — collections of inputs paired with reference outputs or scoring criteria — and run your chain against them. This is your regression test suite for LLM behavior.

Build datasets from three sources:

Production traces. When a user reports a bad response, save the input and context as a test case. Over time your dataset becomes a living record of every failure mode your system has encountered.
Synthetic generation. Use a stronger model to generate edge cases your production model should handle. "What happens when the user asks about a topic outside the knowledge base?" "What if the retrieval returns conflicting documents?"
Manual curation. Product managers and domain experts can write test cases that cover business-critical scenarios. These often catch things engineers miss — the customer who asks questions in Spanish, the query that triggers a safety refusal incorrectly.

Run evaluations on every PR that touches prompts, retrieval config, or model settings. This is the eval-gated pipeline concept applied specifically to LangChain. The eval does not need to block every commit — deterministic checks (type, format, length) on every push, a 50-case eval subset on PRs, full suite on merges to main.

Online evaluation against live traffic

Offline evaluation has a coverage problem: your test dataset will never represent the full diversity of production traffic. LangSmith supports online evaluators that score a sample of live responses against quality criteria.

This catches the long tail — the query patterns that no one anticipated when building the test dataset. Set up online evaluators for your most important quality dimensions: factual accuracy for RAG applications, format compliance for structured output, safety for user-facing chatbots.

The practical question is how large a sample to evaluate. Scoring every production response is expensive (each judge call costs tokens). Scoring 5-10% of traffic gives you statistical signal without doubling your LLM costs.

Versioning: everything that affects output quality

A LangChain application has at least four moving parts that affect output quality: prompts, model versions, retrieval configuration, and orchestration logic. Changing any one of them changes behavior. Versioning everything is not pedantry — it is how you answer "what changed?" when quality drops.

Prompt versioning

LangSmith has built-in prompt versioning with chain-awareness. You can A/B test prompt variants against the same evaluation dataset and compare scores before committing to a change. This is better than storing prompts in a YAML file in your repo (which works but lacks the scoring integration).

If you are not using LangSmith for prompt management, the minimum viable approach is: store prompts in version-controlled files, tag each deployed version, and log the prompt version in every trace. When quality drops, you can correlate the degradation with the prompt change that caused it.

Model version pinning

LangChain makes it easy to specify a model — ChatOpenAI(model="gpt-4o") — but that string is an alias that changes underneath you. When OpenAI ships a new gpt-4o snapshot, your behavior changes with no code change on your side.

Pin to dated snapshots: gpt-4o-2024-08-06 instead of gpt-4o. When a new snapshot is available, create a branch, update the pin, run the full eval suite, and review the results before merging. This turns a surprise behavior change into a deliberate upgrade with data.

Retrieval configuration

For RAG applications, the retrieval pipeline is often the biggest lever on output quality — bigger than the prompt, bigger than the model. Document the following in your config and version it alongside your code:

Chunk size and overlap
Embedding model and version
Number of retrieved documents (top-k)
Reranking model and threshold
Metadata filters

A change from 512-token chunks to 1024-token chunks changes every response your application produces. Treat it like a code change: branch, evaluate, compare, merge.

Deployment governance: who can ship what

Operational maturity is not just about tooling — it is about process. Who can change a prompt in production? What review does a model version upgrade require? These are governance questions that become critical as your LangChain application handles real business workflows.

The LangGraph Platform option

LangChain offers a managed deployment platform (now called LangSmith Deployment) that handles infrastructure for LangGraph applications. Three tiers: Cloud SaaS, Hybrid (SaaS control plane with self-hosted data plane), and fully self-hosted via Helm charts on Kubernetes.

The pricing is usage-based at $0.001 per node execution with a free tier up to 100K nodes/month. The Plus plan runs $39/user/month and includes a dev deployment instance. Self-hosting the platform requires a minimum of 16 vCPU / 64 GB RAM Kubernetes cluster plus managed databases — roughly $950-1,150/month for infrastructure alone.

For teams evaluating this against LangSmith's observability pricing (or considering alternatives like Braintrust), the platform cost is additive. You pay for LangSmith (observability) plus the deployment platform (infrastructure). The alternative is deploying LangGraph yourself using standard container orchestration.

Self-hosted deployment

Most teams deploy LangChain applications as standard containerized services — Docker, Kubernetes, whatever your team already uses. The framework does not impose deployment requirements. Your FastAPI or Express wrapper around a LangGraph agent deploys the same way as any other web service.

The LangChain-specific considerations for self-hosted deployment:

State persistence. If your LangGraph agents use checkpointing (and they should, for fault tolerance), you need a database tier. PostgreSQL is the recommended backend for production — it gives you durability, queryable history, and disaster recovery. Redis works for high-throughput scenarios where sub-millisecond checkpoint retrieval matters. MemorySaver is dev-only.

Dependency management. LangChain's dependency graph is heavy. Teams have reported container sizes and deployment times inflating significantly. Pin your dependencies strictly and consider whether you need the full langchain package or just langchain-core and langgraph with specific provider integrations.

Memory management. The default conversation memory in LangChain stores full conversation history, which grows without bound. Teams have reported significant cost reductions — in some cases 20-30% of LLM spend — after replacing the default memory with custom solutions that summarize older messages or drop low-relevance turns. For production, implement a memory strategy: sliding window, summarization, or token-budget-based truncation.

Approval workflows

For applications where incorrect outputs have business consequences — customer-facing bots, document generation, decision support — add a human approval step before deploying changes.

LangGraph's built-in checkpointing supports human-in-the-loop patterns at the agent level (interrupt execution, wait for approval, resume). But deployment approval is a layer above that: who can merge a prompt change to main, and what evidence do they need to see?

At minimum, require eval results attached to every PR that modifies prompts or model configuration. The reviewer should see the eval scores for the current version versus the proposed version before approving. This is not bureaucracy — it is the same principle as requiring tests to pass before merging code. You are just extending it to LLM-specific quality dimensions.

Coverge takes this further — every pipeline change goes through compilation, graph validation, and an eval suite. The results, approval decision, and deploy metadata are packaged into an immutable proof bundle. The pipeline cannot reach production without passing the gate and getting human sign-off. That level of rigor may not be necessary for every application, but for regulated industries and high-stakes use cases, the audit trail pays for itself.

Common production pitfalls

After observability, evaluation, versioning, and governance, here are the tactical issues that trip teams up:

Abstraction overhead. LangChain's wrappers add latency. Measure your end-to-end latency with and without the framework — if the overhead is material (some teams report 1+ second from memory wrappers alone), consider using langchain-core directly or dropping to raw API calls for latency-critical paths.

Prompt-model coupling. A prompt optimized for GPT-4o may not work well with Claude 3.5 Sonnet. Good prompt management practices help mitigate this. If you plan to switch models (for cost, latency, or capability reasons), test the prompt against the target model's eval suite before switching. Do not assume transferability.

Retrieval drift. Your knowledge base changes over time — new documents added, old ones updated, embeddings recomputed. If your eval dataset references specific documents, evals can pass even when production retrieval quality has degraded. Periodically refresh your eval dataset with current production queries.

Cost surprises. Multi-step agents are expensive by nature. A LangGraph agent that chains three model calls averages 3x the cost of a simple completion. Add tool calls and retrieval, and per-request costs climb fast. Set cost alerts at the trace level, not just the monthly bill level, so you catch regressions the day they happen — not at the end of the month.

FAQ

Is LangChain production-ready?

Yes. LangChain 1.0 and LangGraph 1.0 went GA in October 2025 with an explicit stability pledge: no breaking changes until 2.0. The older AgentExecutor is deprecated — LangGraph is now the recommended runtime for production agents. The framework is used in production at companies ranging from startups to enterprises, though many teams use only langchain-core and langgraph to avoid dependency bloat from the full package.

What observability tools work with LangChain in production?

LangSmith is the first-party option, providing distributed tracing, cost dashboards, latency monitoring, and evaluation against live traffic. Since March 2025, LangSmith also supports OpenTelemetry export, meaning traces can flow into existing tools like Datadog, Grafana, or New Relic. Third-party alternatives include Langfuse (open source, see our Langfuse pricing breakdown), Arize Phoenix, and Helicone — all integrate with LangChain via callbacks or OpenTelemetry.

How do I evaluate LangChain applications before deploying?

Build evaluation datasets from production failures, synthetic generation, and manual curation. Run deterministic checks (format, length, safety) on every push. Run scored evaluations — using LLM-as-a-judge or reference-based comparison — on pull requests that touch prompts, retrieval config, or model settings. This is the eval-gated pipeline pattern described in our LLM CI/CD guide. LangSmith provides native eval tooling, or you can use open-source frameworks like DeepEval, RAGAS, or promptfoo.

How do I version prompts in LangChain?

LangSmith offers built-in prompt versioning with A/B testing against evaluation datasets. Without LangSmith, store prompts in version-controlled files in your repository, tag deployed versions in your config, and log the prompt version in every trace. The critical requirement is correlation — when quality drops, you need to identify which prompt version caused the change and roll back to the previous version.

Should I use LangGraph Platform or self-host?

LangGraph Platform (LangSmith Deployment) handles infrastructure and scales automatically, starting with a free tier at 100K node executions/month. The Plus plan is $39/user/month. Self-hosting via Helm charts requires a 16+ vCPU / 64+ GB RAM Kubernetes cluster (~$950-1,150/month minimum for infrastructure). Choose the platform if you want managed operations and are comfortable with usage-based pricing. Self-host if you need full data control, run in regulated environments, or already have container orchestration expertise.

How does LangChain production relate to LLMOps?

Taking LangChain to production is an exercise in LLMOps. The framework provides the application layer — chains, agents, retrieval — but production readiness depends on the operational layer: observability (tracing every request), evaluation (scoring quality before and after deployment), versioning (tracking every component that affects output), and governance (controlling who can change what). These are the same LLMOps best practices that apply regardless of framework.