LLM gateway: routing, failover, and cost control for production AI systems

Your application calls an LLM. The provider has an outage. Your users see errors for 47 minutes until someone notices and manually switches to a backup model. Or worse — the provider is not down, just slow. P99 latency creeps from 3 seconds to 12 seconds and nobody flags it because the health check still returns 200.

This is the problem an LLM gateway solves. It sits between your application and your model providers, handling routing, failover, rate limiting, cost tracking, and guardrails at the infrastructure layer. Instead of scattering provider-specific logic across your application code, you centralize it in a gateway and let your application focus on what it does with the model's output.

Search volume for "llm gateway" is 858 monthly searches with 33% year-over-year growth — the fastest-growing infrastructure term in the AI ops space. That growth reflects a transition from single-provider prototypes to multi-provider production systems where reliability, cost, and compliance matter.

This guide covers what an LLM gateway does, when you need one versus application-level controls, how to choose between the major options, and what changes when you are routing for agent systems instead of simple completions.

What an LLM gateway actually does

An LLM gateway is a proxy layer that intercepts every request between your application and LLM providers. At its simplest, it is a reverse proxy with LLM-specific features. At its most capable, it is a policy engine that shapes how your organization interacts with AI models.

Core capabilities

Unified API. Your application sends requests using one format. The gateway translates them to provider-specific formats — OpenAI, Anthropic, Google, Mistral, Cohere, open-source models on your own infrastructure. Switching providers means changing a gateway config, not rewriting application code.

Automatic failover. When a provider is down or responding slowly, the gateway routes to a fallback. Good gateways detect degradation before complete failure — if latency exceeds a threshold or error rates spike, they shift traffic to a healthy provider without waiting for a hard failure.

Load balancing. Distribute requests across multiple providers or API keys to maximize throughput and avoid rate limits. This is especially relevant for teams running high-volume pipelines that exceed single-key limits.

Cost tracking. Every request logs token usage, model, cost per token, and total spend. The gateway aggregates this by team, project, feature, or any dimension you tag. Without a gateway, cost tracking means parsing logs from every service that calls an LLM and hoping you did not miss any.

Rate limiting and quotas. Set per-team or per-project spending limits. Hard caps prevent runaway costs from bugs or prompt injection. Soft caps trigger alerts. This is table stakes for any organization with more than one team using LLMs.

Caching. Identical requests (same model, same prompt, same parameters) return cached responses instead of making a new API call. Semantic caching goes further, returning cached responses for semantically similar requests. Cache hit rates of 20-40% are typical for applications with repeated query patterns.

Guardrails. Request-level checks that run before a prompt reaches the model (input guardrails) or before a response reaches the user (output guardrails). PII detection, content filtering, topic restrictions, and format validation. For more on guardrails architecture, see our guardrails guide.

Audit logging. Every request and response logged with metadata — who sent it, which model handled it, what it cost, how long it took, whether any guardrails triggered. This is the foundation for observability and compliance.

Gateway vs. application-level controls

Not every team needs a gateway. Here is when application-level controls are sufficient and when a gateway becomes necessary.

Application-level is fine when

You are calling one provider from one service. You have a single codebase, a single team, and a single model. Adding a gateway layer adds operational overhead (another service to run, monitor, and maintain) without much benefit. Put your retry logic, timeout handling, and cost logging directly in your application.

You are still prototyping. If you are figuring out which model to use or how to structure your prompts, optimizing the infrastructure layer is premature. Ship the prototype, validate the use case, then add infrastructure when the use case proves out.

A gateway becomes necessary when

Multiple teams call LLMs independently. Without a gateway, each team builds their own retry logic, their own cost tracking, their own rate limiting. Standards diverge. Cost visibility fragments. A gateway gives you a single control plane across all teams.

You use multiple providers. The moment you need fallback between Anthropic and OpenAI, or you want to route different use cases to different models, managing this in application code gets messy fast. Provider APIs change at different rates, have different error formats, and handle rate limiting differently. The gateway absorbs this complexity.

Compliance or audit requirements exist. If you need a complete log of every LLM interaction — what was sent, what was received, who triggered it, which data was included — a gateway is the natural enforcement point. It sees every request. Application-level logging misses the requests that bypass your standard code paths.

Cost is a real concern. If your monthly LLM spend has a budget attached to it, you need centralized cost tracking and the ability to set limits. Application-level cost tracking requires every team to implement it correctly. A gateway makes it automatic.

You are running agent systems. Agents make unpredictable numbers of LLM calls. A single user request might trigger 3 calls or 30, depending on the task. Without gateway-level controls, a misbehaving agent can burn through your budget in minutes. This is where gateway-level rate limiting and cost caps move from "nice to have" to "preventing incidents."

Gateway comparison

The LLM gateway market has consolidated around a few major options, each with a different philosophy. Here is how they compare as of early 2026.

Feature	Portkey	LiteLLM	Helicone	Custom (DIY)
Deployment	Cloud-hosted or self-hosted	Self-hosted (OSS) or cloud	Cloud-hosted or self-hosted proxy	Self-hosted
Unified API	Yes, OpenAI-compatible	Yes, OpenAI-compatible	Yes, proxy-based	Whatever you build
Providers supported	200+	100+	Major providers	Whatever you integrate
Failover	Automatic with configurable fallback chains	Automatic with fallback	Basic retry	Whatever you build
Load balancing	Weighted, latency-based, cost-optimized	Weighted round-robin	N/A	Whatever you build
Caching	Semantic + exact match	Exact match	Exact match	Whatever you build
Guardrails	Built-in + custom hooks	Via callbacks	Content moderation	Whatever you build
Cost tracking	Per-request, per-team, per-project	Per-request, per-key	Per-request, per-user	Whatever you build
Audit logging	Full request/response logging	Logging via callbacks	Full request/response logging	Whatever you build
Rate limiting	Per-key, per-team, per-model	Per-key, per-model	Per-user, per-key	Whatever you build
Latency overhead	~20-50ms (cloud), minimal (self-hosted)	Minimal (self-hosted)	~30-60ms (cloud)	Depends on implementation
Scale	Processes 1T+ tokens/day across customers	Widely deployed, varies	Growing adoption	Your responsibility
Open source	Enterprise features paid	Yes (Apache 2.0)	Yes (Apache 2.0)	N/A

Portkey

Portkey is the most feature-complete option. It processes over 1 trillion tokens per day across its customer base, which means the routing and failover logic has been tested at scales most individual organizations will not reach. The gateway supports conditional routing (route based on prompt content, user tier, cost constraints), which is valuable for organizations that need different reliability tiers for different use cases.

The tradeoff is complexity. Portkey's configuration surface is large, and teams that only need basic failover and cost tracking might find it overbuilt. The self-hosted option addresses data residency concerns but requires more operational investment.

# Portkey gateway example
from portkey_ai import Portkey

client = Portkey(
    api_key="your-portkey-key",
    config={
        "strategy": {
            "mode": "fallback",
            "on_status_codes": [429, 500, 502, 503],
        },
        "targets": [
            {
                "provider": "anthropic",
                "api_key": "your-anthropic-key",
                "override_params": {"model": "claude-sonnet-4-20250514"},
                "weight": 1,
            },
            {
                "provider": "openai",
                "api_key": "your-openai-key",
                "override_params": {"model": "gpt-4o"},
                "weight": 0,  # fallback only
            },
        ],
    },
)

# Same interface regardless of which provider handles the request
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Explain RAG evaluation metrics."}],
    max_tokens=500,
)

LiteLLM

LiteLLM is the open-source standard for unified LLM access. If your primary need is a consistent API across providers with basic load balancing and fallback, LiteLLM does this well without requiring a managed service. It is particularly popular in self-hosted environments where data cannot leave your infrastructure.

The tradeoff is that advanced features (semantic caching, conditional routing, built-in guardrails) require more custom code on top of LiteLLM's base. The proxy server provides the gateway functionality; the Python SDK alone is a client library, not a gateway.

# LiteLLM proxy configuration (config.yaml)
# model_list:
#   - model_name: "default"
#     litellm_params:
#       model: "anthropic/claude-sonnet-4-20250514"
#       api_key: "your-anthropic-key"
#   - model_name: "default"
#     litellm_params:
#       model: "openai/gpt-4o"
#       api_key: "your-openai-key"
# router_settings:
#   routing_strategy: "latency-based"
#   num_retries: 3
#   fallbacks: [{"default": ["default"]}]

# Client code talks to LiteLLM proxy
import openai

client = openai.OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000",  # LiteLLM proxy
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain LLM gateways."}],
)

Helicone

Helicone started as an observability platform and added gateway features. Its strength is the analytics layer — detailed dashboards for cost, latency, usage patterns, and user-level tracking. If your primary motivation for a gateway is visibility rather than routing sophistication, Helicone gives you the analytics with less configuration overhead than Portkey.

The tradeoff is less mature routing and failover compared to purpose-built gateways. It works best as a complement to application-level routing rather than a replacement.

Building your own

Building a custom gateway makes sense when your requirements are narrow and specific — maybe you only use one provider and just need caching and cost logging, or you have unusual routing logic that does not fit existing tools.

The hidden cost is maintenance. Provider APIs change, new models launch with different token counting, rate limit formats evolve. Every change requires gateway updates. Teams that build custom gateways often underestimate this ongoing maintenance burden.

If you build custom, start with LiteLLM as a foundation rather than starting from scratch. You get the provider abstraction for free and can add your custom logic on top.

Audit logging at the gateway level

For organizations with compliance requirements — SOC 2, HIPAA, EU AI Act, internal governance policies — the gateway is the natural point for audit logging. It sees every request, which means it can enforce logging consistency regardless of which application made the call.

What to log

A complete audit log for LLM interactions should capture:

{
  "request_id": "req_abc123",
  "timestamp": "2026-04-14T10:30:00Z",
  "user_id": "user_456",
  "team": "support-automation",
  "project": "ticket-classifier",
  
  // Request details
  "model_requested": "claude-sonnet-4-20250514",
  "model_served": "claude-sonnet-4-20250514",
  "provider": "anthropic",
  "prompt_tokens": 1250,
  "prompt_hash": "sha256:...",  // For PII-safe logging
  
  // Response details
  "completion_tokens": 340,
  "total_tokens": 1590,
  "latency_ms": 1840,
  "status": "success",
  
  // Cost
  "cost_usd": 0.0127,
  
  // Guardrails
  "guardrails_triggered": [],
  "pii_detected": false,
  
  // Routing
  "routing_strategy": "primary",
  "failover_attempted": false,
  "cache_hit": false
}

For environments where prompts and completions contain sensitive data, log prompt hashes instead of full text, or log full text to an encrypted store with access controls. The gateway can make this decision at the infrastructure layer rather than relying on each application to implement PII handling correctly.

Connecting audit logs to compliance

Gateway audit logs become the data source for compliance reporting. When an auditor asks "show me all LLM interactions involving customer data in Q1," you query the gateway logs rather than stitching together logs from a dozen services. This connection between infrastructure logging and compliance reporting is covered in depth in our AI governance guide.

Choosing a gateway for agent systems

Agent systems introduce specific challenges that simple completion APIs do not face. An agent might make dozens of LLM calls per task, use tool calling that triggers external API requests, and branch unpredictably based on intermediate results. Here is what to prioritize when choosing a gateway for agent workloads.

Cost containment is non-negotiable

A single agent run can generate 10-50 LLM calls. Multiply by concurrent users and you get cost profiles that are hard to predict. Your gateway needs:

Per-session cost caps. Kill an agent session that exceeds a cost threshold. This prevents runaway agents from burning through budget.
Per-user or per-team quotas. Spread budget across teams with hard or soft limits.
Real-time cost visibility. You cannot wait for an end-of-month bill to discover a cost problem. The gateway should surface cost data in real time or near-real time.

Trace-aware routing

Agent traces span multiple LLM calls. The gateway should understand that calls within the same trace are related, which enables:

Consistent model routing within a trace. If an agent starts a task on Claude, switching mid-task to GPT-4o can cause behavior changes. The gateway should support session affinity.
Trace-level cost and latency tracking. Knowing that a single agent run cost $0.45 across 23 calls is more useful than seeing 23 individual call costs.

Failover that understands tool calling

When an agent uses tool calling, a provider failover mid-conversation can break the tool schema format. Anthropic's tool calling format differs from OpenAI's. A gateway that handles failover for agent systems needs to translate tool schemas across providers, not just message formats.

Integration with observability

Agent observability requires tracking the full execution graph — which tools were called, in what order, what the intermediate results were, and how they influenced subsequent LLM calls. Your gateway should export traces in a format compatible with your observability stack, following OpenTelemetry GenAI semantic conventions for LLM tracing. OpenTelemetry is the emerging standard here, and gateways that support OTLP export integrate cleanly with the rest of your monitoring infrastructure.

Gateway anti-patterns

Putting business logic in the gateway

The gateway should handle infrastructure concerns: routing, failover, logging, rate limiting. It should not contain prompt templates, output parsing, or business-specific validation. When teams put business logic in the gateway, they create a deployment coupling — every prompt change requires a gateway deployment, and the gateway team becomes a bottleneck for the product team.

Over-caching

Aggressive caching saves money but can serve stale responses. For applications where freshness matters (current events, real-time data), caching needs to be scoped carefully. Set TTLs per-model and per-use-case rather than globally. And never cache responses for queries that include user-specific context — a cached response about User A's account should not be served to User B.

Ignoring latency overhead

Every gateway hop adds latency. For cloud-hosted gateways, this is typically 20-60ms. For most applications, this is negligible. For latency-sensitive applications (real-time chat, voice assistants), it matters. Measure the actual latency impact in your environment before committing to a hosted gateway. Self-hosted gateways add less latency but more operational work.

Single gateway as single point of failure

If all LLM traffic flows through one gateway instance and that instance goes down, your entire AI system is offline. Run your gateway with redundancy — multiple instances behind a load balancer, health checks, and automated failover at the gateway layer itself.

Setting up a gateway: practical steps

Start simple

Pick the gateway that matches your primary need. Need routing and failover? Portkey or LiteLLM. Need visibility and analytics? Helicone. Need full control? LiteLLM self-hosted.
Configure a primary provider and one fallback. Do not try to optimize across five providers on day one. Get reliable failover working between two.
Enable cost tracking and audit logging immediately. These provide value from day one and are painful to backfill.
Set a per-team or per-project spending alert. Not a hard cap initially — just an alert. Understand your baseline cost before setting limits.

Then iterate

Add caching for high-volume, repeated queries. Measure cache hit rates and cost savings.
Add guardrails for production use cases. Start with PII detection and content safety; add domain-specific checks as needed.
Integrate with your observability stack. Export gateway metrics and traces to your existing monitoring tools.
Tune routing based on data. After a few weeks of traffic, you will have data on provider latency, reliability, and cost. Use this to optimize routing strategies.

Where Coverge fits

An LLM gateway handles the infrastructure layer — routing requests to the right provider, tracking costs, enforcing rate limits. Coverge operates at the pipeline layer above the gateway — building, evaluating, and deploying the AI pipelines that generate those requests.

The two are complementary. Your gateway ensures requests reach a model reliably and affordably. Coverge ensures the AI pipeline sending those requests is tested, versioned, and deployed with quality gates. Gateway logs feed into Coverge's observability layer, giving you a complete picture from pipeline logic down to provider performance.

For teams running agent systems, this layering is especially valuable. The gateway handles per-request concerns (which provider, what cost, did guardrails trigger). Coverge handles per-pipeline concerns (did the agent produce correct results, is it regressing, should this version be deployed). For teams managing multi-agent workflows, our multi-agent orchestration guide covers how gateway-level routing interacts with agent-level coordination.

Frequently asked questions

Do I need an LLM gateway if I only use one provider?

Probably not yet, but keep the option open. A single-provider setup does not benefit from routing or failover. You still benefit from centralized cost tracking and audit logging, but you can get those from the provider's own dashboard and API logs. The moment you add a second provider, or a second team that independently calls LLMs, a gateway pays for itself quickly.

How much latency does an LLM gateway add?

Cloud-hosted gateways typically add 20-60ms of latency per request. Self-hosted gateways add less — usually under 10ms for same-region deployments. Compare this to LLM response times of 500ms-5s and the overhead is rarely significant. The exception is streaming responses, where even small per-chunk latency can affect perceived responsiveness. Measure in your specific environment.

Can I use LiteLLM as a gateway in production?

Yes. LiteLLM's proxy server mode functions as a full gateway — unified API, provider translation, load balancing, and fallback. Many production deployments use LiteLLM as their gateway layer. The main gap compared to Portkey is in managed observability features and advanced routing policies. If you are comfortable running and operating the proxy server yourself, LiteLLM is a solid production choice.

How do gateways handle streaming responses?

Most gateways support server-sent events (SSE) streaming pass-through. The gateway receives the streaming response from the provider and forwards each chunk to your application. Token counting and cost tracking happen as chunks arrive. Failover during streaming is tricky — if a provider fails mid-stream, the gateway typically cannot seamlessly switch to another provider without restarting the request. The best approach is to detect slow streaming early (long gaps between chunks) and restart the full request on a fallback.

What is the difference between an LLM gateway and an API gateway like Kong or AWS API Gateway?

An LLM gateway is a specialized API gateway with LLM-specific features. A generic API gateway can handle routing and rate limiting, but it does not understand token counting, model-specific failover, prompt caching, or LLM cost tracking. You could build LLM features on top of a generic API gateway using plugins, but you would be rebuilding what LLM-specific gateways already provide. Some teams run both — a generic API gateway at the edge for authentication and general traffic management, and an LLM gateway behind it for model-specific concerns.

How do I evaluate whether my gateway is performing well?

Track these metrics: gateway-added latency (p50, p99), failover success rate (when primary is down, does fallback work?), cache hit rate, cost per request over time, and error rate. Set alerts on gateway latency spikes and failover failures. A healthy gateway should be invisible — your application should not know or care that requests are being routed, cached, or failed over behind the scenes.

Should guardrails live in the gateway or the application?

Both. Gateway-level guardrails handle universal policies: PII detection, content safety, cost caps. These apply to every LLM request regardless of which application sent it. Application-level guardrails handle use-case-specific policies: output format validation, domain-specific safety checks, business logic constraints. For a deeper look at guardrails architecture, see our guardrails guide.