AI agent monitoring: SLOs, anomaly detection, and production alerting for agent pipelines
By Coverge Team
Your agent pipeline has been running in production for three weeks. It processes 2,000 tasks per day. You get a Slack message: "The agent's responses seem worse lately." You check the logs. No errors. No timeouts. Every request completed successfully. But "seem worse" is not wrong — the agent has been silently degrading for five days because a model update shifted its tool-calling behavior, and it now takes twice as many steps to complete the same tasks. Cost is up 40%. Quality is down. And you had no alert because nothing technically failed.
This is the gap between running an agent and monitoring an agent. Running means the process executes without crashing. Monitoring means you know whether it is performing well, and you find out quickly when it stops.
Search volume for "ai agent monitoring" is 70 monthly searches with 133% year-over-year growth — the fastest growth rate of any term in this category. The number is small because the term is new, not because the need is small. Every team running agents in production needs monitoring. Most are searching for it under adjacent terms like "llm monitoring" or "ai observability."
This guide covers the distinction between monitoring and observability for agents, how to define meaningful SLOs, what anomaly detection looks like for non-deterministic systems, and how production alerting connects to AI governance and compliance reporting.
Monitoring vs. observability: they are not the same thing
These terms get conflated constantly. For agent systems, the distinction matters because they solve different problems at different times.
Monitoring tells you whether the system is healthy right now. It is dashboards, alerts, and SLOs. It answers: "Is the agent performing within acceptable bounds?" Monitoring is about detecting problems as they happen or shortly after.
Observability tells you why the system behaved a certain way. It is traces, logs, and the ability to ask arbitrary questions about past behavior. It answers: "Why did this specific agent run take 45 seconds and cost $0.12?" Observability is about diagnosing problems after they are detected.
You need both, but they serve different operational moments:
| Monitoring | Observability | |
|---|---|---|
| When | Continuous, real-time | On-demand, after-the-fact |
| Question | "Is something wrong?" | "What went wrong and why?" |
| Audience | On-call engineer, ops team | Investigating engineer |
| Output | Alerts, dashboards, SLO reports | Traces, query results, root cause analysis |
| Data model | Aggregated metrics | Detailed traces and logs |
| Cost | Low (metrics are cheap) | Higher (traces are verbose) |
For a deep dive into the observability side, see our LLM observability guide and AI agent observability guide. This guide focuses on the monitoring side: what to measure, what thresholds to set, and when to alert.
Defining SLOs for agent pipelines
Service Level Objectives bring discipline to monitoring. Without SLOs, every metric is equally important (meaning none are). With SLOs, you declare what "working" means for your specific system and hold yourself accountable to that definition.
Why traditional SLOs do not work for agents
Traditional software SLOs focus on availability (uptime), latency (response time), and error rate (failed requests). These matter for agents too, but they miss the dimensions that actually determine whether an agent is useful:
Task completion rate. An agent that responds quickly and never errors but fails to complete the user's task is not working. Traditional SLOs would say it is fine.
Output quality. A chatbot agent that returns fast, confident, wrong answers has great latency and zero errors. Its quality is terrible. You need to measure quality, not just infrastructure.
Cost per task. An agent that takes 30 LLM calls to do something that should take 5 is working, but badly. Per-task cost is a behavioral SLO that traditional monitoring does not capture.
Step efficiency. How many steps does the agent take to complete a task? Is that number stable? A sudden increase in average steps is often the first sign of a problem — the agent is struggling, retrying, or going down wrong paths.
Agent SLO framework
Here is a framework for defining agent SLOs across five dimensions:
1. Reliability SLOs
These are closest to traditional SLOs and cover infrastructure-level health:
| SLO | Target | Measurement |
|---|---|---|
| Availability | 99.5% | Percentage of requests that receive a response (not timeout or crash) |
| Error rate | < 2% | Percentage of requests that return an error to the user |
| Latency (p50) | < 5s | Median time from request to response |
| Latency (p99) | < 30s | 99th percentile time from request to response |
2. Quality SLOs
These measure whether the agent's outputs are actually good:
| SLO | Target | Measurement |
|---|---|---|
| Task completion rate | > 90% | Percentage of tasks the agent completes successfully |
| Output accuracy (sampled) | > 85% | Sampled evaluation of output correctness |
| Faithfulness (for RAG agents) | > 0.85 | RAGAS faithfulness on sampled responses |
| User satisfaction | > 75% | Thumbs-up rate among users who provide feedback |
Quality SLOs require sampling — you cannot evaluate every response in real time. Sample 2-5% of responses and run async LLM evaluation. The sample rate can be higher for new deployments and lower for stable ones.
3. Efficiency SLOs
These catch behavioral degradation:
| SLO | Target | Measurement |
|---|---|---|
| Mean steps per task | < 8 | Average number of LLM calls per completed task |
| Cost per task (p50) | < $0.05 | Median cost per task completion |
| Cost per task (p95) | < $0.25 | 95th percentile cost per task |
| Tool call success rate | > 95% | Percentage of tool calls that succeed |
4. Behavioral SLOs
These detect changes in how the agent operates:
| SLO | Target | Measurement |
|---|---|---|
| Refusal rate | < 5% | Percentage of tasks the agent refuses or cannot handle |
| Escalation rate | < 10% | Percentage of tasks escalated to a human |
| Retry rate | < 15% | Percentage of LLM calls that require retry |
| Tool usage distribution | Within 2 std dev of baseline | How tool usage patterns compare to the established baseline |
5. Compliance SLOs
For regulated environments:
| SLO | Target | Measurement |
|---|---|---|
| Guardrail trigger rate | < 3% | Percentage of responses that trigger safety guardrails |
| PII leak rate | 0% | Percentage of responses containing detected PII |
| Audit log completeness | 100% | Percentage of agent runs with complete audit trails |
Setting initial SLO targets
Setting SLOs for a new agent system when you have no baseline data requires a bootstrap process:
- Deploy without SLOs for 1-2 weeks. Collect metric data on every dimension above. Do not alert on anything yet — just measure.
- Analyze the baseline. Calculate the p50, p95, and p99 for each metric. This is your current performance.
- Set SLOs slightly tighter than baseline. If your baseline task completion rate is 88%, set the SLO at 85%. This gives you room for normal variation while catching real degradation.
- Tighten over time. As you improve the agent, ratchet SLO targets up. Review targets quarterly.
The worst thing you can do is set aspirational SLOs on day one. An SLO that is always breached is an SLO that gets ignored.
Anomaly detection for agent behavior
Agent behavior is non-deterministic. The same input can produce different outputs, different step counts, and different tool call sequences. This makes traditional threshold-based alerting noisy — you get alerts for normal variation, not actual problems.
Anomaly detection for agents needs to account for this inherent variation. Here is how to approach it.
Statistical baselines
For metrics with stable distributions (latency, cost, step count), compute rolling statistical baselines and alert on deviations:
import numpy as np
from collections import deque
class AgentMetricMonitor:
"""Monitor agent metrics with rolling statistical baselines."""
def __init__(self, window_size: int = 1000, alert_threshold_sigma: float = 3.0):
self.window_size = window_size
self.alert_threshold = alert_threshold_sigma
self.metrics: dict[str, deque] = {}
def record(self, metric_name: str, value: float) -> dict | None:
"""Record a metric value and check for anomaly."""
if metric_name not in self.metrics:
self.metrics[metric_name] = deque(maxlen=self.window_size)
window = self.metrics[metric_name]
# Need enough data for a baseline
if len(window) < 100:
window.append(value)
return None
mean = np.mean(window)
std = np.std(window)
# Avoid division by zero for constant metrics
if std < 1e-6:
window.append(value)
return None
z_score = (value - mean) / std
window.append(value)
if abs(z_score) > self.alert_threshold:
return {
"metric": metric_name,
"value": value,
"mean": mean,
"std": std,
"z_score": z_score,
"direction": "high" if z_score > 0 else "low",
}
return None
monitor = AgentMetricMonitor(window_size=1000, alert_threshold_sigma=3.0)
# After each agent run
anomaly = monitor.record("steps_per_task", run.step_count)
if anomaly:
send_alert(f"Anomalous step count: {anomaly['value']} (mean: {anomaly['mean']:.1f}, z={anomaly['z_score']:.1f})")
anomaly = monitor.record("cost_per_task", run.total_cost)
if anomaly:
send_alert(f"Anomalous cost: ${anomaly['value']:.4f} (mean: ${anomaly['mean']:.4f})")
Distribution shift detection
Individual anomalies are noisy. Distribution shifts are signal. Instead of alerting on single outliers, detect when the entire distribution of a metric shifts:
from scipy import stats
def detect_distribution_shift(
recent_values: list[float],
baseline_values: list[float],
significance: float = 0.01,
) -> dict:
"""
Detect if recent metric values come from a different distribution
than the baseline using the Kolmogorov-Smirnov test.
"""
statistic, p_value = stats.ks_2samp(baseline_values, recent_values)
shift_detected = p_value < significance
return {
"shift_detected": shift_detected,
"p_value": p_value,
"statistic": statistic,
"baseline_mean": np.mean(baseline_values),
"recent_mean": np.mean(recent_values),
"delta_mean": np.mean(recent_values) - np.mean(baseline_values),
}
# Compare last hour against last week's baseline
result = detect_distribution_shift(
recent_values=get_metric_values("steps_per_task", period="1h"),
baseline_values=get_metric_values("steps_per_task", period="7d"),
)
if result["shift_detected"]:
send_alert(
f"Distribution shift in steps_per_task: "
f"mean moved from {result['baseline_mean']:.1f} to {result['recent_mean']:.1f} "
f"(p={result['p_value']:.4f})"
)
Tool usage pattern monitoring
Changes in how the agent uses its tools are one of the earliest indicators of behavioral issues. If an agent that normally uses tool A in 60% of tasks suddenly starts using it in 30%, something changed — the model's behavior shifted, or the tool's availability changed, or the input distribution changed.
from collections import Counter
def monitor_tool_usage_patterns(
recent_runs: list[dict],
baseline_distribution: dict[str, float],
threshold: float = 0.15,
) -> list[dict]:
"""
Detect shifts in tool usage patterns.
baseline_distribution: {"tool_a": 0.6, "tool_b": 0.3, "tool_c": 0.1}
"""
# Count tool usage in recent runs
tool_counts = Counter()
total_tool_calls = 0
for run in recent_runs:
for tool_call in run["tool_calls"]:
tool_counts[tool_call["name"]] += 1
total_tool_calls += 1
if total_tool_calls == 0:
return []
recent_distribution = {
tool: count / total_tool_calls
for tool, count in tool_counts.items()
}
alerts = []
for tool, baseline_rate in baseline_distribution.items():
recent_rate = recent_distribution.get(tool, 0)
delta = abs(recent_rate - baseline_rate)
if delta > threshold:
alerts.append({
"tool": tool,
"baseline_rate": baseline_rate,
"recent_rate": recent_rate,
"delta": delta,
"direction": "increased" if recent_rate > baseline_rate else "decreased",
})
return alerts
Production alerting patterns
Good alerting for agent systems follows the same principles as traditional alerting, with adaptations for non-deterministic behavior.
Alert hierarchy
Not all problems are equally urgent. Structure your alerts in tiers:
P1 (page immediately):
- Agent availability below SLO (system is down or partially down)
- PII detected in agent output (compliance violation in progress)
- Cost per task exceeds hard cap (runaway agent)
- Error rate above 10% (widespread failure)
P2 (alert within 15 minutes, investigate same day):
- Quality SLO breach (sampled accuracy below target)
- Distribution shift in step count or cost (behavioral change detected)
- Guardrail trigger rate above threshold
- Tool call failure rate above 20%
P3 (daily summary, investigate within 48 hours):
- Cost trending upward outside normal growth
- Task completion rate declining week-over-week
- User satisfaction score declining
- New tool usage patterns detected
Alert fatigue prevention
Agent systems are inherently noisy. Here is how to keep alerts meaningful:
Require sustained deviation, not point anomalies. A single expensive agent run is not an alert. Five consecutive expensive runs are. Use sustained breach windows (e.g., "alert if cost exceeds threshold for 5 consecutive runs" rather than "alert on any expensive run").
Alert on SLO burn rate, not raw metrics. Instead of alerting when latency exceeds 10 seconds, alert when you are consuming your latency error budget faster than expected. This accounts for normal variation and only fires when sustained problems threaten your SLO.
def check_slo_burn_rate(
metric_values: list[float],
slo_target: float,
window_hours: int,
budget_hours: int = 720, # 30-day budget
) -> dict:
"""
Check if the SLO error budget is being consumed too fast.
If breaches in the last window_hours would exhaust the monthly
budget in less than budget_hours, alert.
"""
breaches = sum(1 for v in metric_values if v < slo_target)
breach_rate = breaches / len(metric_values) if metric_values else 0
allowed_breach_rate = 1 - slo_target
burn_rate = breach_rate / allowed_breach_rate if allowed_breach_rate > 0 else float('inf')
# burn_rate > 1 means we're consuming budget faster than sustainable
# burn_rate > 14.4 means we'll exhaust 30-day budget in 2 hours (fast burn)
# burn_rate > 6 means we'll exhaust 30-day budget in 5 hours (medium burn)
return {
"burn_rate": burn_rate,
"breach_rate": breach_rate,
"alert": burn_rate > 6,
"severity": "P1" if burn_rate > 14.4 else "P2" if burn_rate > 6 else "OK",
}
Group correlated alerts. When a model provider has an outage, you will see latency alerts, error rate alerts, cost alerts (from retries), and quality alerts simultaneously. Group these into a single incident rather than firing five independent alerts.
Alerting on model provider changes
One of the most common causes of agent degradation is model provider updates. The model name stays the same, but the weights change. Your agent's behavior shifts. You have no control over when this happens.
Monitor for proxy signals that indicate a model change:
- Sudden shift in output token length distribution
- Change in tool calling patterns (new model version may call tools differently)
- Shift in latency distribution (model updates can change inference speed)
- Change in refusal patterns (updated safety training)
When you detect a potential model change, trigger a full evaluation run against your test suite, following the patterns described in our AI agent testing guide. This gives you a quality assessment within minutes instead of waiting for user complaints over days.
How monitoring feeds compliance reporting
For organizations subject to AI governance requirements (EU AI Act, SOC 2, internal AI policies), monitoring data is the raw material for compliance reporting. The connection is direct: monitoring metrics prove your AI system is operating within defined parameters.
From SLOs to compliance evidence
Each SLO maps to a compliance requirement:
| SLO | Compliance requirement |
|---|---|
| Quality above threshold | System is functioning as intended |
| PII leak rate at 0% | Data protection obligations met |
| Guardrail trigger rate within bounds | Safety measures are effective |
| Audit log completeness at 100% | Full traceability maintained |
| Cost within budget | Organizational controls working |
When an auditor asks "how do you ensure your AI system produces reliable outputs?" you point to your quality SLOs, the monitoring data showing adherence, and the incident history showing how breaches were detected and resolved. This is concrete evidence, not a policy document.
For a deeper look at AI governance from an engineering perspective, see our AI governance engineering guide.
Automated compliance reports
Build compliance reports that pull directly from your monitoring data:
def generate_compliance_report(
pipeline_name: str,
period_start: str,
period_end: str,
) -> dict:
"""Generate a compliance report from monitoring data."""
metrics = fetch_monitoring_metrics(pipeline_name, period_start, period_end)
incidents = fetch_incidents(pipeline_name, period_start, period_end)
return {
"pipeline": pipeline_name,
"reporting_period": {"start": period_start, "end": period_end},
"slo_adherence": {
"availability": {
"target": 0.995,
"actual": metrics["availability"],
"met": metrics["availability"] >= 0.995,
},
"quality": {
"target": 0.85,
"actual": metrics["sampled_accuracy"],
"met": metrics["sampled_accuracy"] >= 0.85,
},
"pii_leaks": {
"target": 0,
"actual": metrics["pii_leak_count"],
"met": metrics["pii_leak_count"] == 0,
},
"audit_completeness": {
"target": 1.0,
"actual": metrics["audit_log_completeness"],
"met": metrics["audit_log_completeness"] >= 1.0,
},
},
"incidents": [
{
"date": i["date"],
"slo_breached": i["slo"],
"duration_minutes": i["duration_minutes"],
"root_cause": i["root_cause"],
"resolution": i["resolution"],
}
for i in incidents
],
"total_executions": metrics["total_executions"],
"total_cost": metrics["total_cost"],
}
This report is generated automatically, not written manually. The data comes from the same monitoring system that powers your alerts and dashboards. Compliance becomes a read operation on existing data rather than a separate workstream.
Building a monitoring stack
Components you need
-
Metrics collection. Every agent run produces metrics: latency, step count, cost, tool calls, token usage. Collect these at the application level and ship them to a time-series database (Prometheus, InfluxDB, or a managed service).
-
Quality sampling. Async evaluation of a sample of production outputs. Run faithfulness, relevance, or custom quality checks. Store results alongside execution metrics.
-
Dashboards. Visualize SLO adherence, metric trends, and anomaly indicators. Grafana, Datadog, or a custom dashboard. The dashboard should answer "how are my agents performing?" in under 30 seconds.
-
Alerting. PagerDuty, OpsGenie, or Slack alerts based on SLO burn rate and anomaly detection. Tiered by severity with clear escalation paths.
-
Incident management. When an alert fires, how do you investigate? The monitoring system should link to the observability system — click an alert and see the relevant traces and logs.
OpenTelemetry for agent monitoring
The OpenTelemetry GenAI semantic conventions provide a standardized way to instrument agent systems. Using OTEL means your agent metrics integrate with any OTEL-compatible backend — Jaeger, Zipkin, Grafana Tempo, Datadog, or Arize Phoenix.
from opentelemetry import trace, metrics
from opentelemetry.semconv.ai import SpanAttributes
tracer = trace.get_tracer("agent-pipeline")
meter = metrics.get_meter("agent-pipeline")
# Define metrics
task_duration = meter.create_histogram(
name="agent.task.duration",
description="Duration of agent task completion",
unit="s",
)
task_steps = meter.create_histogram(
name="agent.task.steps",
description="Number of steps per agent task",
unit="1",
)
task_cost = meter.create_histogram(
name="agent.task.cost",
description="Cost per agent task",
unit="usd",
)
task_completion = meter.create_counter(
name="agent.task.completions",
description="Number of completed agent tasks",
)
async def run_agent_task(task_input: str):
with tracer.start_as_current_span("agent.task") as span:
span.set_attribute("agent.task.input_length", len(task_input))
start = time.time()
result = await agent.execute(task_input)
duration = time.time() - start
# Record metrics
task_duration.record(duration)
task_steps.record(result.step_count)
task_cost.record(result.total_cost)
task_completion.add(1, {"status": "success" if result.completed else "failure"})
# Set span attributes for observability
span.set_attribute("agent.task.steps", result.step_count)
span.set_attribute("agent.task.cost", result.total_cost)
span.set_attribute("agent.task.completed", result.completed)
span.set_attribute(SpanAttributes.GEN_AI_USAGE_INPUT_TOKENS, result.total_input_tokens)
span.set_attribute(SpanAttributes.GEN_AI_USAGE_OUTPUT_TOKENS, result.total_output_tokens)
return result
Where Coverge fits
Coverge builds monitoring into the pipeline lifecycle. When you deploy an agent pipeline through Coverge, monitoring is not something you set up separately — it is part of the deployment. SLOs are defined alongside the pipeline configuration, evaluation runs continuously on sampled production traffic, and alerts fire when quality degrades.
The monitoring data feeds directly into Coverge's version management. When a quality alert fires, you can see which pipeline version is deployed, compare its evaluation results against previous versions, and roll back in seconds if needed. The monitoring layer and the deployment layer are connected, not separate tools that you bridge with custom integration.
This closed loop — deploy, monitor, detect regression, roll back, investigate, fix, redeploy — is what production agent operations looks like. Each step feeds the next, and the monitoring data becomes the evidence base for both operational decisions and compliance reporting.
Frequently asked questions
How is monitoring AI agents different from monitoring traditional microservices?
The core difference is non-determinism. A microservice that returns different results for the same input is broken. An agent that returns different results for the same input is behaving normally. This means you cannot set static thresholds for most quality metrics — you need statistical baselines and distribution-based anomaly detection. Additionally, agent failures are often semantic (wrong answer, not wrong status code), which requires quality-aware monitoring, not just infrastructure monitoring.
What sampling rate should I use for quality monitoring?
Start with 5% of production traffic and adjust based on volume and cost. For low-volume pipelines (under 100 requests/day), sample everything. For high-volume pipelines (thousands/day), 1-2% gives you sufficient statistical signal while keeping evaluation costs manageable. Increase the sampling rate temporarily after deployments (10-20% for the first hour) to catch regressions faster.
How do I set SLOs when I do not know what "good" looks like yet?
Deploy without SLOs for two weeks. Collect metrics on every dimension. Analyze the data to understand normal ranges. Set initial SLOs at the 10th percentile of your observed performance (meaning your system already meets the SLO 90% of the time). This gives you a baseline that catches real degradation without constant false alarms. Tighten as you improve the agent.
Should I alert on every guardrail trigger?
No. Guardrail triggers are expected — they mean your safety layer is working. Alert on the trigger rate, not individual triggers. If 2% of responses trigger a guardrail and that is within your expected range, everything is fine. If the rate suddenly jumps to 8%, that is an alert. Also alert on specific high-severity guardrail categories (PII leaks, harmful content) at the individual trigger level.
How do I monitor costs for agent systems where costs per task vary wildly?
Track cost at the task level, not the request level. A task that costs $0.02 one time and $0.15 the next is normal if the tasks are different complexities. What matters is the distribution of costs over time. Monitor cost percentiles (p50, p95, p99) and alert on distribution shifts. Set hard caps at the per-task level to prevent runaway agents from burning budget on a single task.
What tools should I use for agent monitoring?
For metrics collection and dashboards, any time-series monitoring stack works — Prometheus + Grafana, Datadog, or New Relic. For quality evaluation sampling, tools like Arize Phoenix, Langfuse, or Braintrust integrate well. The key gap in most existing tools is agent-specific anomaly detection — you will likely need custom code for tool usage pattern monitoring and behavioral shift detection. OpenTelemetry provides the instrumentation standard that makes it easier to send data to any backend.
How does agent monitoring change as I scale from one agent to many?
At one agent, monitoring is a dashboard and a few alerts. At ten agents, you need standardized metrics, shared SLO definitions, and aggregate views across all agents. At fifty agents, you need automated anomaly detection, automated compliance reporting, and a team that owns the monitoring infrastructure. The shift from one to many is also when you need to standardize your instrumentation — ad hoc monitoring code in each agent does not scale. Define a monitoring contract (required metrics, standard labels, evaluation hooks) that every agent pipeline must implement.