Category: Builder's Corner
Keyword: AI agent monitoring best practices
Ensuring SLA-grade reliability for AI agents in production is non-negotiable for any organization that depends on automated decision-making or customer-facing automation. This tutorial provides a deep-dive into actionable strategies, concrete tooling, and best practices for robust AI agent monitoring. We’ll walk through real-world configurations and code, so you can implement production-grade observability and alerting for your agents.
For a broader discussion of agent frameworks and how your choice impacts monitoring, see our parent pillar: Choosing the Right AI Agent Framework: LangSmith, Haystack Agents, and CrewAI Compared.
Prerequisites
- Basic Knowledge: Python (3.9+), Docker, Linux CLI, and familiarity with AI agent frameworks (LangSmith, Haystack, or CrewAI).
- Production Agent: A deployed AI agent (e.g., using FastAPI, Flask, or similar) running on a cloud VM or Kubernetes.
- Monitoring Tools:
- Prometheus (v2.41+)
- Grafana (v10+)
- Optional: OpenTelemetry Collector (v0.89+)
- Access: Ability to modify agent source code and deploy configuration changes.
1. Define SLA Metrics for Your AI Agents
-
Identify key reliability metrics: At a minimum, you should monitor:
- Agent response latency (p95, p99)
- Success/failure rates
- Token usage (for LLM-based agents)
- External dependency errors (e.g., vector DB, LLM API)
-
Example SLA:
- 99% of agent responses must be under 2 seconds
- Failure rate must be below 0.1%
These metrics should be tracked both at the infrastructure and application level. For more on how framework choice impacts metric collection, see Choosing the Right AI Agent Framework: LangSmith, Haystack Agents, and CrewAI Compared.
2. Instrument Your Agent with Prometheus Metrics
-
Add Prometheus instrumentation to your agent code. We'll use the
prometheus_clientPython package.pip install prometheus_client
-
Example: FastAPI Agent Instrumentation
Add the following to your
main.py:from fastapi import FastAPI, Request from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST from starlette.responses import Response import time app = FastAPI() REQUEST_LATENCY = Histogram('agent_response_latency_seconds', 'Agent response latency', ['endpoint']) REQUEST_COUNT = Counter('agent_requests_total', 'Total agent requests', ['endpoint', 'status_code']) FAILURE_COUNT = Counter('agent_failures_total', 'Total agent failures', ['endpoint', 'error_type']) @app.middleware("http") async def prometheus_middleware(request: Request, call_next): start = time.time() try: response = await call_next(request) status = response.status_code REQUEST_COUNT.labels(endpoint=request.url.path, status_code=status).inc() return response except Exception as e: FAILURE_COUNT.labels(endpoint=request.url.path, error_type=type(e).__name__).inc() raise finally: latency = time.time() - start REQUEST_LATENCY.labels(endpoint=request.url.path).observe(latency) @app.get("/metrics") def metrics(): return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)Description: This middleware tracks request count, latency, and failures per endpoint. Expose metrics at
/metricsfor scraping. -
For other frameworks (Flask, Django, etc): See the
prometheus_clientofficial documentation.
3. Deploy Prometheus and Grafana for Monitoring
-
Start Prometheus and Grafana with Docker Compose:
version: "3" services: prometheus: image: prom/prometheus:v2.41.0 ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml grafana: image: grafana/grafana:10.0.0 ports: - "3000:3000" depends_on: - prometheus -
Create
prometheus.ymlto scrape your agent:global: scrape_interval: 15s scrape_configs: - job_name: 'ai_agent' static_configs: - targets: ['host.docker.internal:8000'] # Replace with your agent's host:port -
Start the stack:
docker compose up -d
-
Verify Prometheus is scraping metrics:
- Visit
http://localhost:9090/targetsand check your agent appears as "UP".
- Visit
-
Configure Grafana:
- Login at
http://localhost:3000(default: admin/admin) - Add Prometheus as a data source (
http://prometheus:9090) - Create dashboards for
agent_response_latency_secondsandagent_failures_total
Screenshot Description: Grafana dashboard showing p95 latency and failure rates over time.
- Login at
4. Set Up SLA Alerts in Grafana
-
Create alert rules for SLA violations:
- In Grafana, go to Alerting > Alert Rules
- Example: Alert if p99 latency > 2s for 5 minutes
histogram_quantile(0.99, sum(rate(agent_response_latency_seconds_bucket[5m])) by (le)) -
Configure notification channels:
- Email, Slack, PagerDuty, etc.
- Test your alert by simulating agent slowness or failure.
5. Add Distributed Tracing with OpenTelemetry (Optional, Advanced)
- Why tracing? Metrics tell you what happened; traces tell you why. For multi-step agents (e.g., CrewAI planners), traces help pinpoint slow or failing sub-components.
-
Instrument your agent with OpenTelemetry:
pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-fastapi
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from fastapi import FastAPI app = FastAPI() FastAPIInstrumentor.instrument_app(app) -
Run an OpenTelemetry Collector and send traces to Grafana Tempo or Jaeger.
- See OpenTelemetry Collector docs for setup.
Screenshot Description: Trace visualization showing each step of an agent workflow with timing and error details.
6. Monitor External Dependencies
-
Track dependency errors and latency:
- Wrap LLM API calls and database queries with metric counters and histograms.
LLM_FAILURES = Counter('llm_failures_total', 'LLM API failures', ['provider', 'error_type']) LLM_LATENCY = Histogram('llm_latency_seconds', 'LLM API latency', ['provider']) def call_llm_api(provider, *args, **kwargs): start = time.time() try: # Replace with actual LLM call result = external_llm_call(*args, **kwargs) return result except Exception as e: LLM_FAILURES.labels(provider=provider, error_type=type(e).__name__).inc() raise finally: LLM_LATENCY.labels(provider=provider).observe(time.time() - start) - Visualize dependency health in Grafana.
7. Implement Log Aggregation and Correlation
-
Centralize logs for debugging SLA breaches:
- Use ELK stack, Loki, or a managed logging platform.
-
Correlate logs with metrics and traces:
- Include trace IDs in logs for cross-referencing.
import logging from opentelemetry.trace import get_current_span def log_with_trace_id(message): span = get_current_span() trace_id = span.get_span_context().trace_id if span else "no-trace" logging.info(f"{message} | trace_id={trace_id}")
Common Issues & Troubleshooting
-
Problem: Prometheus can't scrape /metrics (target DOWN)
- Solution: Check agent port and network settings. If running locally with Docker, use
host.docker.internalfor Mac/Windows, or set up a bridge network for Linux.
- Solution: Check agent port and network settings. If running locally with Docker, use
-
Problem: No metrics in Grafana
- Solution: Verify Prometheus is scraping metrics (
http://localhost:9090/targets). Check/metricsendpoint in browser for output.
- Solution: Verify Prometheus is scraping metrics (
-
Problem: High latency or error spikes
- Solution: Use traces to drill down on slow steps. Check dependency health metrics and logs for correlated failures.
-
Problem: Alert fatigue (too many alerts)
- Solution: Tune alert thresholds and durations; use p95/p99 rather than average; group similar alerts.
Next Steps
- Iterate: Regularly review and refine your metrics and alerting as your agent evolves.
- Automate: Integrate monitoring setup into CI/CD pipelines for consistent deployments.
- Expand: Explore advanced observability, such as anomaly detection and SLO burn rate alerts.
- Learn More: For a broader perspective on agent architectures and how they affect monitoring and reliability, see our parent pillar: Choosing the Right AI Agent Framework: LangSmith, Haystack Agents, and CrewAI Compared.
With these AI agent monitoring best practices, you can confidently operate your agents at SLA-grade reliability in production. Stay vigilant, iterate on your observability, and your agents will serve users reliably at scale.
