Agent Observability That Doesn't Lie

Why this matters
Most “agent observability” is either:
- too shallow (a chat transcript and a couple logs), or
- too noisy (every token logged, every tool payload stored, no signal)
Neither works in production.
If you’re serious about operating agents, you need observability that answers three questions quickly:
- What happened? (forensics)
- Why did it happen? (debuggability)
- How often does it happen? (reliability)
OpenTelemetry exists to standardize how you instrument, generate, and export telemetry across traces, metrics, and logs. [1] W3C Trace Context defines how trace context propagates across service boundaries. [2]
Agents add two new requirements:
- tool calls are part of your “distributed trace”
- “decisioning” is a first-class component (not just business logic)
This article is a practical blueprint.
TL;DR
- Instrument agents like distributed systems:
- traces for causality (what triggered what)
- metrics for health (p95 latency, error rates)
- logs for human context (but redacted)
- Propagate a single trace across:
- agent runtime -> MCP gateway -> MCP tool servers -> upstream APIs
- Capture decision summaries, not chain-of-thought.
- Treat cost as a production signal: emit per-run and per-tool cost metrics.
- Use semantic conventions where possible to keep telemetry queryable. [3]
- Don’t turn observability into a data breach: OWASP highlights sensitive info disclosure and prompt injection as key risks. [7]
Contents
- What to observe in an agent system
- A trace model for agents
- Metrics that matter
- Logs and redaction
- Audit events vs debug logs
- Dashboards and alerts
- A production checklist
- References
What to observe in an agent system
Agents have four observable subsystems:
- Planner/Reasoner (creates the plan, chooses tools)
- Tool execution (calls MCP tools and interprets results)
- Memory/state (what was stored or retrieved)
- Policy/budget (what was allowed or blocked)
If you only observe #2, you’ll miss why the agent chose the wrong tool. If you only observe #1, you’ll miss production failures.
You need the full chain.
A trace model for agents
The core idea
A single “agent run” is a distributed trace:
- it spans model calls
- tool calls
- downstream system calls
Use W3C Trace Context (traceparent, tracestate) to propagate the trace across boundaries. [2]
Suggested spans (minimum viable)
Root span
agent.run- attributes:
agent.name,tenant,user,session,goal_hash
Planner
agent.plan- attributes:
planner.model,plan.step_count
Model calls
llm.call- attributes:
model,prompt_tokens,completion_tokens,latency_ms
Tool selection
agent.tool_select- attributes:
selector.version,candidate_count,selected_count
Tool call
tool.call- attributes:
tool.name,tool.class(read/write/danger),tool.server,status
Policy
policy.check- attributes:
policy.rule_id,decision(allow/deny),reason_code
Memory
memory.read/memory.write- attributes:
store,keys,bytes
Why spans > logs
Spans give you causality:
- which tool call caused a failure
- which step blew the budget
- which upstream dependency was slow
With OpenTelemetry, you can emit traces and metrics using the same SDK approach. [1][4]
Metrics that matter
Tool health metrics
tool_calls_total{tool,status}tool_latency_ms_bucket{tool}tool_timeouts_total{tool}tool_retries_total{tool}
Agent run health metrics
agent_runs_total{status}agent_run_latency_ms_bucket{agent}agent_steps_total_bucket{agent}
Cost metrics (treat cost like reliability)
llm_tokens_total{model,type=prompt|completion}llm_cost_usd_total{model}run_cost_usd_bucket{agent}
Policy metrics
policy_denied_total{rule_id}danger_tool_attempt_total{tool}
Semantic conventions help your metrics stay queryable and consistent across systems. OpenTelemetry documents semantic conventions for HTTP spans/metrics, for example. [3][5]
Logs and redaction
Logs should add human context, not become a data lake of secrets.
Rules I like:
- Do not log prompts by default.
- Do not log tool payloads by default.
- Log summaries and hashes:
goal_hash,plan_hash,tool_args_hash- Log structured error reasons:
validation_error,upstream_rate_limited,auth_failed,policy_denied
For agent systems, OWASP highlights sensitive information disclosure and insecure output handling. Logging is one of the easiest ways to accidentally create both. [7]
“Debug mode” that isn’t dangerous
If you must support deeper logs:
- only enable per tenant/user for a limited window
- auto-expire
- redact aggressively
- never store raw secrets
Audit events vs debug logs
Treat them as different products:
Audit events (for governance)
- immutable-ish records of side effects
- minimal sensitive data
- always on
- long retention
Example audit fields:
- who: tenant/user/client
- what: tool + action class (create/update/delete)
- when: timestamp
- where: environment
- result: success/failure
- resource IDs (safe identifiers)
- idempotency keys / plan IDs
Debug logs (for engineers)
- short retention
- more context
- highly controlled access
Mixing these two is how you end up with “SharePoint logs full of PII” and no one wants to touch them.
Dashboards and alerts
Dashboards (start simple)
- Tool reliability
- top tools by error rate
- top tools by p95 latency
- timeouts per tool
- Agent success
- success rate by agent type
- “stuck runs” (runs exceeding max duration)
- average steps per run
- Cost
- cost per run
- cost per tenant
- top drivers (which tools/model calls)
Alerts (avoid noise)
Alert on what is actionable:
- tool error rate spikes for critical tools
- tool latency p95 spikes beyond SLO
- budget exceeded spike (runaway behavior)
- policy denied spike (possible prompt injection attempt)
If you use SLOs and error budgets, Google’s SRE material is a practical reference for turning SLOs into alerting strategies. [6]
A production checklist
Tracing
- Every agent run has a trace ID.
- Trace context propagates across MCP boundaries (W3C Trace Context). [2]
- Tool calls are spans with stable tool identifiers.
Metrics
- Tool success/error/latency metrics exist.
- Agent run success/latency/steps metrics exist.
- Cost metrics exist and are monitored.
Logging
- Default logs are redacted summaries, not raw payloads.
- Debug logging is time-bounded and access-controlled.
Audit
- Audit events exist for all side-effecting tools.
- Audit records include “who/what/when/result” without leaking secrets.
Security
- Observability does not become a secret exfil path (OWASP risks considered). [7]
References
[1] OpenTelemetry - Documentation (overview): https://opentelemetry.io/docs/ [2] W3C - Trace Context: https://www.w3.org/TR/trace-context/ [3] OpenTelemetry - Semantic conventions for HTTP (spans/metrics/logs): https://opentelemetry.io/docs/specs/semconv/http/ [4] OpenTelemetry Go - Instrumentation docs: https://opentelemetry.io/docs/languages/go/instrumentation/ [5] OpenTelemetry - Semantic conventions for HTTP metrics: https://opentelemetry.io/docs/specs/semconv/http/http-metrics/ [6] Google SRE Workbook - Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/ [7] OWASP - Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/