Agent Observability That Doesn't Lie

Sat, 20 Dec 2025 12:00:00 -0500

Why this matters

Most “agent observability” is either:

too shallow (a chat transcript and a couple logs), or
too noisy (every token logged, every tool payload stored, no signal)

Neither works in production.

If you’re serious about operating agents, you need observability that answers three questions quickly:

What happened? (forensics)
Why did it happen? (debuggability)
How often does it happen? (reliability)

OpenTelemetry exists to standardize how you instrument, generate, and export telemetry across traces, metrics, and logs. [1] W3C Trace Context defines how trace context propagates across service boundaries. [2]

Agents add two new requirements:

tool calls are part of your “distributed trace”
“decisioning” is a first-class component (not just business logic)

This article is a practical blueprint.

TL;DR

Instrument agents like distributed systems:
traces for causality (what triggered what)
metrics for health (p95 latency, error rates)
logs for human context (but redacted)
Propagate a single trace across:
agent runtime -> MCP gateway -> MCP tool servers -> upstream APIs
Capture decision summaries, not chain-of-thought.
Treat cost as a production signal: emit per-run and per-tool cost metrics.
Use semantic conventions where possible to keep telemetry queryable. [3]
Don’t turn observability into a data breach: OWASP highlights sensitive info disclosure and prompt injection as key risks. [7]

What to observe in an agent system
A trace model for agents
Metrics that matter
Logs and redaction
Audit events vs debug logs
Dashboards and alerts
A production checklist
References

What to observe in an agent system

Agents have four observable subsystems:

Planner/Reasoner (creates the plan, chooses tools)
Tool execution (calls MCP tools and interprets results)
Memory/state (what was stored or retrieved)
Policy/budget (what was allowed or blocked)

If you only observe #2, you’ll miss why the agent chose the wrong tool. If you only observe #1, you’ll miss production failures.

You need the full chain.

A trace model for agents

The core idea

A single “agent run” is a distributed trace:

it spans model calls
tool calls
downstream system calls

Use W3C Trace Context (traceparent, tracestate) to propagate the trace across boundaries. [2]

Suggested spans (minimum viable)

Root span

agent.run
attributes: agent.name, tenant, user, session, goal_hash

Planner

agent.plan
attributes: planner.model, plan.step_count

Model calls

llm.call
attributes: model, prompt_tokens, completion_tokens, latency_ms

Tool selection

agent.tool_select
attributes: selector.version, candidate_count, selected_count

Tool call

tool.call
attributes: tool.name, tool.class (read/write/danger), tool.server, status

Policy

policy.check
attributes: policy.rule_id, decision (allow/deny), reason_code

Memory

memory.read / memory.write
attributes: store, keys, bytes

Why spans > logs

Spans give you causality:

which tool call caused a failure
which step blew the budget
which upstream dependency was slow

With OpenTelemetry, you can emit traces and metrics using the same SDK approach. [1][4]

Metrics that matter

Tool health metrics

tool_calls_total{tool,status}
tool_latency_ms_bucket{tool}
tool_timeouts_total{tool}
tool_retries_total{tool}

Agent run health metrics

agent_runs_total{status}
agent_run_latency_ms_bucket{agent}
agent_steps_total_bucket{agent}

Cost metrics (treat cost like reliability)

llm_tokens_total{model,type=prompt|completion}
llm_cost_usd_total{model}
run_cost_usd_bucket{agent}

Policy metrics

policy_denied_total{rule_id}
danger_tool_attempt_total{tool}

Semantic conventions help your metrics stay queryable and consistent across systems. OpenTelemetry documents semantic conventions for HTTP spans/metrics, for example. [3][5]

Logs and redaction

Logs should add human context, not become a data lake of secrets.

Rules I like:

Do not log prompts by default.
Do not log tool payloads by default.
Log summaries and hashes:
goal_hash, plan_hash, tool_args_hash
Log structured error reasons:
validation_error, upstream_rate_limited, auth_failed, policy_denied

For agent systems, OWASP highlights sensitive information disclosure and insecure output handling. Logging is one of the easiest ways to accidentally create both. [7]

“Debug mode” that isn’t dangerous

If you must support deeper logs:

only enable per tenant/user for a limited window
auto-expire
redact aggressively
never store raw secrets

Audit events vs debug logs

Treat them as different products:

Audit events (for governance)

immutable-ish records of side effects
minimal sensitive data
always on
long retention

Example audit fields:

who: tenant/user/client
what: tool + action class (create/update/delete)
when: timestamp
where: environment
result: success/failure
resource IDs (safe identifiers)
idempotency keys / plan IDs

Debug logs (for engineers)

short retention
more context
highly controlled access

Mixing these two is how you end up with “SharePoint logs full of PII” and no one wants to touch them.

Dashboards and alerts

Dashboards (start simple)

Tool reliability

top tools by error rate
top tools by p95 latency
timeouts per tool

Agent success

success rate by agent type
“stuck runs” (runs exceeding max duration)
average steps per run

Cost

cost per run
cost per tenant
top drivers (which tools/model calls)

Alerts (avoid noise)

Alert on what is actionable:

tool error rate spikes for critical tools
tool latency p95 spikes beyond SLO
budget exceeded spike (runaway behavior)
policy denied spike (possible prompt injection attempt)

If you use SLOs and error budgets, Google’s SRE material is a practical reference for turning SLOs into alerting strategies. [6]

A production checklist

Tracing

Every agent run has a trace ID.
Trace context propagates across MCP boundaries (W3C Trace Context). [2]
Tool calls are spans with stable tool identifiers.

Metrics

Tool success/error/latency metrics exist.
Agent run success/latency/steps metrics exist.
Cost metrics exist and are monitored.

Logging

Default logs are redacted summaries, not raw payloads.
Debug logging is time-bounded and access-controlled.

Audit

Audit events exist for all side-effecting tools.
Audit records include “who/what/when/result” without leaking secrets.

Security

Observability does not become a secret exfil path (OWASP risks considered). [7]

References

[1] OpenTelemetry - Documentation (overview): https://opentelemetry.io/docs/ [2] W3C - Trace Context: https://www.w3.org/TR/trace-context/ [3] OpenTelemetry - Semantic conventions for HTTP (spans/metrics/logs): https://opentelemetry.io/docs/specs/semconv/http/ [4] OpenTelemetry Go - Instrumentation docs: https://opentelemetry.io/docs/languages/go/instrumentation/ [5] OpenTelemetry - Semantic conventions for HTTP metrics: https://opentelemetry.io/docs/specs/semconv/http/http-metrics/ [6] Google SRE Workbook - Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/ [7] OWASP - Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/

Opentelemetry | Roy Gabriel