Agent Observability That Doesn't Lie

December 20, 2025 · 5 min read
blog

Why this matters

Most “agent observability” is either:

  • too shallow (a chat transcript and a couple logs), or
  • too noisy (every token logged, every tool payload stored, no signal)

Neither works in production.

If you’re serious about operating agents, you need observability that answers three questions quickly:

  1. What happened? (forensics)
  2. Why did it happen? (debuggability)
  3. How often does it happen? (reliability)

OpenTelemetry exists to standardize how you instrument, generate, and export telemetry across traces, metrics, and logs. [1] W3C Trace Context defines how trace context propagates across service boundaries. [2]

Agents add two new requirements:

  • tool calls are part of your “distributed trace”
  • “decisioning” is a first-class component (not just business logic)

This article is a practical blueprint.


TL;DR

  • Instrument agents like distributed systems:
  • traces for causality (what triggered what)
  • metrics for health (p95 latency, error rates)
  • logs for human context (but redacted)
  • Propagate a single trace across:
  • agent runtime -> MCP gateway -> MCP tool servers -> upstream APIs
  • Capture decision summaries, not chain-of-thought.
  • Treat cost as a production signal: emit per-run and per-tool cost metrics.
  • Use semantic conventions where possible to keep telemetry queryable. [3]
  • Don’t turn observability into a data breach: OWASP highlights sensitive info disclosure and prompt injection as key risks. [7]

Contents


What to observe in an agent system

Agents have four observable subsystems:

  1. Planner/Reasoner (creates the plan, chooses tools)
  2. Tool execution (calls MCP tools and interprets results)
  3. Memory/state (what was stored or retrieved)
  4. Policy/budget (what was allowed or blocked)

If you only observe #2, you’ll miss why the agent chose the wrong tool. If you only observe #1, you’ll miss production failures.

You need the full chain.


A trace model for agents

The core idea

A single “agent run” is a distributed trace:

  • it spans model calls
  • tool calls
  • downstream system calls

Use W3C Trace Context (traceparent, tracestate) to propagate the trace across boundaries. [2]

Suggested spans (minimum viable)

Root span

  • agent.run
  • attributes: agent.name, tenant, user, session, goal_hash

Planner

  • agent.plan
  • attributes: planner.model, plan.step_count

Model calls

  • llm.call
  • attributes: model, prompt_tokens, completion_tokens, latency_ms

Tool selection

  • agent.tool_select
  • attributes: selector.version, candidate_count, selected_count

Tool call

  • tool.call
  • attributes: tool.name, tool.class (read/write/danger), tool.server, status

Policy

  • policy.check
  • attributes: policy.rule_id, decision (allow/deny), reason_code

Memory

  • memory.read / memory.write
  • attributes: store, keys, bytes

Why spans > logs

Spans give you causality:

  • which tool call caused a failure
  • which step blew the budget
  • which upstream dependency was slow

With OpenTelemetry, you can emit traces and metrics using the same SDK approach. [1][4]


Metrics that matter

Tool health metrics

  • tool_calls_total{tool,status}
  • tool_latency_ms_bucket{tool}
  • tool_timeouts_total{tool}
  • tool_retries_total{tool}

Agent run health metrics

  • agent_runs_total{status}
  • agent_run_latency_ms_bucket{agent}
  • agent_steps_total_bucket{agent}

Cost metrics (treat cost like reliability)

  • llm_tokens_total{model,type=prompt|completion}
  • llm_cost_usd_total{model}
  • run_cost_usd_bucket{agent}

Policy metrics

  • policy_denied_total{rule_id}
  • danger_tool_attempt_total{tool}

Semantic conventions help your metrics stay queryable and consistent across systems. OpenTelemetry documents semantic conventions for HTTP spans/metrics, for example. [3][5]


Logs and redaction

Logs should add human context, not become a data lake of secrets.

Rules I like:

  • Do not log prompts by default.
  • Do not log tool payloads by default.
  • Log summaries and hashes:
  • goal_hash, plan_hash, tool_args_hash
  • Log structured error reasons:
  • validation_error, upstream_rate_limited, auth_failed, policy_denied

For agent systems, OWASP highlights sensitive information disclosure and insecure output handling. Logging is one of the easiest ways to accidentally create both. [7]

“Debug mode” that isn’t dangerous

If you must support deeper logs:

  • only enable per tenant/user for a limited window
  • auto-expire
  • redact aggressively
  • never store raw secrets

Audit events vs debug logs

Treat them as different products:

Audit events (for governance)

  • immutable-ish records of side effects
  • minimal sensitive data
  • always on
  • long retention

Example audit fields:

  • who: tenant/user/client
  • what: tool + action class (create/update/delete)
  • when: timestamp
  • where: environment
  • result: success/failure
  • resource IDs (safe identifiers)
  • idempotency keys / plan IDs

Debug logs (for engineers)

  • short retention
  • more context
  • highly controlled access

Mixing these two is how you end up with “SharePoint logs full of PII” and no one wants to touch them.


Dashboards and alerts

Dashboards (start simple)

  1. Tool reliability
  • top tools by error rate
  • top tools by p95 latency
  • timeouts per tool
  1. Agent success
  • success rate by agent type
  • “stuck runs” (runs exceeding max duration)
  • average steps per run
  1. Cost
  • cost per run
  • cost per tenant
  • top drivers (which tools/model calls)

Alerts (avoid noise)

Alert on what is actionable:

  • tool error rate spikes for critical tools
  • tool latency p95 spikes beyond SLO
  • budget exceeded spike (runaway behavior)
  • policy denied spike (possible prompt injection attempt)

If you use SLOs and error budgets, Google’s SRE material is a practical reference for turning SLOs into alerting strategies. [6]


A production checklist

Tracing

  • Every agent run has a trace ID.
  • Trace context propagates across MCP boundaries (W3C Trace Context). [2]
  • Tool calls are spans with stable tool identifiers.

Metrics

  • Tool success/error/latency metrics exist.
  • Agent run success/latency/steps metrics exist.
  • Cost metrics exist and are monitored.

Logging

  • Default logs are redacted summaries, not raw payloads.
  • Debug logging is time-bounded and access-controlled.

Audit

  • Audit events exist for all side-effecting tools.
  • Audit records include “who/what/when/result” without leaking secrets.

Security

  • Observability does not become a secret exfil path (OWASP risks considered). [7]

References

[1] OpenTelemetry - Documentation (overview): https://opentelemetry.io/docs/ [2] W3C - Trace Context: https://www.w3.org/TR/trace-context/ [3] OpenTelemetry - Semantic conventions for HTTP (spans/metrics/logs): https://opentelemetry.io/docs/specs/semconv/http/ [4] OpenTelemetry Go - Instrumentation docs: https://opentelemetry.io/docs/languages/go/instrumentation/ [5] OpenTelemetry - Semantic conventions for HTTP metrics: https://opentelemetry.io/docs/specs/semconv/http/http-metrics/ [6] Google SRE Workbook - Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/ [7] OWASP - Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/

Authors
DevOps Architect · Applied AI Engineer
I’ve spent 20 years building systems across embedded systems, micro-controllers, PLCS, security platforms, fintech, SRE, and platform architecture. Today I focus on production AI systems in Go: multi-agent orchestration, MCP server ecosystems, and the DevOps platforms that keep them running. I care about systems that work under pressure: observable, recoverable, and built to last.