<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Opentelemetry | Roy Gabriel</title><link>https://roygabriel.dev/tags/opentelemetry/</link><description>Roy Gabriel: DevOps Architect &amp; Applied AI Engineer. Technical blog on Go, MCP servers, Kubernetes, and production AI systems.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 27 Feb 2026 03:18:04 +0000</lastBuildDate><atom:link href="https://roygabriel.dev/tags/opentelemetry/index.xml" rel="self" type="application/rss+xml"/><item><title>Agent Observability That Doesn't Lie</title><link>https://roygabriel.dev/blog/agent-observability-that-doesnt-lie/</link><pubDate>Sat, 20 Dec 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/agent-observability-that-doesnt-lie/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most &amp;ldquo;agent observability&amp;rdquo; is either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;too shallow&lt;/strong&gt; (a chat transcript and a couple logs), or&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;too noisy&lt;/strong&gt; (every token logged, every tool payload stored, no signal)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Neither works in production.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;re serious about operating agents, you need observability that answers three questions quickly:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What happened?&lt;/strong&gt; (forensics)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Why did it happen?&lt;/strong&gt; (debuggability)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How often does it happen?&lt;/strong&gt; (reliability)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenTelemetry exists to standardize how you instrument, generate, and export telemetry across traces, metrics, and logs. [1] W3C Trace Context defines how trace context propagates across service boundaries. [2]&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most &amp;ldquo;agent observability&amp;rdquo; is either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;too shallow&lt;/strong&gt; (a chat transcript and a couple logs), or&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;too noisy&lt;/strong&gt; (every token logged, every tool payload stored, no signal)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Neither works in production.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;re serious about operating agents, you need observability that answers three questions quickly:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What happened?&lt;/strong&gt; (forensics)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Why did it happen?&lt;/strong&gt; (debuggability)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How often does it happen?&lt;/strong&gt; (reliability)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenTelemetry exists to standardize how you instrument, generate, and export telemetry across traces, metrics, and logs. [1] W3C Trace Context defines how trace context propagates across service boundaries. [2]&lt;/p&gt;
&lt;p&gt;Agents add two new requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool calls are part of your &amp;ldquo;distributed trace&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;decisioning&amp;rdquo; is a first-class component (not just business logic)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This article is a practical blueprint.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Instrument agents like distributed systems:&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;traces&lt;/strong&gt; for causality (what triggered what)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;metrics&lt;/strong&gt; for health (p95 latency, error rates)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;logs&lt;/strong&gt; for human context (but redacted)&lt;/li&gt;
&lt;li&gt;Propagate a single trace across:&lt;/li&gt;
&lt;li&gt;agent runtime -&amp;gt; MCP gateway -&amp;gt; MCP tool servers -&amp;gt; upstream APIs&lt;/li&gt;
&lt;li&gt;Capture &lt;strong&gt;decision summaries&lt;/strong&gt;, not chain-of-thought.&lt;/li&gt;
&lt;li&gt;Treat cost as a production signal: emit per-run and per-tool cost metrics.&lt;/li&gt;
&lt;li&gt;Use semantic conventions where possible to keep telemetry queryable. [3]&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t turn observability into a data breach: OWASP highlights sensitive info disclosure and prompt injection as key risks. [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-to-observe-in-an-agent-system"&gt;What to observe in an agent system&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-trace-model-for-agents"&gt;A trace model for agents&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#metrics-that-matter"&gt;Metrics that matter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#logs-and-redaction"&gt;Logs and redaction&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#audit-events-vs-debug-logs"&gt;Audit events vs debug logs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#dashboards-and-alerts"&gt;Dashboards and alerts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="what-to-observe-in-an-agent-system"&gt;What to observe in an agent system&lt;/h2&gt;
&lt;p&gt;Agents have four observable subsystems:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Planner/Reasoner&lt;/strong&gt; (creates the plan, chooses tools)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tool execution&lt;/strong&gt; (calls MCP tools and interprets results)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory/state&lt;/strong&gt; (what was stored or retrieved)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Policy/budget&lt;/strong&gt; (what was allowed or blocked)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you only observe #2, you&amp;rsquo;ll miss why the agent chose the wrong tool.
If you only observe #1, you&amp;rsquo;ll miss production failures.&lt;/p&gt;
&lt;p&gt;You need the full chain.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-trace-model-for-agents"&gt;A trace model for agents&lt;/h2&gt;
&lt;h3 id="the-core-idea"&gt;The core idea&lt;/h3&gt;
&lt;p&gt;A single &amp;ldquo;agent run&amp;rdquo; is a distributed trace:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it spans model calls&lt;/li&gt;
&lt;li&gt;tool calls&lt;/li&gt;
&lt;li&gt;downstream system calls&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use W3C Trace Context (&lt;code&gt;traceparent&lt;/code&gt;, &lt;code&gt;tracestate&lt;/code&gt;) to propagate the trace across boundaries. [2]&lt;/p&gt;
&lt;h3 id="suggested-spans-minimum-viable"&gt;Suggested spans (minimum viable)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Root span&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;agent.run&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;agent.name&lt;/code&gt;, &lt;code&gt;tenant&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;session&lt;/code&gt;, &lt;code&gt;goal_hash&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Planner&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;agent.plan&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;planner.model&lt;/code&gt;, &lt;code&gt;plan.step_count&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Model calls&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm.call&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;prompt_tokens&lt;/code&gt;, &lt;code&gt;completion_tokens&lt;/code&gt;, &lt;code&gt;latency_ms&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tool selection&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;agent.tool_select&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;selector.version&lt;/code&gt;, &lt;code&gt;candidate_count&lt;/code&gt;, &lt;code&gt;selected_count&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tool call&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;tool.call&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;tool.name&lt;/code&gt;, &lt;code&gt;tool.class&lt;/code&gt; (read/write/danger), &lt;code&gt;tool.server&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Policy&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;policy.check&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;policy.rule_id&lt;/code&gt;, &lt;code&gt;decision&lt;/code&gt; (allow/deny), &lt;code&gt;reason_code&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;memory.read&lt;/code&gt; / &lt;code&gt;memory.write&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;attributes: &lt;code&gt;store&lt;/code&gt;, &lt;code&gt;keys&lt;/code&gt;, &lt;code&gt;bytes&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-spans--logs"&gt;Why spans &amp;gt; logs&lt;/h3&gt;
&lt;p&gt;Spans give you causality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which tool call caused a failure&lt;/li&gt;
&lt;li&gt;which step blew the budget&lt;/li&gt;
&lt;li&gt;which upstream dependency was slow&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With OpenTelemetry, you can emit traces and metrics using the same SDK approach. [1][4]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="metrics-that-matter"&gt;Metrics that matter&lt;/h2&gt;
&lt;h3 id="tool-health-metrics"&gt;Tool health metrics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;tool_calls_total{tool,status}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tool_latency_ms_bucket{tool}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tool_timeouts_total{tool}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tool_retries_total{tool}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="agent-run-health-metrics"&gt;Agent run health metrics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;agent_runs_total{status}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;agent_run_latency_ms_bucket{agent}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;agent_steps_total_bucket{agent}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="cost-metrics-treat-cost-like-reliability"&gt;Cost metrics (treat cost like reliability)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm_tokens_total{model,type=prompt|completion}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llm_cost_usd_total{model}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;run_cost_usd_bucket{agent}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="policy-metrics"&gt;Policy metrics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;policy_denied_total{rule_id}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;danger_tool_attempt_total{tool}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Semantic conventions help your metrics stay queryable and consistent across systems. OpenTelemetry documents semantic conventions for HTTP spans/metrics, for example. [3][5]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="logs-and-redaction"&gt;Logs and redaction&lt;/h2&gt;
&lt;p&gt;Logs should add human context, not become a data lake of secrets.&lt;/p&gt;
&lt;p&gt;Rules I like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Do not log prompts by default.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Do not log tool payloads by default.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Log summaries and hashes:&lt;/li&gt;
&lt;li&gt;&lt;code&gt;goal_hash&lt;/code&gt;, &lt;code&gt;plan_hash&lt;/code&gt;, &lt;code&gt;tool_args_hash&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Log &lt;strong&gt;structured error reasons&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;&lt;code&gt;validation_error&lt;/code&gt;, &lt;code&gt;upstream_rate_limited&lt;/code&gt;, &lt;code&gt;auth_failed&lt;/code&gt;, &lt;code&gt;policy_denied&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For agent systems, OWASP highlights sensitive information disclosure and insecure output handling. Logging is one of the easiest ways to accidentally create both. [7]&lt;/p&gt;
&lt;h3 id="debug-mode-that-isnt-dangerous"&gt;&amp;ldquo;Debug mode&amp;rdquo; that isn&amp;rsquo;t dangerous&lt;/h3&gt;
&lt;p&gt;If you must support deeper logs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;only enable per tenant/user for a limited window&lt;/li&gt;
&lt;li&gt;auto-expire&lt;/li&gt;
&lt;li&gt;redact aggressively&lt;/li&gt;
&lt;li&gt;never store raw secrets&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="audit-events-vs-debug-logs"&gt;Audit events vs debug logs&lt;/h2&gt;
&lt;p&gt;Treat them as different products:&lt;/p&gt;
&lt;h3 id="audit-events-for-governance"&gt;Audit events (for governance)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;immutable-ish records of side effects&lt;/li&gt;
&lt;li&gt;minimal sensitive data&lt;/li&gt;
&lt;li&gt;always on&lt;/li&gt;
&lt;li&gt;long retention&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example audit fields:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;who: tenant/user/client&lt;/li&gt;
&lt;li&gt;what: tool + action class (create/update/delete)&lt;/li&gt;
&lt;li&gt;when: timestamp&lt;/li&gt;
&lt;li&gt;where: environment&lt;/li&gt;
&lt;li&gt;result: success/failure&lt;/li&gt;
&lt;li&gt;resource IDs (safe identifiers)&lt;/li&gt;
&lt;li&gt;idempotency keys / plan IDs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="debug-logs-for-engineers"&gt;Debug logs (for engineers)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;short retention&lt;/li&gt;
&lt;li&gt;more context&lt;/li&gt;
&lt;li&gt;highly controlled access&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Mixing these two is how you end up with &amp;ldquo;SharePoint logs full of PII&amp;rdquo; and no one wants to touch them.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="dashboards-and-alerts"&gt;Dashboards and alerts&lt;/h2&gt;
&lt;h3 id="dashboards-start-simple"&gt;Dashboards (start simple)&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Tool reliability&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;top tools by error rate&lt;/li&gt;
&lt;li&gt;top tools by p95 latency&lt;/li&gt;
&lt;li&gt;timeouts per tool&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="2"&gt;
&lt;li&gt;&lt;strong&gt;Agent success&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;success rate by agent type&lt;/li&gt;
&lt;li&gt;&amp;ldquo;stuck runs&amp;rdquo; (runs exceeding max duration)&lt;/li&gt;
&lt;li&gt;average steps per run&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;cost per run&lt;/li&gt;
&lt;li&gt;cost per tenant&lt;/li&gt;
&lt;li&gt;top drivers (which tools/model calls)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="alerts-avoid-noise"&gt;Alerts (avoid noise)&lt;/h3&gt;
&lt;p&gt;Alert on what is actionable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool error rate spikes for critical tools&lt;/li&gt;
&lt;li&gt;tool latency p95 spikes beyond SLO&lt;/li&gt;
&lt;li&gt;budget exceeded spike (runaway behavior)&lt;/li&gt;
&lt;li&gt;policy denied spike (possible prompt injection attempt)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you use SLOs and error budgets, Google&amp;rsquo;s SRE material is a practical reference for turning SLOs into alerting strategies. [6]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="tracing"&gt;Tracing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Every agent run has a trace ID.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Trace context propagates across MCP boundaries (W3C Trace Context). [2]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool calls are spans with stable tool identifiers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="metrics"&gt;Metrics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool success/error/latency metrics exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Agent run success/latency/steps metrics exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Cost metrics exist and are monitored.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="logging"&gt;Logging&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Default logs are redacted summaries, not raw payloads.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Debug logging is time-bounded and access-controlled.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="audit"&gt;Audit&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Audit events exist for all side-effecting tools.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Audit records include &amp;ldquo;who/what/when/result&amp;rdquo; without leaking secrets.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="security"&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Observability does not become a secret exfil path (OWASP risks considered). [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] OpenTelemetry - Documentation (overview): &lt;a href="https://opentelemetry.io/docs/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/&lt;/a&gt;
[2] W3C - Trace Context: &lt;a href="https://www.w3.org/TR/trace-context/" target="_blank" rel="noopener noreferrer"&gt;https://www.w3.org/TR/trace-context/&lt;/a&gt;
[3] OpenTelemetry - Semantic conventions for HTTP (spans/metrics/logs): &lt;a href="https://opentelemetry.io/docs/specs/semconv/http/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/specs/semconv/http/&lt;/a&gt;
[4] OpenTelemetry Go - Instrumentation docs: &lt;a href="https://opentelemetry.io/docs/languages/go/instrumentation/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/languages/go/instrumentation/&lt;/a&gt;
[5] OpenTelemetry - Semantic conventions for HTTP metrics: &lt;a href="https://opentelemetry.io/docs/specs/semconv/http/http-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/specs/semconv/http/http-metrics/&lt;/a&gt;
[6] Google SRE Workbook - Alerting on SLOs: &lt;a href="https://sre.google/workbook/alerting-on-slos/" target="_blank" rel="noopener noreferrer"&gt;https://sre.google/workbook/alerting-on-slos/&lt;/a&gt;
[7] OWASP - Top 10 for Large Language Model Applications: &lt;a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noopener noreferrer"&gt;https://owasp.org/www-project-top-10-for-large-language-model-applications/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item></channel></rss>