<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Architecture | Roy Gabriel</title><link>https://roygabriel.dev/tags/architecture/</link><description>Roy Gabriel: DevOps Architect &amp; Applied AI Engineer. Technical blog on Go, MCP servers, Kubernetes, and production AI systems.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 27 Feb 2026 03:18:04 +0000</lastBuildDate><atom:link href="https://roygabriel.dev/tags/architecture/index.xml" rel="self" type="application/rss+xml"/><item><title>When Enterprise Defaults Become Enterprise Debt</title><link>https://roygabriel.dev/blog/enterprise-defaults-enterprise-debt/</link><pubDate>Sat, 07 Feb 2026 09:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/enterprise-defaults-enterprise-debt/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;. They&amp;rsquo;re not a critique of any one organization; they&amp;rsquo;re patterns that repeat across industries.
The goal isn&amp;rsquo;t to &amp;ldquo;modernize for fun.&amp;rdquo; It&amp;rsquo;s to protect speed-to-market &lt;em&gt;and&lt;/em&gt; reliability as systems and organizations scale.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises don&amp;rsquo;t lose because they picked the &amp;ldquo;wrong&amp;rdquo; framework or cloud provider. They lose because old defaults - once rational - become invisible policy.&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;. They&amp;rsquo;re not a critique of any one organization; they&amp;rsquo;re patterns that repeat across industries.
The goal isn&amp;rsquo;t to &amp;ldquo;modernize for fun.&amp;rdquo; It&amp;rsquo;s to protect speed-to-market &lt;em&gt;and&lt;/em&gt; reliability as systems and organizations scale.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises don&amp;rsquo;t lose because they picked the &amp;ldquo;wrong&amp;rdquo; framework or cloud provider. They lose because old defaults - once rational - become invisible policy.&lt;/p&gt;
&lt;p&gt;The 90s and early 2000s optimized for constraints that were real at the time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;hardware was expensive&lt;/li&gt;
&lt;li&gt;automation was immature&lt;/li&gt;
&lt;li&gt;environments were scarce&lt;/li&gt;
&lt;li&gt;security controls were largely manual&lt;/li&gt;
&lt;li&gt;uptime was achieved by cautious change, not by safe change&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those constraints have shifted. But many organizations still run on &lt;strong&gt;architectural and governance defaults&lt;/strong&gt; designed for a different era.&lt;/p&gt;
&lt;p&gt;The result is predictable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;innovation slows&lt;/strong&gt; (lead time grows)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;quality degrades&lt;/strong&gt; (late integration + big-bang changes)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reliability suffers&lt;/strong&gt; (risk is batched, blast radius expands)&lt;/li&gt;
&lt;li&gt;engineers spend more time navigating the system than improving it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want a single sentence summary: &lt;strong&gt;old patterns don&amp;rsquo;t just slow delivery - they also create the conditions for outages.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Retire &amp;ldquo;analysis as delivery.&amp;rdquo; Timebox discovery and ship thin vertical slices.&lt;/li&gt;
&lt;li&gt;Treat cloud primitives as &lt;em&gt;primitives&lt;/em&gt;, not research projects (e.g., object storage is solved).&lt;/li&gt;
&lt;li&gt;Default to &lt;strong&gt;containers + orchestration&lt;/strong&gt; for most stateless services; use VMs deliberately, not reflexively. [5]&lt;/li&gt;
&lt;li&gt;Replace ticket queues and boards with &lt;strong&gt;guardrails + paved roads + policy-as-code&lt;/strong&gt;. [7][8]&lt;/li&gt;
&lt;li&gt;Measure what matters: &lt;strong&gt;lead time, deploy frequency, change failure rate, MTTR&lt;/strong&gt;. [1][2]&lt;/li&gt;
&lt;li&gt;Modernization works best as an incremental program, not a rewrite (Strangler Fig pattern). [12]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#pattern-1-analysis-as-a-substitute-for-delivery"&gt;Pattern 1: Analysis as a substitute for delivery&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-2-reinventing-commodity-infrastructure"&gt;Pattern 2: Reinventing commodity infrastructure&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-3-vm-first-thinking-as-the-default"&gt;Pattern 3: VM-first thinking as the default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-4-ticket-driven-infrastructure"&gt;Pattern 4: Ticket-driven infrastructure&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-5-change-advisory-board-for-routine-changes"&gt;Pattern 5: Change Advisory Board for routine changes&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-6-the-shared-database-empire"&gt;Pattern 6: The shared database empire&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-7-central-integration-as-a-chokepoint"&gt;Pattern 7: Central integration as a chokepoint&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-8-perma-pocs-and-innovation-theater"&gt;Pattern 8: Perma-POCs and innovation theater&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#replace-committees-with-guardrails"&gt;Replace committees with guardrails&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#modernize-without-a-rewrite"&gt;Modernize without a rewrite&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-checklist"&gt;A practical checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-1-analysis-as-a-substitute-for-delivery"&gt;Pattern 1: Analysis as a substitute for delivery&lt;/h2&gt;
&lt;h3 id="what-it-looks-like"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A team spends months (sometimes a year) doing &amp;ldquo;analysis&amp;rdquo; for a capability that won&amp;rsquo;t be used until it&amp;rsquo;s built - often with the intention of eliminating all risk up front.&lt;/p&gt;
&lt;p&gt;Common examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;multi-tenant &amp;ldquo;high availability image storage&amp;rdquo; designed from scratch&lt;/li&gt;
&lt;li&gt;designing bespoke event systems when managed queues exist&lt;/li&gt;
&lt;li&gt;writing 40-page architecture documents before the first running slice exists&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-existed"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;When provisioning took weeks and environments were scarce, analysis was a rational risk-reducer.&lt;/p&gt;
&lt;h3 id="the-hidden-tax"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You push real learning to the end (integration failures happen late).&lt;/li&gt;
&lt;li&gt;Decisions get made with imaginary constraints, not measured ones.&lt;/li&gt;
&lt;li&gt;Teams optimize for &amp;ldquo;approval&amp;rdquo; rather than &amp;ldquo;outcome.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Timebox discovery and require a running slice early.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A strong default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1-2 week spike to validate constraints&lt;/li&gt;
&lt;li&gt;a thin vertical slice in production (even behind a flag)&lt;/li&gt;
&lt;li&gt;iterate based on real telemetry and user feedback&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-low-drama"&gt;Transition step (low drama)&lt;/h3&gt;
&lt;p&gt;Create an &amp;ldquo;RFC-lite&amp;rdquo; template:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;problem statement + constraints&lt;/li&gt;
&lt;li&gt;1-2 options with tradeoffs&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;plan to measure&lt;/strong&gt; (latency, cost, reliability)&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;thin-slice milestone&lt;/strong&gt; date&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-2-reinventing-commodity-infrastructure"&gt;Pattern 2: Reinventing commodity infrastructure&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-1"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Teams treat widely-proven primitives as novel:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;object storage&lt;/li&gt;
&lt;li&gt;queues&lt;/li&gt;
&lt;li&gt;identity&lt;/li&gt;
&lt;li&gt;metrics + tracing&lt;/li&gt;
&lt;li&gt;load balancing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A classic symptom: &amp;ldquo;We need to design HA multi-tenant object storage,&amp;rdquo; as if durable object storage isn&amp;rsquo;t already a standard building block.&lt;/p&gt;
&lt;h3 id="why-it-existed-1"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;On-prem and early hosting eras forced you to build a lot yourself.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-1"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Reinventing primitives becomes a multi-quarter project.&lt;/li&gt;
&lt;li&gt;Reliability becomes your problem (and you will be on call for it).&lt;/li&gt;
&lt;li&gt;The business pays for the same capability twice: once in time, and again in incidents.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-1"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Default to &lt;strong&gt;managed or proven primitives&lt;/strong&gt; unless you have a documented reason not to.&lt;/p&gt;
&lt;p&gt;For example, modern object storage services are explicitly designed for very high durability and availability (provider details vary). [11]&lt;/p&gt;
&lt;h3 id="transition-step"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Maintain a &amp;ldquo;Reference Implementations&amp;rdquo; catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;How we do object storage&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;How we do queues&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;How we do auth&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;How we do telemetry&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the default is documented and supported, teams stop re-litigating fundamentals.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-3-vm-first-thinking-as-the-default"&gt;Pattern 3: VM-first thinking as the default&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-2"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Everything runs on VMs because &amp;ldquo;that&amp;rsquo;s what we do,&amp;rdquo; even when the workload is a stateless API, worker, or event consumer.&lt;/p&gt;
&lt;h3 id="why-it-existed-2"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;VMs were the universal unit of deployment for a long time, and they map cleanly to org boundaries (&amp;ldquo;this server is mine&amp;rdquo;).&lt;/p&gt;
&lt;h3 id="the-hidden-tax-2"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;drift (snowflake servers)&lt;/li&gt;
&lt;li&gt;slow rollouts&lt;/li&gt;
&lt;li&gt;inconsistent security posture&lt;/li&gt;
&lt;li&gt;wasted compute due to poor bin-packing&lt;/li&gt;
&lt;li&gt;limited standardization across services&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-2"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;For many enterprise services, &lt;strong&gt;containers orchestrated by Kubernetes&lt;/strong&gt; are a strong default for stateless workloads. Kubernetes itself describes Deployments as a good fit for managing stateless applications where Pods are interchangeable and replaceable. [5]&lt;/p&gt;
&lt;p&gt;This doesn&amp;rsquo;t mean &amp;ldquo;Kubernetes for everything,&amp;rdquo; but it does mean:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;prefer declarative workloads with health checks and rollout controls&lt;/li&gt;
&lt;li&gt;keep VMs for deliberate cases (legacy constraints, special licensing, unique state, or when orchestration adds no value)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-1"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Start with &amp;ldquo;Kubernetes-first for new stateless services,&amp;rdquo; not a migration mandate.&lt;/p&gt;
&lt;p&gt;Then build operational guardrails:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;resource requests/limits so services behave predictably under load [6]&lt;/li&gt;
&lt;li&gt;standardized readiness/liveness probes&lt;/li&gt;
&lt;li&gt;standard ingress + auth patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-4-ticket-driven-infrastructure"&gt;Pattern 4: Ticket-driven infrastructure&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-3"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Need a database? Ticket.
Need an environment? Ticket.
Need DNS? Ticket.
Need a queue? Ticket.&lt;/p&gt;
&lt;p&gt;Eventually, the ticketing system becomes the true control plane.&lt;/p&gt;
&lt;h3 id="why-it-existed-3"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s a reasonable response when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;environments are scarce&lt;/li&gt;
&lt;li&gt;changes are risky&lt;/li&gt;
&lt;li&gt;platform knowledge is specialized&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-3"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;queues become normalized (&amp;ldquo;it takes 3 weeks to get a namespace&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;teams route around the platform&lt;/li&gt;
&lt;li&gt;reliability doesn&amp;rsquo;t improve; delivery just slows&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-3"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Self-service via &lt;strong&gt;GitOps&lt;/strong&gt; and platform &amp;ldquo;paved roads.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;OpenGitOps describes GitOps as a set of standards/best practices for adopting a structured approach to GitOps. [7] The point isn&amp;rsquo;t a specific tool - it&amp;rsquo;s the principle: &lt;strong&gt;desired state is declarative and auditable.&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id="transition-step-2"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Pick one high-frequency request and eliminate it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;create a service with a standard ingress/auth/telemetry&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;provision a queue&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;create a dev environment&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Make the paved road the path of least resistance.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-5-change-advisory-board-for-routine-changes"&gt;Pattern 5: Change Advisory Board for routine changes&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-4"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Every change - routine or risky - requires synchronous approval.&lt;/p&gt;
&lt;h3 id="why-it-existed-4"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;When changes were large, rare, and manual, centralized review reduced catastrophic surprises.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-4"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;you batch changes (bigger releases are riskier)&lt;/li&gt;
&lt;li&gt;emergency changes bypass process (creating inconsistency)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;approval&amp;rdquo; becomes the goal rather than &lt;strong&gt;evidence of safety&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;DORA&amp;rsquo;s guidance on streamlining change approval emphasizes making the regular change process fast and reliable enough that it can handle emergencies, and reframes how CAB fits into continuous delivery. [3] Continuous delivery literature makes a similar point: smaller, more frequent changes reduce risk and ease remediation. [4]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-4"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Move to &lt;strong&gt;evidence-based change approval&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;automated tests&lt;/li&gt;
&lt;li&gt;policy-as-code checks&lt;/li&gt;
&lt;li&gt;progressive delivery (canaries, phased rollouts)&lt;/li&gt;
&lt;li&gt;real-time telemetry tied to the release&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-3"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Keep CAB, but change its scope:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;focus on high-risk changes and cross-team coordination&lt;/li&gt;
&lt;li&gt;use automation and metrics for routine changes&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-6-the-shared-database-empire"&gt;Pattern 6: The shared database empire&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-5"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A central database is shared by many services.
Teams coordinate schema changes across multiple apps and releases.&lt;/p&gt;
&lt;p&gt;Microservices.io describes the &amp;ldquo;shared database&amp;rdquo; pattern explicitly: multiple services access a single database directly. [10]&lt;/p&gt;
&lt;h3 id="why-it-existed-5"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s simple at first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one place for data&lt;/li&gt;
&lt;li&gt;easy joins&lt;/li&gt;
&lt;li&gt;one backup plan&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-5"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;coupling spreads everywhere&lt;/li&gt;
&lt;li&gt;every change becomes cross-team work&lt;/li&gt;
&lt;li&gt;reliability suffers because one DB problem becomes everyone&amp;rsquo;s problem&lt;/li&gt;
&lt;li&gt;schema evolution becomes political&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-5"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Prefer service-owned data boundaries. Microservices.io&amp;rsquo;s &amp;ldquo;database per service&amp;rdquo; pattern describes keeping a service&amp;rsquo;s data private and accessible only via its API. [9]&lt;/p&gt;
&lt;h3 id="transition-step-4"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;You don&amp;rsquo;t have to &amp;ldquo;microservices everything.&amp;rdquo;
Start by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;carving out new tables owned by one service&lt;/li&gt;
&lt;li&gt;introducing an API boundary&lt;/li&gt;
&lt;li&gt;migrating consumers gradually&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-7-central-integration-as-a-chokepoint"&gt;Pattern 7: Central integration as a chokepoint&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-6"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;All integrations must go through a single shared integration layer/team (classic ESB gravity).&lt;/p&gt;
&lt;h3 id="why-it-existed-6"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;Centralizing integration gave consistency when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;protocols were messy&lt;/li&gt;
&lt;li&gt;tooling was expensive&lt;/li&gt;
&lt;li&gt;teams lacked automation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-6"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;integration lead times explode&lt;/li&gt;
&lt;li&gt;teams stop experimenting&lt;/li&gt;
&lt;li&gt;one backlog becomes everyone&amp;rsquo;s bottleneck&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-6"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Standardize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;interfaces&lt;/strong&gt; (auth, tracing, deployment, contract testing)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;platform guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;hellip;not every internal implementation detail.&lt;/p&gt;
&lt;h3 id="transition-step-5"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Carve out one &amp;ldquo;self-service integration&amp;rdquo; paved road:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;standard service template&lt;/li&gt;
&lt;li&gt;standard auth&lt;/li&gt;
&lt;li&gt;standard telemetry&lt;/li&gt;
&lt;li&gt;contracts + examples&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-8-perma-pocs-and-innovation-theater"&gt;Pattern 8: Perma-POCs and innovation theater&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-7"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Prototypes exist forever, never becoming production systems.&lt;/p&gt;
&lt;p&gt;Especially common with AI initiatives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;impressive demos&lt;/li&gt;
&lt;li&gt;no production constraints&lt;/li&gt;
&lt;li&gt;no ownership for operability&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-existed-7"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;POCs are a safe way to explore unknowns.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-7"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;teams lose trust (&amp;ldquo;innovation never ships&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;production teams inherit half-baked work&lt;/li&gt;
&lt;li&gt;opportunity cost compounds&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-7"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;From day one, require:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an owner&lt;/li&gt;
&lt;li&gt;a production path&lt;/li&gt;
&lt;li&gt;a thin slice in a real environment&lt;/li&gt;
&lt;li&gt;explicit safety requirements (timeouts, budgets, telemetry)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-6"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Make &amp;ldquo;POC exit criteria&amp;rdquo; mandatory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;what metrics prove value?&lt;/li&gt;
&lt;li&gt;what is the minimum shippable slice?&lt;/li&gt;
&lt;li&gt;what must be true for reliability and security?&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="replace-committees-with-guardrails"&gt;Replace committees with guardrails&lt;/h2&gt;
&lt;p&gt;A recurring theme: &lt;strong&gt;humans are expensive control planes&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The modern move is to convert &amp;ldquo;tribal rules&amp;rdquo; into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;templates&lt;/li&gt;
&lt;li&gt;automation&lt;/li&gt;
&lt;li&gt;policy-as-code&lt;/li&gt;
&lt;li&gt;paved paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Microsoft&amp;rsquo;s platform engineering work describes &amp;ldquo;paved paths&amp;rdquo; within an internal developer platform as recommended paths to production that guide developers through requirements without sacrificing velocity. [8]&lt;/p&gt;
&lt;p&gt;Guardrails beat gatekeepers because guardrails are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;consistent&lt;/li&gt;
&lt;li&gt;fast&lt;/li&gt;
&lt;li&gt;auditable&lt;/li&gt;
&lt;li&gt;scalable&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="modernize-without-a-rewrite"&gt;Modernize without a rewrite&lt;/h2&gt;
&lt;p&gt;Big-bang rewrites are expensive and risky. Incremental modernization is usually the winning move.&lt;/p&gt;
&lt;p&gt;The Strangler Fig pattern is a well-known approach: wrap or route traffic so you can replace parts of a legacy system gradually. [12]&lt;/p&gt;
&lt;p&gt;Practical approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;put a facade in front of the legacy surface&lt;/li&gt;
&lt;li&gt;carve off one slice at a time&lt;/li&gt;
&lt;li&gt;measure outcomes&lt;/li&gt;
&lt;li&gt;keep rollback easy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn&amp;rsquo;t glamorous. It works.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/h2&gt;
&lt;p&gt;If you want to avoid &amp;ldquo;modernization theater,&amp;rdquo; measure.&lt;/p&gt;
&lt;p&gt;DORA&amp;rsquo;s metrics guidance is a solid baseline: deployment frequency, lead time for changes, change failure rate, and time to restore service (MTTR). [1] The 2024 DORA report continues to focus on the organizational capabilities that drive high performance. [2]&lt;/p&gt;
&lt;p&gt;A simple evidence loop:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pick one value stream (one product or platform slice).&lt;/li&gt;
&lt;li&gt;Baseline the four DORA metrics.&lt;/li&gt;
&lt;li&gt;Remove one friction point (one pattern).&lt;/li&gt;
&lt;li&gt;Re-measure.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your metrics don&amp;rsquo;t move, you didn&amp;rsquo;t remove the real constraint.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-practical-checklist"&gt;A practical checklist&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;re trying to retire &amp;ldquo;enterprise debt&amp;rdquo; safely:&lt;/p&gt;
&lt;h3 id="delivery"&gt;Delivery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timebox analysis; require a running slice early.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Prefer small changes and frequent releases; avoid batching.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="platform"&gt;Platform&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Provide a paved road for common workflows (service template, auth, telemetry). [8]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Remove ticket queues for repeatable requests (self-service + GitOps). [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Standardize timeouts, retries, budgets, and resource requests/limits. [6]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Use progressive delivery where risk is high.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="architecture"&gt;Architecture&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Reduce shared DB coupling; establish service-owned boundaries. [9][10]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Modernize incrementally (Strangler Fig), not via big-bang rewrites. [12]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="governance"&gt;Governance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Replace routine approvals with evidence: tests + policy-as-code + telemetry. [3][4]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] DORA - &amp;ldquo;DORA&amp;rsquo;s software delivery performance metrics (guide)&amp;rdquo;. &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
[2] DORA - &amp;ldquo;Accelerate State of DevOps Report 2024&amp;rdquo;. &lt;a href="https://dora.dev/research/2024/dora-report/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/research/2024/dora-report/&lt;/a&gt;
[3] DORA - &amp;ldquo;Streamlining change approval (capability)&amp;rdquo;. &lt;a href="https://dora.dev/capabilities/streamlining-change-approval/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/capabilities/streamlining-change-approval/&lt;/a&gt;
[4] ContinuousDelivery.com - &amp;ldquo;Continuous Delivery and ITIL: Change Management&amp;rdquo;. &lt;a href="https://continuousdelivery.com/2010/11/continuous-delivery-and-itil-change-management/" target="_blank" rel="noopener noreferrer"&gt;https://continuousdelivery.com/2010/11/continuous-delivery-and-itil-change-management/&lt;/a&gt;
[5] Kubernetes docs - &amp;ldquo;Workloads (Deployments are a good fit for stateless workloads)&amp;rdquo;. &lt;a href="https://kubernetes.io/docs/concepts/workloads/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/workloads/&lt;/a&gt;
[6] Kubernetes docs - &amp;ldquo;Resource Management for Pods and Containers (requests/limits)&amp;rdquo;. &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/&lt;/a&gt;
[7] OpenGitOps - &amp;ldquo;What is OpenGitOps?&amp;rdquo; and project background. &lt;a href="https://opengitops.dev/" target="_blank" rel="noopener noreferrer"&gt;https://opengitops.dev/&lt;/a&gt;
and &lt;a href="https://opengitops.dev/about/" target="_blank" rel="noopener noreferrer"&gt;https://opengitops.dev/about/&lt;/a&gt;
[8] Microsoft Engineering Blog - &amp;ldquo;Building paved paths: the journey to platform engineering&amp;rdquo;. &lt;a href="https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/" target="_blank" rel="noopener noreferrer"&gt;https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/&lt;/a&gt;
[9] Microservices.io - &amp;ldquo;Database per service&amp;rdquo; pattern. &lt;a href="https://microservices.io/patterns/data/database-per-service" target="_blank" rel="noopener noreferrer"&gt;https://microservices.io/patterns/data/database-per-service&lt;/a&gt;
[10] Microservices.io - &amp;ldquo;Shared database&amp;rdquo; pattern. &lt;a href="https://microservices.io/patterns/data/shared-database.html" target="_blank" rel="noopener noreferrer"&gt;https://microservices.io/patterns/data/shared-database.html&lt;/a&gt;
[11] AWS documentation - &amp;ldquo;Data protection in Amazon S3 (durability/availability design goals)&amp;rdquo;. &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html" target="_blank" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html&lt;/a&gt;
[12] Martin Fowler - &amp;ldquo;Strangler Fig Application&amp;rdquo; (legacy modernization pattern). &lt;a href="https://martinfowler.com/bliki/StranglerFigApplication.html" target="_blank" rel="noopener noreferrer"&gt;https://martinfowler.com/bliki/StranglerFigApplication.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>Stop Shipping Slide Decks</title><link>https://roygabriel.dev/blog/stop-shipping-slide-decks/</link><pubDate>Sat, 31 Jan 2026 11:15:00 -0500</pubDate><guid>https://roygabriel.dev/blog/stop-shipping-slide-decks/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Position:&lt;/strong&gt; This is not &amp;ldquo;documentation bad.&amp;rdquo;
This is &amp;ldquo;documentation is a tool.&amp;rdquo; If it increases lead time, hides truth, or replaces learning, it&amp;rsquo;s not helping.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;In software, the real &amp;ldquo;source of truth&amp;rdquo; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;running systems&lt;/li&gt;
&lt;li&gt;code and configuration&lt;/li&gt;
&lt;li&gt;production telemetry&lt;/li&gt;
&lt;li&gt;incident history&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Documentation should reduce uncertainty and speed up decisions. But two artifacts routinely do the opposite in large organizations:&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Position:&lt;/strong&gt; This is not &amp;ldquo;documentation bad.&amp;rdquo;
This is &amp;ldquo;documentation is a tool.&amp;rdquo; If it increases lead time, hides truth, or replaces learning, it&amp;rsquo;s not helping.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;In software, the real &amp;ldquo;source of truth&amp;rdquo; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;running systems&lt;/li&gt;
&lt;li&gt;code and configuration&lt;/li&gt;
&lt;li&gt;production telemetry&lt;/li&gt;
&lt;li&gt;incident history&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Documentation should reduce uncertainty and speed up decisions. But two artifacts routinely do the opposite in large organizations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;the 40-page slide deck&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;the Word doc living somewhere in SharePoint that nobody can find&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These artifacts often become &lt;em&gt;deliverables&lt;/em&gt; - a substitute for building. They make it possible to spend months &amp;ldquo;progressing&amp;rdquo; without ever encountering reality.&lt;/p&gt;
&lt;p&gt;And here&amp;rsquo;s the part most orgs miss:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;If you&amp;rsquo;re going to fail, you want to fail &lt;strong&gt;quickly and cheaply&lt;/strong&gt;, not slowly and expensively. [4]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That doesn&amp;rsquo;t mean reckless shipping. It means running a tight learning loop and letting reality correct you early - before you&amp;rsquo;ve sunk quarters of time into the wrong solution.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Decks are great for storytelling. They are bad as an engineering system of record.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;SharePoint architecture docs&amp;rdquo; become a &lt;strong&gt;document cemetery&lt;/strong&gt;: hard to find, hard to diff, and easy to ignore.&lt;/li&gt;
&lt;li&gt;The Agile Manifesto explicitly values &lt;strong&gt;working software over comprehensive documentation&lt;/strong&gt;. [1] And one Agile principle states that working software is the primary measure of progress. [2]&lt;/li&gt;
&lt;li&gt;Replace decks/docs-as-deliverables with:&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RFC-lite&lt;/strong&gt; (1-2 pages) + a &lt;strong&gt;running thin slice&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ADRs&lt;/strong&gt; (Architecture Decision Records) to capture decisions + tradeoffs [5][6]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Docs-as-code&lt;/strong&gt; (Markdown in the repo, reviewed like code)&lt;/li&gt;
&lt;li&gt;diagrams that are versioned and easy to update&lt;/li&gt;
&lt;li&gt;Measure improvement with system outcomes (lead time, deploy frequency, change failure rate, MTTR). [3]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#pattern-1-deck-driven-development"&gt;Pattern 1: Deck-driven development&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-2-sharepoint-document-cemeteries"&gt;Pattern 2: SharePoint document cemeteries&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-3-architecture-as-narrative-not-decisions"&gt;Pattern 3: Architecture as narrative, not decisions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-4-design-phase-gating"&gt;Pattern 4: &amp;ldquo;Design phase&amp;rdquo; gating&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-5-documentation-that-never-gets-pruned"&gt;Pattern 5: Documentation that never gets pruned&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-to-do-instead-a-documentation-system-that-ships"&gt;What to do instead: a documentation system that ships&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-checklist"&gt;A practical checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-1-deck-driven-development"&gt;Pattern 1: Deck-driven development&lt;/h2&gt;
&lt;h3 id="what-it-looks-like"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A 40-page deck is created to describe a system that doesn&amp;rsquo;t exist yet.&lt;/li&gt;
&lt;li&gt;The deck gets reviewed by multiple groups.&lt;/li&gt;
&lt;li&gt;Approval is treated as progress.&lt;/li&gt;
&lt;li&gt;When implementation starts, the world has changed - or key constraints were missed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Decks are socially useful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;they compress complexity into a narrative&lt;/li&gt;
&lt;li&gt;they help leaders &amp;ldquo;see&amp;rdquo; a plan&lt;/li&gt;
&lt;li&gt;they make uncertainty feel controlled&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Decks are a poor engineering artifact because they&amp;rsquo;re:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;low fidelity&lt;/strong&gt;: they rarely contain executable truth&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;hard to maintain&lt;/strong&gt;: updates are manual and usually lag reality&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;hard to diff&lt;/strong&gt;: you can&amp;rsquo;t easily review what changed and why&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;easy to perform&lt;/strong&gt;: a deck can look complete while the design is still untested&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;not tied to code&lt;/strong&gt;: no direct path from &amp;ldquo;decision&amp;rdquo; -&amp;gt; &amp;ldquo;implementation&amp;rdquo; -&amp;gt; &amp;ldquo;verification&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The worst outcome isn&amp;rsquo;t that the deck is wrong. It&amp;rsquo;s that the deck delays the point where you discover what&amp;rsquo;s wrong.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Use decks for storytelling &lt;strong&gt;after&lt;/strong&gt; you have reality. Use engineering artifacts to discover reality.&lt;/p&gt;
&lt;p&gt;A strong default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RFC-lite&lt;/strong&gt; (1-2 pages)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;a runnable thin slice&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;measurable verification&lt;/strong&gt; (latency, cost envelope, failure mode)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This aligns with Agile&amp;rsquo;s emphasis on working software as a real measure of progress. [2]&lt;/p&gt;
&lt;h3 id="transition-step-low-drama"&gt;Transition step (low drama)&lt;/h3&gt;
&lt;p&gt;Replace &amp;ldquo;deck required for approval&amp;rdquo; with &amp;ldquo;evidence required for approval&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;link to the RFC&lt;/li&gt;
&lt;li&gt;link to a running demo / branch / sandbox&lt;/li&gt;
&lt;li&gt;explicit constraints + tradeoffs&lt;/li&gt;
&lt;li&gt;an exit criteria checklist for the slice&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-2-sharepoint-document-cemeteries"&gt;Pattern 2: SharePoint document cemeteries&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-1"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Architecture docs exist as Word/PDF files in SharePoint.&lt;/li&gt;
&lt;li&gt;Multiple versions exist (&amp;ldquo;Final_v7_REAL_FINAL.docx&amp;rdquo;).&lt;/li&gt;
&lt;li&gt;Search works poorly unless you already know what to search for.&lt;/li&gt;
&lt;li&gt;Nobody updates the doc because it&amp;rsquo;s painful and risky (&amp;ldquo;what if I change the blessed doc?&amp;rdquo;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-1"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s an enterprise default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SharePoint is &amp;ldquo;official&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Word docs feel formal&lt;/li&gt;
&lt;li&gt;it&amp;rsquo;s familiar to non-engineering stakeholders&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-1"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;SharePoint docs typically fail at the things engineering needs most:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;discoverability&lt;/strong&gt; (people don&amp;rsquo;t know where to look)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ownership&lt;/strong&gt; (no clear maintainer)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reviewability&lt;/strong&gt; (diffs and PR discussion are weak)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;linking to reality&lt;/strong&gt; (code, configs, dashboards, runbooks)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;keeping current&lt;/strong&gt; (documentation drift becomes the norm)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So teams stop trusting docs and rely on tribal knowledge - until they page someone at 2 a.m.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-1"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Treat documentation as part of the codebase:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Markdown in the repo&lt;/li&gt;
&lt;li&gt;reviewed via PR like code&lt;/li&gt;
&lt;li&gt;versioned with implementation&lt;/li&gt;
&lt;li&gt;linked to:&lt;/li&gt;
&lt;li&gt;APIs (OpenAPI specs)&lt;/li&gt;
&lt;li&gt;dashboards&lt;/li&gt;
&lt;li&gt;runbooks&lt;/li&gt;
&lt;li&gt;incident writeups&lt;/li&gt;
&lt;li&gt;ADRs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Google&amp;rsquo;s documentation best practices make the point directly: a small set of fresh, accurate docs is better than a large pile in disrepair. [7]&lt;/p&gt;
&lt;h3 id="transition-step"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;You don&amp;rsquo;t have to &amp;ldquo;migrate all docs.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Start with a triage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identify the top 10 documents people actually need.&lt;/li&gt;
&lt;li&gt;Recreate them as Markdown in a &lt;code&gt;docs/&lt;/code&gt; folder with an index.&lt;/li&gt;
&lt;li&gt;Leave the rest as archived references, not living truth.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-3-architecture-as-narrative-not-decisions"&gt;Pattern 3: Architecture as narrative, not decisions&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-2"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;The doc describes a target architecture but doesn&amp;rsquo;t answer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;why this approach?&lt;/li&gt;
&lt;li&gt;what alternatives were considered?&lt;/li&gt;
&lt;li&gt;what tradeoffs were accepted?&lt;/li&gt;
&lt;li&gt;what constraints matter most?&lt;/li&gt;
&lt;li&gt;what did we decide not to do?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-2"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Narratives are easier than decision logs. It&amp;rsquo;s simpler to write &amp;ldquo;the system will&amp;hellip;&amp;rdquo; than to record the messy reality of tradeoffs.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-2"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;When decisions aren&amp;rsquo;t recorded, teams re-litigate them repeatedly. The same arguments come back every quarter - often because new people joined and the reasoning isn&amp;rsquo;t captured.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-adrs"&gt;The replacement pattern: ADRs&lt;/h3&gt;
&lt;p&gt;Use &lt;strong&gt;Architecture Decision Records (ADRs)&lt;/strong&gt;: short, structured notes that capture an important decision with its context and consequences. [5] The practice is commonly attributed to Michael Nygard&amp;rsquo;s 2011 write-up. [6]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ADRs are the opposite of a 40-slide deck:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;small&lt;/li&gt;
&lt;li&gt;specific&lt;/li&gt;
&lt;li&gt;diffable&lt;/li&gt;
&lt;li&gt;linkable to code changes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-1"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Start with one ADR per &amp;ldquo;architecturally significant decision&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;database choice&lt;/li&gt;
&lt;li&gt;messaging pattern&lt;/li&gt;
&lt;li&gt;tenancy model&lt;/li&gt;
&lt;li&gt;auth model&lt;/li&gt;
&lt;li&gt;deployment model&lt;/li&gt;
&lt;li&gt;data boundary decisions&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-4-design-phase-gating"&gt;Pattern 4: &amp;ldquo;Design phase&amp;rdquo; gating&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-3"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;We can&amp;rsquo;t start implementation until the analysis is complete.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;The analysis expands to include every possible future case.&lt;/li&gt;
&lt;li&gt;The design grows more &amp;ldquo;complete&amp;rdquo; and less true.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-3"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Enterprises are understandably afraid of failure.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-3"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;This approach doesn&amp;rsquo;t eliminate failure. It defers it - making it more expensive.&lt;/p&gt;
&lt;p&gt;Lean Startup describes progress as validated learning and emphasizes moving quickly through a build-measure-learn loop. [4] The point isn&amp;rsquo;t startups. The point is learning fast when you&amp;rsquo;re uncertain.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-2"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Timebox design, then validate with a thin slice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;write the RFC-lite doc&lt;/li&gt;
&lt;li&gt;implement the smallest realistic end-to-end path&lt;/li&gt;
&lt;li&gt;measure the constraints&lt;/li&gt;
&lt;li&gt;then expand&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-2"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Define &amp;ldquo;analysis exit criteria&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;measurable constraints validated (not theorized)&lt;/li&gt;
&lt;li&gt;spike code exists&lt;/li&gt;
&lt;li&gt;a plan for incremental rollout exists&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-5-documentation-that-never-gets-pruned"&gt;Pattern 5: Documentation that never gets pruned&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-4"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Docs accumulate but aren&amp;rsquo;t maintained:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;outdated architecture diagrams&lt;/li&gt;
&lt;li&gt;old runbooks&lt;/li&gt;
&lt;li&gt;stale onboarding guides&lt;/li&gt;
&lt;li&gt;dead links&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-4"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Pruning isn&amp;rsquo;t rewarded. Writing new docs feels productive; deleting old docs feels risky.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-4"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Stale docs are worse than no docs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;they mislead&lt;/li&gt;
&lt;li&gt;they increase cognitive load&lt;/li&gt;
&lt;li&gt;they create false confidence&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-3"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Adopt &amp;ldquo;minimum viable documentation&amp;rdquo; and prune regularly. [7]&lt;/p&gt;
&lt;p&gt;The rule I like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If a doc isn&amp;rsquo;t maintained, label it &lt;strong&gt;ARCHIVED&lt;/strong&gt; and explain why.&lt;/li&gt;
&lt;li&gt;If a doc is required, tie it to ownership and change workflow.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-3"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Make docs part of PR hygiene:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if the change affects behavior, docs update ships with it&lt;/li&gt;
&lt;li&gt;run link checks in CI&lt;/li&gt;
&lt;li&gt;keep an index page updated&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="what-to-do-instead-a-documentation-system-that-ships"&gt;What to do instead: a documentation system that ships&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s a simple &amp;ldquo;docs system&amp;rdquo; that works in practice.&lt;/p&gt;
&lt;h3 id="a-repo-structure-that-scales"&gt;A repo structure that scales&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;/README.md # entry point: what this is + how to run it
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;/docs/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; index.md # &amp;#34;start here&amp;#34; documentation map
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; rfc/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0001-tenancy-model.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0002-storage-approach.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; adr/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0001-use-postgres.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0002-adopt-opentelemetry.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; architecture/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; context.md # C4-ish: context + boundaries
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; containers.md # top-level services
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; deployment.md # runtime &amp;amp; environments
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; runbooks/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; oncall.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; incident-response.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; api/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; openapi.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="replace-40-slides-with-two-artifacts"&gt;Replace 40 slides with two artifacts&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;RFC-lite (1-2 pages)&lt;/strong&gt;: the &amp;ldquo;what&amp;rdquo; and &amp;ldquo;why&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thin slice demo&lt;/strong&gt;: the reality check&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 id="rfc-lite-template-copypaste"&gt;RFC-lite template (copy/paste)&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-markdown" data-lang="markdown"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;# RFC: &amp;lt;title&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Problem
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What are we trying to solve? Who is affected?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Constraints
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;Latency, cost, compliance, tenancy, uptime, environments.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Proposal
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What are we building? What does &amp;#34;done&amp;#34; mean?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Alternatives considered
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;Option A / B / C with short tradeoffs.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Risks and mitigations
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What could go wrong? How will we contain blast radius?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Verification
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;How will we measure success in production?
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="adr-template-copypaste"&gt;ADR template (copy/paste)&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-markdown" data-lang="markdown"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;# ADR-XXXX: &amp;lt;decision&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Status
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;Proposed | Accepted | Deprecated
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Context
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What drove this decision? What constraints matter?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Decision
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What did we decide?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Consequences
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What do we gain? What do we lose? What changes later?
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/h2&gt;
&lt;p&gt;If you replace decks and doc cemeteries with real engineering artifacts, you should see:&lt;/p&gt;
&lt;h3 id="delivery-metrics-improve"&gt;Delivery metrics improve&lt;/h3&gt;
&lt;p&gt;Track the same system-level outcomes DORA promotes: lead time, deploy frequency, change failure rate, and time to restore service. [3]&lt;/p&gt;
&lt;h3 id="fewer-handoffs-and-fewer-alignment-meetings"&gt;Fewer handoffs and fewer &amp;ldquo;alignment meetings&amp;rdquo;&lt;/h3&gt;
&lt;p&gt;If teams can self-serve context from living docs, coordination cost drops.&lt;/p&gt;
&lt;h3 id="faster-first-reality"&gt;Faster &amp;ldquo;first reality&amp;rdquo;&lt;/h3&gt;
&lt;p&gt;A simple heuristic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How long from idea -&amp;gt; first runnable thin slice?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If that number is months, the system is optimized for analysis, not learning.&lt;/p&gt;
&lt;h3 id="docs-stay-alive"&gt;Docs stay alive&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;docs updated alongside code&lt;/li&gt;
&lt;li&gt;fewer stale &amp;ldquo;final_v7&amp;rdquo; files&lt;/li&gt;
&lt;li&gt;fewer tribal-knowledge escalations&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="a-practical-checklist"&gt;A practical checklist&lt;/h2&gt;
&lt;p&gt;If you want to kill deck-driven delivery without starting a culture war:&lt;/p&gt;
&lt;h3 id="stop-treating-decks-as-deliverables"&gt;Stop treating decks as deliverables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Architecture reviews require an RFC + a runnable slice.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Decks are optional; evidence is not.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="fix-document-discoverability"&gt;Fix document discoverability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; One &lt;code&gt;docs/index.md&lt;/code&gt; that links to the docs that matter.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Make the repo the source of truth for technical docs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="capture-decisions-not-fantasies"&gt;Capture decisions, not fantasies&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Add ADRs for major decisions and link them to PRs. [5][6]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="timebox-analysis"&gt;Timebox analysis&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Set analysis exit criteria.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Optimize for early learning and quick failure when uncertainty is high. [4]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="keep-docs-small-and-alive"&gt;Keep docs small and alive&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Prune regularly; archive what&amp;rsquo;s stale.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Run link checks in CI.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Treat docs like bonsai: maintained and trimmed, not accumulated. [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Manifesto for Agile Software Development (values; &amp;ldquo;Working software over comprehensive documentation&amp;rdquo;). &lt;a href="https://agilemanifesto.org/" target="_blank" rel="noopener noreferrer"&gt;https://agilemanifesto.org/&lt;/a&gt;
[2] Principles behind the Agile Manifesto (&amp;ldquo;Working software is the primary measure of progress&amp;rdquo;). &lt;a href="https://agilemanifesto.org/principles.html" target="_blank" rel="noopener noreferrer"&gt;https://agilemanifesto.org/principles.html&lt;/a&gt;
[3] DORA - &amp;ldquo;DORA&amp;rsquo;s software delivery performance metrics (guide)&amp;rdquo;. &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
[4] Lean Startup principles (Build-Measure-Learn; learning quickly; failing fast/cheaply as a concept). &lt;a href="https://theleanstartup.com/principles" target="_blank" rel="noopener noreferrer"&gt;https://theleanstartup.com/principles&lt;/a&gt;
[5] ADR - Architectural Decision Records (what ADRs are). &lt;a href="https://adr.github.io/" target="_blank" rel="noopener noreferrer"&gt;https://adr.github.io/&lt;/a&gt;
[6] Michael Nygard - &amp;ldquo;Documenting Architecture Decisions&amp;rdquo; (2011; ADR practice origin/popularization). &lt;a href="https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions" target="_blank" rel="noopener noreferrer"&gt;https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions&lt;/a&gt;
[7] Google Documentation Guide - Best practices (&amp;ldquo;Minimum Viable Documentation&amp;rdquo;; keep docs short, fresh, and pruned). &lt;a href="https://google.github.io/styleguide/docguide/best_practices.html" target="_blank" rel="noopener noreferrer"&gt;https://google.github.io/styleguide/docguide/best_practices.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>Durable Agents with Temporal: Retries, Idempotency, and Long-Running State</title><link>https://roygabriel.dev/blog/durable-agents-with-temporal/</link><pubDate>Sat, 06 Dec 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/durable-agents-with-temporal/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Agents are often framed as &amp;ldquo;reason + tools.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;In production, the actual problem is &lt;strong&gt;execution&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calls fail&lt;/li&gt;
&lt;li&gt;networks flake&lt;/li&gt;
&lt;li&gt;credentials expire&lt;/li&gt;
&lt;li&gt;humans need to approve steps&lt;/li&gt;
&lt;li&gt;tasks take hours/days&lt;/li&gt;
&lt;li&gt;systems restart&lt;/li&gt;
&lt;li&gt;you need a forensic trail of what happened&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your agent runtime is &amp;ldquo;one process with a loop,&amp;rdquo; you will eventually lose state and do the wrong side effect twice.&lt;/p&gt;
&lt;p&gt;This is why workflow engines exist.&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Agents are often framed as &amp;ldquo;reason + tools.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;In production, the actual problem is &lt;strong&gt;execution&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calls fail&lt;/li&gt;
&lt;li&gt;networks flake&lt;/li&gt;
&lt;li&gt;credentials expire&lt;/li&gt;
&lt;li&gt;humans need to approve steps&lt;/li&gt;
&lt;li&gt;tasks take hours/days&lt;/li&gt;
&lt;li&gt;systems restart&lt;/li&gt;
&lt;li&gt;you need a forensic trail of what happened&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your agent runtime is &amp;ldquo;one process with a loop,&amp;rdquo; you will eventually lose state and do the wrong side effect twice.&lt;/p&gt;
&lt;p&gt;This is why workflow engines exist.&lt;/p&gt;
&lt;p&gt;Temporal&amp;rsquo;s model - durable workflows with deterministic execution and event history - maps incredibly well to tool-using agents. Temporal explicitly requires workflow code to be deterministic and provides APIs for versioning long-running workflows. [1][2]&lt;/p&gt;
&lt;p&gt;This article is a production pattern: &lt;strong&gt;use Temporal to make agents durable.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Represent an agent run as a &lt;strong&gt;Temporal Workflow&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Make tool calls &lt;strong&gt;Activities&lt;/strong&gt; (retryable, timeout-bounded).&lt;/li&gt;
&lt;li&gt;Put side-effecting tools behind:&lt;/li&gt;
&lt;li&gt;idempotency keys&lt;/li&gt;
&lt;li&gt;preview -&amp;gt; apply&lt;/li&gt;
&lt;li&gt;durable &amp;ldquo;exactly-once&amp;rdquo; semantics (from the workflow&amp;rsquo;s perspective)&lt;/li&gt;
&lt;li&gt;Use Temporal&amp;rsquo;s retry policies for Activities and explicit failure handling. [3]&lt;/li&gt;
&lt;li&gt;Use event history and replay for forensics (Temporal events are first-class). [4]&lt;/li&gt;
&lt;li&gt;Use workflow versioning for safe evolution of long-running agents. [2]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-agents-need-durable-execution"&gt;Why agents need durable execution&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#mapping-an-agent-to-temporal"&gt;Mapping an agent to Temporal&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#determinism-and-why-it-matters"&gt;Determinism and why it matters&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#retries-timeouts-and-idempotency"&gt;Retries, timeouts, and idempotency&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#human-in-the-loop-as-a-first-class-step"&gt;Human-in-the-loop as a first-class step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#replay-audit-and-debugging"&gt;Replay, audit, and debugging&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#versioning-evolving-agents-safely"&gt;Versioning: evolving agents safely&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="why-agents-need-durable-execution"&gt;Why agents need durable execution&lt;/h2&gt;
&lt;p&gt;A few failure modes you&amp;rsquo;ll recognize:&lt;/p&gt;
&lt;h3 id="partial-side-effects"&gt;Partial side effects&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;agent creates a ticket&lt;/li&gt;
&lt;li&gt;process dies before storing the ticket ID&lt;/li&gt;
&lt;li&gt;agent retries and creates a duplicate&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="long-running-waits"&gt;Long-running waits&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;wait for PR approvals&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;wait for a CI pipeline&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;wait for a meeting to complete&amp;rdquo;
If your agent can&amp;rsquo;t wait durably, it becomes a polling daemon.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="human-approval"&gt;Human approval&lt;/h3&gt;
&lt;p&gt;Some steps should not be automated:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;apply to prod&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;send email&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;delete resources&amp;rdquo;
You need durable pause/resume with clean audit.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="mapping-an-agent-to-temporal"&gt;Mapping an agent to Temporal&lt;/h2&gt;
&lt;h3 id="workflow--agent-run"&gt;Workflow = agent run&lt;/h3&gt;
&lt;p&gt;One agent run becomes a single Temporal Workflow Execution. Temporal workflows are designed for long-running, durable coordination. [5]&lt;/p&gt;
&lt;p&gt;Inside the workflow you model steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;interpret goal&lt;/li&gt;
&lt;li&gt;choose tools&lt;/li&gt;
&lt;li&gt;call tools&lt;/li&gt;
&lt;li&gt;react to results&lt;/li&gt;
&lt;li&gt;request approvals&lt;/li&gt;
&lt;li&gt;finalize output&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="activities--tool-calls-and-external-io"&gt;Activities = tool calls and external IO&lt;/h3&gt;
&lt;p&gt;All external calls should be Activities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MCP tool calls&lt;/li&gt;
&lt;li&gt;HTTP calls&lt;/li&gt;
&lt;li&gt;DB writes&lt;/li&gt;
&lt;li&gt;notifications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Why? Activities are where retries and timeouts belong. Temporal defines retry policies as configuration for how and when to retry failures. [3]&lt;/p&gt;
&lt;h3 id="signals--external-events"&gt;Signals = external events&lt;/h3&gt;
&lt;p&gt;Use signals for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;human approvals&lt;/li&gt;
&lt;li&gt;&amp;ldquo;cancel&amp;rdquo;&lt;/li&gt;
&lt;li&gt;updated user intent&lt;/li&gt;
&lt;li&gt;out-of-band events (&amp;ldquo;incident resolved&amp;rdquo;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="queries--introspection"&gt;Queries = introspection&lt;/h3&gt;
&lt;p&gt;Expose workflow state:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;current step&lt;/li&gt;
&lt;li&gt;last tool call&lt;/li&gt;
&lt;li&gt;pending approvals&lt;/li&gt;
&lt;li&gt;budget remaining&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="determinism-and-why-it-matters"&gt;Determinism and why it matters&lt;/h2&gt;
&lt;p&gt;Temporal requires workflow code to be deterministic. [1] Determinism is what allows Temporal to replay history and rebuild state after worker crashes.&lt;/p&gt;
&lt;p&gt;Practical consequence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Don&amp;rsquo;t do IO in workflow code.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t read the current time directly in workflow code (use Temporal APIs).&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t call random generators without deterministic control.&lt;/li&gt;
&lt;li&gt;Keep workflow logic as &amp;ldquo;orchestration,&amp;rdquo; not execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you violate determinism, you can hit non-deterministic errors on replay. Temporal&amp;rsquo;s docs and community discussions emphasize this constraint and the need for careful changes. [1][2]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="retries-timeouts-and-idempotency"&gt;Retries, timeouts, and idempotency&lt;/h2&gt;
&lt;h3 id="retry-policies-activities"&gt;Retry policies (Activities)&lt;/h3&gt;
&lt;p&gt;Temporal retry policies control backoff and retry behavior for activity failures. [3]&lt;/p&gt;
&lt;p&gt;Use them intentionally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;retries for transient failures (rate limits, timeouts)&lt;/li&gt;
&lt;li&gt;limited retries for &amp;ldquo;probably broken&amp;rdquo; failures&lt;/li&gt;
&lt;li&gt;exponential backoff with jitter (avoid thundering herd)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="timeouts-are-not-optional"&gt;Timeouts are not optional&lt;/h3&gt;
&lt;p&gt;Set explicit timeouts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ScheduleToStart&lt;/li&gt;
&lt;li&gt;StartToClose&lt;/li&gt;
&lt;li&gt;ScheduleToClose&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without timeouts, retries can run &amp;ldquo;forever&amp;rdquo; in practice.&lt;/p&gt;
&lt;h3 id="idempotency-keys-for-side-effects"&gt;Idempotency keys for side effects&lt;/h3&gt;
&lt;p&gt;Your workflow can be retried/replayed. Your Activity can be retried. Upstream systems can time out after performing the operation.&lt;/p&gt;
&lt;p&gt;For side-effecting tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;generate an idempotency key in the workflow&lt;/li&gt;
&lt;li&gt;pass it into the tool Activity&lt;/li&gt;
&lt;li&gt;store &amp;ldquo;operation result&amp;rdquo; in workflow state&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When the Activity retries, it reuses the key so the upstream system deduplicates.&lt;/p&gt;
&lt;p&gt;This is the difference between &amp;ldquo;retries&amp;rdquo; and &amp;ldquo;duplicates.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="human-in-the-loop-as-a-first-class-step"&gt;Human-in-the-loop as a first-class step&lt;/h2&gt;
&lt;p&gt;For dangerous operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pause&lt;/li&gt;
&lt;li&gt;ask for approval with the plan summary&lt;/li&gt;
&lt;li&gt;resume when approved&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Temporal workflows can wait for signals without holding threads like a traditional process would.&lt;/p&gt;
&lt;p&gt;This is one of the cleanest ways to build:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;preview -&amp;gt; approve -&amp;gt; apply&amp;rdquo;
without building a bunch of custom state machinery.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="replay-audit-and-debugging"&gt;Replay, audit, and debugging&lt;/h2&gt;
&lt;p&gt;Temporal events are recorded as part of the workflow&amp;rsquo;s event history. [4]&lt;/p&gt;
&lt;p&gt;This yields production superpowers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reconstruct exactly what happened&lt;/li&gt;
&lt;li&gt;understand why a step was taken&lt;/li&gt;
&lt;li&gt;replay a run to test a bug fix&lt;/li&gt;
&lt;li&gt;implement &amp;ldquo;reset&amp;rdquo; patterns (carefully)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For agents, this is the difference between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;the model did something weird&amp;rdquo;
and&lt;/li&gt;
&lt;li&gt;&amp;ldquo;step 7 called tool X with args Y after tool Z returned response R&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="versioning-evolving-agents-safely"&gt;Versioning: evolving agents safely&lt;/h2&gt;
&lt;p&gt;Agent logic will change. Prompts will change. Tool contracts will change.&lt;/p&gt;
&lt;p&gt;If you have long-running agents, you need a strategy that doesn&amp;rsquo;t break in-flight executions.&lt;/p&gt;
&lt;p&gt;Temporal provides workflow versioning mechanisms because determinism means you can&amp;rsquo;t simply change workflow logic without thought. [2]&lt;/p&gt;
&lt;p&gt;Production approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;keep existing executions on old code paths&lt;/li&gt;
&lt;li&gt;route new executions to new paths&lt;/li&gt;
&lt;li&gt;migrate intentionally&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This prevents &amp;ldquo;deploy broke every running workflow.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="architecture"&gt;Architecture&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Agent runs modeled as workflows; tool calls as activities.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; External events modeled as signals; state exposed via queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="determinism"&gt;Determinism&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; No IO in workflow code (only orchestration).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Workflow changes use versioning strategy. [2]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Retry policies defined for Activities. [3]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timeouts defined and bounded.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Idempotency keys used for side-effecting actions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="governance"&gt;Governance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Human approval gates exist for dangerous operations.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Audit trails include plan summaries and results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Event history used for debugging and incident analysis. [4]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Temporal - Workflow Definition (determinism requirement): &lt;a href="https://docs.temporal.io/workflow-definition" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/workflow-definition&lt;/a&gt;
[2] Temporal Go SDK - Versioning (evolving deterministic workflows safely): &lt;a href="https://docs.temporal.io/develop/go/versioning" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/develop/go/versioning&lt;/a&gt;
[3] Temporal - Retry Policies (how and when retries happen): &lt;a href="https://docs.temporal.io/encyclopedia/retry-policies" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/encyclopedia/retry-policies&lt;/a&gt;
[4] Temporal - Events reference (event history): &lt;a href="https://docs.temporal.io/references/events" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/references/events&lt;/a&gt;
[5] Temporal - Workflows overview: &lt;a href="https://docs.temporal.io/workflows" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/workflows&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>Tool Discovery at Scale: Solving the Million Tool Problem</title><link>https://roygabriel.dev/blog/million-tool-problem/</link><pubDate>Sat, 15 Nov 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/million-tool-problem/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Tool-using agents are powerful &lt;em&gt;because&lt;/em&gt; they can do real work: read systems, change systems, orchestrate workflows.&lt;/p&gt;
&lt;p&gt;The trap is what I call the &lt;strong&gt;Million Tool Problem&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;The moment you have &amp;ldquo;enough tools,&amp;rdquo; tool selection becomes harder than tool execution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At small scale, you can stuff tool schemas into the prompt and hope the model chooses correctly. At scale, that approach breaks:&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Tool-using agents are powerful &lt;em&gt;because&lt;/em&gt; they can do real work: read systems, change systems, orchestrate workflows.&lt;/p&gt;
&lt;p&gt;The trap is what I call the &lt;strong&gt;Million Tool Problem&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;The moment you have &amp;ldquo;enough tools,&amp;rdquo; tool selection becomes harder than tool execution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At small scale, you can stuff tool schemas into the prompt and hope the model chooses correctly. At scale, that approach breaks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;token budgets explode&lt;/li&gt;
&lt;li&gt;accuracy drops (models confuse similar tools)&lt;/li&gt;
&lt;li&gt;latency rises (bigger prompts, more reasoning)&lt;/li&gt;
&lt;li&gt;safety degrades (wrong tool, wrong args, wrong side effects)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn&amp;rsquo;t hypothetical. Tool-use research exists because selection is hard. Benchmarks like ToolBench and AgentBench exist specifically to evaluate this capability in interactive settings. [3][6]&lt;/p&gt;
&lt;p&gt;This post is a production-first design for tool discovery that stays:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fast&lt;/strong&gt; (low latency, bounded prompt size)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;safe&lt;/strong&gt; (tool contracts and policy gates)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;debuggable&lt;/strong&gt; (you can explain why a tool was chosen)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;maintainable&lt;/strong&gt; (tool catalogs evolve constantly)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Tool discovery is an &lt;strong&gt;IR problem + a policy problem&lt;/strong&gt;, not a prompt trick.&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;3-stage selector&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;
&lt;ol&gt;
&lt;li&gt;coarse filter (tags / domain / allowlist)&lt;/li&gt;
&lt;li&gt;retrieval (BM25 + embeddings)&lt;/li&gt;
&lt;li&gt;rerank (LLM or learned ranker)&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Treat tool descriptions as a product:&lt;/li&gt;
&lt;li&gt;consistent naming&lt;/li&gt;
&lt;li&gt;sharp &amp;ldquo;when to use&amp;rdquo; / &amp;ldquo;when not to use&amp;rdquo;&lt;/li&gt;
&lt;li&gt;examples of correct arguments&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;tool quality scoring&lt;/strong&gt; (latency, error rate, drift, safety incidents).&lt;/li&gt;
&lt;li&gt;Build a tight evaluation harness (ToolBench/StableToolBench ideas apply). [3][4]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-include-all-tools-fails"&gt;Why &amp;ldquo;include all tools&amp;rdquo; fails&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-3-stage-tool-selector"&gt;The 3-stage tool selector&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#tool-metadata-that-makes-models-smarter"&gt;Tool metadata that makes models smarter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#ranking-bm25--embeddings--rerank"&gt;Ranking: BM25 + embeddings + rerank&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#safety-allowlists-danger-gates-and-budgets"&gt;Safety: allowlists, &amp;ldquo;danger gates,&amp;rdquo; and budgets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#quality-scoring-and-tool-quarantine"&gt;Quality scoring and tool quarantine&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#debuggability-explainable-tool-selection"&gt;Debuggability: explainable tool selection&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-minimal-reference-architecture"&gt;A minimal reference architecture&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="why-include-all-tools-fails"&gt;Why &amp;ldquo;include all tools&amp;rdquo; fails&lt;/h2&gt;
&lt;h3 id="token-and-latency-pressure"&gt;Token and latency pressure&lt;/h3&gt;
&lt;p&gt;Even if your tool schemas are &amp;ldquo;small,&amp;rdquo; they add up. Once you cross a few dozen tools, you spend more tokens describing tools than describing the task.&lt;/p&gt;
&lt;h3 id="confusability"&gt;Confusability&lt;/h3&gt;
&lt;p&gt;Tools with similar names or overlapping domains cause selection errors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;search_events&lt;/code&gt; vs &lt;code&gt;list_events&lt;/code&gt; vs &lt;code&gt;get_event&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;create_task&lt;/code&gt; vs &lt;code&gt;create_issue&lt;/code&gt; vs &lt;code&gt;create_ticket&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-long-tail-problem"&gt;The long tail problem&lt;/h3&gt;
&lt;p&gt;Most catalogs have a long tail:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;10 tools get used daily&lt;/li&gt;
&lt;li&gt;100 tools get used weekly&lt;/li&gt;
&lt;li&gt;1,000 tools are niche, but critical when needed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is exactly the kind of situation information retrieval was invented for.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="the-3-stage-tool-selector"&gt;The 3-stage tool selector&lt;/h2&gt;
&lt;p&gt;Think like a search engine:&lt;/p&gt;
&lt;h3 id="stage-0-policy-filter-mandatory"&gt;Stage 0: Policy filter (mandatory)&lt;/h3&gt;
&lt;p&gt;Before ranking, enforce policy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which tools is this client allowed to call?&lt;/li&gt;
&lt;li&gt;which tools are enabled for this tenant/environment?&lt;/li&gt;
&lt;li&gt;which tools are safe for this context (read-only mode, incident mode, etc.)?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MCP makes tool discovery explicit via listing tools and schemas. That&amp;rsquo;s an interface you can mediate with policy. [1]&lt;/p&gt;
&lt;h3 id="stage-1-coarse-routing-cheap"&gt;Stage 1: Coarse routing (cheap)&lt;/h3&gt;
&lt;p&gt;Route into the right &amp;ldquo;tool neighborhood&amp;rdquo; using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tags (&lt;code&gt;kubernetes&lt;/code&gt;, &lt;code&gt;calendar&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;domains (&amp;ldquo;devops&amp;rdquo;, &amp;ldquo;productivity&amp;rdquo;, &amp;ldquo;security&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;environment (&amp;ldquo;prod&amp;rdquo; vs &amp;ldquo;dev&amp;rdquo;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goal: reduce the candidate set from 10,000 -&amp;gt; 300.&lt;/p&gt;
&lt;h3 id="stage-2-retrieval-bm25--embeddings"&gt;Stage 2: Retrieval (BM25 + embeddings)&lt;/h3&gt;
&lt;p&gt;Run a hybrid search over:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool name&lt;/li&gt;
&lt;li&gt;tool description&lt;/li&gt;
&lt;li&gt;parameter names&lt;/li&gt;
&lt;li&gt;example calls&lt;/li&gt;
&lt;li&gt;&amp;ldquo;when not to use&amp;rdquo; hints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hybrid search is pragmatic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;lexical retrieval (BM25-style) is great for exact matches and acronyms [9]&lt;/li&gt;
&lt;li&gt;embeddings are great for semantic similarity [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goal: 300 -&amp;gt; 30.&lt;/p&gt;
&lt;h3 id="stage-3-rerank-expensive-accurate"&gt;Stage 3: Rerank (expensive, accurate)&lt;/h3&gt;
&lt;p&gt;Rerank the top-K tools using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an LLM judge (cheap if K is small)&lt;/li&gt;
&lt;li&gt;or a learned ranker&lt;/li&gt;
&lt;li&gt;or deterministic rules + a smaller LLM tie-breaker&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goal: 30 -&amp;gt; 5.&lt;/p&gt;
&lt;p&gt;Then the agent sees a small, high-quality tool set.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tool-metadata-that-makes-models-smarter"&gt;Tool metadata that makes models smarter&lt;/h2&gt;
&lt;p&gt;If you want better tool selection, stop treating tool schemas as &amp;ldquo;just types.&amp;rdquo; Add metadata that improves discrimination.&lt;/p&gt;
&lt;h3 id="tool-card-fields-recommended"&gt;Tool card fields (recommended)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: stable, verb-first&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: one sentence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When to use&lt;/strong&gt;: 2-4 bullets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When NOT to use&lt;/strong&gt;: 2-4 bullets (this is underrated)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Side effects&lt;/strong&gt;: none / read-only / creates / updates / deletes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Required arguments&lt;/strong&gt;: and why they&amp;rsquo;re required&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: 2-3 example invocations with realistic args&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error modes&lt;/strong&gt;: rate limit, auth, not found, validation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This reduces tool confusion dramatically because it gives the model &lt;em&gt;differentiating features&lt;/em&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="ranking-bm25--embeddings--rerank"&gt;Ranking: BM25 + embeddings + rerank&lt;/h2&gt;
&lt;h3 id="lexical-retrieval-bm25"&gt;Lexical retrieval (BM25)&lt;/h3&gt;
&lt;p&gt;BM25 and probabilistic retrieval approaches are foundational in search. [9]&lt;/p&gt;
&lt;p&gt;Practical benefit: it handles queries like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;S3&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;JWT&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;PodDisruptionBudget&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Cron&amp;rdquo;
&amp;hellip;where embeddings can be inconsistent.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="embeddings"&gt;Embeddings&lt;/h3&gt;
&lt;p&gt;Sentence embeddings (like SBERT-style approaches) are designed to enable efficient semantic similarity search. [7]&lt;/p&gt;
&lt;p&gt;Practical benefit: it handles intent queries like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;delete all tasks due tomorrow&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;find calendar conflicts next week&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;check if deployment is stuck&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="approximate-nearest-neighbor-indexing"&gt;Approximate nearest neighbor indexing&lt;/h3&gt;
&lt;p&gt;At scale, you&amp;rsquo;ll want ANN indexing (FAISS is a well-known library in this space). [8]&lt;/p&gt;
&lt;h3 id="rerank"&gt;Rerank&lt;/h3&gt;
&lt;p&gt;This is where you incorporate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool quality score&lt;/li&gt;
&lt;li&gt;tenant policy&lt;/li&gt;
&lt;li&gt;&amp;ldquo;danger tool&amp;rdquo; gating&lt;/li&gt;
&lt;li&gt;recent tool drift&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Reranking is also where you can enforce &amp;ldquo;don&amp;rsquo;t pick write tools unless necessary.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="safety-allowlists-danger-gates-and-budgets"&gt;Safety: allowlists, &amp;ldquo;danger gates,&amp;rdquo; and budgets&lt;/h2&gt;
&lt;p&gt;Tool discovery is not neutral. It&amp;rsquo;s an &lt;em&gt;authorization problem&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Your selector should be policy-aware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read-only mode&lt;/strong&gt;: only surface read tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No-delete mode&lt;/strong&gt;: deletes never appear&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prod incident mode&lt;/strong&gt;: allow observation tools, restrict mutation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human approval mode&lt;/strong&gt;: show write tools, but require confirmation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Also: build budgets into selection.
If a tool is expensive (slow, rate-limited, high blast radius), rank it lower unless strongly justified.&lt;/p&gt;
&lt;p&gt;For tool-using agents, OWASP highlights prompt injection and excessive agency as key risks - exactly the failure modes you get when tools are over-exposed without gates. [10]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="quality-scoring-and-tool-quarantine"&gt;Quality scoring and tool quarantine&lt;/h2&gt;
&lt;p&gt;You need a &lt;strong&gt;tool quality score&lt;/strong&gt; because tools drift:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;upstream APIs change&lt;/li&gt;
&lt;li&gt;auth breaks&lt;/li&gt;
&lt;li&gt;quotas shift&lt;/li&gt;
&lt;li&gt;tool server regressions happen&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Track per tool:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;p50 / p95 latency&lt;/li&gt;
&lt;li&gt;error rate&lt;/li&gt;
&lt;li&gt;timeout rate&lt;/li&gt;
&lt;li&gt;&amp;ldquo;invalid argument&amp;rdquo; rate (often a selection problem)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;unsafe attempt&amp;rdquo; rate (policy violations)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then take action:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;quarantine tools with regression spikes&lt;/li&gt;
&lt;li&gt;degrade to read-only tools during outages&lt;/li&gt;
&lt;li&gt;route to backups (alternate implementations)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="debuggability-explainable-tool-selection"&gt;Debuggability: explainable tool selection&lt;/h2&gt;
&lt;p&gt;If you can&amp;rsquo;t answer &lt;strong&gt;&amp;ldquo;why did the agent pick that tool?&amp;rdquo;&lt;/strong&gt;, you won&amp;rsquo;t be able to operate the system.&lt;/p&gt;
&lt;p&gt;Log (or attach to traces) the selection evidence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;query text&lt;/li&gt;
&lt;li&gt;candidate tools (top 30)&lt;/li&gt;
&lt;li&gt;retrieval scores&lt;/li&gt;
&lt;li&gt;rerank scores&lt;/li&gt;
&lt;li&gt;policy filters applied&lt;/li&gt;
&lt;li&gt;final selected tools and why&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This also becomes training data later.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-minimal-reference-architecture"&gt;A minimal reference architecture&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Agent runtime (planner) -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Tool Selector Service -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - policy filter -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - hybrid retrieval -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - rerank -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - tool quality weighting -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - returns top-K tools + schemas
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Agent execution -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - calls tools via MCP -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Where MCP fits: MCP provides a standardized way for clients to discover tools and invoke them. [1]&lt;/p&gt;
&lt;p&gt;The selector doesn&amp;rsquo;t replace MCP. It makes MCP usable at scale.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="tool-catalog-hygiene"&gt;Tool catalog hygiene&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Stable naming conventions.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &amp;ldquo;When NOT to use&amp;rdquo; bullets exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Examples exist for the top tools.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool side effects are classified.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="selection-pipeline"&gt;Selection pipeline&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Mandatory policy filter before ranking.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Hybrid retrieval (lexical + embeddings). [7][9]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Rerank top-K with quality + policy.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Candidate set bounded (K is small).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="safety"&gt;Safety&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Dangerous tools are gated and not surfaced by default.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Budget-aware ranking exists.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; OWASP LLM risks considered in tool exposure strategy. [10]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Selection decisions are explainable (log evidence).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool quality scoring exists and drives quarantine.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Selection regressions are covered by evals (next article).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Model Context Protocol (MCP) - Specification (Protocol Revision 2025-11-25): &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-11-25&lt;/a&gt;
[2] MCP - Transports (including stdio and Streamable HTTP): &lt;a href="https://modelcontextprotocol.io/specification/2025-03-26/basic/transports" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-03-26/basic/transports&lt;/a&gt;
[3] ToolLLM / ToolBench (tool-use dataset + evaluation): &lt;a href="https://arxiv.org/abs/2307.16789" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.16789&lt;/a&gt;
[4] StableToolBench (stable tool-use benchmarking): &lt;a href="https://arxiv.org/abs/2403.07714" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2403.07714&lt;/a&gt;
[5] tau-bench (tool-agent-user interaction benchmark): &lt;a href="https://arxiv.org/abs/2406.12045" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2406.12045&lt;/a&gt;
[6] AgentBench (evaluating LLMs as agents): &lt;a href="https://arxiv.org/abs/2308.03688" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2308.03688&lt;/a&gt;
[7] Sentence-BERT (efficient semantic similarity search via embeddings): &lt;a href="https://arxiv.org/abs/1908.10084" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/1908.10084&lt;/a&gt;
[8] FAISS / Billion-scale similarity search with GPUs: &lt;a href="https://arxiv.org/abs/1702.08734" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/1702.08734&lt;/a&gt;
and &lt;a href="https://github.com/facebookresearch/faiss" target="_blank" rel="noopener noreferrer"&gt;https://github.com/facebookresearch/faiss&lt;/a&gt;
[9] Robertson (BM25 and probabilistic relevance framework): &lt;a href="https://dl.acm.org/doi/abs/10.1561/1500000019" target="_blank" rel="noopener noreferrer"&gt;https://dl.acm.org/doi/abs/10.1561/1500000019&lt;/a&gt;
[10] OWASP - Top 10 for Large Language Model Applications: &lt;a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noopener noreferrer"&gt;https://owasp.org/www-project-top-10-for-large-language-model-applications/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>The Service Template That Prevents Incidents</title><link>https://roygabriel.dev/blog/paved-road-service-template/</link><pubDate>Sat, 25 Oct 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/paved-road-service-template/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises try to standardize software delivery with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;Confluence pages&lt;/li&gt;
&lt;li&gt;slide decks&lt;/li&gt;
&lt;li&gt;architecture review boards&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It doesn&amp;rsquo;t scale.&lt;/p&gt;
&lt;p&gt;Teams don&amp;rsquo;t move faster because the &lt;em&gt;rules&lt;/em&gt; exist. Teams move faster because the &lt;strong&gt;defaults&lt;/strong&gt; exist.&lt;/p&gt;
&lt;p&gt;Platform engineering language captures this well: paved roads / golden paths reduce cognitive load and make the &amp;ldquo;right way&amp;rdquo; the easy way. [1][2]
The CNCF Platforms White Paper makes the case for internal platforms as a lever that impacts value streams indirectly - through better flow and developer experience. [3]&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises try to standardize software delivery with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;Confluence pages&lt;/li&gt;
&lt;li&gt;slide decks&lt;/li&gt;
&lt;li&gt;architecture review boards&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It doesn&amp;rsquo;t scale.&lt;/p&gt;
&lt;p&gt;Teams don&amp;rsquo;t move faster because the &lt;em&gt;rules&lt;/em&gt; exist. Teams move faster because the &lt;strong&gt;defaults&lt;/strong&gt; exist.&lt;/p&gt;
&lt;p&gt;Platform engineering language captures this well: paved roads / golden paths reduce cognitive load and make the &amp;ldquo;right way&amp;rdquo; the easy way. [1][2]
The CNCF Platforms White Paper makes the case for internal platforms as a lever that impacts value streams indirectly - through better flow and developer experience. [3]&lt;/p&gt;
&lt;p&gt;This article is a practical blueprint for the thing that actually changes outcomes:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;A service template that bakes reliability, security, and operability into day-one defaults.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Build one paved road for APIs:&lt;/li&gt;
&lt;li&gt;repo template + CI pipeline + runtime defaults&lt;/li&gt;
&lt;li&gt;Include &amp;ldquo;boring&amp;rdquo; but critical capabilities:&lt;/li&gt;
&lt;li&gt;health probes, resource requests/limits, disruption budgets [4][5][6]&lt;/li&gt;
&lt;li&gt;tracing/metrics/logging via OpenTelemetry [7]&lt;/li&gt;
&lt;li&gt;timeouts, retries, rate limits&lt;/li&gt;
&lt;li&gt;standardized deployment and rollout&lt;/li&gt;
&lt;li&gt;Measure success with outcomes (DORA metrics): lead time, deploy frequency, change failure rate, MTTR. [8]&lt;/li&gt;
&lt;li&gt;Optimize for day 2 to day 50, not just &amp;ldquo;hello world.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-a-paved-road-is-and-isnt"&gt;What a paved road is (and isn&amp;rsquo;t)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-api-service-template-required-capabilities"&gt;The API service template: required capabilities&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-reference-repository-structure"&gt;A reference repository structure&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#kubernetes-defaults-that-save-you-later"&gt;Kubernetes defaults that save you later&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#observability-by-default"&gt;Observability by default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#security-by-default"&gt;Security by default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#rollouts-and-operational-controls"&gt;Rollouts and operational controls&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-to-roll-this-out-without-a-platform-revolt"&gt;How to roll this out without a platform revolt&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="what-a-paved-road-is-and-isnt"&gt;What a paved road is (and isn&amp;rsquo;t)&lt;/h2&gt;
&lt;h3 id="a-paved-road-is"&gt;A paved road is&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;recommended&lt;/strong&gt; path to production&lt;/li&gt;
&lt;li&gt;preconfigured defaults that make safe delivery easy&lt;/li&gt;
&lt;li&gt;automation that eliminates repetitive decisions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Microsoft describes this in internal developer platform terms: recommended and supported development paths, incrementally paved through an internal platform. [2]&lt;/p&gt;
&lt;h3 id="a-paved-road-is-not"&gt;A paved road is not&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;a mandate that blocks all other approaches&lt;/li&gt;
&lt;li&gt;a committee process&lt;/li&gt;
&lt;li&gt;a doc nobody reads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your paved road becomes a gate, teams will route around it.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="the-api-service-template-required-capabilities"&gt;The API service template: required capabilities&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s what &amp;ldquo;enterprise production API&amp;rdquo; should mean out of the box.&lt;/p&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;structured logging with correlation IDs&lt;/li&gt;
&lt;li&gt;metrics (request rate/latency/errors)&lt;/li&gt;
&lt;li&gt;tracing across inbound/outbound calls [7]&lt;/li&gt;
&lt;li&gt;runtime config and feature flags&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;timeouts everywhere&lt;/li&gt;
&lt;li&gt;bounded retries with backoff&lt;/li&gt;
&lt;li&gt;health probes (liveness/readiness/startup) [5]&lt;/li&gt;
&lt;li&gt;graceful shutdown&lt;/li&gt;
&lt;li&gt;rate limits / concurrency caps&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="platform-fit"&gt;Platform fit&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Kubernetes-ready manifests&lt;/li&gt;
&lt;li&gt;resource requests/limits [4]&lt;/li&gt;
&lt;li&gt;PodDisruptionBudget for availability during maintenance [6]&lt;/li&gt;
&lt;li&gt;standardized rollout strategy&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="security"&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;auth middleware&lt;/li&gt;
&lt;li&gt;input validation&lt;/li&gt;
&lt;li&gt;secret injection patterns (no secrets in repo)&lt;/li&gt;
&lt;li&gt;least privilege service accounts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="delivery"&gt;Delivery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;CI pipeline: lint/test/build/scan&lt;/li&gt;
&lt;li&gt;SBOM generation&lt;/li&gt;
&lt;li&gt;deploy automation (GitOps or pipeline)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="a-reference-repository-structure"&gt;A reference repository structure&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- cmd/service/ # main
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- internal/ # business logic
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- pkg/ # shared libs (optional)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- api/ # OpenAPI spec, schemas
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- deploy/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- k8s/ # manifests (or Helm/Kustomize)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- policy/ # OPA/constraints (optional)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- docs/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- index.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- runbooks/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- Makefile
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- .github/workflows/ # CI
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Key idea: the template is not just code - it is the full production story:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how to run locally&lt;/li&gt;
&lt;li&gt;how to deploy&lt;/li&gt;
&lt;li&gt;how to observe&lt;/li&gt;
&lt;li&gt;how to operate on-call&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="kubernetes-defaults-that-save-you-later"&gt;Kubernetes defaults that save you later&lt;/h2&gt;
&lt;h3 id="1-resource-requests-and-limits"&gt;1) Resource requests and limits&lt;/h3&gt;
&lt;p&gt;Kubernetes scheduling and stability depend on requests/limits. The official docs explain how pod requests/limits are derived from container values. [4]&lt;/p&gt;
&lt;p&gt;Template default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;set conservative requests&lt;/li&gt;
&lt;li&gt;set safe limits&lt;/li&gt;
&lt;li&gt;provide guidance for right-sizing&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-probes"&gt;2) Probes&lt;/h3&gt;
&lt;p&gt;Kubernetes supports liveness, readiness, and startup probes. The docs describe how to configure them and why they matter. [5]&lt;/p&gt;
&lt;p&gt;Template default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;readinessProbe&lt;/code&gt; ensures traffic only goes to ready pods&lt;/li&gt;
&lt;li&gt;&lt;code&gt;livenessProbe&lt;/code&gt; catches deadlocks / stuck processes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;startupProbe&lt;/code&gt; prevents early restarts for slow boot services&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-disruption-budgets"&gt;3) Disruption budgets&lt;/h3&gt;
&lt;p&gt;PodDisruptionBudgets limit concurrent disruptions during voluntary maintenance. [6]&lt;/p&gt;
&lt;p&gt;Template default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;include a PDB for replicated services&lt;/li&gt;
&lt;li&gt;define min available or max unavailable&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="observability-by-default"&gt;Observability by default&lt;/h2&gt;
&lt;p&gt;If you do one thing: instrument the template so every service ships with telemetry.&lt;/p&gt;
&lt;p&gt;OpenTelemetry provides the framework for standard traces/metrics/logs. [7]&lt;/p&gt;
&lt;p&gt;Template defaults:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;standard HTTP server instrumentation&lt;/li&gt;
&lt;li&gt;propagation of trace context (W3C headers)&lt;/li&gt;
&lt;li&gt;request logs include trace IDs&lt;/li&gt;
&lt;li&gt;golden dashboard:&lt;/li&gt;
&lt;li&gt;RPS&lt;/li&gt;
&lt;li&gt;p95 latency&lt;/li&gt;
&lt;li&gt;error rate&lt;/li&gt;
&lt;li&gt;saturation (CPU/memory)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="security-by-default"&gt;Security by default&lt;/h2&gt;
&lt;p&gt;Avoid &amp;ldquo;security guidance documents.&amp;rdquo; Make secure defaults.&lt;/p&gt;
&lt;p&gt;Template defaults:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;auth middleware with standardized claims/roles mapping&lt;/li&gt;
&lt;li&gt;structured validation for request bodies&lt;/li&gt;
&lt;li&gt;outbound allowlists (where feasible)&lt;/li&gt;
&lt;li&gt;secret injection via environment/secret store (no plain text)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Your paved road becomes a security accelerator because teams start secure.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="rollouts-and-operational-controls"&gt;Rollouts and operational controls&lt;/h2&gt;
&lt;p&gt;Default rollout patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;canary or progressive delivery when needed&lt;/li&gt;
&lt;li&gt;safe rollback&lt;/li&gt;
&lt;li&gt;feature flags for risky changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Default operational controls:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;rate limiting&lt;/li&gt;
&lt;li&gt;concurrency limits&lt;/li&gt;
&lt;li&gt;timeouts and circuit breakers&lt;/li&gt;
&lt;li&gt;&amp;ldquo;maintenance mode&amp;rdquo; toggle&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="how-to-roll-this-out-without-a-platform-revolt"&gt;How to roll this out without a platform revolt&lt;/h2&gt;
&lt;p&gt;This is the part platform teams often miss.&lt;/p&gt;
&lt;h3 id="1-make-it-optional---but-obviously-better"&gt;1) Make it optional - but obviously better&lt;/h3&gt;
&lt;p&gt;If adopting the template reduces weeks of work to hours, teams will choose it.&lt;/p&gt;
&lt;h3 id="2-provide-migration-paths"&gt;2) Provide migration paths&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;minimal adoption: observability + probes&lt;/li&gt;
&lt;li&gt;medium: deploy manifests + CI&lt;/li&gt;
&lt;li&gt;full: service template + libraries&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-measure-outcomes-not-adoption"&gt;3) Measure outcomes, not adoption&lt;/h3&gt;
&lt;p&gt;Use DORA metrics to show impact: lead time, deploy frequency, change failure rate, time to restore service. [8]&lt;/p&gt;
&lt;p&gt;If the paved road doesn&amp;rsquo;t move these, it&amp;rsquo;s not paved.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="template"&gt;Template&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Repo template includes CI, deploy, docs, runbooks.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Observability instrumentation included by default. [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="kubernetes"&gt;Kubernetes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Resource requests/limits included. [4]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Liveness/readiness/startup probes included. [5]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; PodDisruptionBudget included for replicated services. [6]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability-1"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timeouts and bounded retries are standard.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Graceful shutdown is implemented.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Rate limiting/concurrency caps exist.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="security-1"&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Auth middleware included.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Secrets handled via secure injection (not repo).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="outcomes"&gt;Outcomes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; DORA metrics tracked to validate improvement. [8]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] CNCF - What is platform engineering? (golden paths/paved roads framing): &lt;a href="https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/" target="_blank" rel="noopener noreferrer"&gt;https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/&lt;/a&gt;
[2] Microsoft Learn - What is platform engineering? (paved paths / internal developer platform): &lt;a href="https://learn.microsoft.com/en-us/platform-engineering/what-is-platform-engineering" target="_blank" rel="noopener noreferrer"&gt;https://learn.microsoft.com/en-us/platform-engineering/what-is-platform-engineering&lt;/a&gt;
[3] CNCF TAG App Delivery - Platforms White Paper: &lt;a href="https://tag-app-delivery.cncf.io/whitepapers/platforms/" target="_blank" rel="noopener noreferrer"&gt;https://tag-app-delivery.cncf.io/whitepapers/platforms/&lt;/a&gt;
[4] Kubernetes - Resource Management for Pods and Containers (requests/limits): &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/&lt;/a&gt;
[5] Kubernetes - Configure Liveness, Readiness and Startup Probes: &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/&lt;/a&gt;
[6] Kubernetes - Specifying a Disruption Budget for your Application (PDB): &lt;a href="https://kubernetes.io/docs/tasks/run-application/configure-pdb/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/tasks/run-application/configure-pdb/&lt;/a&gt;
[7] OpenTelemetry - Documentation (instrumentation and telemetry): &lt;a href="https://opentelemetry.io/docs/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/&lt;/a&gt;
[8] DORA - DORA&amp;rsquo;s software delivery performance metrics: &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>The Real Security Model for Agents</title><link>https://roygabriel.dev/blog/real-security-model-for-agents/</link><pubDate>Sat, 18 Oct 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/real-security-model-for-agents/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;If you ship tool-using agents, you are shipping:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an execution engine&lt;/li&gt;
&lt;li&gt;with access to external systems&lt;/li&gt;
&lt;li&gt;controlled by untrusted inputs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is the same security posture as any automation platform - except the &amp;ldquo;operator&amp;rdquo; is probabilistic.&lt;/p&gt;
&lt;p&gt;OWASP&amp;rsquo;s Top 10 for LLM Applications makes it clear: prompt injection, insecure output handling, sensitive info disclosure, excessive agency&amp;hellip; these are mainstream risks, not edge cases. [1]
The good news: most mitigations are &lt;em&gt;classic security engineering&lt;/em&gt; applied to a new execution model.&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;If you ship tool-using agents, you are shipping:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an execution engine&lt;/li&gt;
&lt;li&gt;with access to external systems&lt;/li&gt;
&lt;li&gt;controlled by untrusted inputs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is the same security posture as any automation platform - except the &amp;ldquo;operator&amp;rdquo; is probabilistic.&lt;/p&gt;
&lt;p&gt;OWASP&amp;rsquo;s Top 10 for LLM Applications makes it clear: prompt injection, insecure output handling, sensitive info disclosure, excessive agency&amp;hellip; these are mainstream risks, not edge cases. [1]
The good news: most mitigations are &lt;em&gt;classic security engineering&lt;/em&gt; applied to a new execution model.&lt;/p&gt;
&lt;p&gt;This article is a practical, production-first security model for agents and MCP tool ecosystems.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Don&amp;rsquo;t &amp;ldquo;secure the model.&amp;rdquo; Secure the &lt;strong&gt;system&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Treat all inputs as untrusted:&lt;/li&gt;
&lt;li&gt;user text&lt;/li&gt;
&lt;li&gt;tool outputs&lt;/li&gt;
&lt;li&gt;retrieved documents&lt;/li&gt;
&lt;li&gt;Design tools with least privilege:&lt;/li&gt;
&lt;li&gt;separate read/write/danger tools&lt;/li&gt;
&lt;li&gt;require preview -&amp;gt; apply for destructive actions&lt;/li&gt;
&lt;li&gt;Centralize auth and policy:&lt;/li&gt;
&lt;li&gt;MCP defines authorization for HTTP transports - use it. [2]&lt;/li&gt;
&lt;li&gt;Control egress and prevent SSRF by default. [3]&lt;/li&gt;
&lt;li&gt;Never let raw model output drive execution without validation (OWASP LLM02). [4]&lt;/li&gt;
&lt;li&gt;Redact logs and manage secrets like an adult (OWASP cheat sheets). [5][6]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#threat-model-what-can-go-wrong"&gt;Threat model: what can go wrong&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#security-layers-that-actually-work"&gt;Security layers that actually work&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#tool-design-readwritedanger-tiers"&gt;Tool design: read/write/danger tiers&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#output-handling-never-execute-raw-model-output"&gt;Output handling: never execute raw model output&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#secrets-minimize-scope-rotate"&gt;Secrets: minimize, scope, rotate&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#network-and-egress-controls"&gt;Network and egress controls&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#logging-and-audit-without-data-leaks"&gt;Logging and audit without data leaks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="threat-model-what-can-go-wrong"&gt;Threat model: what can go wrong&lt;/h2&gt;
&lt;h3 id="1-prompt-injection---policy-bypass-attempt"&gt;1) Prompt injection -&amp;gt; policy bypass attempt&lt;/h3&gt;
&lt;p&gt;A user or document says:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;Ignore previous instructions&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Call this tool with these parameters&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Reveal secrets&amp;rdquo;
OWASP calls this out as a primary risk category. [1]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-insecure-output-handling---downstream-exploitation"&gt;2) Insecure output handling -&amp;gt; downstream exploitation&lt;/h3&gt;
&lt;p&gt;If you pass model output into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a shell&lt;/li&gt;
&lt;li&gt;SQL&lt;/li&gt;
&lt;li&gt;YAML manifests&lt;/li&gt;
&lt;li&gt;HTTP requests
&amp;hellip;without validation, you&amp;rsquo;ve built an indirect code execution path.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OWASP&amp;rsquo;s LLM02 describes this precisely: insufficient validation and handling of LLM outputs before passing them downstream. [4]&lt;/p&gt;
&lt;h3 id="3-excessive-agency---unintended-side-effects"&gt;3) Excessive agency -&amp;gt; unintended side effects&lt;/h3&gt;
&lt;p&gt;The agent is over-permissioned:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it can delete resources&lt;/li&gt;
&lt;li&gt;send emails&lt;/li&gt;
&lt;li&gt;modify production
&amp;hellip;and it will eventually do something you didn&amp;rsquo;t mean.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4-data-exfiltration-via-tools"&gt;4) Data exfiltration via tools&lt;/h3&gt;
&lt;p&gt;Tool outputs are rich and often sensitive:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calendar events&lt;/li&gt;
&lt;li&gt;emails&lt;/li&gt;
&lt;li&gt;internal tickets&lt;/li&gt;
&lt;li&gt;source code&lt;/li&gt;
&lt;li&gt;cluster configs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Exfil happens through:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;model responses&lt;/li&gt;
&lt;li&gt;logs&lt;/li&gt;
&lt;li&gt;&amp;ldquo;helpful&amp;rdquo; summaries&lt;/li&gt;
&lt;li&gt;tool chaining&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="5-network-abuse--ssrf"&gt;5) Network abuse / SSRF&lt;/h3&gt;
&lt;p&gt;Any &amp;ldquo;fetch URL&amp;rdquo; capability is an SSRF invitation unless you constrain egress. OWASP&amp;rsquo;s SSRF cheat sheet is still relevant. [3]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="security-layers-that-actually-work"&gt;Security layers that actually work&lt;/h2&gt;
&lt;p&gt;Security in agent systems is defense-in-depth:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Identity&lt;/strong&gt; (who is calling?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authorization&lt;/strong&gt; (what can they do?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contracts&lt;/strong&gt; (what does a tool accept/return?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validation&lt;/strong&gt; (are inputs/outputs safe?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Egress control&lt;/strong&gt; (where can the system talk to?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit&lt;/strong&gt; (what happened?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kill switches&lt;/strong&gt; (how do you stop it fast?)&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="tool-design-readwritedanger-tiers"&gt;Tool design: read/write/danger tiers&lt;/h2&gt;
&lt;h3 id="tiering-is-mandatory"&gt;Tiering is mandatory&lt;/h3&gt;
&lt;p&gt;Split tools by side effects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read tools&lt;/strong&gt;: list/search/get&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write tools&lt;/strong&gt;: create/update with bounded scope&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Danger tools&lt;/strong&gt;: deletes, bulk updates, privileged actions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then enforce policy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read tools are widely available&lt;/li&gt;
&lt;li&gt;Write tools require explicit scopes and tighter budgets&lt;/li&gt;
&lt;li&gt;Danger tools require:&lt;/li&gt;
&lt;li&gt;preview -&amp;gt; apply&lt;/li&gt;
&lt;li&gt;confirmation tokens&lt;/li&gt;
&lt;li&gt;additional policy checks&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="preview---apply-pattern"&gt;Preview -&amp;gt; Apply pattern&lt;/h3&gt;
&lt;p&gt;For dangerous operations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;plan_*&lt;/code&gt; returns a plan summary + &lt;code&gt;plan_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apply_*&lt;/code&gt; requires &lt;code&gt;plan_id&lt;/code&gt; + user confirmation&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This prevents &amp;ldquo;drive-by deletes&amp;rdquo; and supports audit.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="output-handling-never-execute-raw-model-output"&gt;Output handling: never execute raw model output&lt;/h2&gt;
&lt;p&gt;This is the most common real-world failure.&lt;/p&gt;
&lt;h3 id="rule-model-output-is-data-not-code"&gt;Rule: model output is data, not code&lt;/h3&gt;
&lt;p&gt;If the agent is generating:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;kubernetes YAML&lt;/li&gt;
&lt;li&gt;SQL statements&lt;/li&gt;
&lt;li&gt;curl commands&lt;/li&gt;
&lt;li&gt;Terraform changes
&amp;hellip;treat the output as untrusted data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OWASP&amp;rsquo;s LLM02 guidance exists because people keep wiring LLM output directly into execution paths. [4]&lt;/p&gt;
&lt;h3 id="safer-alternative-structured-intent---validated-execution"&gt;Safer alternative: structured intent -&amp;gt; validated execution&lt;/h3&gt;
&lt;p&gt;Instead of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LLM writes YAML -&amp;gt; apply&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LLM proposes a structured change request (schema)&lt;/li&gt;
&lt;li&gt;server validates:&lt;/li&gt;
&lt;li&gt;allowlisted fields&lt;/li&gt;
&lt;li&gt;bounded ranges&lt;/li&gt;
&lt;li&gt;namespace/tenant scope&lt;/li&gt;
&lt;li&gt;server executes with known-safe libraries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where &amp;ldquo;tool contracts&amp;rdquo; win.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="secrets-minimize-scope-rotate"&gt;Secrets: minimize, scope, rotate&lt;/h2&gt;
&lt;p&gt;Secrets are the other common failure path.&lt;/p&gt;
&lt;h3 id="minimum-viable-rules"&gt;Minimum viable rules&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Never&lt;/strong&gt; put long-lived secrets in prompts.&lt;/li&gt;
&lt;li&gt;Prefer short-lived tokens and scoped credentials.&lt;/li&gt;
&lt;li&gt;Inject secrets server-side, not in the model context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OWASP&amp;rsquo;s Secrets Management Cheat Sheet is a good baseline for central storage, rotation, auditing, and least privilege. [5]&lt;/p&gt;
&lt;h3 id="scope-secrets-to-tenants-and-tools"&gt;Scope secrets to tenants and tools&lt;/h3&gt;
&lt;p&gt;Instead of &amp;ldquo;one OAuth token for everything,&amp;rdquo; mint:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;per tenant&lt;/li&gt;
&lt;li&gt;per tool category&lt;/li&gt;
&lt;li&gt;short TTL&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When something goes wrong, you want the blast radius small and revocation easy.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="network-and-egress-controls"&gt;Network and egress controls&lt;/h2&gt;
&lt;p&gt;If your agent system can reach the open internet or internal networks, you need guardrails.&lt;/p&gt;
&lt;h3 id="egress-allowlists"&gt;Egress allowlists&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;allowlist domains for integrations&lt;/li&gt;
&lt;li&gt;block metadata IP ranges&lt;/li&gt;
&lt;li&gt;re-validate after redirects&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OWASP&amp;rsquo;s SSRF prevention guidance provides practical patterns for validation and blocking internal addresses. [3]&lt;/p&gt;
&lt;h3 id="separate-network-planes"&gt;Separate network planes&lt;/h3&gt;
&lt;p&gt;Keep tool servers in a network segment that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;can reach only what they need&lt;/li&gt;
&lt;li&gt;cannot reach internal admin endpoints&lt;/li&gt;
&lt;li&gt;cannot reach secrets stores directly unless necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="logging-and-audit-without-data-leaks"&gt;Logging and audit without data leaks&lt;/h2&gt;
&lt;p&gt;Logging is security. Logging is also a leak vector.&lt;/p&gt;
&lt;p&gt;OWASP&amp;rsquo;s Logging Cheat Sheet calls out that logs may contain personal and sensitive information and must be protected from misuse. [6]&lt;/p&gt;
&lt;h3 id="practical-logging-rules"&gt;Practical logging rules&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;do not log raw prompts by default&lt;/li&gt;
&lt;li&gt;do not log raw tool payloads by default&lt;/li&gt;
&lt;li&gt;log structured summaries:&lt;/li&gt;
&lt;li&gt;tool name&lt;/li&gt;
&lt;li&gt;action class&lt;/li&gt;
&lt;li&gt;resource IDs (safe identifiers)&lt;/li&gt;
&lt;li&gt;status&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;store audit events separately from debug logs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="audit-events-always-on"&gt;Audit events (always on)&lt;/h3&gt;
&lt;p&gt;Every write/danger tool should emit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;who / what / when / result&lt;/li&gt;
&lt;li&gt;plan_id / idempotency_key&lt;/li&gt;
&lt;li&gt;before/after resource identifiers (not content)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Audit is what makes &amp;ldquo;agents in production&amp;rdquo; defensible to security and compliance teams.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="identity-and-authorization"&gt;Identity and authorization&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Strong auth for clients.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Least-privilege scopes per tool.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; MCP HTTP authorization flow implemented where applicable. [2]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="tool-contracts"&gt;Tool contracts&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tools tiered: read/write/danger.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Preview -&amp;gt; apply for dangerous actions.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Schema validation + bounded arguments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="output-handling"&gt;Output handling&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; No raw model output is executed without validation (OWASP LLM02). [4]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="secrets"&gt;Secrets&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Secrets never placed in prompts.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Short-lived, scoped tokens used.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Rotation/audit practices exist (OWASP Secrets Mgmt). [5]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="network"&gt;Network&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Egress allowlists exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; SSRF protections implemented. [3]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="logging-and-audit"&gt;Logging and audit&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Logs are redacted and access-controlled.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Audit events exist for all side-effecting tools.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Log systems protected per OWASP guidance. [6]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] OWASP - Top 10 for Large Language Model Applications (v1.1): &lt;a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noopener noreferrer"&gt;https://owasp.org/www-project-top-10-for-large-language-model-applications/&lt;/a&gt;
[2] Model Context Protocol (MCP) - Authorization (Protocol Revision 2025-11-25): &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25/basic/authorization" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-11-25/basic/authorization&lt;/a&gt;
[3] OWASP - SSRF Prevention Cheat Sheet: &lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html" target="_blank" rel="noopener noreferrer"&gt;https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html&lt;/a&gt;
[4] OWASP GenAI Security Project - LLM02: Insecure Output Handling: &lt;a href="https://genai.owasp.org/llmrisk2023-24/llm02-insecure-output-handling/" target="_blank" rel="noopener noreferrer"&gt;https://genai.owasp.org/llmrisk2023-24/llm02-insecure-output-handling/&lt;/a&gt;
[5] OWASP - Secrets Management Cheat Sheet: &lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html" target="_blank" rel="noopener noreferrer"&gt;https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html&lt;/a&gt;
[6] OWASP - Logging Cheat Sheet: &lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html" target="_blank" rel="noopener noreferrer"&gt;https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item></channel></rss>