<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Platform-Engineering | Roy Gabriel</title><link>https://roygabriel.dev/tags/platform-engineering/</link><description>Roy Gabriel: DevOps Architect &amp; Applied AI Engineer. Technical blog on Go, MCP servers, Kubernetes, and production AI systems.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 27 Feb 2026 03:18:04 +0000</lastBuildDate><atom:link href="https://roygabriel.dev/tags/platform-engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>When Enterprise Defaults Become Enterprise Debt</title><link>https://roygabriel.dev/blog/enterprise-defaults-enterprise-debt/</link><pubDate>Sat, 07 Feb 2026 09:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/enterprise-defaults-enterprise-debt/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;. They&amp;rsquo;re not a critique of any one organization; they&amp;rsquo;re patterns that repeat across industries.
The goal isn&amp;rsquo;t to &amp;ldquo;modernize for fun.&amp;rdquo; It&amp;rsquo;s to protect speed-to-market &lt;em&gt;and&lt;/em&gt; reliability as systems and organizations scale.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises don&amp;rsquo;t lose because they picked the &amp;ldquo;wrong&amp;rdquo; framework or cloud provider. They lose because old defaults - once rational - become invisible policy.&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;. They&amp;rsquo;re not a critique of any one organization; they&amp;rsquo;re patterns that repeat across industries.
The goal isn&amp;rsquo;t to &amp;ldquo;modernize for fun.&amp;rdquo; It&amp;rsquo;s to protect speed-to-market &lt;em&gt;and&lt;/em&gt; reliability as systems and organizations scale.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises don&amp;rsquo;t lose because they picked the &amp;ldquo;wrong&amp;rdquo; framework or cloud provider. They lose because old defaults - once rational - become invisible policy.&lt;/p&gt;
&lt;p&gt;The 90s and early 2000s optimized for constraints that were real at the time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;hardware was expensive&lt;/li&gt;
&lt;li&gt;automation was immature&lt;/li&gt;
&lt;li&gt;environments were scarce&lt;/li&gt;
&lt;li&gt;security controls were largely manual&lt;/li&gt;
&lt;li&gt;uptime was achieved by cautious change, not by safe change&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those constraints have shifted. But many organizations still run on &lt;strong&gt;architectural and governance defaults&lt;/strong&gt; designed for a different era.&lt;/p&gt;
&lt;p&gt;The result is predictable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;innovation slows&lt;/strong&gt; (lead time grows)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;quality degrades&lt;/strong&gt; (late integration + big-bang changes)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reliability suffers&lt;/strong&gt; (risk is batched, blast radius expands)&lt;/li&gt;
&lt;li&gt;engineers spend more time navigating the system than improving it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want a single sentence summary: &lt;strong&gt;old patterns don&amp;rsquo;t just slow delivery - they also create the conditions for outages.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Retire &amp;ldquo;analysis as delivery.&amp;rdquo; Timebox discovery and ship thin vertical slices.&lt;/li&gt;
&lt;li&gt;Treat cloud primitives as &lt;em&gt;primitives&lt;/em&gt;, not research projects (e.g., object storage is solved).&lt;/li&gt;
&lt;li&gt;Default to &lt;strong&gt;containers + orchestration&lt;/strong&gt; for most stateless services; use VMs deliberately, not reflexively. [5]&lt;/li&gt;
&lt;li&gt;Replace ticket queues and boards with &lt;strong&gt;guardrails + paved roads + policy-as-code&lt;/strong&gt;. [7][8]&lt;/li&gt;
&lt;li&gt;Measure what matters: &lt;strong&gt;lead time, deploy frequency, change failure rate, MTTR&lt;/strong&gt;. [1][2]&lt;/li&gt;
&lt;li&gt;Modernization works best as an incremental program, not a rewrite (Strangler Fig pattern). [12]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#pattern-1-analysis-as-a-substitute-for-delivery"&gt;Pattern 1: Analysis as a substitute for delivery&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-2-reinventing-commodity-infrastructure"&gt;Pattern 2: Reinventing commodity infrastructure&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-3-vm-first-thinking-as-the-default"&gt;Pattern 3: VM-first thinking as the default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-4-ticket-driven-infrastructure"&gt;Pattern 4: Ticket-driven infrastructure&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-5-change-advisory-board-for-routine-changes"&gt;Pattern 5: Change Advisory Board for routine changes&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-6-the-shared-database-empire"&gt;Pattern 6: The shared database empire&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-7-central-integration-as-a-chokepoint"&gt;Pattern 7: Central integration as a chokepoint&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-8-perma-pocs-and-innovation-theater"&gt;Pattern 8: Perma-POCs and innovation theater&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#replace-committees-with-guardrails"&gt;Replace committees with guardrails&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#modernize-without-a-rewrite"&gt;Modernize without a rewrite&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-checklist"&gt;A practical checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-1-analysis-as-a-substitute-for-delivery"&gt;Pattern 1: Analysis as a substitute for delivery&lt;/h2&gt;
&lt;h3 id="what-it-looks-like"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A team spends months (sometimes a year) doing &amp;ldquo;analysis&amp;rdquo; for a capability that won&amp;rsquo;t be used until it&amp;rsquo;s built - often with the intention of eliminating all risk up front.&lt;/p&gt;
&lt;p&gt;Common examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;multi-tenant &amp;ldquo;high availability image storage&amp;rdquo; designed from scratch&lt;/li&gt;
&lt;li&gt;designing bespoke event systems when managed queues exist&lt;/li&gt;
&lt;li&gt;writing 40-page architecture documents before the first running slice exists&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-existed"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;When provisioning took weeks and environments were scarce, analysis was a rational risk-reducer.&lt;/p&gt;
&lt;h3 id="the-hidden-tax"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You push real learning to the end (integration failures happen late).&lt;/li&gt;
&lt;li&gt;Decisions get made with imaginary constraints, not measured ones.&lt;/li&gt;
&lt;li&gt;Teams optimize for &amp;ldquo;approval&amp;rdquo; rather than &amp;ldquo;outcome.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Timebox discovery and require a running slice early.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A strong default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1-2 week spike to validate constraints&lt;/li&gt;
&lt;li&gt;a thin vertical slice in production (even behind a flag)&lt;/li&gt;
&lt;li&gt;iterate based on real telemetry and user feedback&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-low-drama"&gt;Transition step (low drama)&lt;/h3&gt;
&lt;p&gt;Create an &amp;ldquo;RFC-lite&amp;rdquo; template:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;problem statement + constraints&lt;/li&gt;
&lt;li&gt;1-2 options with tradeoffs&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;plan to measure&lt;/strong&gt; (latency, cost, reliability)&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;thin-slice milestone&lt;/strong&gt; date&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-2-reinventing-commodity-infrastructure"&gt;Pattern 2: Reinventing commodity infrastructure&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-1"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Teams treat widely-proven primitives as novel:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;object storage&lt;/li&gt;
&lt;li&gt;queues&lt;/li&gt;
&lt;li&gt;identity&lt;/li&gt;
&lt;li&gt;metrics + tracing&lt;/li&gt;
&lt;li&gt;load balancing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A classic symptom: &amp;ldquo;We need to design HA multi-tenant object storage,&amp;rdquo; as if durable object storage isn&amp;rsquo;t already a standard building block.&lt;/p&gt;
&lt;h3 id="why-it-existed-1"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;On-prem and early hosting eras forced you to build a lot yourself.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-1"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Reinventing primitives becomes a multi-quarter project.&lt;/li&gt;
&lt;li&gt;Reliability becomes your problem (and you will be on call for it).&lt;/li&gt;
&lt;li&gt;The business pays for the same capability twice: once in time, and again in incidents.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-1"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Default to &lt;strong&gt;managed or proven primitives&lt;/strong&gt; unless you have a documented reason not to.&lt;/p&gt;
&lt;p&gt;For example, modern object storage services are explicitly designed for very high durability and availability (provider details vary). [11]&lt;/p&gt;
&lt;h3 id="transition-step"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Maintain a &amp;ldquo;Reference Implementations&amp;rdquo; catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;How we do object storage&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;How we do queues&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;How we do auth&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;How we do telemetry&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the default is documented and supported, teams stop re-litigating fundamentals.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-3-vm-first-thinking-as-the-default"&gt;Pattern 3: VM-first thinking as the default&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-2"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Everything runs on VMs because &amp;ldquo;that&amp;rsquo;s what we do,&amp;rdquo; even when the workload is a stateless API, worker, or event consumer.&lt;/p&gt;
&lt;h3 id="why-it-existed-2"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;VMs were the universal unit of deployment for a long time, and they map cleanly to org boundaries (&amp;ldquo;this server is mine&amp;rdquo;).&lt;/p&gt;
&lt;h3 id="the-hidden-tax-2"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;drift (snowflake servers)&lt;/li&gt;
&lt;li&gt;slow rollouts&lt;/li&gt;
&lt;li&gt;inconsistent security posture&lt;/li&gt;
&lt;li&gt;wasted compute due to poor bin-packing&lt;/li&gt;
&lt;li&gt;limited standardization across services&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-2"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;For many enterprise services, &lt;strong&gt;containers orchestrated by Kubernetes&lt;/strong&gt; are a strong default for stateless workloads. Kubernetes itself describes Deployments as a good fit for managing stateless applications where Pods are interchangeable and replaceable. [5]&lt;/p&gt;
&lt;p&gt;This doesn&amp;rsquo;t mean &amp;ldquo;Kubernetes for everything,&amp;rdquo; but it does mean:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;prefer declarative workloads with health checks and rollout controls&lt;/li&gt;
&lt;li&gt;keep VMs for deliberate cases (legacy constraints, special licensing, unique state, or when orchestration adds no value)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-1"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Start with &amp;ldquo;Kubernetes-first for new stateless services,&amp;rdquo; not a migration mandate.&lt;/p&gt;
&lt;p&gt;Then build operational guardrails:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;resource requests/limits so services behave predictably under load [6]&lt;/li&gt;
&lt;li&gt;standardized readiness/liveness probes&lt;/li&gt;
&lt;li&gt;standard ingress + auth patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-4-ticket-driven-infrastructure"&gt;Pattern 4: Ticket-driven infrastructure&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-3"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Need a database? Ticket.
Need an environment? Ticket.
Need DNS? Ticket.
Need a queue? Ticket.&lt;/p&gt;
&lt;p&gt;Eventually, the ticketing system becomes the true control plane.&lt;/p&gt;
&lt;h3 id="why-it-existed-3"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s a reasonable response when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;environments are scarce&lt;/li&gt;
&lt;li&gt;changes are risky&lt;/li&gt;
&lt;li&gt;platform knowledge is specialized&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-3"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;queues become normalized (&amp;ldquo;it takes 3 weeks to get a namespace&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;teams route around the platform&lt;/li&gt;
&lt;li&gt;reliability doesn&amp;rsquo;t improve; delivery just slows&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-3"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Self-service via &lt;strong&gt;GitOps&lt;/strong&gt; and platform &amp;ldquo;paved roads.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;OpenGitOps describes GitOps as a set of standards/best practices for adopting a structured approach to GitOps. [7] The point isn&amp;rsquo;t a specific tool - it&amp;rsquo;s the principle: &lt;strong&gt;desired state is declarative and auditable.&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id="transition-step-2"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Pick one high-frequency request and eliminate it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;create a service with a standard ingress/auth/telemetry&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;provision a queue&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;create a dev environment&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Make the paved road the path of least resistance.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-5-change-advisory-board-for-routine-changes"&gt;Pattern 5: Change Advisory Board for routine changes&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-4"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Every change - routine or risky - requires synchronous approval.&lt;/p&gt;
&lt;h3 id="why-it-existed-4"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;When changes were large, rare, and manual, centralized review reduced catastrophic surprises.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-4"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;you batch changes (bigger releases are riskier)&lt;/li&gt;
&lt;li&gt;emergency changes bypass process (creating inconsistency)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;approval&amp;rdquo; becomes the goal rather than &lt;strong&gt;evidence of safety&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;DORA&amp;rsquo;s guidance on streamlining change approval emphasizes making the regular change process fast and reliable enough that it can handle emergencies, and reframes how CAB fits into continuous delivery. [3] Continuous delivery literature makes a similar point: smaller, more frequent changes reduce risk and ease remediation. [4]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-4"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Move to &lt;strong&gt;evidence-based change approval&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;automated tests&lt;/li&gt;
&lt;li&gt;policy-as-code checks&lt;/li&gt;
&lt;li&gt;progressive delivery (canaries, phased rollouts)&lt;/li&gt;
&lt;li&gt;real-time telemetry tied to the release&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-3"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Keep CAB, but change its scope:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;focus on high-risk changes and cross-team coordination&lt;/li&gt;
&lt;li&gt;use automation and metrics for routine changes&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-6-the-shared-database-empire"&gt;Pattern 6: The shared database empire&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-5"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A central database is shared by many services.
Teams coordinate schema changes across multiple apps and releases.&lt;/p&gt;
&lt;p&gt;Microservices.io describes the &amp;ldquo;shared database&amp;rdquo; pattern explicitly: multiple services access a single database directly. [10]&lt;/p&gt;
&lt;h3 id="why-it-existed-5"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s simple at first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one place for data&lt;/li&gt;
&lt;li&gt;easy joins&lt;/li&gt;
&lt;li&gt;one backup plan&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-5"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;coupling spreads everywhere&lt;/li&gt;
&lt;li&gt;every change becomes cross-team work&lt;/li&gt;
&lt;li&gt;reliability suffers because one DB problem becomes everyone&amp;rsquo;s problem&lt;/li&gt;
&lt;li&gt;schema evolution becomes political&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-5"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Prefer service-owned data boundaries. Microservices.io&amp;rsquo;s &amp;ldquo;database per service&amp;rdquo; pattern describes keeping a service&amp;rsquo;s data private and accessible only via its API. [9]&lt;/p&gt;
&lt;h3 id="transition-step-4"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;You don&amp;rsquo;t have to &amp;ldquo;microservices everything.&amp;rdquo;
Start by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;carving out new tables owned by one service&lt;/li&gt;
&lt;li&gt;introducing an API boundary&lt;/li&gt;
&lt;li&gt;migrating consumers gradually&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-7-central-integration-as-a-chokepoint"&gt;Pattern 7: Central integration as a chokepoint&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-6"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;All integrations must go through a single shared integration layer/team (classic ESB gravity).&lt;/p&gt;
&lt;h3 id="why-it-existed-6"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;Centralizing integration gave consistency when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;protocols were messy&lt;/li&gt;
&lt;li&gt;tooling was expensive&lt;/li&gt;
&lt;li&gt;teams lacked automation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-6"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;integration lead times explode&lt;/li&gt;
&lt;li&gt;teams stop experimenting&lt;/li&gt;
&lt;li&gt;one backlog becomes everyone&amp;rsquo;s bottleneck&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-6"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Standardize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;interfaces&lt;/strong&gt; (auth, tracing, deployment, contract testing)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;platform guardrails&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;hellip;not every internal implementation detail.&lt;/p&gt;
&lt;h3 id="transition-step-5"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Carve out one &amp;ldquo;self-service integration&amp;rdquo; paved road:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;standard service template&lt;/li&gt;
&lt;li&gt;standard auth&lt;/li&gt;
&lt;li&gt;standard telemetry&lt;/li&gt;
&lt;li&gt;contracts + examples&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-8-perma-pocs-and-innovation-theater"&gt;Pattern 8: Perma-POCs and innovation theater&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-7"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Prototypes exist forever, never becoming production systems.&lt;/p&gt;
&lt;p&gt;Especially common with AI initiatives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;impressive demos&lt;/li&gt;
&lt;li&gt;no production constraints&lt;/li&gt;
&lt;li&gt;no ownership for operability&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-existed-7"&gt;Why it existed&lt;/h3&gt;
&lt;p&gt;POCs are a safe way to explore unknowns.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-7"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;teams lose trust (&amp;ldquo;innovation never ships&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;production teams inherit half-baked work&lt;/li&gt;
&lt;li&gt;opportunity cost compounds&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-7"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;From day one, require:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an owner&lt;/li&gt;
&lt;li&gt;a production path&lt;/li&gt;
&lt;li&gt;a thin slice in a real environment&lt;/li&gt;
&lt;li&gt;explicit safety requirements (timeouts, budgets, telemetry)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-6"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Make &amp;ldquo;POC exit criteria&amp;rdquo; mandatory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;what metrics prove value?&lt;/li&gt;
&lt;li&gt;what is the minimum shippable slice?&lt;/li&gt;
&lt;li&gt;what must be true for reliability and security?&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="replace-committees-with-guardrails"&gt;Replace committees with guardrails&lt;/h2&gt;
&lt;p&gt;A recurring theme: &lt;strong&gt;humans are expensive control planes&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The modern move is to convert &amp;ldquo;tribal rules&amp;rdquo; into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;templates&lt;/li&gt;
&lt;li&gt;automation&lt;/li&gt;
&lt;li&gt;policy-as-code&lt;/li&gt;
&lt;li&gt;paved paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Microsoft&amp;rsquo;s platform engineering work describes &amp;ldquo;paved paths&amp;rdquo; within an internal developer platform as recommended paths to production that guide developers through requirements without sacrificing velocity. [8]&lt;/p&gt;
&lt;p&gt;Guardrails beat gatekeepers because guardrails are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;consistent&lt;/li&gt;
&lt;li&gt;fast&lt;/li&gt;
&lt;li&gt;auditable&lt;/li&gt;
&lt;li&gt;scalable&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="modernize-without-a-rewrite"&gt;Modernize without a rewrite&lt;/h2&gt;
&lt;p&gt;Big-bang rewrites are expensive and risky. Incremental modernization is usually the winning move.&lt;/p&gt;
&lt;p&gt;The Strangler Fig pattern is a well-known approach: wrap or route traffic so you can replace parts of a legacy system gradually. [12]&lt;/p&gt;
&lt;p&gt;Practical approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;put a facade in front of the legacy surface&lt;/li&gt;
&lt;li&gt;carve off one slice at a time&lt;/li&gt;
&lt;li&gt;measure outcomes&lt;/li&gt;
&lt;li&gt;keep rollback easy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn&amp;rsquo;t glamorous. It works.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/h2&gt;
&lt;p&gt;If you want to avoid &amp;ldquo;modernization theater,&amp;rdquo; measure.&lt;/p&gt;
&lt;p&gt;DORA&amp;rsquo;s metrics guidance is a solid baseline: deployment frequency, lead time for changes, change failure rate, and time to restore service (MTTR). [1] The 2024 DORA report continues to focus on the organizational capabilities that drive high performance. [2]&lt;/p&gt;
&lt;p&gt;A simple evidence loop:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pick one value stream (one product or platform slice).&lt;/li&gt;
&lt;li&gt;Baseline the four DORA metrics.&lt;/li&gt;
&lt;li&gt;Remove one friction point (one pattern).&lt;/li&gt;
&lt;li&gt;Re-measure.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your metrics don&amp;rsquo;t move, you didn&amp;rsquo;t remove the real constraint.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-practical-checklist"&gt;A practical checklist&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;re trying to retire &amp;ldquo;enterprise debt&amp;rdquo; safely:&lt;/p&gt;
&lt;h3 id="delivery"&gt;Delivery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timebox analysis; require a running slice early.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Prefer small changes and frequent releases; avoid batching.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="platform"&gt;Platform&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Provide a paved road for common workflows (service template, auth, telemetry). [8]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Remove ticket queues for repeatable requests (self-service + GitOps). [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Standardize timeouts, retries, budgets, and resource requests/limits. [6]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Use progressive delivery where risk is high.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="architecture"&gt;Architecture&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Reduce shared DB coupling; establish service-owned boundaries. [9][10]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Modernize incrementally (Strangler Fig), not via big-bang rewrites. [12]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="governance"&gt;Governance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Replace routine approvals with evidence: tests + policy-as-code + telemetry. [3][4]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] DORA - &amp;ldquo;DORA&amp;rsquo;s software delivery performance metrics (guide)&amp;rdquo;. &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
[2] DORA - &amp;ldquo;Accelerate State of DevOps Report 2024&amp;rdquo;. &lt;a href="https://dora.dev/research/2024/dora-report/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/research/2024/dora-report/&lt;/a&gt;
[3] DORA - &amp;ldquo;Streamlining change approval (capability)&amp;rdquo;. &lt;a href="https://dora.dev/capabilities/streamlining-change-approval/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/capabilities/streamlining-change-approval/&lt;/a&gt;
[4] ContinuousDelivery.com - &amp;ldquo;Continuous Delivery and ITIL: Change Management&amp;rdquo;. &lt;a href="https://continuousdelivery.com/2010/11/continuous-delivery-and-itil-change-management/" target="_blank" rel="noopener noreferrer"&gt;https://continuousdelivery.com/2010/11/continuous-delivery-and-itil-change-management/&lt;/a&gt;
[5] Kubernetes docs - &amp;ldquo;Workloads (Deployments are a good fit for stateless workloads)&amp;rdquo;. &lt;a href="https://kubernetes.io/docs/concepts/workloads/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/workloads/&lt;/a&gt;
[6] Kubernetes docs - &amp;ldquo;Resource Management for Pods and Containers (requests/limits)&amp;rdquo;. &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/&lt;/a&gt;
[7] OpenGitOps - &amp;ldquo;What is OpenGitOps?&amp;rdquo; and project background. &lt;a href="https://opengitops.dev/" target="_blank" rel="noopener noreferrer"&gt;https://opengitops.dev/&lt;/a&gt;
and &lt;a href="https://opengitops.dev/about/" target="_blank" rel="noopener noreferrer"&gt;https://opengitops.dev/about/&lt;/a&gt;
[8] Microsoft Engineering Blog - &amp;ldquo;Building paved paths: the journey to platform engineering&amp;rdquo;. &lt;a href="https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/" target="_blank" rel="noopener noreferrer"&gt;https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/&lt;/a&gt;
[9] Microservices.io - &amp;ldquo;Database per service&amp;rdquo; pattern. &lt;a href="https://microservices.io/patterns/data/database-per-service" target="_blank" rel="noopener noreferrer"&gt;https://microservices.io/patterns/data/database-per-service&lt;/a&gt;
[10] Microservices.io - &amp;ldquo;Shared database&amp;rdquo; pattern. &lt;a href="https://microservices.io/patterns/data/shared-database.html" target="_blank" rel="noopener noreferrer"&gt;https://microservices.io/patterns/data/shared-database.html&lt;/a&gt;
[11] AWS documentation - &amp;ldquo;Data protection in Amazon S3 (durability/availability design goals)&amp;rdquo;. &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html" target="_blank" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html&lt;/a&gt;
[12] Martin Fowler - &amp;ldquo;Strangler Fig Application&amp;rdquo; (legacy modernization pattern). &lt;a href="https://martinfowler.com/bliki/StranglerFigApplication.html" target="_blank" rel="noopener noreferrer"&gt;https://martinfowler.com/bliki/StranglerFigApplication.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>Stop Shipping Slide Decks</title><link>https://roygabriel.dev/blog/stop-shipping-slide-decks/</link><pubDate>Sat, 31 Jan 2026 11:15:00 -0500</pubDate><guid>https://roygabriel.dev/blog/stop-shipping-slide-decks/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Position:&lt;/strong&gt; This is not &amp;ldquo;documentation bad.&amp;rdquo;
This is &amp;ldquo;documentation is a tool.&amp;rdquo; If it increases lead time, hides truth, or replaces learning, it&amp;rsquo;s not helping.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;In software, the real &amp;ldquo;source of truth&amp;rdquo; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;running systems&lt;/li&gt;
&lt;li&gt;code and configuration&lt;/li&gt;
&lt;li&gt;production telemetry&lt;/li&gt;
&lt;li&gt;incident history&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Documentation should reduce uncertainty and speed up decisions. But two artifacts routinely do the opposite in large organizations:&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Position:&lt;/strong&gt; This is not &amp;ldquo;documentation bad.&amp;rdquo;
This is &amp;ldquo;documentation is a tool.&amp;rdquo; If it increases lead time, hides truth, or replaces learning, it&amp;rsquo;s not helping.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;In software, the real &amp;ldquo;source of truth&amp;rdquo; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;running systems&lt;/li&gt;
&lt;li&gt;code and configuration&lt;/li&gt;
&lt;li&gt;production telemetry&lt;/li&gt;
&lt;li&gt;incident history&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Documentation should reduce uncertainty and speed up decisions. But two artifacts routinely do the opposite in large organizations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;the 40-page slide deck&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;the Word doc living somewhere in SharePoint that nobody can find&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These artifacts often become &lt;em&gt;deliverables&lt;/em&gt; - a substitute for building. They make it possible to spend months &amp;ldquo;progressing&amp;rdquo; without ever encountering reality.&lt;/p&gt;
&lt;p&gt;And here&amp;rsquo;s the part most orgs miss:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;If you&amp;rsquo;re going to fail, you want to fail &lt;strong&gt;quickly and cheaply&lt;/strong&gt;, not slowly and expensively. [4]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That doesn&amp;rsquo;t mean reckless shipping. It means running a tight learning loop and letting reality correct you early - before you&amp;rsquo;ve sunk quarters of time into the wrong solution.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Decks are great for storytelling. They are bad as an engineering system of record.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;SharePoint architecture docs&amp;rdquo; become a &lt;strong&gt;document cemetery&lt;/strong&gt;: hard to find, hard to diff, and easy to ignore.&lt;/li&gt;
&lt;li&gt;The Agile Manifesto explicitly values &lt;strong&gt;working software over comprehensive documentation&lt;/strong&gt;. [1] And one Agile principle states that working software is the primary measure of progress. [2]&lt;/li&gt;
&lt;li&gt;Replace decks/docs-as-deliverables with:&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RFC-lite&lt;/strong&gt; (1-2 pages) + a &lt;strong&gt;running thin slice&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ADRs&lt;/strong&gt; (Architecture Decision Records) to capture decisions + tradeoffs [5][6]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Docs-as-code&lt;/strong&gt; (Markdown in the repo, reviewed like code)&lt;/li&gt;
&lt;li&gt;diagrams that are versioned and easy to update&lt;/li&gt;
&lt;li&gt;Measure improvement with system outcomes (lead time, deploy frequency, change failure rate, MTTR). [3]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#pattern-1-deck-driven-development"&gt;Pattern 1: Deck-driven development&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-2-sharepoint-document-cemeteries"&gt;Pattern 2: SharePoint document cemeteries&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-3-architecture-as-narrative-not-decisions"&gt;Pattern 3: Architecture as narrative, not decisions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-4-design-phase-gating"&gt;Pattern 4: &amp;ldquo;Design phase&amp;rdquo; gating&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-5-documentation-that-never-gets-pruned"&gt;Pattern 5: Documentation that never gets pruned&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-to-do-instead-a-documentation-system-that-ships"&gt;What to do instead: a documentation system that ships&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-checklist"&gt;A practical checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-1-deck-driven-development"&gt;Pattern 1: Deck-driven development&lt;/h2&gt;
&lt;h3 id="what-it-looks-like"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A 40-page deck is created to describe a system that doesn&amp;rsquo;t exist yet.&lt;/li&gt;
&lt;li&gt;The deck gets reviewed by multiple groups.&lt;/li&gt;
&lt;li&gt;Approval is treated as progress.&lt;/li&gt;
&lt;li&gt;When implementation starts, the world has changed - or key constraints were missed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Decks are socially useful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;they compress complexity into a narrative&lt;/li&gt;
&lt;li&gt;they help leaders &amp;ldquo;see&amp;rdquo; a plan&lt;/li&gt;
&lt;li&gt;they make uncertainty feel controlled&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Decks are a poor engineering artifact because they&amp;rsquo;re:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;low fidelity&lt;/strong&gt;: they rarely contain executable truth&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;hard to maintain&lt;/strong&gt;: updates are manual and usually lag reality&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;hard to diff&lt;/strong&gt;: you can&amp;rsquo;t easily review what changed and why&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;easy to perform&lt;/strong&gt;: a deck can look complete while the design is still untested&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;not tied to code&lt;/strong&gt;: no direct path from &amp;ldquo;decision&amp;rdquo; -&amp;gt; &amp;ldquo;implementation&amp;rdquo; -&amp;gt; &amp;ldquo;verification&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The worst outcome isn&amp;rsquo;t that the deck is wrong. It&amp;rsquo;s that the deck delays the point where you discover what&amp;rsquo;s wrong.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Use decks for storytelling &lt;strong&gt;after&lt;/strong&gt; you have reality. Use engineering artifacts to discover reality.&lt;/p&gt;
&lt;p&gt;A strong default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RFC-lite&lt;/strong&gt; (1-2 pages)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;a runnable thin slice&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;measurable verification&lt;/strong&gt; (latency, cost envelope, failure mode)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This aligns with Agile&amp;rsquo;s emphasis on working software as a real measure of progress. [2]&lt;/p&gt;
&lt;h3 id="transition-step-low-drama"&gt;Transition step (low drama)&lt;/h3&gt;
&lt;p&gt;Replace &amp;ldquo;deck required for approval&amp;rdquo; with &amp;ldquo;evidence required for approval&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;link to the RFC&lt;/li&gt;
&lt;li&gt;link to a running demo / branch / sandbox&lt;/li&gt;
&lt;li&gt;explicit constraints + tradeoffs&lt;/li&gt;
&lt;li&gt;an exit criteria checklist for the slice&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-2-sharepoint-document-cemeteries"&gt;Pattern 2: SharePoint document cemeteries&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-1"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Architecture docs exist as Word/PDF files in SharePoint.&lt;/li&gt;
&lt;li&gt;Multiple versions exist (&amp;ldquo;Final_v7_REAL_FINAL.docx&amp;rdquo;).&lt;/li&gt;
&lt;li&gt;Search works poorly unless you already know what to search for.&lt;/li&gt;
&lt;li&gt;Nobody updates the doc because it&amp;rsquo;s painful and risky (&amp;ldquo;what if I change the blessed doc?&amp;rdquo;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-1"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s an enterprise default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SharePoint is &amp;ldquo;official&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Word docs feel formal&lt;/li&gt;
&lt;li&gt;it&amp;rsquo;s familiar to non-engineering stakeholders&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax-1"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;SharePoint docs typically fail at the things engineering needs most:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;discoverability&lt;/strong&gt; (people don&amp;rsquo;t know where to look)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ownership&lt;/strong&gt; (no clear maintainer)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;reviewability&lt;/strong&gt; (diffs and PR discussion are weak)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;linking to reality&lt;/strong&gt; (code, configs, dashboards, runbooks)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;keeping current&lt;/strong&gt; (documentation drift becomes the norm)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So teams stop trusting docs and rely on tribal knowledge - until they page someone at 2 a.m.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-1"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Treat documentation as part of the codebase:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Markdown in the repo&lt;/li&gt;
&lt;li&gt;reviewed via PR like code&lt;/li&gt;
&lt;li&gt;versioned with implementation&lt;/li&gt;
&lt;li&gt;linked to:&lt;/li&gt;
&lt;li&gt;APIs (OpenAPI specs)&lt;/li&gt;
&lt;li&gt;dashboards&lt;/li&gt;
&lt;li&gt;runbooks&lt;/li&gt;
&lt;li&gt;incident writeups&lt;/li&gt;
&lt;li&gt;ADRs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Google&amp;rsquo;s documentation best practices make the point directly: a small set of fresh, accurate docs is better than a large pile in disrepair. [7]&lt;/p&gt;
&lt;h3 id="transition-step"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;You don&amp;rsquo;t have to &amp;ldquo;migrate all docs.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Start with a triage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identify the top 10 documents people actually need.&lt;/li&gt;
&lt;li&gt;Recreate them as Markdown in a &lt;code&gt;docs/&lt;/code&gt; folder with an index.&lt;/li&gt;
&lt;li&gt;Leave the rest as archived references, not living truth.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-3-architecture-as-narrative-not-decisions"&gt;Pattern 3: Architecture as narrative, not decisions&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-2"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;The doc describes a target architecture but doesn&amp;rsquo;t answer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;why this approach?&lt;/li&gt;
&lt;li&gt;what alternatives were considered?&lt;/li&gt;
&lt;li&gt;what tradeoffs were accepted?&lt;/li&gt;
&lt;li&gt;what constraints matter most?&lt;/li&gt;
&lt;li&gt;what did we decide not to do?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-2"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Narratives are easier than decision logs. It&amp;rsquo;s simpler to write &amp;ldquo;the system will&amp;hellip;&amp;rdquo; than to record the messy reality of tradeoffs.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-2"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;When decisions aren&amp;rsquo;t recorded, teams re-litigate them repeatedly. The same arguments come back every quarter - often because new people joined and the reasoning isn&amp;rsquo;t captured.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-adrs"&gt;The replacement pattern: ADRs&lt;/h3&gt;
&lt;p&gt;Use &lt;strong&gt;Architecture Decision Records (ADRs)&lt;/strong&gt;: short, structured notes that capture an important decision with its context and consequences. [5] The practice is commonly attributed to Michael Nygard&amp;rsquo;s 2011 write-up. [6]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ADRs are the opposite of a 40-slide deck:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;small&lt;/li&gt;
&lt;li&gt;specific&lt;/li&gt;
&lt;li&gt;diffable&lt;/li&gt;
&lt;li&gt;linkable to code changes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-1"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Start with one ADR per &amp;ldquo;architecturally significant decision&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;database choice&lt;/li&gt;
&lt;li&gt;messaging pattern&lt;/li&gt;
&lt;li&gt;tenancy model&lt;/li&gt;
&lt;li&gt;auth model&lt;/li&gt;
&lt;li&gt;deployment model&lt;/li&gt;
&lt;li&gt;data boundary decisions&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-4-design-phase-gating"&gt;Pattern 4: &amp;ldquo;Design phase&amp;rdquo; gating&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-3"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;We can&amp;rsquo;t start implementation until the analysis is complete.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;The analysis expands to include every possible future case.&lt;/li&gt;
&lt;li&gt;The design grows more &amp;ldquo;complete&amp;rdquo; and less true.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-3"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Enterprises are understandably afraid of failure.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-3"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;This approach doesn&amp;rsquo;t eliminate failure. It defers it - making it more expensive.&lt;/p&gt;
&lt;p&gt;Lean Startup describes progress as validated learning and emphasizes moving quickly through a build-measure-learn loop. [4] The point isn&amp;rsquo;t startups. The point is learning fast when you&amp;rsquo;re uncertain.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-2"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Timebox design, then validate with a thin slice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;write the RFC-lite doc&lt;/li&gt;
&lt;li&gt;implement the smallest realistic end-to-end path&lt;/li&gt;
&lt;li&gt;measure the constraints&lt;/li&gt;
&lt;li&gt;then expand&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-2"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Define &amp;ldquo;analysis exit criteria&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;measurable constraints validated (not theorized)&lt;/li&gt;
&lt;li&gt;spike code exists&lt;/li&gt;
&lt;li&gt;a plan for incremental rollout exists&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-5-documentation-that-never-gets-pruned"&gt;Pattern 5: Documentation that never gets pruned&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-4"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Docs accumulate but aren&amp;rsquo;t maintained:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;outdated architecture diagrams&lt;/li&gt;
&lt;li&gt;old runbooks&lt;/li&gt;
&lt;li&gt;stale onboarding guides&lt;/li&gt;
&lt;li&gt;dead links&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-4"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Pruning isn&amp;rsquo;t rewarded. Writing new docs feels productive; deleting old docs feels risky.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-4"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Stale docs are worse than no docs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;they mislead&lt;/li&gt;
&lt;li&gt;they increase cognitive load&lt;/li&gt;
&lt;li&gt;they create false confidence&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-3"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Adopt &amp;ldquo;minimum viable documentation&amp;rdquo; and prune regularly. [7]&lt;/p&gt;
&lt;p&gt;The rule I like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If a doc isn&amp;rsquo;t maintained, label it &lt;strong&gt;ARCHIVED&lt;/strong&gt; and explain why.&lt;/li&gt;
&lt;li&gt;If a doc is required, tie it to ownership and change workflow.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="transition-step-3"&gt;Transition step&lt;/h3&gt;
&lt;p&gt;Make docs part of PR hygiene:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if the change affects behavior, docs update ships with it&lt;/li&gt;
&lt;li&gt;run link checks in CI&lt;/li&gt;
&lt;li&gt;keep an index page updated&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="what-to-do-instead-a-documentation-system-that-ships"&gt;What to do instead: a documentation system that ships&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s a simple &amp;ldquo;docs system&amp;rdquo; that works in practice.&lt;/p&gt;
&lt;h3 id="a-repo-structure-that-scales"&gt;A repo structure that scales&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;/README.md # entry point: what this is + how to run it
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;/docs/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; index.md # &amp;#34;start here&amp;#34; documentation map
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; rfc/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0001-tenancy-model.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0002-storage-approach.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; adr/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0001-use-postgres.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 0002-adopt-opentelemetry.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; architecture/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; context.md # C4-ish: context + boundaries
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; containers.md # top-level services
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; deployment.md # runtime &amp;amp; environments
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; runbooks/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; oncall.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; incident-response.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; api/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; openapi.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="replace-40-slides-with-two-artifacts"&gt;Replace 40 slides with two artifacts&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;RFC-lite (1-2 pages)&lt;/strong&gt;: the &amp;ldquo;what&amp;rdquo; and &amp;ldquo;why&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thin slice demo&lt;/strong&gt;: the reality check&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 id="rfc-lite-template-copypaste"&gt;RFC-lite template (copy/paste)&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-markdown" data-lang="markdown"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;# RFC: &amp;lt;title&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Problem
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What are we trying to solve? Who is affected?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Constraints
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;Latency, cost, compliance, tenancy, uptime, environments.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Proposal
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What are we building? What does &amp;#34;done&amp;#34; mean?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Alternatives considered
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;Option A / B / C with short tradeoffs.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Risks and mitigations
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What could go wrong? How will we contain blast radius?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Verification
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;How will we measure success in production?
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="adr-template-copypaste"&gt;ADR template (copy/paste)&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-markdown" data-lang="markdown"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;# ADR-XXXX: &amp;lt;decision&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gh"&gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Status
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;Proposed | Accepted | Deprecated
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Context
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What drove this decision? What constraints matter?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Decision
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What did we decide?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;## Consequences
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gu"&gt;&lt;/span&gt;What do we gain? What do we lose? What changes later?
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/h2&gt;
&lt;p&gt;If you replace decks and doc cemeteries with real engineering artifacts, you should see:&lt;/p&gt;
&lt;h3 id="delivery-metrics-improve"&gt;Delivery metrics improve&lt;/h3&gt;
&lt;p&gt;Track the same system-level outcomes DORA promotes: lead time, deploy frequency, change failure rate, and time to restore service. [3]&lt;/p&gt;
&lt;h3 id="fewer-handoffs-and-fewer-alignment-meetings"&gt;Fewer handoffs and fewer &amp;ldquo;alignment meetings&amp;rdquo;&lt;/h3&gt;
&lt;p&gt;If teams can self-serve context from living docs, coordination cost drops.&lt;/p&gt;
&lt;h3 id="faster-first-reality"&gt;Faster &amp;ldquo;first reality&amp;rdquo;&lt;/h3&gt;
&lt;p&gt;A simple heuristic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How long from idea -&amp;gt; first runnable thin slice?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If that number is months, the system is optimized for analysis, not learning.&lt;/p&gt;
&lt;h3 id="docs-stay-alive"&gt;Docs stay alive&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;docs updated alongside code&lt;/li&gt;
&lt;li&gt;fewer stale &amp;ldquo;final_v7&amp;rdquo; files&lt;/li&gt;
&lt;li&gt;fewer tribal-knowledge escalations&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="a-practical-checklist"&gt;A practical checklist&lt;/h2&gt;
&lt;p&gt;If you want to kill deck-driven delivery without starting a culture war:&lt;/p&gt;
&lt;h3 id="stop-treating-decks-as-deliverables"&gt;Stop treating decks as deliverables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Architecture reviews require an RFC + a runnable slice.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Decks are optional; evidence is not.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="fix-document-discoverability"&gt;Fix document discoverability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; One &lt;code&gt;docs/index.md&lt;/code&gt; that links to the docs that matter.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Make the repo the source of truth for technical docs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="capture-decisions-not-fantasies"&gt;Capture decisions, not fantasies&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Add ADRs for major decisions and link them to PRs. [5][6]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="timebox-analysis"&gt;Timebox analysis&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Set analysis exit criteria.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Optimize for early learning and quick failure when uncertainty is high. [4]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="keep-docs-small-and-alive"&gt;Keep docs small and alive&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Prune regularly; archive what&amp;rsquo;s stale.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Run link checks in CI.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Treat docs like bonsai: maintained and trimmed, not accumulated. [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Manifesto for Agile Software Development (values; &amp;ldquo;Working software over comprehensive documentation&amp;rdquo;). &lt;a href="https://agilemanifesto.org/" target="_blank" rel="noopener noreferrer"&gt;https://agilemanifesto.org/&lt;/a&gt;
[2] Principles behind the Agile Manifesto (&amp;ldquo;Working software is the primary measure of progress&amp;rdquo;). &lt;a href="https://agilemanifesto.org/principles.html" target="_blank" rel="noopener noreferrer"&gt;https://agilemanifesto.org/principles.html&lt;/a&gt;
[3] DORA - &amp;ldquo;DORA&amp;rsquo;s software delivery performance metrics (guide)&amp;rdquo;. &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
[4] Lean Startup principles (Build-Measure-Learn; learning quickly; failing fast/cheaply as a concept). &lt;a href="https://theleanstartup.com/principles" target="_blank" rel="noopener noreferrer"&gt;https://theleanstartup.com/principles&lt;/a&gt;
[5] ADR - Architectural Decision Records (what ADRs are). &lt;a href="https://adr.github.io/" target="_blank" rel="noopener noreferrer"&gt;https://adr.github.io/&lt;/a&gt;
[6] Michael Nygard - &amp;ldquo;Documenting Architecture Decisions&amp;rdquo; (2011; ADR practice origin/popularization). &lt;a href="https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions" target="_blank" rel="noopener noreferrer"&gt;https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions&lt;/a&gt;
[7] Google Documentation Guide - Best practices (&amp;ldquo;Minimum Viable Documentation&amp;rdquo;; keep docs short, fresh, and pruned). &lt;a href="https://google.github.io/styleguide/docguide/best_practices.html" target="_blank" rel="noopener noreferrer"&gt;https://google.github.io/styleguide/docguide/best_practices.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>When Management Layers Become Latency</title><link>https://roygabriel.dev/blog/when-management-layers-become-latency/</link><pubDate>Sat, 24 Jan 2026 10:30:00 -0500</pubDate><guid>https://roygabriel.dev/blog/when-management-layers-become-latency/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;. This isn&amp;rsquo;t &amp;ldquo;management bad.&amp;rdquo;
Good management is an accelerator. The problem is when management becomes &lt;strong&gt;layers of translation&lt;/strong&gt; between reality and decisions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;In production systems, adding hops between a request and a response increases latency, failure modes, and debugging time.&lt;/p&gt;
&lt;p&gt;Organizations behave the same way.&lt;/p&gt;
&lt;p&gt;When engineering work flows through too many intermediary layers - tech leads, scrum masters, managers, senior managers, project managers, directors, senior directors, VPs, and beyond - the organization starts to exhibit the same symptoms as an over-proxied network:&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;. This isn&amp;rsquo;t &amp;ldquo;management bad.&amp;rdquo;
Good management is an accelerator. The problem is when management becomes &lt;strong&gt;layers of translation&lt;/strong&gt; between reality and decisions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;In production systems, adding hops between a request and a response increases latency, failure modes, and debugging time.&lt;/p&gt;
&lt;p&gt;Organizations behave the same way.&lt;/p&gt;
&lt;p&gt;When engineering work flows through too many intermediary layers - tech leads, scrum masters, managers, senior managers, project managers, directors, senior directors, VPs, and beyond - the organization starts to exhibit the same symptoms as an over-proxied network:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;long lead times&lt;/li&gt;
&lt;li&gt;lost context (&amp;ldquo;telephone game&amp;rdquo; requirements)&lt;/li&gt;
&lt;li&gt;local optimization (everyone looks busy; value doesn&amp;rsquo;t move)&lt;/li&gt;
&lt;li&gt;coordination overhead that scales faster than delivery&lt;/li&gt;
&lt;li&gt;engineers feeling like nothing they build reaches production&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The painful part is that the org can look &lt;strong&gt;healthy&lt;/strong&gt; on paper (status is green, roadmaps are full) while the product fails to meet real expectations.&lt;/p&gt;
&lt;p&gt;This article is about the mechanics behind that failure - and the replacement patterns that restore flow.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Layers create handoffs.&lt;/strong&gt; Handoffs create queues. Queues create lead time.&lt;/li&gt;
&lt;li&gt;More roles don&amp;rsquo;t automatically increase throughput; coordination cost can dominate (Brooks&amp;rsquo;s Law). [6]&lt;/li&gt;
&lt;li&gt;Fast flow requires &lt;strong&gt;end-to-end ownership&lt;/strong&gt; with minimal handoffs (stream-aligned teams). [3][4]&lt;/li&gt;
&lt;li&gt;Measure outcomes at the system level (DORA metrics), not &amp;ldquo;activity&amp;rdquo; (story points, number of meetings). [1]&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t turn metrics into targets (Goodhart&amp;rsquo;s Law). [7]&lt;/li&gt;
&lt;li&gt;Burnout often rises when delivery is painful and risky; improving delivery capability predicts lower burnout. [2][8]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#pattern-1-translation-layers-replace-direct-truth"&gt;Pattern 1: Translation layers replace direct truth&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-2-status-becomes-the-work"&gt;Pattern 2: Status becomes the work&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-3-more-people-is-treated-like-a-throughput-solution"&gt;Pattern 3: &amp;ldquo;More people&amp;rdquo; is treated like a throughput solution&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-4-projectization-and-temporary-teams"&gt;Pattern 4: Projectization and temporary teams&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-5-governance-by-meeting-instead-of-guardrail"&gt;Pattern 5: Governance by meeting instead of guardrail&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-6-metrics-as-targets"&gt;Pattern 6: Metrics as targets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-7-engineers-are-abstracted-away-from-production"&gt;Pattern 7: Engineers are abstracted away from production&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#replacement-patterns-that-work"&gt;Replacement patterns that work&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#verification-how-you-know-the-org-is-healing"&gt;Verification: how you know the org is healing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-checklist"&gt;A practical checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-1-translation-layers-replace-direct-truth"&gt;Pattern 1: Translation layers replace direct truth&lt;/h2&gt;
&lt;h3 id="what-it-looks-like"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A customer need or operational pain moves through a chain:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;customer -&amp;gt; product -&amp;gt; program -&amp;gt; project -&amp;gt; delivery manager -&amp;gt; engineering manager -&amp;gt; tech lead -&amp;gt; engineers&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;By the time it arrives at the team, it&amp;rsquo;s been translated multiple times and often loses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the actual user story&lt;/li&gt;
&lt;li&gt;the constraints&lt;/li&gt;
&lt;li&gt;the real priority&lt;/li&gt;
&lt;li&gt;the &amp;ldquo;why&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Layering feels safe:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fewer people &amp;ldquo;bother&amp;rdquo; engineers&lt;/li&gt;
&lt;li&gt;leaders get curated information&lt;/li&gt;
&lt;li&gt;decision makers see clean narratives&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-hidden-tax"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Misalignment becomes normal.&lt;/li&gt;
&lt;li&gt;Engineers build the wrong thing &lt;em&gt;efficiently&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Product expectations aren&amp;rsquo;t met, not because engineers can&amp;rsquo;t build - but because the input signal is degraded.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Shorten the feedback loop.&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ensure teams have direct access to:&lt;/li&gt;
&lt;li&gt;customer signals (support tickets, usage, interviews)&lt;/li&gt;
&lt;li&gt;operational signals (incidents, latency, error budgets)&lt;/li&gt;
&lt;li&gt;Make the &amp;ldquo;why&amp;rdquo; non-optional: put it in the ticket, the PRD, and the kickoff.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;If a team can&amp;rsquo;t explain &amp;ldquo;why this exists,&amp;rdquo; it shouldn&amp;rsquo;t ship yet.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-2-status-becomes-the-work"&gt;Pattern 2: Status becomes the work&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-1"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Organizations that struggle to ship often compensate with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;more meetings&lt;/li&gt;
&lt;li&gt;more dashboards&lt;/li&gt;
&lt;li&gt;more decks&lt;/li&gt;
&lt;li&gt;more &amp;ldquo;alignment sessions&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The output looks like progress, but the production system doesn&amp;rsquo;t change.&lt;/p&gt;
&lt;h3 id="why-it-exists-1"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;When uncertainty is high, visibility is comforting.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-1"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Attention becomes scarce.&lt;/li&gt;
&lt;li&gt;Engineers fragment into &amp;ldquo;meeting responders.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Work becomes multi-tasked across too many initiatives (WIP explosion).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-1"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Reduce status overhead by making &lt;strong&gt;the system visible&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CI/CD dashboards&lt;/li&gt;
&lt;li&gt;production telemetry&lt;/li&gt;
&lt;li&gt;an engineering scorecard based on system outcomes (not activity)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;DORA&amp;rsquo;s metrics are widely used as system-level indicators for delivery performance: deployment frequency, lead time, change failure rate, and time to restore service. [1]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-3-more-people-is-treated-like-a-throughput-solution"&gt;Pattern 3: &amp;ldquo;More people&amp;rdquo; is treated like a throughput solution&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-2"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A late initiative triggers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;new managers&lt;/li&gt;
&lt;li&gt;new project managers&lt;/li&gt;
&lt;li&gt;new engineers&lt;/li&gt;
&lt;li&gt;more coordination rituals&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-2"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s intuitive: more people should mean more output.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-2"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Software delivery has a coordination component. Adding people increases communication paths, onboarding, and synchronization.&lt;/p&gt;
&lt;p&gt;Brooks&amp;rsquo;s Law captures this succinctly: adding manpower to a late software project can make it later. [6]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-2"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Before adding headcount, reduce coordination load:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;clarify ownership&lt;/li&gt;
&lt;li&gt;shrink scope to a thin vertical slice&lt;/li&gt;
&lt;li&gt;eliminate handoffs&lt;/li&gt;
&lt;li&gt;stabilize requirements long enough to ship&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then scale with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;duplication (more teams owning similar streams)&lt;/li&gt;
&lt;li&gt;platform leverage (paved roads), not more meetings&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-4-projectization-and-temporary-teams"&gt;Pattern 4: Projectization and temporary teams&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-3"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Engineers are repeatedly reorganized into short-lived &amp;ldquo;project teams,&amp;rdquo; and after delivery they are moved again.&lt;/p&gt;
&lt;h3 id="why-it-exists-3"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Projects are easy to budget, track, and narrate.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-3"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Temporary teams produce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fragile ownership&lt;/li&gt;
&lt;li&gt;weak operability&lt;/li&gt;
&lt;li&gt;&amp;ldquo;throw it over the wall&amp;rdquo; incentives&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Fast flow requires teams that own outcomes end-to-end with minimal handoffs.&lt;/p&gt;
&lt;p&gt;Team Topologies describes &lt;strong&gt;stream-aligned teams&lt;/strong&gt; as owning a slice of value end-to-end with no handoffs. [3][4]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-3"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Prefer &lt;strong&gt;stable teams&lt;/strong&gt; aligned to a value stream (product/service), with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;clear ownership&lt;/li&gt;
&lt;li&gt;operational responsibility (&amp;ldquo;you build it, you run it&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;direct feedback from users and production&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-5-governance-by-meeting-instead-of-guardrail"&gt;Pattern 5: Governance by meeting instead of guardrail&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-4"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Instead of &amp;ldquo;how do we make safe delivery easy,&amp;rdquo; governance becomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;approval steps&lt;/li&gt;
&lt;li&gt;committees&lt;/li&gt;
&lt;li&gt;sign-off chains&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-exists-4"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Risk is real, and leaders want control.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-4"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Humans are expensive control planes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;slow&lt;/li&gt;
&lt;li&gt;inconsistent&lt;/li&gt;
&lt;li&gt;difficult to audit at scale&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-4"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Convert rules into guardrails:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;policy-as-code&lt;/li&gt;
&lt;li&gt;templates&lt;/li&gt;
&lt;li&gt;paved paths&lt;/li&gt;
&lt;li&gt;automated checks in CI/CD&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is how you scale safety without scaling meetings.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-6-metrics-as-targets"&gt;Pattern 6: Metrics as targets&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-5"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;Teams are pressured to hit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;story points&lt;/li&gt;
&lt;li&gt;&amp;ldquo;velocity&amp;rdquo;&lt;/li&gt;
&lt;li&gt;number of deployments&lt;/li&gt;
&lt;li&gt;&amp;ldquo;percent complete&amp;rdquo;&lt;/li&gt;
&lt;li&gt;tickets closed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then behavior adapts to the metric.&lt;/p&gt;
&lt;h3 id="why-it-exists-5"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Leaders need a dashboard.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-5"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;When a measure becomes a target, it can stop being a good measure (Goodhart&amp;rsquo;s Law). [7]&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;inflate points&lt;/li&gt;
&lt;li&gt;ship low-value changes to increase deploy count&lt;/li&gt;
&lt;li&gt;avoid hard work because it hurts &amp;ldquo;throughput&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-5"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Use metrics diagnostically at the system level (not as individual KPIs).&lt;/p&gt;
&lt;p&gt;If you adopt DORA metrics, use them to identify constraints and improve flow - not as quarterly targets for teams. [1][9]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-7-engineers-are-abstracted-away-from-production"&gt;Pattern 7: Engineers are abstracted away from production&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-6"&gt;What it looks like&lt;/h3&gt;
&lt;p&gt;A team builds a system, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;another team deploys it&lt;/li&gt;
&lt;li&gt;another team runs it&lt;/li&gt;
&lt;li&gt;another team handles incidents&lt;/li&gt;
&lt;li&gt;another team owns the roadmap&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Engineers eventually conclude: &amp;ldquo;Nothing I build actually ships.&amp;rdquo;&lt;/p&gt;
&lt;h3 id="why-it-exists-6"&gt;Why it exists&lt;/h3&gt;
&lt;p&gt;Specialization can be useful, but excessive separation breaks feedback loops.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-6"&gt;The hidden tax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;teams don&amp;rsquo;t learn from production&lt;/li&gt;
&lt;li&gt;quality declines because consequences are indirect&lt;/li&gt;
&lt;li&gt;&amp;ldquo;deployment pain&amp;rdquo; rises: shipping becomes stressful and disruptive&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;DORA describes &lt;em&gt;deployment pain&lt;/em&gt; as fear/anxiety around deploying and links it to poorer delivery performance and culture. [8] DORA also notes continuous delivery predicts lower levels of burnout and reduces deployment pain. [2]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-6"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Re-connect engineers to production:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;give teams operational ownership for what they build&lt;/li&gt;
&lt;li&gt;make telemetry and incident review part of engineering&lt;/li&gt;
&lt;li&gt;reduce fear by making releases small, frequent, and observable&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="replacement-patterns-that-work"&gt;Replacement patterns that work&lt;/h2&gt;
&lt;p&gt;These are the patterns I&amp;rsquo;ve seen consistently restore delivery flow without chaos.&lt;/p&gt;
&lt;h3 id="1-clarify-decision-rights-and-keep-them-close-to-the-work"&gt;1) Clarify decision rights (and keep them close to the work)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;One accountable owner per initiative (not &amp;ldquo;everyone is accountable&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Engineers participate in tradeoff decisions early (scope, sequencing, risk)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-design-teams-for-flow-not-for-org-charts"&gt;2) Design teams for flow (not for org charts)&lt;/h3&gt;
&lt;p&gt;Organizations build systems that mirror their communication structures (Conway&amp;rsquo;s Law). [5]
If your org is siloed and layered, your architecture often becomes siloed and layered too.&lt;/p&gt;
&lt;p&gt;Design teams so the desired architecture is the &lt;em&gt;path of least resistance&lt;/em&gt;.&lt;/p&gt;
&lt;h3 id="3-prefer-stream-aligned-teams--platform-leverage"&gt;3) Prefer stream-aligned teams + platform leverage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Stream-aligned teams own outcomes end-to-end (no handoffs). [3][4]&lt;/li&gt;
&lt;li&gt;Platform teams reduce cognitive load by providing paved roads (auth, telemetry, CI/CD). [4]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4-replace-alignment-meetings-with-shared-artifacts"&gt;4) Replace &amp;ldquo;alignment meetings&amp;rdquo; with shared artifacts&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;one-page decision records&lt;/li&gt;
&lt;li&gt;clear &amp;ldquo;definition of done&amp;rdquo;&lt;/li&gt;
&lt;li&gt;demos that show working software in a real environment&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="5-turn-delivery-into-a-calm-repeatable-process"&gt;5) Turn delivery into a calm, repeatable process&lt;/h3&gt;
&lt;p&gt;When delivery is painful, people add layers to manage fear.
Fix the source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tests&lt;/li&gt;
&lt;li&gt;automation&lt;/li&gt;
&lt;li&gt;progressive delivery&lt;/li&gt;
&lt;li&gt;observable releases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That&amp;rsquo;s how you reduce burnout sustainably. [2][8]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="verification-how-you-know-the-org-is-healing"&gt;Verification: how you know the org is healing&lt;/h2&gt;
&lt;p&gt;Don&amp;rsquo;t rely on vibes. Use evidence.&lt;/p&gt;
&lt;h3 id="delivery-outcomes-system-level"&gt;Delivery outcomes (system-level)&lt;/h3&gt;
&lt;p&gt;Start with DORA metrics to track flow and stability. [1]&lt;/p&gt;
&lt;h3 id="product-outcomes"&gt;Product outcomes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;adoption (are users actually using the thing?)&lt;/li&gt;
&lt;li&gt;retention (does usage persist?)&lt;/li&gt;
&lt;li&gt;reduced operational toil (do incidents go down?)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="team-outcomes"&gt;Team outcomes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;fewer emergency escalations&lt;/li&gt;
&lt;li&gt;fewer &amp;ldquo;status-only&amp;rdquo; meetings&lt;/li&gt;
&lt;li&gt;improved on-call experience (lower deployment pain) [8]&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If lead time drops but burnout rises, you probably &amp;ldquo;optimized the dashboard&amp;rdquo; instead of the system (see Goodhart). [7]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-practical-checklist"&gt;A practical checklist&lt;/h2&gt;
&lt;p&gt;If your org feels &amp;ldquo;management-heavy,&amp;rdquo; try this in order:&lt;/p&gt;
&lt;h3 id="reduce-translation-layers"&gt;Reduce translation layers&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Put engineers in the room (or thread) with real users/operators at least weekly.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Require the &amp;ldquo;why&amp;rdquo; to be written and reviewed before build starts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reduce-handoffs"&gt;Reduce handoffs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Map the value stream and count handoffs.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Remove one handoff per quarter; make it a goal.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reduce-wip"&gt;Reduce WIP&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Limit concurrent initiatives per team.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Finish before starting.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="convert-meetings-into-guardrails"&gt;Convert meetings into guardrails&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Replace approvals with automated checks where possible.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Create paved paths so the safe way is the easy way.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reconnect-teams-to-production"&gt;Reconnect teams to production&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Teams own what they ship.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tie incident learning back to design decisions.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Make releases smaller and more frequent.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] DORA - &amp;ldquo;DORA&amp;rsquo;s software delivery performance metrics (guide)&amp;rdquo;. &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
[2] DORA - &amp;ldquo;Capabilities: Continuous delivery&amp;rdquo; (notes relationship to burnout and deployment pain). &lt;a href="https://dora.dev/capabilities/continuous-delivery/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/capabilities/continuous-delivery/&lt;/a&gt;
[3] Team Topologies - &amp;ldquo;Key Concepts&amp;rdquo; (stream-aligned teams; no handoffs). &lt;a href="https://teamtopologies.com/key-concepts" target="_blank" rel="noopener noreferrer"&gt;https://teamtopologies.com/key-concepts&lt;/a&gt;
[4] IT Revolution - &amp;ldquo;The Four Team Types from Team Topologies&amp;rdquo; (stream-aligned teams own end-to-end). &lt;a href="https://itrevolution.com/articles/four-team-types/" target="_blank" rel="noopener noreferrer"&gt;https://itrevolution.com/articles/four-team-types/&lt;/a&gt;
[5] Splunk - &amp;ldquo;Conway&amp;rsquo;s Law Explained&amp;rdquo; (systems mirror communication structures; includes original quote). &lt;a href="https://www.splunk.com/en_us/blog/learn/conways-law.html" target="_blank" rel="noopener noreferrer"&gt;https://www.splunk.com/en_us/blog/learn/conways-law.html&lt;/a&gt;
[6] Brooks&amp;rsquo;s Law (coined in &lt;em&gt;The Mythical Man-Month&lt;/em&gt;): &amp;ldquo;Adding manpower to a late software project makes it later.&amp;rdquo; &lt;a href="https://en.wikipedia.org/wiki/Brooks%27s_law" target="_blank" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/Brooks%27s_law&lt;/a&gt;
[7] CNA - &amp;ldquo;Goodhart&amp;rsquo;s Law&amp;rdquo; (when a measure becomes a target, it ceases to be a good measure). &lt;a href="https://www.cna.org/analyses/2022/09/goodharts-law" target="_blank" rel="noopener noreferrer"&gt;https://www.cna.org/analyses/2022/09/goodharts-law&lt;/a&gt;
[8] DORA - &amp;ldquo;Capabilities: Well-being&amp;rdquo; (deployment pain and its relationship to performance/culture). &lt;a href="https://dora.dev/capabilities/well-being/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/capabilities/well-being/&lt;/a&gt;
[9] SEI (CMU) - &amp;ldquo;How to Misuse and Abuse DORA Metrics&amp;rdquo; (metric anti-patterns). &lt;a href="https://www.sei.cmu.edu/library/how-to-misuse-and-abuse-dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://www.sei.cmu.edu/library/how-to-misuse-and-abuse-dora-metrics/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>Agile Isn't Dead. Agile Compliance Is.</title><link>https://roygabriel.dev/blog/agile-compliance-is-dead/</link><pubDate>Wed, 31 Dec 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/agile-compliance-is-dead/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;.
This isn&amp;rsquo;t &amp;ldquo;Agile bad.&amp;rdquo; It&amp;rsquo;s &amp;ldquo;Agile the brand is often used to justify systems that do the opposite of Agile&amp;rsquo;s intent.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Agile isn&amp;rsquo;t a set of meetings. It&amp;rsquo;s a physics statement:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Shorter feedback loops reduce risk.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Most enterprises didn&amp;rsquo;t fail Agile. They replaced Agile with a bureaucracy that uses Agile vocabulary:&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Note on examples:&lt;/strong&gt; The scenarios below are &lt;strong&gt;anonymized composites&lt;/strong&gt;.
This isn&amp;rsquo;t &amp;ldquo;Agile bad.&amp;rdquo; It&amp;rsquo;s &amp;ldquo;Agile the brand is often used to justify systems that do the opposite of Agile&amp;rsquo;s intent.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Agile isn&amp;rsquo;t a set of meetings. It&amp;rsquo;s a physics statement:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Shorter feedback loops reduce risk.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Most enterprises didn&amp;rsquo;t fail Agile. They replaced Agile with a bureaucracy that uses Agile vocabulary:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;Sprint&amp;rdquo; becomes a reporting interval&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Velocity&amp;rdquo; becomes a performance metric&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Planning&amp;rdquo; becomes a negotiation&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Definition of done&amp;rdquo; becomes a checklist&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Agile transformation&amp;rdquo; becomes a multi-year program&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is predictable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;delivery slows&lt;/li&gt;
&lt;li&gt;quality degrades&lt;/li&gt;
&lt;li&gt;reliability suffers&lt;/li&gt;
&lt;li&gt;engineers burn out&lt;/li&gt;
&lt;li&gt;product expectations aren&amp;rsquo;t met&lt;/li&gt;
&lt;li&gt;leadership gets more dashboards and fewer outcomes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This post is a production-first teardown of Agile theater - and a replacement model that actually ships.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Agile is about &lt;strong&gt;learning quickly&lt;/strong&gt;, not &lt;strong&gt;predicting perfectly&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Scrum is useful when it reduces uncertainty. It&amp;rsquo;s harmful when it becomes a compliance system.&lt;/li&gt;
&lt;li&gt;If you treat sprints as contracts, you&amp;rsquo;ll get &lt;strong&gt;scrumfall&lt;/strong&gt;: waterfall dependencies with sprint-shaped reporting.&lt;/li&gt;
&lt;li&gt;Replace &amp;ldquo;Agile compliance&amp;rdquo; with:&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flow&lt;/strong&gt; (small batches, limit WIP)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous delivery&lt;/strong&gt; (safe, frequent releases) [4]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evidence-based planning&lt;/strong&gt; (measure outcomes; adjust quickly) [5]&lt;/li&gt;
&lt;li&gt;Use system metrics (DORA) to verify improvement: lead time, deploy frequency, change failure rate, MTTR. [6]&lt;/li&gt;
&lt;li&gt;Beware Goodhart&amp;rsquo;s Law: metrics used as targets will be gamed. [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#agile-the-physics-vs-agile-the-bureaucracy"&gt;Agile the physics vs Agile the bureaucracy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-1-sprints-as-contracts"&gt;Pattern 1: Sprints as contracts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-2-velocity-as-a-performance-metric"&gt;Pattern 2: Velocity as a performance metric&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-3-backlog-bloat-as-a-museum-of-anxiety"&gt;Pattern 3: Backlog bloat as a museum of anxiety&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-4-ceremonies-become-the-work"&gt;Pattern 4: Ceremonies become the work&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-5-dependencies-turn-scrum-into-fiction"&gt;Pattern 5: Dependencies turn Scrum into fiction&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-6-definition-of-done-without-production"&gt;Pattern 6: Definition of done without production&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-7-product-ownership-by-proxy"&gt;Pattern 7: Product ownership by proxy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#whats-better-flow--cd--evidence"&gt;What&amp;rsquo;s better: Flow + CD + evidence&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#transition-plan-30-days-without-a-revolution"&gt;Transition plan: 30 days without a revolution&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-practical-checklist"&gt;A practical checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="agile-the-physics-vs-agile-the-bureaucracy"&gt;Agile the physics vs Agile the bureaucracy&lt;/h2&gt;
&lt;p&gt;The Agile Manifesto values working software over comprehensive documentation and emphasizes collaboration and responding to change. [1] One of its principles states that &lt;strong&gt;working software is the primary measure of progress&lt;/strong&gt;. [2]&lt;/p&gt;
&lt;p&gt;Those ideas are still correct.&lt;/p&gt;
&lt;p&gt;What broke in enterprises is implementation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agile became &lt;strong&gt;process&lt;/strong&gt; instead of &lt;strong&gt;feedback&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;agile artifacts became &lt;strong&gt;deliverables&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;teams were optimized for &lt;strong&gt;predictability theater&lt;/strong&gt; instead of throughput and learning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short: Agile got turned into compliance.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-1-sprints-as-contracts"&gt;Pattern 1: Sprints as contracts&lt;/h2&gt;
&lt;h3 id="what-it-looks-like"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Sprint planning is treated as a commitment contract.&lt;/li&gt;
&lt;li&gt;Changing scope is seen as failure, even when reality changes.&lt;/li&gt;
&lt;li&gt;Teams avoid surfacing unknowns because unknowns disrupt &amp;ldquo;commitment.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;Leaders want predictability. Sprints feel like a way to buy it.&lt;/p&gt;
&lt;h3 id="the-hidden-tax"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;When you turn sprints into contracts, teams adapt:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reduce exploration&lt;/li&gt;
&lt;li&gt;defer integration&lt;/li&gt;
&lt;li&gt;accept low-quality shortcuts&lt;/li&gt;
&lt;li&gt;split work into artificial &amp;ldquo;done-looking&amp;rdquo; chunks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You don&amp;rsquo;t eliminate uncertainty. You hide it until the end.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Use cadence as a heartbeat, not as a contract:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plan in small chunks.&lt;/li&gt;
&lt;li&gt;Commit to &lt;strong&gt;outcomes and constraints&lt;/strong&gt;, not a stack of tickets.&lt;/li&gt;
&lt;li&gt;Treat scope as a lever; treat time as a constraint.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-2-velocity-as-a-performance-metric"&gt;Pattern 2: Velocity as a performance metric&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-1"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Story points become productivity.&lt;/li&gt;
&lt;li&gt;Velocity is compared across teams.&lt;/li&gt;
&lt;li&gt;Teams feel pressure to &amp;ldquo;go faster&amp;rdquo; by increasing points delivered.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens-1"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;Velocity is a number. Numbers are tempting.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-1"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Story points are a local measure with no consistent meaning across teams. When you attach incentives, teams optimize for the metric:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;inflate estimates&lt;/li&gt;
&lt;li&gt;split work to maximize points&lt;/li&gt;
&lt;li&gt;avoid hard, high-leverage work&lt;/li&gt;
&lt;li&gt;ship low-value changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a textbook Goodhart&amp;rsquo;s Law failure mode: when a measure becomes a target, it ceases to be a good measure. [7]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-1"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Measure the system, not the story:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;lead time&lt;/li&gt;
&lt;li&gt;cycle time&lt;/li&gt;
&lt;li&gt;deploy frequency&lt;/li&gt;
&lt;li&gt;change failure rate&lt;/li&gt;
&lt;li&gt;MTTR&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use metrics diagnostically, not as quarterly targets.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-3-backlog-bloat-as-a-museum-of-anxiety"&gt;Pattern 3: Backlog bloat as a museum of anxiety&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-2"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Thousands of backlog items exist &amp;ldquo;for visibility.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Nothing gets deleted.&lt;/li&gt;
&lt;li&gt;Refinement happens continuously, but priorities change weekly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens-2"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;Backlogs feel like control: &amp;ldquo;We haven&amp;rsquo;t forgotten.&amp;rdquo;&lt;/p&gt;
&lt;h3 id="the-hidden-tax-2"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;A giant backlog increases planning cost and reduces focus. Teams stop trusting priorities and operate on side-channel requests.&lt;/p&gt;
&lt;p&gt;My favorite framing:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;If everything is in the backlog, nothing is prioritized. It&amp;rsquo;s just a museum of anxiety.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="the-replacement-pattern-2"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Adopt a tight horizon model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Now:&lt;/strong&gt; what we&amp;rsquo;re building&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Next:&lt;/strong&gt; what&amp;rsquo;s likely next&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Later:&lt;/strong&gt; ideas (low-investment capture)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Refine Now/Next. Archive the rest.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-4-ceremonies-become-the-work"&gt;Pattern 4: Ceremonies become the work&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-3"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Standups become status meetings for managers.&lt;/li&gt;
&lt;li&gt;Planning takes hours.&lt;/li&gt;
&lt;li&gt;Refinement is endless.&lt;/li&gt;
&lt;li&gt;Retrospectives generate action items that never get resourced.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens-3"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;Ceremonies are easy to schedule. Delivery capability is harder to build.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-3"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;Attention becomes fragmented. Engineers become &amp;ldquo;meeting responders.&amp;rdquo; Work gets multi-tasked across initiatives.&lt;/p&gt;
&lt;p&gt;This is how you get:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;slow delivery&lt;/li&gt;
&lt;li&gt;low quality&lt;/li&gt;
&lt;li&gt;burnout&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-replacement-pattern-3"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Keep only the meetings that reduce uncertainty:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;shorter planning&lt;/li&gt;
&lt;li&gt;true async refinement&lt;/li&gt;
&lt;li&gt;standup for coordination within the team (not reporting)&lt;/li&gt;
&lt;li&gt;retros with real ownership and budget&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then invest in the thing ceremonies can&amp;rsquo;t replace: &lt;strong&gt;engineering capability&lt;/strong&gt; (tests, pipelines, observability, automation).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-5-dependencies-turn-scrum-into-fiction"&gt;Pattern 5: Dependencies turn Scrum into fiction&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-4"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Every story depends on another team.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Blocked&amp;rdquo; is normal.&lt;/li&gt;
&lt;li&gt;Integration is deferred to later sprints.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens-4"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;Organizations are siloed. Systems mirror communication structures (Conway&amp;rsquo;s Law). [8]&lt;/p&gt;
&lt;h3 id="the-hidden-tax-4"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;You get scrumfall: waterfall dependencies, sprint-shaped reporting.&lt;/p&gt;
&lt;p&gt;A two-week sprint can&amp;rsquo;t save a three-month dependency queue.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-4"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Design for end-to-end ownership and flow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reduce handoffs&lt;/li&gt;
&lt;li&gt;remove or automate cross-team gates&lt;/li&gt;
&lt;li&gt;create platform paved roads so teams can self-serve [9]&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When dependencies can&amp;rsquo;t be eliminated, make them explicit and manage them like risk, not like hope.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="pattern-6-definition-of-done-without-production"&gt;Pattern 6: Definition of done without production&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-5"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;Done&amp;rdquo; means &amp;ldquo;merged.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;QA is a phase.&lt;/li&gt;
&lt;li&gt;Observability is optional.&lt;/li&gt;
&lt;li&gt;Releases happen &amp;ldquo;later.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens-5"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;Shipping is painful. So teams avoid it.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-5"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;If &amp;ldquo;done&amp;rdquo; doesn&amp;rsquo;t include production, you accumulate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;integration debt&lt;/li&gt;
&lt;li&gt;release debt&lt;/li&gt;
&lt;li&gt;incident debt&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Reliability declines because feedback arrives late.&lt;/p&gt;
&lt;p&gt;Continuous delivery&amp;rsquo;s core argument is that keeping software deployable and releasing frequently reduces risk and enables faster feedback. [4]&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-5"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Upgrade your definition of done:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;deployed to a real environment&lt;/li&gt;
&lt;li&gt;observable (metrics/logs/traces)&lt;/li&gt;
&lt;li&gt;rollback path exists&lt;/li&gt;
&lt;li&gt;runbook exists for major failure modes&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="pattern-7-product-ownership-by-proxy"&gt;Pattern 7: Product ownership by proxy&lt;/h2&gt;
&lt;h3 id="what-it-looks-like-6"&gt;What it looks like&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Engineers rarely talk to users/operators.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Product&amp;rdquo; is a chain of intermediaries.&lt;/li&gt;
&lt;li&gt;Requirements arrive as polished tickets without the &amp;ldquo;why.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-happens-6"&gt;Why it happens&lt;/h3&gt;
&lt;p&gt;The organization tries to protect engineers from churn.&lt;/p&gt;
&lt;h3 id="the-hidden-tax-6"&gt;The hidden tax&lt;/h3&gt;
&lt;p&gt;This degrades the input signal. Engineers build the wrong thing efficiently - and then everyone is surprised it didn&amp;rsquo;t land.&lt;/p&gt;
&lt;h3 id="the-replacement-pattern-6"&gt;The replacement pattern&lt;/h3&gt;
&lt;p&gt;Bring engineers closer to reality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;listen to customer calls&lt;/li&gt;
&lt;li&gt;review usage telemetry&lt;/li&gt;
&lt;li&gt;participate in discovery&lt;/li&gt;
&lt;li&gt;keep the &amp;ldquo;why&amp;rdquo; attached to every build&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No one should ship something they can&amp;rsquo;t explain.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="whats-better-flow--cd--evidence"&gt;What&amp;rsquo;s better: Flow + CD + evidence&lt;/h2&gt;
&lt;p&gt;If Agile compliance is the disease, what&amp;rsquo;s the cure?&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s not &amp;ldquo;a different framework.&amp;rdquo; It&amp;rsquo;s an operating model:&lt;/p&gt;
&lt;h3 id="1-flow-small-batches-limited-wip"&gt;1) Flow: small batches, limited WIP&lt;/h3&gt;
&lt;p&gt;Lean/Kanban concepts focus on limiting work in progress and optimizing for flow. [3]&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Finish work, don&amp;rsquo;t start work.&lt;/li&gt;
&lt;li&gt;Reduce batch size.&lt;/li&gt;
&lt;li&gt;Make queues visible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-continuous-delivery-make-change-safe"&gt;2) Continuous Delivery: make change safe&lt;/h3&gt;
&lt;p&gt;Continuous delivery is a capability: keep changes small, deployable, and observable so you can release frequently with lower risk. [4]&lt;/p&gt;
&lt;p&gt;This includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CI&lt;/li&gt;
&lt;li&gt;automated testing&lt;/li&gt;
&lt;li&gt;progressive delivery (when needed)&lt;/li&gt;
&lt;li&gt;rollback/roll-forward discipline&lt;/li&gt;
&lt;li&gt;telemetry tied to releases&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-evidence-based-planning-bets-not-contracts"&gt;3) Evidence-based planning: bets, not contracts&lt;/h3&gt;
&lt;p&gt;Lean Startup&amp;rsquo;s build-measure-learn loop emphasizes validated learning - ship something real, measure, and adjust. [5]&lt;/p&gt;
&lt;p&gt;For enterprises, the translation is simple:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plan in small bets&lt;/li&gt;
&lt;li&gt;Validate early&lt;/li&gt;
&lt;li&gt;Use evidence to re-plan, not politics&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="transition-plan-30-days-without-a-revolution"&gt;Transition plan: 30 days without a revolution&lt;/h2&gt;
&lt;p&gt;You don&amp;rsquo;t need to burn the framework down. You need to change what you reward and what you ship.&lt;/p&gt;
&lt;h3 id="week-1-make-work-visible-as-flow"&gt;Week 1: Make work visible as flow&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Map the value stream from idea -&amp;gt; production.&lt;/li&gt;
&lt;li&gt;Count handoffs.&lt;/li&gt;
&lt;li&gt;Measure current lead time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="week-2-reduce-batch-size"&gt;Week 2: Reduce batch size&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Pick one initiative.&lt;/li&gt;
&lt;li&gt;Cut it to a thin vertical slice that can ship.&lt;/li&gt;
&lt;li&gt;Define &amp;ldquo;done&amp;rdquo; as &amp;ldquo;in production, measurable.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="week-3-reduce-wip"&gt;Week 3: Reduce WIP&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Stop starting new work.&lt;/li&gt;
&lt;li&gt;Finish the slice.&lt;/li&gt;
&lt;li&gt;Remove one blocking dependency with a paved path or automation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="week-4-close-the-feedback-loop"&gt;Week 4: Close the feedback loop&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Ship.&lt;/li&gt;
&lt;li&gt;Measure.&lt;/li&gt;
&lt;li&gt;Run a retro focused on system constraints (not blame).&lt;/li&gt;
&lt;li&gt;Repeat.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you do this and nothing improves, you learned something valuable: the constraint is elsewhere.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="verification-how-you-know-its-working"&gt;Verification: how you know it&amp;rsquo;s working&lt;/h2&gt;
&lt;p&gt;You should see movement in system outcomes:&lt;/p&gt;
&lt;p&gt;DORA describes four key delivery performance metrics: lead time for changes, deployment frequency, change failure rate, and time to restore service. [6]&lt;/p&gt;
&lt;p&gt;Signs of real improvement:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;lead time drops (less queueing and fewer handoffs)&lt;/li&gt;
&lt;li&gt;deploy frequency rises (smaller batches, calmer releases)&lt;/li&gt;
&lt;li&gt;change failure rate drops (better tests and safer rollouts)&lt;/li&gt;
&lt;li&gt;MTTR drops (better observability and operability)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And importantly: teams report less &amp;ldquo;deployment pain&amp;rdquo; and less burnout as delivery becomes calmer and more reliable. [10]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-practical-checklist"&gt;A practical checklist&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;re stuck in Agile theater, try this:&lt;/p&gt;
&lt;h3 id="stop-measuring-activity"&gt;Stop measuring activity&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Stop comparing velocity across teams.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Stop treating story points as productivity.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="shrink-feedback-loops"&gt;Shrink feedback loops&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Ship a thin slice to production early (behind a flag if needed).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Put engineers closer to users/operators.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reduce-handoffs-and-wip"&gt;Reduce handoffs and WIP&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Limit concurrent initiatives.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Remove one handoff per quarter.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="invest-in-delivery-capability"&gt;Invest in delivery capability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; CI, tests, deployment automation&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; observability tied to releases&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; safer rollouts and rollback paths&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="use-metrics-as-signals-not-targets"&gt;Use metrics as signals, not targets&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Track DORA metrics at the system level. [6]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Avoid metric gaming (Goodhart). [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Manifesto for Agile Software Development (values). &lt;a href="https://agilemanifesto.org/" target="_blank" rel="noopener noreferrer"&gt;https://agilemanifesto.org/&lt;/a&gt;
[2] Principles behind the Agile Manifesto (&amp;ldquo;Working software is the primary measure of progress&amp;rdquo;). &lt;a href="https://agilemanifesto.org/principles.html" target="_blank" rel="noopener noreferrer"&gt;https://agilemanifesto.org/principles.html&lt;/a&gt;
[3] Kanban Guide (principles and practices oriented around flow and WIP). &lt;a href="https://kanbanguides.org/english/" target="_blank" rel="noopener noreferrer"&gt;https://kanbanguides.org/english/&lt;/a&gt;
[4] Continuous Delivery (concepts; keep software deployable, release frequently). &lt;a href="https://continuousdelivery.com/" target="_blank" rel="noopener noreferrer"&gt;https://continuousdelivery.com/&lt;/a&gt;
[5] The Lean Startup - Principles (Build-Measure-Learn; validated learning). &lt;a href="https://theleanstartup.com/principles" target="_blank" rel="noopener noreferrer"&gt;https://theleanstartup.com/principles&lt;/a&gt;
[6] DORA - &amp;ldquo;DORA&amp;rsquo;s software delivery performance metrics (guide)&amp;rdquo;. &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
[7] CNA - &amp;ldquo;Goodhart&amp;rsquo;s Law&amp;rdquo; (when a measure becomes a target, it ceases to be a good measure). &lt;a href="https://www.cna.org/analyses/2022/09/goodharts-law" target="_blank" rel="noopener noreferrer"&gt;https://www.cna.org/analyses/2022/09/goodharts-law&lt;/a&gt;
[8] Splunk - &amp;ldquo;Conway&amp;rsquo;s Law Explained&amp;rdquo; (systems mirror communication structures; includes original quote). &lt;a href="https://www.splunk.com/en_us/blog/learn/conways-law.html" target="_blank" rel="noopener noreferrer"&gt;https://www.splunk.com/en_us/blog/learn/conways-law.html&lt;/a&gt;
[9] Microsoft Engineering Blog - &amp;ldquo;Building paved paths: the journey to platform engineering&amp;rdquo;. &lt;a href="https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/" target="_blank" rel="noopener noreferrer"&gt;https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/&lt;/a&gt;
[10] DORA - &amp;ldquo;Capabilities: Well-being&amp;rdquo; (deployment pain and relationship to performance/culture). &lt;a href="https://dora.dev/capabilities/well-being/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/capabilities/well-being/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>From Stdio to Enterprise: The MCP Gateway Pattern</title><link>https://roygabriel.dev/blog/mcp-gateway-pattern/</link><pubDate>Sat, 22 Nov 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/mcp-gateway-pattern/</guid><description>&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;As-of note:&lt;/strong&gt; MCP evolves quickly. This article references the MCP spec revision &lt;strong&gt;2025-11-25&lt;/strong&gt;. Validate details against the current spec before shipping changes. [1][2][3]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Local MCP servers over &lt;strong&gt;stdio&lt;/strong&gt; are an amazing developer experience: you install a tool server, the host (Claude Desktop / Claude Code / an agent runtime) launches it, and you&amp;rsquo;re productive in minutes. [2]&lt;/p&gt;
&lt;p&gt;But as soon as MCP becomes &lt;em&gt;shared infrastructure&lt;/em&gt; - multiple clients, multiple users, multiple environments - the &amp;ldquo;local tool server&amp;rdquo; model runs into the same constraints every integration layer hits:&lt;/p&gt;</description><content:encoded>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;As-of note:&lt;/strong&gt; MCP evolves quickly. This article references the MCP spec revision &lt;strong&gt;2025-11-25&lt;/strong&gt;. Validate details against the current spec before shipping changes. [1][2][3]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Local MCP servers over &lt;strong&gt;stdio&lt;/strong&gt; are an amazing developer experience: you install a tool server, the host (Claude Desktop / Claude Code / an agent runtime) launches it, and you&amp;rsquo;re productive in minutes. [2]&lt;/p&gt;
&lt;p&gt;But as soon as MCP becomes &lt;em&gt;shared infrastructure&lt;/em&gt; - multiple clients, multiple users, multiple environments - the &amp;ldquo;local tool server&amp;rdquo; model runs into the same constraints every integration layer hits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Who is allowed to call what tool?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How do you prevent one noisy user from melting shared dependencies?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How do you audit tool side effects?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How do you roll out tool changes without breaking clients?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How do you keep secrets out of prompts, logs, and screenshots?&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where the &lt;strong&gt;MCP Gateway Pattern&lt;/strong&gt; shows up.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;A gateway is not &amp;ldquo;another service.&amp;rdquo; It&amp;rsquo;s a &lt;strong&gt;capability boundary&lt;/strong&gt;: the place where you enforce policy, budgets, and observability for tool use at scale.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stdio is great for local, single-user, low-blast-radius setups.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HTTP transports&lt;/strong&gt; (Streamable HTTP) enable multi-client servers - but they also require real auth and multi-tenant safety. [2][3]&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;MCP gateway&lt;/strong&gt; sits between clients and tool servers to provide:&lt;/li&gt;
&lt;li&gt;authentication &amp;amp; authorization&lt;/li&gt;
&lt;li&gt;tenant isolation&lt;/li&gt;
&lt;li&gt;rate limits / concurrency / cost budgets&lt;/li&gt;
&lt;li&gt;consistent tool schemas + safety gates&lt;/li&gt;
&lt;li&gt;audit logs and observability&lt;/li&gt;
&lt;li&gt;routing, versioning, rollout controls&lt;/li&gt;
&lt;li&gt;Build the gateway to be boring: small surface area, strict validation, explicit policies, great telemetry.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#when-stdio-stops-being-enough"&gt;When stdio stops being enough&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-mcp-gateway-pattern"&gt;The MCP Gateway Pattern&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#responsibilities-of-a-gateway"&gt;Responsibilities of a gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#reference-architecture"&gt;Reference architecture&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#policy-patterns-that-actually-work"&gt;Policy patterns that actually work&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#scaling-and-isolation-strategies"&gt;Scaling and isolation strategies&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#observability-and-audit"&gt;Observability and audit&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#rollouts-and-versioning"&gt;Rollouts and versioning&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="when-stdio-stops-being-enough"&gt;When stdio stops being enough&lt;/h2&gt;
&lt;p&gt;MCP supports multiple transports; stdio is common for local servers. [2] In that model, the host controls process lifetime and secrets typically come from the environment on the local machine.&lt;/p&gt;
&lt;p&gt;Stdio starts to strain when you need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;multi-client concurrency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;shared tenancy&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;central policy enforcement&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;centralized audit&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fleet-level rollout controls&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At that point, you&amp;rsquo;re effectively building a platform. The platform needs a stable ingress point with consistent security and operational behavior.&lt;/p&gt;
&lt;p&gt;MCP&amp;rsquo;s &lt;strong&gt;HTTP-based transports&lt;/strong&gt; (like Streamable HTTP) are designed for servers that can handle multiple connections and enable streaming/notifications. [2] MCP also defines an authorization flow for HTTP-based transports. [3]&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the entry point for a gateway.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="the-mcp-gateway-pattern"&gt;The MCP Gateway Pattern&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Definition:&lt;/strong&gt; An MCP gateway is an MCP server (or MCP-adjacent ingress layer) that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;authenticates and authorizes the client&lt;/li&gt;
&lt;li&gt;routes requests to one or more downstream MCP servers (or tool backends)&lt;/li&gt;
&lt;li&gt;enforces budgets and safety gates&lt;/li&gt;
&lt;li&gt;emits consistent telemetry and audit records&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It looks like an API gateway, but the payload is &amp;ldquo;tool capability&amp;rdquo; not &amp;ldquo;REST endpoints.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="responsibilities-of-a-gateway"&gt;Responsibilities of a gateway&lt;/h2&gt;
&lt;h3 id="1-authentication-and-authorization"&gt;1) Authentication and authorization&lt;/h3&gt;
&lt;p&gt;If you expose MCP servers over HTTP, you need strong auth. MCP includes an authorization framework at the transport layer for HTTP-based transports. [3]&lt;/p&gt;
&lt;p&gt;Practical gateway rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Authenticate every client&lt;/strong&gt; (bearer tokens, mTLS, OAuth-derived access tokens).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authorize per tool&lt;/strong&gt;, not per server.&lt;/li&gt;
&lt;li&gt;Prefer &lt;strong&gt;least privilege&lt;/strong&gt; scopes:&lt;/li&gt;
&lt;li&gt;&lt;code&gt;calendar.read&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;calendar.write&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;email.read&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;email.send&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k8s.readonly&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k8s.apply&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;For high-impact tools: require explicit confirmation tokens and/or multi-party approval.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-tool-contract-enforcement"&gt;2) Tool contract enforcement&lt;/h3&gt;
&lt;p&gt;MCP tools are invoked by an LLM-driven client. That means tool arguments are &lt;strong&gt;untrusted&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The gateway is the ideal place to enforce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;schema validation&lt;/li&gt;
&lt;li&gt;payload size caps&lt;/li&gt;
&lt;li&gt;allowlists and blocklists&lt;/li&gt;
&lt;li&gt;&amp;ldquo;danger gates&amp;rdquo; (preview/apply, confirmations)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;semantic validation&amp;rdquo; (not just types - e.g., limits required, date ranges bounded)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MCP&amp;rsquo;s spec is grounded in structured schemas; treat those schemas as contracts. [1]&lt;/p&gt;
&lt;h3 id="3-budgets-and-backpressure"&gt;3) Budgets and backpressure&lt;/h3&gt;
&lt;p&gt;Agents can trigger bursty tool calls. Without backpressure you get the classic cascade:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;upstream rate limits&lt;/li&gt;
&lt;li&gt;DB pool exhaustion&lt;/li&gt;
&lt;li&gt;thread/goroutine explosion&lt;/li&gt;
&lt;li&gt;timeouts everywhere&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At the gateway you can enforce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;per-tenant rate limits&lt;/li&gt;
&lt;li&gt;per-tool concurrency limits&lt;/li&gt;
&lt;li&gt;timeouts and deadline propagation&lt;/li&gt;
&lt;li&gt;queue depth caps (bounded memory)&lt;/li&gt;
&lt;li&gt;circuit breakers for flaky dependencies&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where you keep &amp;ldquo;one user spamming tools&amp;rdquo; from becoming &amp;ldquo;everyone is down.&amp;rdquo;&lt;/p&gt;
&lt;h3 id="4-secret-handling-and-redaction"&gt;4) Secret handling and redaction&lt;/h3&gt;
&lt;p&gt;Gateways are a natural place to centralize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;secret injection (short-lived tokens per tenant)&lt;/li&gt;
&lt;li&gt;output redaction (strip tokens, emails, PII fields)&lt;/li&gt;
&lt;li&gt;logging policies (never log raw tool payloads by default)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For agent systems, OWASP highlights risks like prompt injection and sensitive info disclosure as major categories. [7]&lt;/p&gt;
&lt;p&gt;Your gateway should assume that anything returned by a tool could be coerced into exfiltration if you&amp;rsquo;re careless.&lt;/p&gt;
&lt;h3 id="5-observability-and-audit"&gt;5) Observability and audit&lt;/h3&gt;
&lt;p&gt;Operationally, the gateway is your best place to emit consistent:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;request logs&lt;/li&gt;
&lt;li&gt;tool call metrics&lt;/li&gt;
&lt;li&gt;traces across tool chains&lt;/li&gt;
&lt;li&gt;audit events for side effects&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OpenTelemetry is the de facto standard for collecting and exporting telemetry. [5] W3C Trace Context defines headers like &lt;code&gt;traceparent&lt;/code&gt;/&lt;code&gt;tracestate&lt;/code&gt; for trace propagation across services. [6]&lt;/p&gt;
&lt;p&gt;If you want an enterprise to trust agents, you need the forensic trail.&lt;/p&gt;
&lt;h3 id="6-routing-and-discovery-at-scale"&gt;6) Routing and discovery at scale&lt;/h3&gt;
&lt;p&gt;The gateway becomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the routing table (&amp;ldquo;tool X lives in cluster Y&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;the discovery system (&amp;ldquo;list tools available for tenant Z&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;the version broker (&amp;ldquo;tool schema v3 for client A, v4 for client B&amp;rdquo;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is also where you can implement &amp;ldquo;tool quality&amp;rdquo; policies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;quarantine tools with high error rates&lt;/li&gt;
&lt;li&gt;fallback to read-only alternatives&lt;/li&gt;
&lt;li&gt;degrade gracefully under partial outages&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="reference-architecture"&gt;Reference architecture&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s a simple, effective gateway architecture:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Agent host / IDE / runtime -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- (MCP client) -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - Streamable HTTP / JSON-RPC [2][4]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;------------------------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- MCP Gateway -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - AuthN/Z [3] -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - Schema + safety gates -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - Budgets (rate, concurrency, cost) -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - Audit + telemetry (OTel) [5][6] -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - Routing + tool registry -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;------------------------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;----------------- ------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- MCP Server A - - MCP Server B -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- (calendar) - - (k8s, github...)-
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;------------------ ------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Upstream APIs Upstream APIs
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Key design decision: &lt;strong&gt;the gateway should not contain business logic&lt;/strong&gt;. It enforces policy and routes tool calls. Tool semantics live in tool servers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="policy-patterns-that-actually-work"&gt;Policy patterns that actually work&lt;/h2&gt;
&lt;h3 id="pattern-read-vs-write-tool-classes"&gt;Pattern: Read vs write tool classes&lt;/h3&gt;
&lt;p&gt;Classify tools into tiers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read-only:&lt;/strong&gt; listing, searching, fetching&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write-safe:&lt;/strong&gt; creates/updates that are naturally reversible&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dangerous:&lt;/strong&gt; deletes, bulk updates, destructive actions, privileged ops&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then enforce different rules per tier:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read-only: wide availability, higher concurrency&lt;/li&gt;
&lt;li&gt;Write-safe: lower concurrency, stronger audit, idempotency keys&lt;/li&gt;
&lt;li&gt;Dangerous: preview/apply, explicit confirmations, restricted scopes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="pattern-preview---apply"&gt;Pattern: Preview -&amp;gt; Apply&lt;/h3&gt;
&lt;p&gt;For any tool that can cause harm:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;plan_*&lt;/code&gt; returns a plan + summary + &lt;code&gt;plan_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apply_*&lt;/code&gt; requires &lt;code&gt;plan_id&lt;/code&gt; (and optionally a user confirmation token)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is the &amp;ldquo;terraform plan/apply&amp;rdquo; mental model applied to tools.&lt;/p&gt;
&lt;h3 id="pattern-allowlisted-egress-ssrf-containment"&gt;Pattern: Allowlisted egress (SSRF containment)&lt;/h3&gt;
&lt;p&gt;If tools can fetch URLs or call arbitrary endpoints, treat it as SSRF risk. OWASP&amp;rsquo;s SSRF prevention guidance is a useful baseline. [8]&lt;/p&gt;
&lt;p&gt;At the gateway, enforce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;allowlisted domains&lt;/li&gt;
&lt;li&gt;IP/CIDR blocks for internal metadata ranges&lt;/li&gt;
&lt;li&gt;redirect re-validation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="pattern-tenant-bound-tokens"&gt;Pattern: Tenant-bound tokens&lt;/h3&gt;
&lt;p&gt;Instead of giving tool servers &amp;ldquo;global&amp;rdquo; credentials, mint tenant-scoped tokens and inject them for each call.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reduces blast radius&lt;/li&gt;
&lt;li&gt;makes audit meaningful&lt;/li&gt;
&lt;li&gt;enables &amp;ldquo;kill switch&amp;rdquo; revocation per tenant&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="scaling-and-isolation-strategies"&gt;Scaling and isolation strategies&lt;/h2&gt;
&lt;p&gt;A gateway is where multi-tenancy becomes real. Choose an isolation model:&lt;/p&gt;
&lt;h3 id="option-a-process-isolation-per-tool-server-simple-strong-isolation"&gt;Option A: Process isolation per tool server (simple, strong isolation)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;each integration is its own process/container&lt;/li&gt;
&lt;li&gt;faults stay contained&lt;/li&gt;
&lt;li&gt;rollouts per integration are easy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tradeoff: more processes to manage.&lt;/p&gt;
&lt;h3 id="option-b-shared-server-with-strong-tenant-sandboxing"&gt;Option B: Shared server with strong tenant sandboxing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;single multi-tenant server handles many clients&lt;/li&gt;
&lt;li&gt;cheaper to run&lt;/li&gt;
&lt;li&gt;requires rigorous isolation inside the process&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tradeoff: higher risk if a bug leaks across tenants.&lt;/p&gt;
&lt;h3 id="option-c-hybrid"&gt;Option C: Hybrid&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;sensitive&amp;rdquo; integrations are isolated&lt;/li&gt;
&lt;li&gt;&amp;ldquo;low-risk&amp;rdquo; read-only tools can be multi-tenant&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most enterprises end up here.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="observability-and-audit"&gt;Observability and audit&lt;/h2&gt;
&lt;h3 id="what-to-emit-minimum-viable"&gt;What to emit (minimum viable)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool_calls_total{tool, tenant, status}&lt;/li&gt;
&lt;li&gt;tool_latency_ms{tool}&lt;/li&gt;
&lt;li&gt;rate_limited_total{tenant}&lt;/li&gt;
&lt;li&gt;budget_exceeded_total{tenant, budget_type}&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Traces&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;request span (client -&amp;gt; gateway)&lt;/li&gt;
&lt;li&gt;tool execution span (gateway -&amp;gt; server)&lt;/li&gt;
&lt;li&gt;downstream spans (server -&amp;gt; upstream API)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Audit events&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;who (tenant/user/client)&lt;/li&gt;
&lt;li&gt;what (tool + summarized parameters)&lt;/li&gt;
&lt;li&gt;when&lt;/li&gt;
&lt;li&gt;result (success/failure)&lt;/li&gt;
&lt;li&gt;side effect IDs (resource IDs, plan_id, idempotency_key)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OpenTelemetry&amp;rsquo;s Go docs are a good reference for instrumentation patterns. [5]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="rollouts-and-versioning"&gt;Rollouts and versioning&lt;/h2&gt;
&lt;p&gt;Tool contracts drift. Clients upgrade at different times. Gateways can reduce pain by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pinning tool schema versions per client&lt;/li&gt;
&lt;li&gt;supporting additive changes first (new fields optional)&lt;/li&gt;
&lt;li&gt;allowing parallel tool versions for a period&lt;/li&gt;
&lt;li&gt;enabling canary rollouts per tenant&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you do nothing else: &lt;strong&gt;never deploy a breaking tool change to 100% of tenants at once.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="security"&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; AuthN required for all HTTP-based access. [3]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; AuthZ enforced per tool (least privilege).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool inputs validated and bounded.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Dangerous tools require preview/apply and explicit confirmations.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Egress allowlists exist for URL/network tools. [8]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Per-tenant rate limiting and per-tool concurrency caps.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timeouts everywhere; deadlines propagate.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Bounded queues (no unbounded memory growth).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Circuit breakers for flaky dependencies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Traces propagate end-to-end (W3C Trace Context). [6]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Metrics and logs are consistent and redacted.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Audit events exist for side effects.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="delivery"&gt;Delivery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool schemas versioned; canary rollouts supported.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Quarantine and fallback policies exist for failing tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Model Context Protocol (MCP) - Specification (Protocol Revision 2025-11-25): &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-11-25&lt;/a&gt;
[2] MCP - Transports (including Streamable HTTP): &lt;a href="https://modelcontextprotocol.io/specification/2025-03-26/basic/transports" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-03-26/basic/transports&lt;/a&gt;
[3] MCP - Authorization (HTTP-based transports): &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25/basic/authorization" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-11-25/basic/authorization&lt;/a&gt;
[4] JSON-RPC 2.0 Specification: &lt;a href="https://www.jsonrpc.org/specification" target="_blank" rel="noopener noreferrer"&gt;https://www.jsonrpc.org/specification&lt;/a&gt;
[5] OpenTelemetry Go - Instrumentation docs: &lt;a href="https://opentelemetry.io/docs/languages/go/instrumentation/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/languages/go/instrumentation/&lt;/a&gt;
[6] W3C - Trace Context: &lt;a href="https://www.w3.org/TR/trace-context/" target="_blank" rel="noopener noreferrer"&gt;https://www.w3.org/TR/trace-context/&lt;/a&gt;
[7] OWASP - Top 10 for Large Language Model Applications: &lt;a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noopener noreferrer"&gt;https://owasp.org/www-project-top-10-for-large-language-model-applications/&lt;/a&gt;
[8] OWASP - SSRF Prevention Cheat Sheet: &lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html" target="_blank" rel="noopener noreferrer"&gt;https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item><item><title>The Service Template That Prevents Incidents</title><link>https://roygabriel.dev/blog/paved-road-service-template/</link><pubDate>Sat, 25 Oct 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/paved-road-service-template/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises try to standardize software delivery with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;Confluence pages&lt;/li&gt;
&lt;li&gt;slide decks&lt;/li&gt;
&lt;li&gt;architecture review boards&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It doesn&amp;rsquo;t scale.&lt;/p&gt;
&lt;p&gt;Teams don&amp;rsquo;t move faster because the &lt;em&gt;rules&lt;/em&gt; exist. Teams move faster because the &lt;strong&gt;defaults&lt;/strong&gt; exist.&lt;/p&gt;
&lt;p&gt;Platform engineering language captures this well: paved roads / golden paths reduce cognitive load and make the &amp;ldquo;right way&amp;rdquo; the easy way. [1][2]
The CNCF Platforms White Paper makes the case for internal platforms as a lever that impacts value streams indirectly - through better flow and developer experience. [3]&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Most enterprises try to standardize software delivery with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;Confluence pages&lt;/li&gt;
&lt;li&gt;slide decks&lt;/li&gt;
&lt;li&gt;architecture review boards&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It doesn&amp;rsquo;t scale.&lt;/p&gt;
&lt;p&gt;Teams don&amp;rsquo;t move faster because the &lt;em&gt;rules&lt;/em&gt; exist. Teams move faster because the &lt;strong&gt;defaults&lt;/strong&gt; exist.&lt;/p&gt;
&lt;p&gt;Platform engineering language captures this well: paved roads / golden paths reduce cognitive load and make the &amp;ldquo;right way&amp;rdquo; the easy way. [1][2]
The CNCF Platforms White Paper makes the case for internal platforms as a lever that impacts value streams indirectly - through better flow and developer experience. [3]&lt;/p&gt;
&lt;p&gt;This article is a practical blueprint for the thing that actually changes outcomes:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;A service template that bakes reliability, security, and operability into day-one defaults.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Build one paved road for APIs:&lt;/li&gt;
&lt;li&gt;repo template + CI pipeline + runtime defaults&lt;/li&gt;
&lt;li&gt;Include &amp;ldquo;boring&amp;rdquo; but critical capabilities:&lt;/li&gt;
&lt;li&gt;health probes, resource requests/limits, disruption budgets [4][5][6]&lt;/li&gt;
&lt;li&gt;tracing/metrics/logging via OpenTelemetry [7]&lt;/li&gt;
&lt;li&gt;timeouts, retries, rate limits&lt;/li&gt;
&lt;li&gt;standardized deployment and rollout&lt;/li&gt;
&lt;li&gt;Measure success with outcomes (DORA metrics): lead time, deploy frequency, change failure rate, MTTR. [8]&lt;/li&gt;
&lt;li&gt;Optimize for day 2 to day 50, not just &amp;ldquo;hello world.&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-a-paved-road-is-and-isnt"&gt;What a paved road is (and isn&amp;rsquo;t)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-api-service-template-required-capabilities"&gt;The API service template: required capabilities&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-reference-repository-structure"&gt;A reference repository structure&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#kubernetes-defaults-that-save-you-later"&gt;Kubernetes defaults that save you later&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#observability-by-default"&gt;Observability by default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#security-by-default"&gt;Security by default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#rollouts-and-operational-controls"&gt;Rollouts and operational controls&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-to-roll-this-out-without-a-platform-revolt"&gt;How to roll this out without a platform revolt&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="what-a-paved-road-is-and-isnt"&gt;What a paved road is (and isn&amp;rsquo;t)&lt;/h2&gt;
&lt;h3 id="a-paved-road-is"&gt;A paved road is&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;recommended&lt;/strong&gt; path to production&lt;/li&gt;
&lt;li&gt;preconfigured defaults that make safe delivery easy&lt;/li&gt;
&lt;li&gt;automation that eliminates repetitive decisions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Microsoft describes this in internal developer platform terms: recommended and supported development paths, incrementally paved through an internal platform. [2]&lt;/p&gt;
&lt;h3 id="a-paved-road-is-not"&gt;A paved road is not&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;a mandate that blocks all other approaches&lt;/li&gt;
&lt;li&gt;a committee process&lt;/li&gt;
&lt;li&gt;a doc nobody reads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your paved road becomes a gate, teams will route around it.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="the-api-service-template-required-capabilities"&gt;The API service template: required capabilities&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s what &amp;ldquo;enterprise production API&amp;rdquo; should mean out of the box.&lt;/p&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;structured logging with correlation IDs&lt;/li&gt;
&lt;li&gt;metrics (request rate/latency/errors)&lt;/li&gt;
&lt;li&gt;tracing across inbound/outbound calls [7]&lt;/li&gt;
&lt;li&gt;runtime config and feature flags&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;timeouts everywhere&lt;/li&gt;
&lt;li&gt;bounded retries with backoff&lt;/li&gt;
&lt;li&gt;health probes (liveness/readiness/startup) [5]&lt;/li&gt;
&lt;li&gt;graceful shutdown&lt;/li&gt;
&lt;li&gt;rate limits / concurrency caps&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="platform-fit"&gt;Platform fit&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Kubernetes-ready manifests&lt;/li&gt;
&lt;li&gt;resource requests/limits [4]&lt;/li&gt;
&lt;li&gt;PodDisruptionBudget for availability during maintenance [6]&lt;/li&gt;
&lt;li&gt;standardized rollout strategy&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="security"&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;auth middleware&lt;/li&gt;
&lt;li&gt;input validation&lt;/li&gt;
&lt;li&gt;secret injection patterns (no secrets in repo)&lt;/li&gt;
&lt;li&gt;least privilege service accounts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="delivery"&gt;Delivery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;CI pipeline: lint/test/build/scan&lt;/li&gt;
&lt;li&gt;SBOM generation&lt;/li&gt;
&lt;li&gt;deploy automation (GitOps or pipeline)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="a-reference-repository-structure"&gt;A reference repository structure&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- cmd/service/ # main
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- internal/ # business logic
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- pkg/ # shared libs (optional)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- api/ # OpenAPI spec, schemas
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- deploy/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- k8s/ # manifests (or Helm/Kustomize)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- policy/ # OPA/constraints (optional)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- docs/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- index.md
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- --- runbooks/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- Makefile
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;--- .github/workflows/ # CI
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Key idea: the template is not just code - it is the full production story:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how to run locally&lt;/li&gt;
&lt;li&gt;how to deploy&lt;/li&gt;
&lt;li&gt;how to observe&lt;/li&gt;
&lt;li&gt;how to operate on-call&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="kubernetes-defaults-that-save-you-later"&gt;Kubernetes defaults that save you later&lt;/h2&gt;
&lt;h3 id="1-resource-requests-and-limits"&gt;1) Resource requests and limits&lt;/h3&gt;
&lt;p&gt;Kubernetes scheduling and stability depend on requests/limits. The official docs explain how pod requests/limits are derived from container values. [4]&lt;/p&gt;
&lt;p&gt;Template default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;set conservative requests&lt;/li&gt;
&lt;li&gt;set safe limits&lt;/li&gt;
&lt;li&gt;provide guidance for right-sizing&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-probes"&gt;2) Probes&lt;/h3&gt;
&lt;p&gt;Kubernetes supports liveness, readiness, and startup probes. The docs describe how to configure them and why they matter. [5]&lt;/p&gt;
&lt;p&gt;Template default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;readinessProbe&lt;/code&gt; ensures traffic only goes to ready pods&lt;/li&gt;
&lt;li&gt;&lt;code&gt;livenessProbe&lt;/code&gt; catches deadlocks / stuck processes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;startupProbe&lt;/code&gt; prevents early restarts for slow boot services&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-disruption-budgets"&gt;3) Disruption budgets&lt;/h3&gt;
&lt;p&gt;PodDisruptionBudgets limit concurrent disruptions during voluntary maintenance. [6]&lt;/p&gt;
&lt;p&gt;Template default:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;include a PDB for replicated services&lt;/li&gt;
&lt;li&gt;define min available or max unavailable&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="observability-by-default"&gt;Observability by default&lt;/h2&gt;
&lt;p&gt;If you do one thing: instrument the template so every service ships with telemetry.&lt;/p&gt;
&lt;p&gt;OpenTelemetry provides the framework for standard traces/metrics/logs. [7]&lt;/p&gt;
&lt;p&gt;Template defaults:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;standard HTTP server instrumentation&lt;/li&gt;
&lt;li&gt;propagation of trace context (W3C headers)&lt;/li&gt;
&lt;li&gt;request logs include trace IDs&lt;/li&gt;
&lt;li&gt;golden dashboard:&lt;/li&gt;
&lt;li&gt;RPS&lt;/li&gt;
&lt;li&gt;p95 latency&lt;/li&gt;
&lt;li&gt;error rate&lt;/li&gt;
&lt;li&gt;saturation (CPU/memory)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="security-by-default"&gt;Security by default&lt;/h2&gt;
&lt;p&gt;Avoid &amp;ldquo;security guidance documents.&amp;rdquo; Make secure defaults.&lt;/p&gt;
&lt;p&gt;Template defaults:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;auth middleware with standardized claims/roles mapping&lt;/li&gt;
&lt;li&gt;structured validation for request bodies&lt;/li&gt;
&lt;li&gt;outbound allowlists (where feasible)&lt;/li&gt;
&lt;li&gt;secret injection via environment/secret store (no plain text)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Your paved road becomes a security accelerator because teams start secure.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="rollouts-and-operational-controls"&gt;Rollouts and operational controls&lt;/h2&gt;
&lt;p&gt;Default rollout patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;canary or progressive delivery when needed&lt;/li&gt;
&lt;li&gt;safe rollback&lt;/li&gt;
&lt;li&gt;feature flags for risky changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Default operational controls:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;rate limiting&lt;/li&gt;
&lt;li&gt;concurrency limits&lt;/li&gt;
&lt;li&gt;timeouts and circuit breakers&lt;/li&gt;
&lt;li&gt;&amp;ldquo;maintenance mode&amp;rdquo; toggle&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="how-to-roll-this-out-without-a-platform-revolt"&gt;How to roll this out without a platform revolt&lt;/h2&gt;
&lt;p&gt;This is the part platform teams often miss.&lt;/p&gt;
&lt;h3 id="1-make-it-optional---but-obviously-better"&gt;1) Make it optional - but obviously better&lt;/h3&gt;
&lt;p&gt;If adopting the template reduces weeks of work to hours, teams will choose it.&lt;/p&gt;
&lt;h3 id="2-provide-migration-paths"&gt;2) Provide migration paths&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;minimal adoption: observability + probes&lt;/li&gt;
&lt;li&gt;medium: deploy manifests + CI&lt;/li&gt;
&lt;li&gt;full: service template + libraries&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-measure-outcomes-not-adoption"&gt;3) Measure outcomes, not adoption&lt;/h3&gt;
&lt;p&gt;Use DORA metrics to show impact: lead time, deploy frequency, change failure rate, time to restore service. [8]&lt;/p&gt;
&lt;p&gt;If the paved road doesn&amp;rsquo;t move these, it&amp;rsquo;s not paved.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="template"&gt;Template&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Repo template includes CI, deploy, docs, runbooks.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Observability instrumentation included by default. [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="kubernetes"&gt;Kubernetes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Resource requests/limits included. [4]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Liveness/readiness/startup probes included. [5]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; PodDisruptionBudget included for replicated services. [6]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability-1"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timeouts and bounded retries are standard.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Graceful shutdown is implemented.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Rate limiting/concurrency caps exist.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="security-1"&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Auth middleware included.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Secrets handled via secure injection (not repo).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="outcomes"&gt;Outcomes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; DORA metrics tracked to validate improvement. [8]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] CNCF - What is platform engineering? (golden paths/paved roads framing): &lt;a href="https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/" target="_blank" rel="noopener noreferrer"&gt;https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/&lt;/a&gt;
[2] Microsoft Learn - What is platform engineering? (paved paths / internal developer platform): &lt;a href="https://learn.microsoft.com/en-us/platform-engineering/what-is-platform-engineering" target="_blank" rel="noopener noreferrer"&gt;https://learn.microsoft.com/en-us/platform-engineering/what-is-platform-engineering&lt;/a&gt;
[3] CNCF TAG App Delivery - Platforms White Paper: &lt;a href="https://tag-app-delivery.cncf.io/whitepapers/platforms/" target="_blank" rel="noopener noreferrer"&gt;https://tag-app-delivery.cncf.io/whitepapers/platforms/&lt;/a&gt;
[4] Kubernetes - Resource Management for Pods and Containers (requests/limits): &lt;a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/&lt;/a&gt;
[5] Kubernetes - Configure Liveness, Readiness and Startup Probes: &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/&lt;/a&gt;
[6] Kubernetes - Specifying a Disruption Budget for your Application (PDB): &lt;a href="https://kubernetes.io/docs/tasks/run-application/configure-pdb/" target="_blank" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/tasks/run-application/configure-pdb/&lt;/a&gt;
[7] OpenTelemetry - Documentation (instrumentation and telemetry): &lt;a href="https://opentelemetry.io/docs/" target="_blank" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/&lt;/a&gt;
[8] DORA - DORA&amp;rsquo;s software delivery performance metrics: &lt;a href="https://dora.dev/guides/dora-metrics/" target="_blank" rel="noopener noreferrer"&gt;https://dora.dev/guides/dora-metrics/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item></channel></rss>