Platform-Engineering | Roy Gabriel

When Enterprise Defaults Become Enterprise Debt

Sat, 07 Feb 2026 09:00:00 -0500

Note on examples: The scenarios below are anonymized composites. They’re not a critique of any one organization; they’re patterns that repeat across industries. The goal isn’t to “modernize for fun.” It’s to protect speed-to-market and reliability as systems and organizations scale.

Why this matters

Most enterprises don’t lose because they picked the “wrong” framework or cloud provider. They lose because old defaults - once rational - become invisible policy.

The 90s and early 2000s optimized for constraints that were real at the time:

hardware was expensive
automation was immature
environments were scarce
security controls were largely manual
uptime was achieved by cautious change, not by safe change

Those constraints have shifted. But many organizations still run on architectural and governance defaults designed for a different era.

The result is predictable:

innovation slows (lead time grows)
quality degrades (late integration + big-bang changes)
reliability suffers (risk is batched, blast radius expands)
engineers spend more time navigating the system than improving it

If you want a single sentence summary: old patterns don’t just slow delivery - they also create the conditions for outages.

TL;DR

Retire “analysis as delivery.” Timebox discovery and ship thin vertical slices.
Treat cloud primitives as primitives, not research projects (e.g., object storage is solved).
Default to containers + orchestration for most stateless services; use VMs deliberately, not reflexively. [5]
Replace ticket queues and boards with guardrails + paved roads + policy-as-code. [7][8]
Measure what matters: lead time, deploy frequency, change failure rate, MTTR. [1][2]
Modernization works best as an incremental program, not a rewrite (Strangler Fig pattern). [12]

Pattern 1: Analysis as a substitute for delivery
Pattern 2: Reinventing commodity infrastructure
Pattern 3: VM-first thinking as the default
Pattern 4: Ticket-driven infrastructure
Pattern 5: Change Advisory Board for routine changes
Pattern 6: The shared database empire
Pattern 7: Central integration as a chokepoint
Pattern 8: Perma-POCs and innovation theater
Replace committees with guardrails
Modernize without a rewrite
Verification: how you know it’s working
A practical checklist
References

Pattern 1: Analysis as a substitute for delivery

What it looks like

A team spends months (sometimes a year) doing “analysis” for a capability that won’t be used until it’s built - often with the intention of eliminating all risk up front.

Common examples:

multi-tenant “high availability image storage” designed from scratch
designing bespoke event systems when managed queues exist
writing 40-page architecture documents before the first running slice exists

Why it existed

When provisioning took weeks and environments were scarce, analysis was a rational risk-reducer.

The hidden tax

You push real learning to the end (integration failures happen late).
Decisions get made with imaginary constraints, not measured ones.
Teams optimize for “approval” rather than “outcome.”

The replacement pattern

Timebox discovery and require a running slice early.

A strong default:

1-2 week spike to validate constraints
a thin vertical slice in production (even behind a flag)
iterate based on real telemetry and user feedback

Transition step (low drama)

Create an “RFC-lite” template:

problem statement + constraints
1-2 options with tradeoffs
a plan to measure (latency, cost, reliability)
a thin-slice milestone date

Pattern 2: Reinventing commodity infrastructure

What it looks like

Teams treat widely-proven primitives as novel:

object storage
queues
identity
metrics + tracing
load balancing

A classic symptom: “We need to design HA multi-tenant object storage,” as if durable object storage isn’t already a standard building block.

Why it existed

On-prem and early hosting eras forced you to build a lot yourself.

The hidden tax

Reinventing primitives becomes a multi-quarter project.
Reliability becomes your problem (and you will be on call for it).
The business pays for the same capability twice: once in time, and again in incidents.

The replacement pattern

Default to managed or proven primitives unless you have a documented reason not to.

For example, modern object storage services are explicitly designed for very high durability and availability (provider details vary). [11]

Transition step

Maintain a “Reference Implementations” catalog:

“How we do object storage”
“How we do queues”
“How we do auth”
“How we do telemetry”

If the default is documented and supported, teams stop re-litigating fundamentals.

Pattern 3: VM-first thinking as the default

What it looks like

Everything runs on VMs because “that’s what we do,” even when the workload is a stateless API, worker, or event consumer.

Why it existed

VMs were the universal unit of deployment for a long time, and they map cleanly to org boundaries (“this server is mine”).

The hidden tax

drift (snowflake servers)
slow rollouts
inconsistent security posture
wasted compute due to poor bin-packing
limited standardization across services

The replacement pattern

For many enterprise services, containers orchestrated by Kubernetes are a strong default for stateless workloads. Kubernetes itself describes Deployments as a good fit for managing stateless applications where Pods are interchangeable and replaceable. [5]

This doesn’t mean “Kubernetes for everything,” but it does mean:

prefer declarative workloads with health checks and rollout controls
keep VMs for deliberate cases (legacy constraints, special licensing, unique state, or when orchestration adds no value)

Transition step

Start with “Kubernetes-first for new stateless services,” not a migration mandate.

Then build operational guardrails:

resource requests/limits so services behave predictably under load [6]
standardized readiness/liveness probes
standard ingress + auth patterns

Pattern 4: Ticket-driven infrastructure

What it looks like

Need a database? Ticket. Need an environment? Ticket. Need DNS? Ticket. Need a queue? Ticket.

Eventually, the ticketing system becomes the true control plane.

Why it existed

It’s a reasonable response when:

environments are scarce
changes are risky
platform knowledge is specialized

The hidden tax

queues become normalized (“it takes 3 weeks to get a namespace”)
teams route around the platform
reliability doesn’t improve; delivery just slows

The replacement pattern

Self-service via GitOps and platform “paved roads.”

OpenGitOps describes GitOps as a set of standards/best practices for adopting a structured approach to GitOps. [7] The point isn’t a specific tool - it’s the principle: desired state is declarative and auditable.

Transition step

Pick one high-frequency request and eliminate it:

“create a service with a standard ingress/auth/telemetry”
“provision a queue”
“create a dev environment”

Make the paved road the path of least resistance.

Pattern 5: Change Advisory Board for routine changes

What it looks like

Every change - routine or risky - requires synchronous approval.

Why it existed

When changes were large, rare, and manual, centralized review reduced catastrophic surprises.

The hidden tax

you batch changes (bigger releases are riskier)
emergency changes bypass process (creating inconsistency)
“approval” becomes the goal rather than evidence of safety

DORA’s guidance on streamlining change approval emphasizes making the regular change process fast and reliable enough that it can handle emergencies, and reframes how CAB fits into continuous delivery. [3] Continuous delivery literature makes a similar point: smaller, more frequent changes reduce risk and ease remediation. [4]

The replacement pattern

Move to evidence-based change approval:

automated tests
policy-as-code checks
progressive delivery (canaries, phased rollouts)
real-time telemetry tied to the release

Transition step

Keep CAB, but change its scope:

focus on high-risk changes and cross-team coordination
use automation and metrics for routine changes

Pattern 6: The shared database empire

What it looks like

A central database is shared by many services. Teams coordinate schema changes across multiple apps and releases.

Microservices.io describes the “shared database” pattern explicitly: multiple services access a single database directly. [10]

Why it existed

It’s simple at first:

one place for data
easy joins
one backup plan

The hidden tax

coupling spreads everywhere
every change becomes cross-team work
reliability suffers because one DB problem becomes everyone’s problem
schema evolution becomes political

The replacement pattern

Prefer service-owned data boundaries. Microservices.io’s “database per service” pattern describes keeping a service’s data private and accessible only via its API. [9]

Transition step

You don’t have to “microservices everything.” Start by:

carving out new tables owned by one service
introducing an API boundary
migrating consumers gradually

Pattern 7: Central integration as a chokepoint

What it looks like

All integrations must go through a single shared integration layer/team (classic ESB gravity).

Why it existed

Centralizing integration gave consistency when:

protocols were messy
tooling was expensive
teams lacked automation

The hidden tax

integration lead times explode
teams stop experimenting
one backlog becomes everyone’s bottleneck

The replacement pattern

Standardize:

interfaces (auth, tracing, deployment, contract testing)
platform guardrails

…not every internal implementation detail.

Transition step

Carve out one “self-service integration” paved road:

standard service template
standard auth
standard telemetry
contracts + examples

Pattern 8: Perma-POCs and innovation theater

What it looks like

Prototypes exist forever, never becoming production systems.

Especially common with AI initiatives:

impressive demos
no production constraints
no ownership for operability

Why it existed

POCs are a safe way to explore unknowns.

The hidden tax

teams lose trust (“innovation never ships”)
production teams inherit half-baked work
opportunity cost compounds

The replacement pattern

From day one, require:

an owner
a production path
a thin slice in a real environment
explicit safety requirements (timeouts, budgets, telemetry)

Transition step

Make “POC exit criteria” mandatory:

what metrics prove value?
what is the minimum shippable slice?
what must be true for reliability and security?

Replace committees with guardrails

A recurring theme: humans are expensive control planes.

The modern move is to convert “tribal rules” into:

templates
automation
policy-as-code
paved paths

Microsoft’s platform engineering work describes “paved paths” within an internal developer platform as recommended paths to production that guide developers through requirements without sacrificing velocity. [8]

Guardrails beat gatekeepers because guardrails are:

consistent
fast
auditable
scalable

Modernize without a rewrite

Big-bang rewrites are expensive and risky. Incremental modernization is usually the winning move.

The Strangler Fig pattern is a well-known approach: wrap or route traffic so you can replace parts of a legacy system gradually. [12]

Practical approach:

put a facade in front of the legacy surface
carve off one slice at a time
measure outcomes
keep rollback easy

This isn’t glamorous. It works.

Verification: how you know it’s working

If you want to avoid “modernization theater,” measure.

DORA’s metrics guidance is a solid baseline: deployment frequency, lead time for changes, change failure rate, and time to restore service (MTTR). [1] The 2024 DORA report continues to focus on the organizational capabilities that drive high performance. [2]

A simple evidence loop:

Pick one value stream (one product or platform slice).
Baseline the four DORA metrics.
Remove one friction point (one pattern).
Re-measure.

If your metrics don’t move, you didn’t remove the real constraint.

A practical checklist

If you’re trying to retire “enterprise debt” safely:

Delivery

Timebox analysis; require a running slice early.
Prefer small changes and frequent releases; avoid batching.

Platform

Provide a paved road for common workflows (service template, auth, telemetry). [8]
Remove ticket queues for repeatable requests (self-service + GitOps). [7]

Reliability

Standardize timeouts, retries, budgets, and resource requests/limits. [6]
Use progressive delivery where risk is high.

Architecture

Reduce shared DB coupling; establish service-owned boundaries. [9][10]
Modernize incrementally (Strangler Fig), not via big-bang rewrites. [12]

Governance

Replace routine approvals with evidence: tests + policy-as-code + telemetry. [3][4]

References

[1] DORA - “DORA’s software delivery performance metrics (guide)”. https://dora.dev/guides/dora-metrics/ [2] DORA - “Accelerate State of DevOps Report 2024”. https://dora.dev/research/2024/dora-report/ [3] DORA - “Streamlining change approval (capability)”. https://dora.dev/capabilities/streamlining-change-approval/ [4] ContinuousDelivery.com - “Continuous Delivery and ITIL: Change Management”. https://continuousdelivery.com/2010/11/continuous-delivery-and-itil-change-management/ [5] Kubernetes docs - “Workloads (Deployments are a good fit for stateless workloads)”. https://kubernetes.io/docs/concepts/workloads/ [6] Kubernetes docs - “Resource Management for Pods and Containers (requests/limits)”. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ [7] OpenGitOps - “What is OpenGitOps?” and project background. https://opengitops.dev/ and https://opengitops.dev/about/ [8] Microsoft Engineering Blog - “Building paved paths: the journey to platform engineering”. https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/ [9] Microservices.io - “Database per service” pattern. https://microservices.io/patterns/data/database-per-service [10] Microservices.io - “Shared database” pattern. https://microservices.io/patterns/data/shared-database.html [11] AWS documentation - “Data protection in Amazon S3 (durability/availability design goals)”. https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html [12] Martin Fowler - “Strangler Fig Application” (legacy modernization pattern). https://martinfowler.com/bliki/StranglerFigApplication.html

Stop Shipping Slide Decks

Sat, 31 Jan 2026 11:15:00 -0500

Position: This is not “documentation bad.” This is “documentation is a tool.” If it increases lead time, hides truth, or replaces learning, it’s not helping.

Why this matters

In software, the real “source of truth” is:

running systems
code and configuration
production telemetry
incident history

Documentation should reduce uncertainty and speed up decisions. But two artifacts routinely do the opposite in large organizations:

the 40-page slide deck
the Word doc living somewhere in SharePoint that nobody can find

These artifacts often become deliverables - a substitute for building. They make it possible to spend months “progressing” without ever encountering reality.

And here’s the part most orgs miss:

If you’re going to fail, you want to fail quickly and cheaply, not slowly and expensively. [4]

That doesn’t mean reckless shipping. It means running a tight learning loop and letting reality correct you early - before you’ve sunk quarters of time into the wrong solution.

TL;DR

Decks are great for storytelling. They are bad as an engineering system of record.
“SharePoint architecture docs” become a document cemetery: hard to find, hard to diff, and easy to ignore.
The Agile Manifesto explicitly values working software over comprehensive documentation. [1] And one Agile principle states that working software is the primary measure of progress. [2]
Replace decks/docs-as-deliverables with:
RFC-lite (1-2 pages) + a running thin slice
ADRs (Architecture Decision Records) to capture decisions + tradeoffs [5][6]
Docs-as-code (Markdown in the repo, reviewed like code)
diagrams that are versioned and easy to update
Measure improvement with system outcomes (lead time, deploy frequency, change failure rate, MTTR). [3]

Pattern 1: Deck-driven development
Pattern 2: SharePoint document cemeteries
Pattern 3: Architecture as narrative, not decisions
Pattern 4: “Design phase” gating
Pattern 5: Documentation that never gets pruned
What to do instead: a documentation system that ships
Verification: how you know it’s working
A practical checklist
References

Pattern 1: Deck-driven development

What it looks like

A 40-page deck is created to describe a system that doesn’t exist yet.
The deck gets reviewed by multiple groups.
Approval is treated as progress.
When implementation starts, the world has changed - or key constraints were missed.

Why it exists

Decks are socially useful:

they compress complexity into a narrative
they help leaders “see” a plan
they make uncertainty feel controlled

The hidden tax

Decks are a poor engineering artifact because they’re:

low fidelity: they rarely contain executable truth
hard to maintain: updates are manual and usually lag reality
hard to diff: you can’t easily review what changed and why
easy to perform: a deck can look complete while the design is still untested
not tied to code: no direct path from “decision” -> “implementation” -> “verification”

The worst outcome isn’t that the deck is wrong. It’s that the deck delays the point where you discover what’s wrong.

The replacement pattern

Use decks for storytelling after you have reality. Use engineering artifacts to discover reality.

A strong default:

RFC-lite (1-2 pages)
a runnable thin slice
measurable verification (latency, cost envelope, failure mode)

This aligns with Agile’s emphasis on working software as a real measure of progress. [2]

Transition step (low drama)

Replace “deck required for approval” with “evidence required for approval”:

link to the RFC
link to a running demo / branch / sandbox
explicit constraints + tradeoffs
an exit criteria checklist for the slice

Pattern 2: SharePoint document cemeteries

What it looks like

Architecture docs exist as Word/PDF files in SharePoint.
Multiple versions exist (“Final_v7_REAL_FINAL.docx”).
Search works poorly unless you already know what to search for.
Nobody updates the doc because it’s painful and risky (“what if I change the blessed doc?”).

Why it exists

It’s an enterprise default:

SharePoint is “official”
Word docs feel formal
it’s familiar to non-engineering stakeholders

The hidden tax

SharePoint docs typically fail at the things engineering needs most:

discoverability (people don’t know where to look)
ownership (no clear maintainer)
reviewability (diffs and PR discussion are weak)
linking to reality (code, configs, dashboards, runbooks)
keeping current (documentation drift becomes the norm)

So teams stop trusting docs and rely on tribal knowledge - until they page someone at 2 a.m.

The replacement pattern

Treat documentation as part of the codebase:

Markdown in the repo
reviewed via PR like code
versioned with implementation
linked to:
APIs (OpenAPI specs)
dashboards
runbooks
incident writeups
ADRs

Google’s documentation best practices make the point directly: a small set of fresh, accurate docs is better than a large pile in disrepair. [7]

Transition step

You don’t have to “migrate all docs.”

Start with a triage:

Identify the top 10 documents people actually need.
Recreate them as Markdown in a docs/ folder with an index.
Leave the rest as archived references, not living truth.

Pattern 3: Architecture as narrative, not decisions

What it looks like

The doc describes a target architecture but doesn’t answer:

why this approach?
what alternatives were considered?
what tradeoffs were accepted?
what constraints matter most?
what did we decide not to do?

Why it exists

Narratives are easier than decision logs. It’s simpler to write “the system will…” than to record the messy reality of tradeoffs.

The hidden tax

When decisions aren’t recorded, teams re-litigate them repeatedly. The same arguments come back every quarter - often because new people joined and the reasoning isn’t captured.

The replacement pattern: ADRs

Use Architecture Decision Records (ADRs): short, structured notes that capture an important decision with its context and consequences. [5] The practice is commonly attributed to Michael Nygard’s 2011 write-up. [6]

ADRs are the opposite of a 40-slide deck:

small
specific
diffable
linkable to code changes

Transition step

Start with one ADR per “architecturally significant decision”:

database choice
messaging pattern
tenancy model
auth model
deployment model
data boundary decisions

Pattern 4: “Design phase” gating

What it looks like

“We can’t start implementation until the analysis is complete.”
The analysis expands to include every possible future case.
The design grows more “complete” and less true.

Why it exists

Enterprises are understandably afraid of failure.

The hidden tax

This approach doesn’t eliminate failure. It defers it - making it more expensive.

Lean Startup describes progress as validated learning and emphasizes moving quickly through a build-measure-learn loop. [4] The point isn’t startups. The point is learning fast when you’re uncertain.

The replacement pattern

Timebox design, then validate with a thin slice:

write the RFC-lite doc
implement the smallest realistic end-to-end path
measure the constraints
then expand

Transition step

Define “analysis exit criteria”:

measurable constraints validated (not theorized)
spike code exists
a plan for incremental rollout exists

Pattern 5: Documentation that never gets pruned

What it looks like

Docs accumulate but aren’t maintained:

outdated architecture diagrams
old runbooks
stale onboarding guides
dead links

Why it exists

Pruning isn’t rewarded. Writing new docs feels productive; deleting old docs feels risky.

The hidden tax

Stale docs are worse than no docs:

they mislead
they increase cognitive load
they create false confidence

The replacement pattern

Adopt “minimum viable documentation” and prune regularly. [7]

The rule I like:

If a doc isn’t maintained, label it ARCHIVED and explain why.
If a doc is required, tie it to ownership and change workflow.

Transition step

Make docs part of PR hygiene:

if the change affects behavior, docs update ships with it
run link checks in CI
keep an index page updated

What to do instead: a documentation system that ships

Here’s a simple “docs system” that works in practice.

A repo structure that scales

/README.md # entry point: what this is + how to run it
/docs/
 index.md # "start here" documentation map
 rfc/
 0001-tenancy-model.md
 0002-storage-approach.md
 adr/
 0001-use-postgres.md
 0002-adopt-opentelemetry.md
 architecture/
 context.md # C4-ish: context + boundaries
 containers.md # top-level services
 deployment.md # runtime & environments
 runbooks/
 oncall.md
 incident-response.md
 api/
 openapi.yaml

Replace 40 slides with two artifacts

RFC-lite (1-2 pages): the “what” and “why”
Thin slice demo: the reality check

RFC-lite template (copy/paste)

# RFC: <title>

## Problem
What are we trying to solve? Who is affected?

## Constraints
Latency, cost, compliance, tenancy, uptime, environments.

## Proposal
What are we building? What does "done" mean?

## Alternatives considered
Option A / B / C with short tradeoffs.

## Risks and mitigations
What could go wrong? How will we contain blast radius?

## Verification
How will we measure success in production?

ADR template (copy/paste)

# ADR-XXXX: <decision>

## Status
Proposed | Accepted | Deprecated

## Context
What drove this decision? What constraints matter?

## Decision
What did we decide?

## Consequences
What do we gain? What do we lose? What changes later?

Verification: how you know it’s working

If you replace decks and doc cemeteries with real engineering artifacts, you should see:

Delivery metrics improve

Track the same system-level outcomes DORA promotes: lead time, deploy frequency, change failure rate, and time to restore service. [3]

Fewer handoffs and fewer “alignment meetings”

If teams can self-serve context from living docs, coordination cost drops.

Faster “first reality”

A simple heuristic:

How long from idea -> first runnable thin slice?

If that number is months, the system is optimized for analysis, not learning.

Docs stay alive

docs updated alongside code
fewer stale “final_v7” files
fewer tribal-knowledge escalations

A practical checklist

If you want to kill deck-driven delivery without starting a culture war:

Stop treating decks as deliverables

Architecture reviews require an RFC + a runnable slice.
Decks are optional; evidence is not.

Fix document discoverability

One docs/index.md that links to the docs that matter.
Make the repo the source of truth for technical docs.

Capture decisions, not fantasies

Add ADRs for major decisions and link them to PRs. [5][6]

Timebox analysis

Set analysis exit criteria.
Optimize for early learning and quick failure when uncertainty is high. [4]

Keep docs small and alive

Prune regularly; archive what’s stale.
Run link checks in CI.
Treat docs like bonsai: maintained and trimmed, not accumulated. [7]

References

[1] Manifesto for Agile Software Development (values; “Working software over comprehensive documentation”). https://agilemanifesto.org/ [2] Principles behind the Agile Manifesto (“Working software is the primary measure of progress”). https://agilemanifesto.org/principles.html [3] DORA - “DORA’s software delivery performance metrics (guide)”. https://dora.dev/guides/dora-metrics/ [4] Lean Startup principles (Build-Measure-Learn; learning quickly; failing fast/cheaply as a concept). https://theleanstartup.com/principles [5] ADR - Architectural Decision Records (what ADRs are). https://adr.github.io/ [6] Michael Nygard - “Documenting Architecture Decisions” (2011; ADR practice origin/popularization). https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions [7] Google Documentation Guide - Best practices (“Minimum Viable Documentation”; keep docs short, fresh, and pruned). https://google.github.io/styleguide/docguide/best_practices.html

When Management Layers Become Latency

Sat, 24 Jan 2026 10:30:00 -0500

Note on examples: The scenarios below are anonymized composites. This isn’t “management bad.” Good management is an accelerator. The problem is when management becomes layers of translation between reality and decisions.

Why this matters

In production systems, adding hops between a request and a response increases latency, failure modes, and debugging time.

Organizations behave the same way.

When engineering work flows through too many intermediary layers - tech leads, scrum masters, managers, senior managers, project managers, directors, senior directors, VPs, and beyond - the organization starts to exhibit the same symptoms as an over-proxied network:

long lead times
lost context (“telephone game” requirements)
local optimization (everyone looks busy; value doesn’t move)
coordination overhead that scales faster than delivery
engineers feeling like nothing they build reaches production

The painful part is that the org can look healthy on paper (status is green, roadmaps are full) while the product fails to meet real expectations.

This article is about the mechanics behind that failure - and the replacement patterns that restore flow.

TL;DR

Layers create handoffs. Handoffs create queues. Queues create lead time.
More roles don’t automatically increase throughput; coordination cost can dominate (Brooks’s Law). [6]
Fast flow requires end-to-end ownership with minimal handoffs (stream-aligned teams). [3][4]
Measure outcomes at the system level (DORA metrics), not “activity” (story points, number of meetings). [1]
Don’t turn metrics into targets (Goodhart’s Law). [7]
Burnout often rises when delivery is painful and risky; improving delivery capability predicts lower burnout. [2][8]

Pattern 1: Translation layers replace direct truth
Pattern 2: Status becomes the work
Pattern 3: “More people” is treated like a throughput solution
Pattern 4: Projectization and temporary teams
Pattern 5: Governance by meeting instead of guardrail
Pattern 6: Metrics as targets
Pattern 7: Engineers are abstracted away from production
Replacement patterns that work
Verification: how you know the org is healing
A practical checklist
References

Pattern 1: Translation layers replace direct truth

What it looks like

A customer need or operational pain moves through a chain:

customer -> product -> program -> project -> delivery manager -> engineering manager -> tech lead -> engineers

By the time it arrives at the team, it’s been translated multiple times and often loses:

the actual user story
the constraints
the real priority
the “why”

Why it exists

Layering feels safe:

fewer people “bother” engineers
leaders get curated information
decision makers see clean narratives

The hidden tax

Misalignment becomes normal.
Engineers build the wrong thing efficiently.
Product expectations aren’t met, not because engineers can’t build - but because the input signal is degraded.

The replacement pattern

Shorten the feedback loop.

Ensure teams have direct access to:
customer signals (support tickets, usage, interviews)
operational signals (incidents, latency, error budgets)
Make the “why” non-optional: put it in the ticket, the PRD, and the kickoff.

If a team can’t explain “why this exists,” it shouldn’t ship yet.

Pattern 2: Status becomes the work

What it looks like

Organizations that struggle to ship often compensate with:

more meetings
more dashboards
more decks
more “alignment sessions”

The output looks like progress, but the production system doesn’t change.

Why it exists

When uncertainty is high, visibility is comforting.

The hidden tax

Attention becomes scarce.
Engineers fragment into “meeting responders.”
Work becomes multi-tasked across too many initiatives (WIP explosion).

The replacement pattern

Reduce status overhead by making the system visible:

CI/CD dashboards
production telemetry
an engineering scorecard based on system outcomes (not activity)

DORA’s metrics are widely used as system-level indicators for delivery performance: deployment frequency, lead time, change failure rate, and time to restore service. [1]

Pattern 3: “More people” is treated like a throughput solution

What it looks like

A late initiative triggers:

new managers
new project managers
new engineers
more coordination rituals

Why it exists

It’s intuitive: more people should mean more output.

The hidden tax

Software delivery has a coordination component. Adding people increases communication paths, onboarding, and synchronization.

Brooks’s Law captures this succinctly: adding manpower to a late software project can make it later. [6]

The replacement pattern

Before adding headcount, reduce coordination load:

clarify ownership
shrink scope to a thin vertical slice
eliminate handoffs
stabilize requirements long enough to ship

Then scale with:

duplication (more teams owning similar streams)
platform leverage (paved roads), not more meetings

Pattern 4: Projectization and temporary teams

What it looks like

Engineers are repeatedly reorganized into short-lived “project teams,” and after delivery they are moved again.

Why it exists

Projects are easy to budget, track, and narrate.

The hidden tax

Temporary teams produce:

fragile ownership
weak operability
“throw it over the wall” incentives

Fast flow requires teams that own outcomes end-to-end with minimal handoffs.

Team Topologies describes stream-aligned teams as owning a slice of value end-to-end with no handoffs. [3][4]

The replacement pattern

Prefer stable teams aligned to a value stream (product/service), with:

clear ownership
operational responsibility (“you build it, you run it”)
direct feedback from users and production

Pattern 5: Governance by meeting instead of guardrail

What it looks like

Instead of “how do we make safe delivery easy,” governance becomes:

approval steps
committees
sign-off chains

Why it exists

Risk is real, and leaders want control.

The hidden tax

Humans are expensive control planes:

slow
inconsistent
difficult to audit at scale

The replacement pattern

Convert rules into guardrails:

policy-as-code
templates
paved paths
automated checks in CI/CD

This is how you scale safety without scaling meetings.

Pattern 6: Metrics as targets

What it looks like

Teams are pressured to hit:

story points
“velocity”
number of deployments
“percent complete”
tickets closed

Then behavior adapts to the metric.

Why it exists

Leaders need a dashboard.

The hidden tax

When a measure becomes a target, it can stop being a good measure (Goodhart’s Law). [7]

Examples:

inflate points
ship low-value changes to increase deploy count
avoid hard work because it hurts “throughput”

The replacement pattern

Use metrics diagnostically at the system level (not as individual KPIs).

If you adopt DORA metrics, use them to identify constraints and improve flow - not as quarterly targets for teams. [1][9]

Pattern 7: Engineers are abstracted away from production

What it looks like

A team builds a system, but:

another team deploys it
another team runs it
another team handles incidents
another team owns the roadmap

Engineers eventually conclude: “Nothing I build actually ships.”

Why it exists

Specialization can be useful, but excessive separation breaks feedback loops.

The hidden tax

teams don’t learn from production
quality declines because consequences are indirect
“deployment pain” rises: shipping becomes stressful and disruptive

DORA describes deployment pain as fear/anxiety around deploying and links it to poorer delivery performance and culture. [8] DORA also notes continuous delivery predicts lower levels of burnout and reduces deployment pain. [2]

The replacement pattern

Re-connect engineers to production:

give teams operational ownership for what they build
make telemetry and incident review part of engineering
reduce fear by making releases small, frequent, and observable

Replacement patterns that work

These are the patterns I’ve seen consistently restore delivery flow without chaos.

1) Clarify decision rights (and keep them close to the work)

One accountable owner per initiative (not “everyone is accountable”)
Engineers participate in tradeoff decisions early (scope, sequencing, risk)

2) Design teams for flow (not for org charts)

Organizations build systems that mirror their communication structures (Conway’s Law). [5] If your org is siloed and layered, your architecture often becomes siloed and layered too.

Design teams so the desired architecture is the path of least resistance.

3) Prefer stream-aligned teams + platform leverage

Stream-aligned teams own outcomes end-to-end (no handoffs). [3][4]
Platform teams reduce cognitive load by providing paved roads (auth, telemetry, CI/CD). [4]

4) Replace “alignment meetings” with shared artifacts

one-page decision records
clear “definition of done”
demos that show working software in a real environment

5) Turn delivery into a calm, repeatable process

When delivery is painful, people add layers to manage fear. Fix the source:

tests
automation
progressive delivery
observable releases

That’s how you reduce burnout sustainably. [2][8]

Verification: how you know the org is healing

Don’t rely on vibes. Use evidence.

Delivery outcomes (system-level)

Start with DORA metrics to track flow and stability. [1]

Product outcomes

adoption (are users actually using the thing?)
retention (does usage persist?)
reduced operational toil (do incidents go down?)

Team outcomes

fewer emergency escalations
fewer “status-only” meetings
improved on-call experience (lower deployment pain) [8]

If lead time drops but burnout rises, you probably “optimized the dashboard” instead of the system (see Goodhart). [7]

A practical checklist

If your org feels “management-heavy,” try this in order:

Reduce translation layers

Put engineers in the room (or thread) with real users/operators at least weekly.
Require the “why” to be written and reviewed before build starts.

Reduce handoffs

Map the value stream and count handoffs.
Remove one handoff per quarter; make it a goal.

Reduce WIP

Limit concurrent initiatives per team.
Finish before starting.

Convert meetings into guardrails

Replace approvals with automated checks where possible.
Create paved paths so the safe way is the easy way.

Reconnect teams to production

Teams own what they ship.
Tie incident learning back to design decisions.
Make releases smaller and more frequent.

References

[1] DORA - “DORA’s software delivery performance metrics (guide)”. https://dora.dev/guides/dora-metrics/ [2] DORA - “Capabilities: Continuous delivery” (notes relationship to burnout and deployment pain). https://dora.dev/capabilities/continuous-delivery/ [3] Team Topologies - “Key Concepts” (stream-aligned teams; no handoffs). https://teamtopologies.com/key-concepts [4] IT Revolution - “The Four Team Types from Team Topologies” (stream-aligned teams own end-to-end). https://itrevolution.com/articles/four-team-types/ [5] Splunk - “Conway’s Law Explained” (systems mirror communication structures; includes original quote). https://www.splunk.com/en_us/blog/learn/conways-law.html [6] Brooks’s Law (coined in The Mythical Man-Month): “Adding manpower to a late software project makes it later.” https://en.wikipedia.org/wiki/Brooks%27s_law [7] CNA - “Goodhart’s Law” (when a measure becomes a target, it ceases to be a good measure). https://www.cna.org/analyses/2022/09/goodharts-law [8] DORA - “Capabilities: Well-being” (deployment pain and its relationship to performance/culture). https://dora.dev/capabilities/well-being/ [9] SEI (CMU) - “How to Misuse and Abuse DORA Metrics” (metric anti-patterns). https://www.sei.cmu.edu/library/how-to-misuse-and-abuse-dora-metrics/

Agile Isn't Dead. Agile Compliance Is.

Wed, 31 Dec 2025 12:00:00 -0500

Note on examples: The scenarios below are anonymized composites. This isn’t “Agile bad.” It’s “Agile the brand is often used to justify systems that do the opposite of Agile’s intent.”

Why this matters

Agile isn’t a set of meetings. It’s a physics statement:

Shorter feedback loops reduce risk.

Most enterprises didn’t fail Agile. They replaced Agile with a bureaucracy that uses Agile vocabulary:

“Sprint” becomes a reporting interval
“Velocity” becomes a performance metric
“Planning” becomes a negotiation
“Definition of done” becomes a checklist
“Agile transformation” becomes a multi-year program

The result is predictable:

delivery slows
quality degrades
reliability suffers
engineers burn out
product expectations aren’t met
leadership gets more dashboards and fewer outcomes

This post is a production-first teardown of Agile theater - and a replacement model that actually ships.

TL;DR

Agile is about learning quickly, not predicting perfectly.
Scrum is useful when it reduces uncertainty. It’s harmful when it becomes a compliance system.
If you treat sprints as contracts, you’ll get scrumfall: waterfall dependencies with sprint-shaped reporting.
Replace “Agile compliance” with:
Flow (small batches, limit WIP)
Continuous delivery (safe, frequent releases) [4]
Evidence-based planning (measure outcomes; adjust quickly) [5]
Use system metrics (DORA) to verify improvement: lead time, deploy frequency, change failure rate, MTTR. [6]
Beware Goodhart’s Law: metrics used as targets will be gamed. [7]

Agile the physics vs Agile the bureaucracy
Pattern 1: Sprints as contracts
Pattern 2: Velocity as a performance metric
Pattern 3: Backlog bloat as a museum of anxiety
Pattern 4: Ceremonies become the work
Pattern 5: Dependencies turn Scrum into fiction
Pattern 6: Definition of done without production
Pattern 7: Product ownership by proxy
What’s better: Flow + CD + evidence
Transition plan: 30 days without a revolution
Verification: how you know it’s working
A practical checklist
References

Agile the physics vs Agile the bureaucracy

The Agile Manifesto values working software over comprehensive documentation and emphasizes collaboration and responding to change. [1] One of its principles states that working software is the primary measure of progress. [2]

Those ideas are still correct.

What broke in enterprises is implementation:

Agile became process instead of feedback
agile artifacts became deliverables
teams were optimized for predictability theater instead of throughput and learning

In short: Agile got turned into compliance.

Pattern 1: Sprints as contracts

What it looks like

Sprint planning is treated as a commitment contract.
Changing scope is seen as failure, even when reality changes.
Teams avoid surfacing unknowns because unknowns disrupt “commitment.”

Why it happens

Leaders want predictability. Sprints feel like a way to buy it.

The hidden tax

When you turn sprints into contracts, teams adapt:

reduce exploration
defer integration
accept low-quality shortcuts
split work into artificial “done-looking” chunks

You don’t eliminate uncertainty. You hide it until the end.

The replacement pattern

Use cadence as a heartbeat, not as a contract:

Plan in small chunks.
Commit to outcomes and constraints, not a stack of tickets.
Treat scope as a lever; treat time as a constraint.

Pattern 2: Velocity as a performance metric

What it looks like

Story points become productivity.
Velocity is compared across teams.
Teams feel pressure to “go faster” by increasing points delivered.

Why it happens

Velocity is a number. Numbers are tempting.

The hidden tax

Story points are a local measure with no consistent meaning across teams. When you attach incentives, teams optimize for the metric:

inflate estimates
split work to maximize points
avoid hard, high-leverage work
ship low-value changes

This is a textbook Goodhart’s Law failure mode: when a measure becomes a target, it ceases to be a good measure. [7]

The replacement pattern

Measure the system, not the story:

lead time
cycle time
deploy frequency
change failure rate
MTTR

Use metrics diagnostically, not as quarterly targets.

Pattern 3: Backlog bloat as a museum of anxiety

What it looks like

Thousands of backlog items exist “for visibility.”
Nothing gets deleted.
Refinement happens continuously, but priorities change weekly.

Why it happens

Backlogs feel like control: “We haven’t forgotten.”

The hidden tax

A giant backlog increases planning cost and reduces focus. Teams stop trusting priorities and operate on side-channel requests.

My favorite framing:

If everything is in the backlog, nothing is prioritized. It’s just a museum of anxiety.

The replacement pattern

Adopt a tight horizon model:

Now: what we’re building
Next: what’s likely next
Later: ideas (low-investment capture)

Refine Now/Next. Archive the rest.

Pattern 4: Ceremonies become the work

What it looks like

Standups become status meetings for managers.
Planning takes hours.
Refinement is endless.
Retrospectives generate action items that never get resourced.

Why it happens

Ceremonies are easy to schedule. Delivery capability is harder to build.

The hidden tax

Attention becomes fragmented. Engineers become “meeting responders.” Work gets multi-tasked across initiatives.

This is how you get:

slow delivery
low quality
burnout

The replacement pattern

Keep only the meetings that reduce uncertainty:

shorter planning
true async refinement
standup for coordination within the team (not reporting)
retros with real ownership and budget

Then invest in the thing ceremonies can’t replace: engineering capability (tests, pipelines, observability, automation).

Pattern 5: Dependencies turn Scrum into fiction

What it looks like

Every story depends on another team.
“Blocked” is normal.
Integration is deferred to later sprints.

Why it happens

Organizations are siloed. Systems mirror communication structures (Conway’s Law). [8]

The hidden tax

You get scrumfall: waterfall dependencies, sprint-shaped reporting.

A two-week sprint can’t save a three-month dependency queue.

The replacement pattern

Design for end-to-end ownership and flow:

reduce handoffs
remove or automate cross-team gates
create platform paved roads so teams can self-serve [9]

When dependencies can’t be eliminated, make them explicit and manage them like risk, not like hope.

Pattern 6: Definition of done without production

What it looks like

“Done” means “merged.”
QA is a phase.
Observability is optional.
Releases happen “later.”

Why it happens

Shipping is painful. So teams avoid it.

The hidden tax

If “done” doesn’t include production, you accumulate:

integration debt
release debt
incident debt

Reliability declines because feedback arrives late.

Continuous delivery’s core argument is that keeping software deployable and releasing frequently reduces risk and enables faster feedback. [4]

The replacement pattern

Upgrade your definition of done:

deployed to a real environment
observable (metrics/logs/traces)
rollback path exists
runbook exists for major failure modes

Pattern 7: Product ownership by proxy

What it looks like

Engineers rarely talk to users/operators.
“Product” is a chain of intermediaries.
Requirements arrive as polished tickets without the “why.”

Why it happens

The organization tries to protect engineers from churn.

The hidden tax

This degrades the input signal. Engineers build the wrong thing efficiently - and then everyone is surprised it didn’t land.

The replacement pattern

Bring engineers closer to reality:

listen to customer calls
review usage telemetry
participate in discovery
keep the “why” attached to every build

No one should ship something they can’t explain.

What’s better: Flow + CD + evidence

If Agile compliance is the disease, what’s the cure?

It’s not “a different framework.” It’s an operating model:

1) Flow: small batches, limited WIP

Lean/Kanban concepts focus on limiting work in progress and optimizing for flow. [3]

Finish work, don’t start work.
Reduce batch size.
Make queues visible.

2) Continuous Delivery: make change safe

Continuous delivery is a capability: keep changes small, deployable, and observable so you can release frequently with lower risk. [4]

This includes:

CI
automated testing
progressive delivery (when needed)
rollback/roll-forward discipline
telemetry tied to releases

3) Evidence-based planning: bets, not contracts

Lean Startup’s build-measure-learn loop emphasizes validated learning - ship something real, measure, and adjust. [5]

For enterprises, the translation is simple:

Plan in small bets
Validate early
Use evidence to re-plan, not politics

Transition plan: 30 days without a revolution

You don’t need to burn the framework down. You need to change what you reward and what you ship.

Week 1: Make work visible as flow

Map the value stream from idea -> production.
Count handoffs.
Measure current lead time.

Week 2: Reduce batch size

Pick one initiative.
Cut it to a thin vertical slice that can ship.
Define “done” as “in production, measurable.”

Week 3: Reduce WIP

Stop starting new work.
Finish the slice.
Remove one blocking dependency with a paved path or automation.

Week 4: Close the feedback loop

Ship.
Measure.
Run a retro focused on system constraints (not blame).
Repeat.

If you do this and nothing improves, you learned something valuable: the constraint is elsewhere.

Verification: how you know it’s working

You should see movement in system outcomes:

DORA describes four key delivery performance metrics: lead time for changes, deployment frequency, change failure rate, and time to restore service. [6]

Signs of real improvement:

lead time drops (less queueing and fewer handoffs)
deploy frequency rises (smaller batches, calmer releases)
change failure rate drops (better tests and safer rollouts)
MTTR drops (better observability and operability)

And importantly: teams report less “deployment pain” and less burnout as delivery becomes calmer and more reliable. [10]

A practical checklist

If you’re stuck in Agile theater, try this:

Stop measuring activity

Stop comparing velocity across teams.
Stop treating story points as productivity.

Shrink feedback loops

Ship a thin slice to production early (behind a flag if needed).
Put engineers closer to users/operators.

Reduce handoffs and WIP

Limit concurrent initiatives.
Remove one handoff per quarter.

Invest in delivery capability

CI, tests, deployment automation
observability tied to releases
safer rollouts and rollback paths

Use metrics as signals, not targets

Track DORA metrics at the system level. [6]
Avoid metric gaming (Goodhart). [7]

References

[1] Manifesto for Agile Software Development (values). https://agilemanifesto.org/ [2] Principles behind the Agile Manifesto (“Working software is the primary measure of progress”). https://agilemanifesto.org/principles.html [3] Kanban Guide (principles and practices oriented around flow and WIP). https://kanbanguides.org/english/ [4] Continuous Delivery (concepts; keep software deployable, release frequently). https://continuousdelivery.com/ [5] The Lean Startup - Principles (Build-Measure-Learn; validated learning). https://theleanstartup.com/principles [6] DORA - “DORA’s software delivery performance metrics (guide)”. https://dora.dev/guides/dora-metrics/ [7] CNA - “Goodhart’s Law” (when a measure becomes a target, it ceases to be a good measure). https://www.cna.org/analyses/2022/09/goodharts-law [8] Splunk - “Conway’s Law Explained” (systems mirror communication structures; includes original quote). https://www.splunk.com/en_us/blog/learn/conways-law.html [9] Microsoft Engineering Blog - “Building paved paths: the journey to platform engineering”. https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/ [10] DORA - “Capabilities: Well-being” (deployment pain and relationship to performance/culture). https://dora.dev/capabilities/well-being/

From Stdio to Enterprise: The MCP Gateway Pattern

Sat, 22 Nov 2025 12:00:00 -0500

As-of note: MCP evolves quickly. This article references the MCP spec revision 2025-11-25. Validate details against the current spec before shipping changes. [1][2][3]

Why this matters

Local MCP servers over stdio are an amazing developer experience: you install a tool server, the host (Claude Desktop / Claude Code / an agent runtime) launches it, and you’re productive in minutes. [2]

But as soon as MCP becomes shared infrastructure - multiple clients, multiple users, multiple environments - the “local tool server” model runs into the same constraints every integration layer hits:

Who is allowed to call what tool?
How do you prevent one noisy user from melting shared dependencies?
How do you audit tool side effects?
How do you roll out tool changes without breaking clients?
How do you keep secrets out of prompts, logs, and screenshots?

This is where the MCP Gateway Pattern shows up.

A gateway is not “another service.” It’s a capability boundary: the place where you enforce policy, budgets, and observability for tool use at scale.

TL;DR

Stdio is great for local, single-user, low-blast-radius setups.
HTTP transports (Streamable HTTP) enable multi-client servers - but they also require real auth and multi-tenant safety. [2][3]
An MCP gateway sits between clients and tool servers to provide:
authentication & authorization
tenant isolation
rate limits / concurrency / cost budgets
consistent tool schemas + safety gates
audit logs and observability
routing, versioning, rollout controls
Build the gateway to be boring: small surface area, strict validation, explicit policies, great telemetry.

When stdio stops being enough
The MCP Gateway Pattern
Responsibilities of a gateway
Reference architecture
Policy patterns that actually work
Scaling and isolation strategies
Observability and audit
Rollouts and versioning
A production checklist
References

When stdio stops being enough

MCP supports multiple transports; stdio is common for local servers. [2] In that model, the host controls process lifetime and secrets typically come from the environment on the local machine.

Stdio starts to strain when you need:

multi-client concurrency
shared tenancy
central policy enforcement
centralized audit
fleet-level rollout controls

At that point, you’re effectively building a platform. The platform needs a stable ingress point with consistent security and operational behavior.

MCP’s HTTP-based transports (like Streamable HTTP) are designed for servers that can handle multiple connections and enable streaming/notifications. [2] MCP also defines an authorization flow for HTTP-based transports. [3]

That’s the entry point for a gateway.

The MCP Gateway Pattern

Definition: An MCP gateway is an MCP server (or MCP-adjacent ingress layer) that:

authenticates and authorizes the client
routes requests to one or more downstream MCP servers (or tool backends)
enforces budgets and safety gates
emits consistent telemetry and audit records

It looks like an API gateway, but the payload is “tool capability” not “REST endpoints.”

Responsibilities of a gateway

1) Authentication and authorization

If you expose MCP servers over HTTP, you need strong auth. MCP includes an authorization framework at the transport layer for HTTP-based transports. [3]

Practical gateway rules:

Authenticate every client (bearer tokens, mTLS, OAuth-derived access tokens).
Authorize per tool, not per server.
Prefer least privilege scopes:
calendar.read
calendar.write
email.read
email.send
k8s.readonly
k8s.apply
For high-impact tools: require explicit confirmation tokens and/or multi-party approval.

2) Tool contract enforcement

MCP tools are invoked by an LLM-driven client. That means tool arguments are untrusted.

The gateway is the ideal place to enforce:

schema validation
payload size caps
allowlists and blocklists
“danger gates” (preview/apply, confirmations)
“semantic validation” (not just types - e.g., limits required, date ranges bounded)

MCP’s spec is grounded in structured schemas; treat those schemas as contracts. [1]

3) Budgets and backpressure

Agents can trigger bursty tool calls. Without backpressure you get the classic cascade:

upstream rate limits
DB pool exhaustion
thread/goroutine explosion
timeouts everywhere

At the gateway you can enforce:

per-tenant rate limits
per-tool concurrency limits
timeouts and deadline propagation
queue depth caps (bounded memory)
circuit breakers for flaky dependencies

This is where you keep “one user spamming tools” from becoming “everyone is down.”

4) Secret handling and redaction

Gateways are a natural place to centralize:

secret injection (short-lived tokens per tenant)
output redaction (strip tokens, emails, PII fields)
logging policies (never log raw tool payloads by default)

For agent systems, OWASP highlights risks like prompt injection and sensitive info disclosure as major categories. [7]

Your gateway should assume that anything returned by a tool could be coerced into exfiltration if you’re careless.

5) Observability and audit

Operationally, the gateway is your best place to emit consistent:

request logs
tool call metrics
traces across tool chains
audit events for side effects

OpenTelemetry is the de facto standard for collecting and exporting telemetry. [5] W3C Trace Context defines headers like traceparent/tracestate for trace propagation across services. [6]

If you want an enterprise to trust agents, you need the forensic trail.

6) Routing and discovery at scale

The gateway becomes:

the routing table (“tool X lives in cluster Y”)
the discovery system (“list tools available for tenant Z”)
the version broker (“tool schema v3 for client A, v4 for client B”)

This is also where you can implement “tool quality” policies:

quarantine tools with high error rates
fallback to read-only alternatives
degrade gracefully under partial outages

Reference architecture

Here’s a simple, effective gateway architecture:

--------------------------------
- Agent host / IDE / runtime -
- (MCP client) -
--------------------------------
 - Streamable HTTP / JSON-RPC [2][4]
 v
------------------------------------------------
- MCP Gateway -
- - AuthN/Z [3] -
- - Schema + safety gates -
- - Budgets (rate, concurrency, cost) -
- - Audit + telemetry (OTel) [5][6] -
- - Routing + tool registry -
------------------------------------------------
 -
 ------------------------
 v v
----------------- ------------------
- MCP Server A - - MCP Server B -
- (calendar) - - (k8s, github...)-
------------------ ------------------
 v v
 Upstream APIs Upstream APIs

Key design decision: the gateway should not contain business logic. It enforces policy and routes tool calls. Tool semantics live in tool servers.

Policy patterns that actually work

Pattern: Read vs write tool classes

Classify tools into tiers:

Read-only: listing, searching, fetching
Write-safe: creates/updates that are naturally reversible
Dangerous: deletes, bulk updates, destructive actions, privileged ops

Then enforce different rules per tier:

Read-only: wide availability, higher concurrency
Write-safe: lower concurrency, stronger audit, idempotency keys
Dangerous: preview/apply, explicit confirmations, restricted scopes

Pattern: Preview -> Apply

For any tool that can cause harm:

plan_* returns a plan + summary + plan_id
apply_* requires plan_id (and optionally a user confirmation token)

This is the “terraform plan/apply” mental model applied to tools.

Pattern: Allowlisted egress (SSRF containment)

If tools can fetch URLs or call arbitrary endpoints, treat it as SSRF risk. OWASP’s SSRF prevention guidance is a useful baseline. [8]

At the gateway, enforce:

allowlisted domains
IP/CIDR blocks for internal metadata ranges
redirect re-validation

Pattern: Tenant-bound tokens

Instead of giving tool servers “global” credentials, mint tenant-scoped tokens and inject them for each call.

reduces blast radius
makes audit meaningful
enables “kill switch” revocation per tenant

Scaling and isolation strategies

A gateway is where multi-tenancy becomes real. Choose an isolation model:

Option A: Process isolation per tool server (simple, strong isolation)

each integration is its own process/container
faults stay contained
rollouts per integration are easy

Tradeoff: more processes to manage.

Option B: Shared server with strong tenant sandboxing

single multi-tenant server handles many clients
cheaper to run
requires rigorous isolation inside the process

Tradeoff: higher risk if a bug leaks across tenants.

Option C: Hybrid

“sensitive” integrations are isolated
“low-risk” read-only tools can be multi-tenant

Most enterprises end up here.

Observability and audit

What to emit (minimum viable)

Metrics

tool_calls_total{tool, tenant, status}
tool_latency_ms{tool}
rate_limited_total{tenant}
budget_exceeded_total{tenant, budget_type}

Traces

request span (client -> gateway)
tool execution span (gateway -> server)
downstream spans (server -> upstream API)

Audit events

who (tenant/user/client)
what (tool + summarized parameters)
when
result (success/failure)
side effect IDs (resource IDs, plan_id, idempotency_key)

OpenTelemetry’s Go docs are a good reference for instrumentation patterns. [5]

Rollouts and versioning

Tool contracts drift. Clients upgrade at different times. Gateways can reduce pain by:

pinning tool schema versions per client
supporting additive changes first (new fields optional)
allowing parallel tool versions for a period
enabling canary rollouts per tenant

If you do nothing else: never deploy a breaking tool change to 100% of tenants at once.

A production checklist

Security

AuthN required for all HTTP-based access. [3]
AuthZ enforced per tool (least privilege).
Tool inputs validated and bounded.
Dangerous tools require preview/apply and explicit confirmations.
Egress allowlists exist for URL/network tools. [8]

Reliability

Per-tenant rate limiting and per-tool concurrency caps.
Timeouts everywhere; deadlines propagate.
Bounded queues (no unbounded memory growth).
Circuit breakers for flaky dependencies.

Operability

Traces propagate end-to-end (W3C Trace Context). [6]
Metrics and logs are consistent and redacted.
Audit events exist for side effects.

Delivery

Tool schemas versioned; canary rollouts supported.
Quarantine and fallback policies exist for failing tools.

References

[1] Model Context Protocol (MCP) - Specification (Protocol Revision 2025-11-25): https://modelcontextprotocol.io/specification/2025-11-25 [2] MCP - Transports (including Streamable HTTP): https://modelcontextprotocol.io/specification/2025-03-26/basic/transports [3] MCP - Authorization (HTTP-based transports): https://modelcontextprotocol.io/specification/2025-11-25/basic/authorization [4] JSON-RPC 2.0 Specification: https://www.jsonrpc.org/specification [5] OpenTelemetry Go - Instrumentation docs: https://opentelemetry.io/docs/languages/go/instrumentation/ [6] W3C - Trace Context: https://www.w3.org/TR/trace-context/ [7] OWASP - Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/ [8] OWASP - SSRF Prevention Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html

The Service Template That Prevents Incidents

Sat, 25 Oct 2025 12:00:00 -0500

Why this matters

Most enterprises try to standardize software delivery with:

PDFs
Confluence pages
slide decks
architecture review boards

It doesn’t scale.

Teams don’t move faster because the rules exist. Teams move faster because the defaults exist.

Platform engineering language captures this well: paved roads / golden paths reduce cognitive load and make the “right way” the easy way. [1][2] The CNCF Platforms White Paper makes the case for internal platforms as a lever that impacts value streams indirectly - through better flow and developer experience. [3]

This article is a practical blueprint for the thing that actually changes outcomes:

A service template that bakes reliability, security, and operability into day-one defaults.

TL;DR

Build one paved road for APIs:
repo template + CI pipeline + runtime defaults
Include “boring” but critical capabilities:
health probes, resource requests/limits, disruption budgets [4][5][6]
tracing/metrics/logging via OpenTelemetry [7]
timeouts, retries, rate limits
standardized deployment and rollout
Measure success with outcomes (DORA metrics): lead time, deploy frequency, change failure rate, MTTR. [8]
Optimize for day 2 to day 50, not just “hello world.”

What a paved road is (and isn’t)
The API service template: required capabilities
A reference repository structure
Kubernetes defaults that save you later
Observability by default
Security by default
Rollouts and operational controls
How to roll this out without a platform revolt
A production checklist
References

What a paved road is (and isn’t)

A paved road is

a recommended path to production
preconfigured defaults that make safe delivery easy
automation that eliminates repetitive decisions

Microsoft describes this in internal developer platform terms: recommended and supported development paths, incrementally paved through an internal platform. [2]

A paved road is not

a mandate that blocks all other approaches
a committee process
a doc nobody reads

If your paved road becomes a gate, teams will route around it.

The API service template: required capabilities

Here’s what “enterprise production API” should mean out of the box.

Operability

structured logging with correlation IDs
metrics (request rate/latency/errors)
tracing across inbound/outbound calls [7]
runtime config and feature flags

Reliability

timeouts everywhere
bounded retries with backoff
health probes (liveness/readiness/startup) [5]
graceful shutdown
rate limits / concurrency caps

Platform fit

Kubernetes-ready manifests
resource requests/limits [4]
PodDisruptionBudget for availability during maintenance [6]
standardized rollout strategy

Security

auth middleware
input validation
secret injection patterns (no secrets in repo)
least privilege service accounts

Delivery

CI pipeline: lint/test/build/scan
SBOM generation
deploy automation (GitOps or pipeline)

A reference repository structure

.
--- cmd/service/ # main
--- internal/ # business logic
--- pkg/ # shared libs (optional)
--- api/ # OpenAPI spec, schemas
--- deploy/
- --- k8s/ # manifests (or Helm/Kustomize)
- --- policy/ # OPA/constraints (optional)
--- docs/
- --- index.md
- --- runbooks/
--- Makefile
--- .github/workflows/ # CI

Key idea: the template is not just code - it is the full production story:

how to run locally
how to deploy
how to observe
how to operate on-call

Kubernetes defaults that save you later

1) Resource requests and limits

Kubernetes scheduling and stability depend on requests/limits. The official docs explain how pod requests/limits are derived from container values. [4]

Template default:

set conservative requests
set safe limits
provide guidance for right-sizing

2) Probes

Kubernetes supports liveness, readiness, and startup probes. The docs describe how to configure them and why they matter. [5]

Template default:

readinessProbe ensures traffic only goes to ready pods
livenessProbe catches deadlocks / stuck processes
startupProbe prevents early restarts for slow boot services

3) Disruption budgets

PodDisruptionBudgets limit concurrent disruptions during voluntary maintenance. [6]

Template default:

include a PDB for replicated services
define min available or max unavailable

Observability by default

If you do one thing: instrument the template so every service ships with telemetry.

OpenTelemetry provides the framework for standard traces/metrics/logs. [7]

Template defaults:

standard HTTP server instrumentation
propagation of trace context (W3C headers)
request logs include trace IDs
golden dashboard:
RPS
p95 latency
error rate
saturation (CPU/memory)

Security by default

Avoid “security guidance documents.” Make secure defaults.

Template defaults:

auth middleware with standardized claims/roles mapping
structured validation for request bodies
outbound allowlists (where feasible)
secret injection via environment/secret store (no plain text)

Your paved road becomes a security accelerator because teams start secure.

Rollouts and operational controls

Default rollout patterns:

canary or progressive delivery when needed
safe rollback
feature flags for risky changes

Default operational controls:

rate limiting
concurrency limits
timeouts and circuit breakers
“maintenance mode” toggle

How to roll this out without a platform revolt

This is the part platform teams often miss.

1) Make it optional - but obviously better

If adopting the template reduces weeks of work to hours, teams will choose it.

2) Provide migration paths

minimal adoption: observability + probes
medium: deploy manifests + CI
full: service template + libraries

3) Measure outcomes, not adoption

Use DORA metrics to show impact: lead time, deploy frequency, change failure rate, time to restore service. [8]

If the paved road doesn’t move these, it’s not paved.

A production checklist

Template

Repo template includes CI, deploy, docs, runbooks.
Observability instrumentation included by default. [7]

Kubernetes

Resource requests/limits included. [4]
Liveness/readiness/startup probes included. [5]
PodDisruptionBudget included for replicated services. [6]

Reliability

Timeouts and bounded retries are standard.
Graceful shutdown is implemented.
Rate limiting/concurrency caps exist.

Security

Auth middleware included.
Secrets handled via secure injection (not repo).

Outcomes

DORA metrics tracked to validate improvement. [8]

References

[1] CNCF - What is platform engineering? (golden paths/paved roads framing): https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/ [2] Microsoft Learn - What is platform engineering? (paved paths / internal developer platform): https://learn.microsoft.com/en-us/platform-engineering/what-is-platform-engineering [3] CNCF TAG App Delivery - Platforms White Paper: https://tag-app-delivery.cncf.io/whitepapers/platforms/ [4] Kubernetes - Resource Management for Pods and Containers (requests/limits): https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ [5] Kubernetes - Configure Liveness, Readiness and Startup Probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ [6] Kubernetes - Specifying a Disruption Budget for your Application (PDB): https://kubernetes.io/docs/tasks/run-application/configure-pdb/ [7] OpenTelemetry - Documentation (instrumentation and telemetry): https://opentelemetry.io/docs/ [8] DORA - DORA’s software delivery performance metrics: https://dora.dev/guides/dora-metrics/

Platform-Engineering | Roy Gabriel

When Enterprise Defaults Become Enterprise Debt

Why this matters

TL;DR

Contents

Pattern 1: Analysis as a substitute for delivery

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step (low drama)

Pattern 2: Reinventing commodity infrastructure

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 3: VM-first thinking as the default

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 4: Ticket-driven infrastructure

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 5: Change Advisory Board for routine changes

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 6: The shared database empire

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 7: Central integration as a chokepoint

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 8: Perma-POCs and innovation theater

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Replace committees with guardrails

Modernize without a rewrite

Verification: how you know it’s working

A practical checklist

Delivery

Platform

Reliability

Architecture

Governance

References

Stop Shipping Slide Decks

Why this matters

TL;DR

Contents

Pattern 1: Deck-driven development

What it looks like

Why it exists

The hidden tax

The replacement pattern

Transition step (low drama)

Pattern 2: SharePoint document cemeteries

What it looks like

Why it exists

The hidden tax

The replacement pattern

Transition step

Pattern 3: Architecture as narrative, not decisions