Architecture | Roy Gabriel

When Enterprise Defaults Become Enterprise Debt

Sat, 07 Feb 2026 09:00:00 -0500

Note on examples: The scenarios below are anonymized composites. They’re not a critique of any one organization; they’re patterns that repeat across industries. The goal isn’t to “modernize for fun.” It’s to protect speed-to-market and reliability as systems and organizations scale.

Why this matters

Most enterprises don’t lose because they picked the “wrong” framework or cloud provider. They lose because old defaults - once rational - become invisible policy.

The 90s and early 2000s optimized for constraints that were real at the time:

hardware was expensive
automation was immature
environments were scarce
security controls were largely manual
uptime was achieved by cautious change, not by safe change

Those constraints have shifted. But many organizations still run on architectural and governance defaults designed for a different era.

The result is predictable:

innovation slows (lead time grows)
quality degrades (late integration + big-bang changes)
reliability suffers (risk is batched, blast radius expands)
engineers spend more time navigating the system than improving it

If you want a single sentence summary: old patterns don’t just slow delivery - they also create the conditions for outages.

TL;DR

Retire “analysis as delivery.” Timebox discovery and ship thin vertical slices.
Treat cloud primitives as primitives, not research projects (e.g., object storage is solved).
Default to containers + orchestration for most stateless services; use VMs deliberately, not reflexively. [5]
Replace ticket queues and boards with guardrails + paved roads + policy-as-code. [7][8]
Measure what matters: lead time, deploy frequency, change failure rate, MTTR. [1][2]
Modernization works best as an incremental program, not a rewrite (Strangler Fig pattern). [12]

Pattern 1: Analysis as a substitute for delivery
Pattern 2: Reinventing commodity infrastructure
Pattern 3: VM-first thinking as the default
Pattern 4: Ticket-driven infrastructure
Pattern 5: Change Advisory Board for routine changes
Pattern 6: The shared database empire
Pattern 7: Central integration as a chokepoint
Pattern 8: Perma-POCs and innovation theater
Replace committees with guardrails
Modernize without a rewrite
Verification: how you know it’s working
A practical checklist
References

Pattern 1: Analysis as a substitute for delivery

What it looks like

A team spends months (sometimes a year) doing “analysis” for a capability that won’t be used until it’s built - often with the intention of eliminating all risk up front.

Common examples:

multi-tenant “high availability image storage” designed from scratch
designing bespoke event systems when managed queues exist
writing 40-page architecture documents before the first running slice exists

Why it existed

When provisioning took weeks and environments were scarce, analysis was a rational risk-reducer.

The hidden tax

You push real learning to the end (integration failures happen late).
Decisions get made with imaginary constraints, not measured ones.
Teams optimize for “approval” rather than “outcome.”

The replacement pattern

Timebox discovery and require a running slice early.

A strong default:

1-2 week spike to validate constraints
a thin vertical slice in production (even behind a flag)
iterate based on real telemetry and user feedback

Transition step (low drama)

Create an “RFC-lite” template:

problem statement + constraints
1-2 options with tradeoffs
a plan to measure (latency, cost, reliability)
a thin-slice milestone date

Pattern 2: Reinventing commodity infrastructure

What it looks like

Teams treat widely-proven primitives as novel:

object storage
queues
identity
metrics + tracing
load balancing

A classic symptom: “We need to design HA multi-tenant object storage,” as if durable object storage isn’t already a standard building block.

Why it existed

On-prem and early hosting eras forced you to build a lot yourself.

The hidden tax

Reinventing primitives becomes a multi-quarter project.
Reliability becomes your problem (and you will be on call for it).
The business pays for the same capability twice: once in time, and again in incidents.

The replacement pattern

Default to managed or proven primitives unless you have a documented reason not to.

For example, modern object storage services are explicitly designed for very high durability and availability (provider details vary). [11]

Transition step

Maintain a “Reference Implementations” catalog:

“How we do object storage”
“How we do queues”
“How we do auth”
“How we do telemetry”

If the default is documented and supported, teams stop re-litigating fundamentals.

Pattern 3: VM-first thinking as the default

What it looks like

Everything runs on VMs because “that’s what we do,” even when the workload is a stateless API, worker, or event consumer.

Why it existed

VMs were the universal unit of deployment for a long time, and they map cleanly to org boundaries (“this server is mine”).

The hidden tax

drift (snowflake servers)
slow rollouts
inconsistent security posture
wasted compute due to poor bin-packing
limited standardization across services

The replacement pattern

For many enterprise services, containers orchestrated by Kubernetes are a strong default for stateless workloads. Kubernetes itself describes Deployments as a good fit for managing stateless applications where Pods are interchangeable and replaceable. [5]

This doesn’t mean “Kubernetes for everything,” but it does mean:

prefer declarative workloads with health checks and rollout controls
keep VMs for deliberate cases (legacy constraints, special licensing, unique state, or when orchestration adds no value)

Transition step

Start with “Kubernetes-first for new stateless services,” not a migration mandate.

Then build operational guardrails:

resource requests/limits so services behave predictably under load [6]
standardized readiness/liveness probes
standard ingress + auth patterns

Pattern 4: Ticket-driven infrastructure

What it looks like

Need a database? Ticket. Need an environment? Ticket. Need DNS? Ticket. Need a queue? Ticket.

Eventually, the ticketing system becomes the true control plane.

Why it existed

It’s a reasonable response when:

environments are scarce
changes are risky
platform knowledge is specialized

The hidden tax

queues become normalized (“it takes 3 weeks to get a namespace”)
teams route around the platform
reliability doesn’t improve; delivery just slows

The replacement pattern

Self-service via GitOps and platform “paved roads.”

OpenGitOps describes GitOps as a set of standards/best practices for adopting a structured approach to GitOps. [7] The point isn’t a specific tool - it’s the principle: desired state is declarative and auditable.

Transition step

Pick one high-frequency request and eliminate it:

“create a service with a standard ingress/auth/telemetry”
“provision a queue”
“create a dev environment”

Make the paved road the path of least resistance.

Pattern 5: Change Advisory Board for routine changes

What it looks like

Every change - routine or risky - requires synchronous approval.

Why it existed

When changes were large, rare, and manual, centralized review reduced catastrophic surprises.

The hidden tax

you batch changes (bigger releases are riskier)
emergency changes bypass process (creating inconsistency)
“approval” becomes the goal rather than evidence of safety

DORA’s guidance on streamlining change approval emphasizes making the regular change process fast and reliable enough that it can handle emergencies, and reframes how CAB fits into continuous delivery. [3] Continuous delivery literature makes a similar point: smaller, more frequent changes reduce risk and ease remediation. [4]

The replacement pattern

Move to evidence-based change approval:

automated tests
policy-as-code checks
progressive delivery (canaries, phased rollouts)
real-time telemetry tied to the release

Transition step

Keep CAB, but change its scope:

focus on high-risk changes and cross-team coordination
use automation and metrics for routine changes

Pattern 6: The shared database empire

What it looks like

A central database is shared by many services. Teams coordinate schema changes across multiple apps and releases.

Microservices.io describes the “shared database” pattern explicitly: multiple services access a single database directly. [10]

Why it existed

It’s simple at first:

one place for data
easy joins
one backup plan

The hidden tax

coupling spreads everywhere
every change becomes cross-team work
reliability suffers because one DB problem becomes everyone’s problem
schema evolution becomes political

The replacement pattern

Prefer service-owned data boundaries. Microservices.io’s “database per service” pattern describes keeping a service’s data private and accessible only via its API. [9]

Transition step

You don’t have to “microservices everything.” Start by:

carving out new tables owned by one service
introducing an API boundary
migrating consumers gradually

Pattern 7: Central integration as a chokepoint

What it looks like

All integrations must go through a single shared integration layer/team (classic ESB gravity).

Why it existed

Centralizing integration gave consistency when:

protocols were messy
tooling was expensive
teams lacked automation

The hidden tax

integration lead times explode
teams stop experimenting
one backlog becomes everyone’s bottleneck

The replacement pattern

Standardize:

interfaces (auth, tracing, deployment, contract testing)
platform guardrails

…not every internal implementation detail.

Transition step

Carve out one “self-service integration” paved road:

standard service template
standard auth
standard telemetry
contracts + examples

Pattern 8: Perma-POCs and innovation theater

What it looks like

Prototypes exist forever, never becoming production systems.

Especially common with AI initiatives:

impressive demos
no production constraints
no ownership for operability

Why it existed

POCs are a safe way to explore unknowns.

The hidden tax

teams lose trust (“innovation never ships”)
production teams inherit half-baked work
opportunity cost compounds

The replacement pattern

From day one, require:

an owner
a production path
a thin slice in a real environment
explicit safety requirements (timeouts, budgets, telemetry)

Transition step

Make “POC exit criteria” mandatory:

what metrics prove value?
what is the minimum shippable slice?
what must be true for reliability and security?

Replace committees with guardrails

A recurring theme: humans are expensive control planes.

The modern move is to convert “tribal rules” into:

templates
automation
policy-as-code
paved paths

Microsoft’s platform engineering work describes “paved paths” within an internal developer platform as recommended paths to production that guide developers through requirements without sacrificing velocity. [8]

Guardrails beat gatekeepers because guardrails are:

consistent
fast
auditable
scalable

Modernize without a rewrite

Big-bang rewrites are expensive and risky. Incremental modernization is usually the winning move.

The Strangler Fig pattern is a well-known approach: wrap or route traffic so you can replace parts of a legacy system gradually. [12]

Practical approach:

put a facade in front of the legacy surface
carve off one slice at a time
measure outcomes
keep rollback easy

This isn’t glamorous. It works.

Verification: how you know it’s working

If you want to avoid “modernization theater,” measure.

DORA’s metrics guidance is a solid baseline: deployment frequency, lead time for changes, change failure rate, and time to restore service (MTTR). [1] The 2024 DORA report continues to focus on the organizational capabilities that drive high performance. [2]

A simple evidence loop:

Pick one value stream (one product or platform slice).
Baseline the four DORA metrics.
Remove one friction point (one pattern).
Re-measure.

If your metrics don’t move, you didn’t remove the real constraint.

A practical checklist

If you’re trying to retire “enterprise debt” safely:

Delivery

Timebox analysis; require a running slice early.
Prefer small changes and frequent releases; avoid batching.

Platform

Provide a paved road for common workflows (service template, auth, telemetry). [8]
Remove ticket queues for repeatable requests (self-service + GitOps). [7]

Reliability

Standardize timeouts, retries, budgets, and resource requests/limits. [6]
Use progressive delivery where risk is high.

Architecture

Reduce shared DB coupling; establish service-owned boundaries. [9][10]
Modernize incrementally (Strangler Fig), not via big-bang rewrites. [12]

Governance

Replace routine approvals with evidence: tests + policy-as-code + telemetry. [3][4]

References

[1] DORA - “DORA’s software delivery performance metrics (guide)”. https://dora.dev/guides/dora-metrics/ [2] DORA - “Accelerate State of DevOps Report 2024”. https://dora.dev/research/2024/dora-report/ [3] DORA - “Streamlining change approval (capability)”. https://dora.dev/capabilities/streamlining-change-approval/ [4] ContinuousDelivery.com - “Continuous Delivery and ITIL: Change Management”. https://continuousdelivery.com/2010/11/continuous-delivery-and-itil-change-management/ [5] Kubernetes docs - “Workloads (Deployments are a good fit for stateless workloads)”. https://kubernetes.io/docs/concepts/workloads/ [6] Kubernetes docs - “Resource Management for Pods and Containers (requests/limits)”. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ [7] OpenGitOps - “What is OpenGitOps?” and project background. https://opengitops.dev/ and https://opengitops.dev/about/ [8] Microsoft Engineering Blog - “Building paved paths: the journey to platform engineering”. https://devblogs.microsoft.com/engineering-at-microsoft/building-paved-paths-the-journey-to-platform-engineering/ [9] Microservices.io - “Database per service” pattern. https://microservices.io/patterns/data/database-per-service [10] Microservices.io - “Shared database” pattern. https://microservices.io/patterns/data/shared-database.html [11] AWS documentation - “Data protection in Amazon S3 (durability/availability design goals)”. https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html [12] Martin Fowler - “Strangler Fig Application” (legacy modernization pattern). https://martinfowler.com/bliki/StranglerFigApplication.html

Stop Shipping Slide Decks

Sat, 31 Jan 2026 11:15:00 -0500

Position: This is not “documentation bad.” This is “documentation is a tool.” If it increases lead time, hides truth, or replaces learning, it’s not helping.

Why this matters

In software, the real “source of truth” is:

running systems
code and configuration
production telemetry
incident history

Documentation should reduce uncertainty and speed up decisions. But two artifacts routinely do the opposite in large organizations:

the 40-page slide deck
the Word doc living somewhere in SharePoint that nobody can find

These artifacts often become deliverables - a substitute for building. They make it possible to spend months “progressing” without ever encountering reality.

And here’s the part most orgs miss:

If you’re going to fail, you want to fail quickly and cheaply, not slowly and expensively. [4]

That doesn’t mean reckless shipping. It means running a tight learning loop and letting reality correct you early - before you’ve sunk quarters of time into the wrong solution.

TL;DR

Decks are great for storytelling. They are bad as an engineering system of record.
“SharePoint architecture docs” become a document cemetery: hard to find, hard to diff, and easy to ignore.
The Agile Manifesto explicitly values working software over comprehensive documentation. [1] And one Agile principle states that working software is the primary measure of progress. [2]
Replace decks/docs-as-deliverables with:
RFC-lite (1-2 pages) + a running thin slice
ADRs (Architecture Decision Records) to capture decisions + tradeoffs [5][6]
Docs-as-code (Markdown in the repo, reviewed like code)
diagrams that are versioned and easy to update
Measure improvement with system outcomes (lead time, deploy frequency, change failure rate, MTTR). [3]

Pattern 1: Deck-driven development
Pattern 2: SharePoint document cemeteries
Pattern 3: Architecture as narrative, not decisions
Pattern 4: “Design phase” gating
Pattern 5: Documentation that never gets pruned
What to do instead: a documentation system that ships
Verification: how you know it’s working
A practical checklist
References

Pattern 1: Deck-driven development

What it looks like

A 40-page deck is created to describe a system that doesn’t exist yet.
The deck gets reviewed by multiple groups.
Approval is treated as progress.
When implementation starts, the world has changed - or key constraints were missed.

Why it exists

Decks are socially useful:

they compress complexity into a narrative
they help leaders “see” a plan
they make uncertainty feel controlled

The hidden tax

Decks are a poor engineering artifact because they’re:

low fidelity: they rarely contain executable truth
hard to maintain: updates are manual and usually lag reality
hard to diff: you can’t easily review what changed and why
easy to perform: a deck can look complete while the design is still untested
not tied to code: no direct path from “decision” -> “implementation” -> “verification”

The worst outcome isn’t that the deck is wrong. It’s that the deck delays the point where you discover what’s wrong.

The replacement pattern

Use decks for storytelling after you have reality. Use engineering artifacts to discover reality.

A strong default:

RFC-lite (1-2 pages)
a runnable thin slice
measurable verification (latency, cost envelope, failure mode)

This aligns with Agile’s emphasis on working software as a real measure of progress. [2]

Transition step (low drama)

Replace “deck required for approval” with “evidence required for approval”:

link to the RFC
link to a running demo / branch / sandbox
explicit constraints + tradeoffs
an exit criteria checklist for the slice

Pattern 2: SharePoint document cemeteries

What it looks like

Architecture docs exist as Word/PDF files in SharePoint.
Multiple versions exist (“Final_v7_REAL_FINAL.docx”).
Search works poorly unless you already know what to search for.
Nobody updates the doc because it’s painful and risky (“what if I change the blessed doc?”).

Why it exists

It’s an enterprise default:

SharePoint is “official”
Word docs feel formal
it’s familiar to non-engineering stakeholders

The hidden tax

SharePoint docs typically fail at the things engineering needs most:

discoverability (people don’t know where to look)
ownership (no clear maintainer)
reviewability (diffs and PR discussion are weak)
linking to reality (code, configs, dashboards, runbooks)
keeping current (documentation drift becomes the norm)

So teams stop trusting docs and rely on tribal knowledge - until they page someone at 2 a.m.

The replacement pattern

Treat documentation as part of the codebase:

Markdown in the repo
reviewed via PR like code
versioned with implementation
linked to:
APIs (OpenAPI specs)
dashboards
runbooks
incident writeups
ADRs

Google’s documentation best practices make the point directly: a small set of fresh, accurate docs is better than a large pile in disrepair. [7]

Transition step

You don’t have to “migrate all docs.”

Start with a triage:

Identify the top 10 documents people actually need.
Recreate them as Markdown in a docs/ folder with an index.
Leave the rest as archived references, not living truth.

Pattern 3: Architecture as narrative, not decisions

What it looks like

The doc describes a target architecture but doesn’t answer:

why this approach?
what alternatives were considered?
what tradeoffs were accepted?
what constraints matter most?
what did we decide not to do?

Why it exists

Narratives are easier than decision logs. It’s simpler to write “the system will…” than to record the messy reality of tradeoffs.

The hidden tax

When decisions aren’t recorded, teams re-litigate them repeatedly. The same arguments come back every quarter - often because new people joined and the reasoning isn’t captured.

The replacement pattern: ADRs

Use Architecture Decision Records (ADRs): short, structured notes that capture an important decision with its context and consequences. [5] The practice is commonly attributed to Michael Nygard’s 2011 write-up. [6]

ADRs are the opposite of a 40-slide deck:

small
specific
diffable
linkable to code changes

Transition step

Start with one ADR per “architecturally significant decision”:

database choice
messaging pattern
tenancy model
auth model
deployment model
data boundary decisions

Pattern 4: “Design phase” gating

What it looks like

“We can’t start implementation until the analysis is complete.”
The analysis expands to include every possible future case.
The design grows more “complete” and less true.

Why it exists

Enterprises are understandably afraid of failure.

The hidden tax

This approach doesn’t eliminate failure. It defers it - making it more expensive.

Lean Startup describes progress as validated learning and emphasizes moving quickly through a build-measure-learn loop. [4] The point isn’t startups. The point is learning fast when you’re uncertain.

The replacement pattern

Timebox design, then validate with a thin slice:

write the RFC-lite doc
implement the smallest realistic end-to-end path
measure the constraints
then expand

Transition step

Define “analysis exit criteria”:

measurable constraints validated (not theorized)
spike code exists
a plan for incremental rollout exists

Pattern 5: Documentation that never gets pruned

What it looks like

Docs accumulate but aren’t maintained:

outdated architecture diagrams
old runbooks
stale onboarding guides
dead links

Why it exists

Pruning isn’t rewarded. Writing new docs feels productive; deleting old docs feels risky.

The hidden tax

Stale docs are worse than no docs:

they mislead
they increase cognitive load
they create false confidence

The replacement pattern

Adopt “minimum viable documentation” and prune regularly. [7]

The rule I like:

If a doc isn’t maintained, label it ARCHIVED and explain why.
If a doc is required, tie it to ownership and change workflow.

Transition step

Make docs part of PR hygiene:

if the change affects behavior, docs update ships with it
run link checks in CI
keep an index page updated

What to do instead: a documentation system that ships

Here’s a simple “docs system” that works in practice.

A repo structure that scales

/README.md # entry point: what this is + how to run it
/docs/
 index.md # "start here" documentation map
 rfc/
 0001-tenancy-model.md
 0002-storage-approach.md
 adr/
 0001-use-postgres.md
 0002-adopt-opentelemetry.md
 architecture/
 context.md # C4-ish: context + boundaries
 containers.md # top-level services
 deployment.md # runtime & environments
 runbooks/
 oncall.md
 incident-response.md
 api/
 openapi.yaml

Replace 40 slides with two artifacts

RFC-lite (1-2 pages): the “what” and “why”
Thin slice demo: the reality check

RFC-lite template (copy/paste)

# RFC: <title>

## Problem
What are we trying to solve? Who is affected?

## Constraints
Latency, cost, compliance, tenancy, uptime, environments.

## Proposal
What are we building? What does "done" mean?

## Alternatives considered
Option A / B / C with short tradeoffs.

## Risks and mitigations
What could go wrong? How will we contain blast radius?

## Verification
How will we measure success in production?

ADR template (copy/paste)

# ADR-XXXX: <decision>

## Status
Proposed | Accepted | Deprecated

## Context
What drove this decision? What constraints matter?

## Decision
What did we decide?

## Consequences
What do we gain? What do we lose? What changes later?

Verification: how you know it’s working

If you replace decks and doc cemeteries with real engineering artifacts, you should see:

Delivery metrics improve

Track the same system-level outcomes DORA promotes: lead time, deploy frequency, change failure rate, and time to restore service. [3]

Fewer handoffs and fewer “alignment meetings”

If teams can self-serve context from living docs, coordination cost drops.

Faster “first reality”

A simple heuristic:

How long from idea -> first runnable thin slice?

If that number is months, the system is optimized for analysis, not learning.

Docs stay alive

docs updated alongside code
fewer stale “final_v7” files
fewer tribal-knowledge escalations

A practical checklist

If you want to kill deck-driven delivery without starting a culture war:

Stop treating decks as deliverables

Architecture reviews require an RFC + a runnable slice.
Decks are optional; evidence is not.

Fix document discoverability

One docs/index.md that links to the docs that matter.
Make the repo the source of truth for technical docs.

Capture decisions, not fantasies

Add ADRs for major decisions and link them to PRs. [5][6]

Timebox analysis

Set analysis exit criteria.
Optimize for early learning and quick failure when uncertainty is high. [4]

Keep docs small and alive

Prune regularly; archive what’s stale.
Run link checks in CI.
Treat docs like bonsai: maintained and trimmed, not accumulated. [7]

References

[1] Manifesto for Agile Software Development (values; “Working software over comprehensive documentation”). https://agilemanifesto.org/ [2] Principles behind the Agile Manifesto (“Working software is the primary measure of progress”). https://agilemanifesto.org/principles.html [3] DORA - “DORA’s software delivery performance metrics (guide)”. https://dora.dev/guides/dora-metrics/ [4] Lean Startup principles (Build-Measure-Learn; learning quickly; failing fast/cheaply as a concept). https://theleanstartup.com/principles [5] ADR - Architectural Decision Records (what ADRs are). https://adr.github.io/ [6] Michael Nygard - “Documenting Architecture Decisions” (2011; ADR practice origin/popularization). https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions [7] Google Documentation Guide - Best practices (“Minimum Viable Documentation”; keep docs short, fresh, and pruned). https://google.github.io/styleguide/docguide/best_practices.html

Durable Agents with Temporal: Retries, Idempotency, and Long-Running State

Sat, 06 Dec 2025 12:00:00 -0500

Why this matters

Agents are often framed as “reason + tools.”

In production, the actual problem is execution:

calls fail
networks flake
credentials expire
humans need to approve steps
tasks take hours/days
systems restart
you need a forensic trail of what happened

If your agent runtime is “one process with a loop,” you will eventually lose state and do the wrong side effect twice.

This is why workflow engines exist.

Temporal’s model - durable workflows with deterministic execution and event history - maps incredibly well to tool-using agents. Temporal explicitly requires workflow code to be deterministic and provides APIs for versioning long-running workflows. [1][2]

This article is a production pattern: use Temporal to make agents durable.

TL;DR

Represent an agent run as a Temporal Workflow.
Make tool calls Activities (retryable, timeout-bounded).
Put side-effecting tools behind:
idempotency keys
preview -> apply
durable “exactly-once” semantics (from the workflow’s perspective)
Use Temporal’s retry policies for Activities and explicit failure handling. [3]
Use event history and replay for forensics (Temporal events are first-class). [4]
Use workflow versioning for safe evolution of long-running agents. [2]

Why agents need durable execution
Mapping an agent to Temporal
Determinism and why it matters
Retries, timeouts, and idempotency
Human-in-the-loop as a first-class step
Replay, audit, and debugging
Versioning: evolving agents safely
A production checklist
References

Why agents need durable execution

A few failure modes you’ll recognize:

Partial side effects

agent creates a ticket
process dies before storing the ticket ID
agent retries and creates a duplicate

Long-running waits

“wait for PR approvals”
“wait for a CI pipeline”
“wait for a meeting to complete” If your agent can’t wait durably, it becomes a polling daemon.

Human approval

Some steps should not be automated:

“apply to prod”
“send email”
“delete resources” You need durable pause/resume with clean audit.

Mapping an agent to Temporal

Workflow = agent run

One agent run becomes a single Temporal Workflow Execution. Temporal workflows are designed for long-running, durable coordination. [5]

Inside the workflow you model steps:

interpret goal
choose tools
call tools
react to results
request approvals
finalize output

Activities = tool calls and external IO

All external calls should be Activities:

MCP tool calls
HTTP calls
DB writes
notifications

Why? Activities are where retries and timeouts belong. Temporal defines retry policies as configuration for how and when to retry failures. [3]

Signals = external events

Use signals for:

human approvals
“cancel”
updated user intent
out-of-band events (“incident resolved”)

Queries = introspection

Expose workflow state:

current step
last tool call
pending approvals
budget remaining

Determinism and why it matters

Temporal requires workflow code to be deterministic. [1] Determinism is what allows Temporal to replay history and rebuild state after worker crashes.

Practical consequence:

Don’t do IO in workflow code.
Don’t read the current time directly in workflow code (use Temporal APIs).
Don’t call random generators without deterministic control.
Keep workflow logic as “orchestration,” not execution.

If you violate determinism, you can hit non-deterministic errors on replay. Temporal’s docs and community discussions emphasize this constraint and the need for careful changes. [1][2]

Retries, timeouts, and idempotency

Retry policies (Activities)

Temporal retry policies control backoff and retry behavior for activity failures. [3]

Use them intentionally:

retries for transient failures (rate limits, timeouts)
limited retries for “probably broken” failures
exponential backoff with jitter (avoid thundering herd)

Timeouts are not optional

Set explicit timeouts:

ScheduleToStart
StartToClose
ScheduleToClose

Without timeouts, retries can run “forever” in practice.

Idempotency keys for side effects

Your workflow can be retried/replayed. Your Activity can be retried. Upstream systems can time out after performing the operation.

For side-effecting tools:

generate an idempotency key in the workflow
pass it into the tool Activity
store “operation result” in workflow state

When the Activity retries, it reuses the key so the upstream system deduplicates.

This is the difference between “retries” and “duplicates.”

Human-in-the-loop as a first-class step

For dangerous operations:

pause
ask for approval with the plan summary
resume when approved

Temporal workflows can wait for signals without holding threads like a traditional process would.

This is one of the cleanest ways to build:

“preview -> approve -> apply” without building a bunch of custom state machinery.

Replay, audit, and debugging

Temporal events are recorded as part of the workflow’s event history. [4]

This yields production superpowers:

reconstruct exactly what happened
understand why a step was taken
replay a run to test a bug fix
implement “reset” patterns (carefully)

For agents, this is the difference between:

“the model did something weird” and
“step 7 called tool X with args Y after tool Z returned response R”

Versioning: evolving agents safely

Agent logic will change. Prompts will change. Tool contracts will change.

If you have long-running agents, you need a strategy that doesn’t break in-flight executions.

Temporal provides workflow versioning mechanisms because determinism means you can’t simply change workflow logic without thought. [2]

Production approach:

keep existing executions on old code paths
route new executions to new paths
migrate intentionally

This prevents “deploy broke every running workflow.”

A production checklist

Architecture

Agent runs modeled as workflows; tool calls as activities.
External events modeled as signals; state exposed via queries.

Determinism

No IO in workflow code (only orchestration).
Workflow changes use versioning strategy. [2]

Reliability

Retry policies defined for Activities. [3]
Timeouts defined and bounded.
Idempotency keys used for side-effecting actions.

Governance

Human approval gates exist for dangerous operations.
Audit trails include plan summaries and results.

Operability

Event history used for debugging and incident analysis. [4]

References

[1] Temporal - Workflow Definition (determinism requirement): https://docs.temporal.io/workflow-definition [2] Temporal Go SDK - Versioning (evolving deterministic workflows safely): https://docs.temporal.io/develop/go/versioning [3] Temporal - Retry Policies (how and when retries happen): https://docs.temporal.io/encyclopedia/retry-policies [4] Temporal - Events reference (event history): https://docs.temporal.io/references/events [5] Temporal - Workflows overview: https://docs.temporal.io/workflows

Tool Discovery at Scale: Solving the Million Tool Problem

Sat, 15 Nov 2025 12:00:00 -0500

Why this matters

Tool-using agents are powerful because they can do real work: read systems, change systems, orchestrate workflows.

The trap is what I call the Million Tool Problem:

The moment you have “enough tools,” tool selection becomes harder than tool execution.

At small scale, you can stuff tool schemas into the prompt and hope the model chooses correctly. At scale, that approach breaks:

token budgets explode
accuracy drops (models confuse similar tools)
latency rises (bigger prompts, more reasoning)
safety degrades (wrong tool, wrong args, wrong side effects)

This isn’t hypothetical. Tool-use research exists because selection is hard. Benchmarks like ToolBench and AgentBench exist specifically to evaluate this capability in interactive settings. [3][6]

This post is a production-first design for tool discovery that stays:

fast (low latency, bounded prompt size)
safe (tool contracts and policy gates)
debuggable (you can explain why a tool was chosen)
maintainable (tool catalogs evolve constantly)

TL;DR

Tool discovery is an IR problem + a policy problem, not a prompt trick.
Use a 3-stage selector:

coarse filter (tags / domain / allowlist)
retrieval (BM25 + embeddings)
rerank (LLM or learned ranker)

Treat tool descriptions as a product:
consistent naming
sharp “when to use” / “when not to use”
examples of correct arguments
Add tool quality scoring (latency, error rate, drift, safety incidents).
Build a tight evaluation harness (ToolBench/StableToolBench ideas apply). [3][4]

Why “include all tools” fails
The 3-stage tool selector
Tool metadata that makes models smarter
Ranking: BM25 + embeddings + rerank
Safety: allowlists, “danger gates,” and budgets
Quality scoring and tool quarantine
Debuggability: explainable tool selection
A minimal reference architecture
A production checklist
References

Why “include all tools” fails

Token and latency pressure

Even if your tool schemas are “small,” they add up. Once you cross a few dozen tools, you spend more tokens describing tools than describing the task.

Confusability

Tools with similar names or overlapping domains cause selection errors:

search_events vs list_events vs get_event
create_task vs create_issue vs create_ticket

The long tail problem

Most catalogs have a long tail:

10 tools get used daily
100 tools get used weekly
1,000 tools are niche, but critical when needed

This is exactly the kind of situation information retrieval was invented for.

The 3-stage tool selector

Think like a search engine:

Stage 0: Policy filter (mandatory)

Before ranking, enforce policy:

which tools is this client allowed to call?
which tools are enabled for this tenant/environment?
which tools are safe for this context (read-only mode, incident mode, etc.)?

MCP makes tool discovery explicit via listing tools and schemas. That’s an interface you can mediate with policy. [1]

Stage 1: Coarse routing (cheap)

Route into the right “tool neighborhood” using:

tags (kubernetes, calendar, email)
domains (“devops”, “productivity”, “security”)
environment (“prod” vs “dev”)

Goal: reduce the candidate set from 10,000 -> 300.

Stage 2: Retrieval (BM25 + embeddings)

Run a hybrid search over:

tool name
tool description
parameter names
example calls
“when not to use” hints

Hybrid search is pragmatic:

lexical retrieval (BM25-style) is great for exact matches and acronyms [9]
embeddings are great for semantic similarity [7]

Goal: 300 -> 30.

Stage 3: Rerank (expensive, accurate)

Rerank the top-K tools using:

an LLM judge (cheap if K is small)
or a learned ranker
or deterministic rules + a smaller LLM tie-breaker

Goal: 30 -> 5.

Then the agent sees a small, high-quality tool set.

Tool metadata that makes models smarter

If you want better tool selection, stop treating tool schemas as “just types.” Add metadata that improves discrimination.

Tool card fields (recommended)

Name: stable, verb-first
Purpose: one sentence
When to use: 2-4 bullets
When NOT to use: 2-4 bullets (this is underrated)
Side effects: none / read-only / creates / updates / deletes
Required arguments: and why they’re required
Examples: 2-3 example invocations with realistic args
Error modes: rate limit, auth, not found, validation

This reduces tool confusion dramatically because it gives the model differentiating features.

Ranking: BM25 + embeddings + rerank

Lexical retrieval (BM25)

BM25 and probabilistic retrieval approaches are foundational in search. [9]

Practical benefit: it handles queries like:

“S3”
“JWT”
“PodDisruptionBudget”
“Cron” …where embeddings can be inconsistent.

Embeddings

Sentence embeddings (like SBERT-style approaches) are designed to enable efficient semantic similarity search. [7]

Practical benefit: it handles intent queries like:

“delete all tasks due tomorrow”
“find calendar conflicts next week”
“check if deployment is stuck”

Approximate nearest neighbor indexing

At scale, you’ll want ANN indexing (FAISS is a well-known library in this space). [8]

Rerank

This is where you incorporate:

tool quality score
tenant policy
“danger tool” gating
recent tool drift

Reranking is also where you can enforce “don’t pick write tools unless necessary.”

Safety: allowlists, “danger gates,” and budgets

Tool discovery is not neutral. It’s an authorization problem.

Your selector should be policy-aware:

Read-only mode: only surface read tools
No-delete mode: deletes never appear
Prod incident mode: allow observation tools, restrict mutation
Human approval mode: show write tools, but require confirmation

Also: build budgets into selection. If a tool is expensive (slow, rate-limited, high blast radius), rank it lower unless strongly justified.

For tool-using agents, OWASP highlights prompt injection and excessive agency as key risks - exactly the failure modes you get when tools are over-exposed without gates. [10]

Quality scoring and tool quarantine

You need a tool quality score because tools drift:

upstream APIs change
auth breaks
quotas shift
tool server regressions happen

Track per tool:

p50 / p95 latency
error rate
timeout rate
“invalid argument” rate (often a selection problem)
“unsafe attempt” rate (policy violations)

Then take action:

quarantine tools with regression spikes
degrade to read-only tools during outages
route to backups (alternate implementations)

Debuggability: explainable tool selection

If you can’t answer “why did the agent pick that tool?”, you won’t be able to operate the system.

Log (or attach to traces) the selection evidence:

query text
candidate tools (top 30)
retrieval scores
rerank scores
policy filters applied
final selected tools and why

This also becomes training data later.

A minimal reference architecture

-------------------------------
- Agent runtime (planner) -
-------------------------------
 -
 v
-------------------------------
- Tool Selector Service -
- - policy filter -
- - hybrid retrieval -
- - rerank -
- - tool quality weighting -
-------------------------------
 - returns top-K tools + schemas
 v
-------------------------------
- Agent execution -
- - calls tools via MCP -
-------------------------------

Where MCP fits: MCP provides a standardized way for clients to discover tools and invoke them. [1]

The selector doesn’t replace MCP. It makes MCP usable at scale.

A production checklist

Tool catalog hygiene

Stable naming conventions.
“When NOT to use” bullets exist.
Examples exist for the top tools.
Tool side effects are classified.

Selection pipeline

Mandatory policy filter before ranking.
Hybrid retrieval (lexical + embeddings). [7][9]
Rerank top-K with quality + policy.
Candidate set bounded (K is small).

Safety

Dangerous tools are gated and not surfaced by default.
Budget-aware ranking exists.
OWASP LLM risks considered in tool exposure strategy. [10]

Operability

Selection decisions are explainable (log evidence).
Tool quality scoring exists and drives quarantine.
Selection regressions are covered by evals (next article).

References

[1] Model Context Protocol (MCP) - Specification (Protocol Revision 2025-11-25): https://modelcontextprotocol.io/specification/2025-11-25 [2] MCP - Transports (including stdio and Streamable HTTP): https://modelcontextprotocol.io/specification/2025-03-26/basic/transports [3] ToolLLM / ToolBench (tool-use dataset + evaluation): https://arxiv.org/abs/2307.16789 [4] StableToolBench (stable tool-use benchmarking): https://arxiv.org/abs/2403.07714 [5] tau-bench (tool-agent-user interaction benchmark): https://arxiv.org/abs/2406.12045 [6] AgentBench (evaluating LLMs as agents): https://arxiv.org/abs/2308.03688 [7] Sentence-BERT (efficient semantic similarity search via embeddings): https://arxiv.org/abs/1908.10084 [8] FAISS / Billion-scale similarity search with GPUs: https://arxiv.org/abs/1702.08734 and https://github.com/facebookresearch/faiss [9] Robertson (BM25 and probabilistic relevance framework): https://dl.acm.org/doi/abs/10.1561/1500000019 [10] OWASP - Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/

The Service Template That Prevents Incidents

Sat, 25 Oct 2025 12:00:00 -0500

Why this matters

Most enterprises try to standardize software delivery with:

PDFs
Confluence pages
slide decks
architecture review boards

It doesn’t scale.

Teams don’t move faster because the rules exist. Teams move faster because the defaults exist.

Platform engineering language captures this well: paved roads / golden paths reduce cognitive load and make the “right way” the easy way. [1][2] The CNCF Platforms White Paper makes the case for internal platforms as a lever that impacts value streams indirectly - through better flow and developer experience. [3]

This article is a practical blueprint for the thing that actually changes outcomes:

A service template that bakes reliability, security, and operability into day-one defaults.

TL;DR

Build one paved road for APIs:
repo template + CI pipeline + runtime defaults
Include “boring” but critical capabilities:
health probes, resource requests/limits, disruption budgets [4][5][6]
tracing/metrics/logging via OpenTelemetry [7]
timeouts, retries, rate limits
standardized deployment and rollout
Measure success with outcomes (DORA metrics): lead time, deploy frequency, change failure rate, MTTR. [8]
Optimize for day 2 to day 50, not just “hello world.”

What a paved road is (and isn’t)
The API service template: required capabilities
A reference repository structure
Kubernetes defaults that save you later
Observability by default
Security by default
Rollouts and operational controls
How to roll this out without a platform revolt
A production checklist
References

What a paved road is (and isn’t)

A paved road is

a recommended path to production
preconfigured defaults that make safe delivery easy
automation that eliminates repetitive decisions

Microsoft describes this in internal developer platform terms: recommended and supported development paths, incrementally paved through an internal platform. [2]

A paved road is not

a mandate that blocks all other approaches
a committee process
a doc nobody reads

If your paved road becomes a gate, teams will route around it.

The API service template: required capabilities

Here’s what “enterprise production API” should mean out of the box.

Operability

structured logging with correlation IDs
metrics (request rate/latency/errors)
tracing across inbound/outbound calls [7]
runtime config and feature flags

Reliability

timeouts everywhere
bounded retries with backoff
health probes (liveness/readiness/startup) [5]
graceful shutdown
rate limits / concurrency caps

Platform fit

Kubernetes-ready manifests
resource requests/limits [4]
PodDisruptionBudget for availability during maintenance [6]
standardized rollout strategy

Security

auth middleware
input validation
secret injection patterns (no secrets in repo)
least privilege service accounts

Delivery

CI pipeline: lint/test/build/scan
SBOM generation
deploy automation (GitOps or pipeline)

A reference repository structure

.
--- cmd/service/ # main
--- internal/ # business logic
--- pkg/ # shared libs (optional)
--- api/ # OpenAPI spec, schemas
--- deploy/
- --- k8s/ # manifests (or Helm/Kustomize)
- --- policy/ # OPA/constraints (optional)
--- docs/
- --- index.md
- --- runbooks/
--- Makefile
--- .github/workflows/ # CI

Key idea: the template is not just code - it is the full production story:

how to run locally
how to deploy
how to observe
how to operate on-call

Kubernetes defaults that save you later

1) Resource requests and limits

Kubernetes scheduling and stability depend on requests/limits. The official docs explain how pod requests/limits are derived from container values. [4]

Template default:

set conservative requests
set safe limits
provide guidance for right-sizing

2) Probes

Kubernetes supports liveness, readiness, and startup probes. The docs describe how to configure them and why they matter. [5]

Template default:

readinessProbe ensures traffic only goes to ready pods
livenessProbe catches deadlocks / stuck processes
startupProbe prevents early restarts for slow boot services

3) Disruption budgets

PodDisruptionBudgets limit concurrent disruptions during voluntary maintenance. [6]

Template default:

include a PDB for replicated services
define min available or max unavailable

Observability by default

If you do one thing: instrument the template so every service ships with telemetry.

OpenTelemetry provides the framework for standard traces/metrics/logs. [7]

Template defaults:

standard HTTP server instrumentation
propagation of trace context (W3C headers)
request logs include trace IDs
golden dashboard:
RPS
p95 latency
error rate
saturation (CPU/memory)

Security by default

Avoid “security guidance documents.” Make secure defaults.

Template defaults:

auth middleware with standardized claims/roles mapping
structured validation for request bodies
outbound allowlists (where feasible)
secret injection via environment/secret store (no plain text)

Your paved road becomes a security accelerator because teams start secure.

Rollouts and operational controls

Default rollout patterns:

canary or progressive delivery when needed
safe rollback
feature flags for risky changes

Default operational controls:

rate limiting
concurrency limits
timeouts and circuit breakers
“maintenance mode” toggle

How to roll this out without a platform revolt

This is the part platform teams often miss.

1) Make it optional - but obviously better

If adopting the template reduces weeks of work to hours, teams will choose it.

2) Provide migration paths

minimal adoption: observability + probes
medium: deploy manifests + CI
full: service template + libraries

3) Measure outcomes, not adoption

Use DORA metrics to show impact: lead time, deploy frequency, change failure rate, time to restore service. [8]

If the paved road doesn’t move these, it’s not paved.

A production checklist

Template

Repo template includes CI, deploy, docs, runbooks.
Observability instrumentation included by default. [7]

Kubernetes

Resource requests/limits included. [4]
Liveness/readiness/startup probes included. [5]
PodDisruptionBudget included for replicated services. [6]

Reliability

Timeouts and bounded retries are standard.
Graceful shutdown is implemented.
Rate limiting/concurrency caps exist.

Security

Auth middleware included.
Secrets handled via secure injection (not repo).

Outcomes

DORA metrics tracked to validate improvement. [8]

References

[1] CNCF - What is platform engineering? (golden paths/paved roads framing): https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/ [2] Microsoft Learn - What is platform engineering? (paved paths / internal developer platform): https://learn.microsoft.com/en-us/platform-engineering/what-is-platform-engineering [3] CNCF TAG App Delivery - Platforms White Paper: https://tag-app-delivery.cncf.io/whitepapers/platforms/ [4] Kubernetes - Resource Management for Pods and Containers (requests/limits): https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ [5] Kubernetes - Configure Liveness, Readiness and Startup Probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ [6] Kubernetes - Specifying a Disruption Budget for your Application (PDB): https://kubernetes.io/docs/tasks/run-application/configure-pdb/ [7] OpenTelemetry - Documentation (instrumentation and telemetry): https://opentelemetry.io/docs/ [8] DORA - DORA’s software delivery performance metrics: https://dora.dev/guides/dora-metrics/

The Real Security Model for Agents

Sat, 18 Oct 2025 12:00:00 -0500

Why this matters

If you ship tool-using agents, you are shipping:

an execution engine
with access to external systems
controlled by untrusted inputs

That is the same security posture as any automation platform - except the “operator” is probabilistic.

OWASP’s Top 10 for LLM Applications makes it clear: prompt injection, insecure output handling, sensitive info disclosure, excessive agency… these are mainstream risks, not edge cases. [1] The good news: most mitigations are classic security engineering applied to a new execution model.

This article is a practical, production-first security model for agents and MCP tool ecosystems.

TL;DR

Don’t “secure the model.” Secure the system.
Treat all inputs as untrusted:
user text
tool outputs
retrieved documents
Design tools with least privilege:
separate read/write/danger tools
require preview -> apply for destructive actions
Centralize auth and policy:
MCP defines authorization for HTTP transports - use it. [2]
Control egress and prevent SSRF by default. [3]
Never let raw model output drive execution without validation (OWASP LLM02). [4]
Redact logs and manage secrets like an adult (OWASP cheat sheets). [5][6]

Threat model: what can go wrong
Security layers that actually work
Tool design: read/write/danger tiers
Output handling: never execute raw model output
Secrets: minimize, scope, rotate
Network and egress controls
Logging and audit without data leaks
A production checklist
References

Threat model: what can go wrong

1) Prompt injection -> policy bypass attempt

A user or document says:

“Ignore previous instructions”
“Call this tool with these parameters”
“Reveal secrets” OWASP calls this out as a primary risk category. [1]

2) Insecure output handling -> downstream exploitation

If you pass model output into:

a shell
SQL
YAML manifests
HTTP requests …without validation, you’ve built an indirect code execution path.

OWASP’s LLM02 describes this precisely: insufficient validation and handling of LLM outputs before passing them downstream. [4]

3) Excessive agency -> unintended side effects

The agent is over-permissioned:

it can delete resources
send emails
modify production …and it will eventually do something you didn’t mean.

4) Data exfiltration via tools

Tool outputs are rich and often sensitive:

calendar events
emails
internal tickets
source code
cluster configs

Exfil happens through:

model responses
logs
“helpful” summaries
tool chaining

5) Network abuse / SSRF

Any “fetch URL” capability is an SSRF invitation unless you constrain egress. OWASP’s SSRF cheat sheet is still relevant. [3]

Security layers that actually work

Security in agent systems is defense-in-depth:

Identity (who is calling?)
Authorization (what can they do?)
Contracts (what does a tool accept/return?)
Validation (are inputs/outputs safe?)
Egress control (where can the system talk to?)
Audit (what happened?)
Kill switches (how do you stop it fast?)

Tool design: read/write/danger tiers

Tiering is mandatory

Split tools by side effects:

Read tools: list/search/get
Write tools: create/update with bounded scope
Danger tools: deletes, bulk updates, privileged actions

Then enforce policy:

Read tools are widely available
Write tools require explicit scopes and tighter budgets
Danger tools require:
preview -> apply
confirmation tokens
additional policy checks

Preview -> Apply pattern

For dangerous operations:

plan_* returns a plan summary + plan_id
apply_* requires plan_id + user confirmation

This prevents “drive-by deletes” and supports audit.

Output handling: never execute raw model output

This is the most common real-world failure.

Rule: model output is data, not code

If the agent is generating:

kubernetes YAML
SQL statements
curl commands
Terraform changes …treat the output as untrusted data.

OWASP’s LLM02 guidance exists because people keep wiring LLM output directly into execution paths. [4]

Safer alternative: structured intent -> validated execution

Instead of:

LLM writes YAML -> apply

Do:

LLM proposes a structured change request (schema)
server validates:
allowlisted fields
bounded ranges
namespace/tenant scope
server executes with known-safe libraries

This is where “tool contracts” win.

Secrets: minimize, scope, rotate

Secrets are the other common failure path.

Minimum viable rules

Never put long-lived secrets in prompts.
Prefer short-lived tokens and scoped credentials.
Inject secrets server-side, not in the model context.

OWASP’s Secrets Management Cheat Sheet is a good baseline for central storage, rotation, auditing, and least privilege. [5]

Scope secrets to tenants and tools

Instead of “one OAuth token for everything,” mint:

per tenant
per tool category
short TTL

When something goes wrong, you want the blast radius small and revocation easy.

Network and egress controls

If your agent system can reach the open internet or internal networks, you need guardrails.

Egress allowlists

allowlist domains for integrations
block metadata IP ranges
re-validate after redirects

OWASP’s SSRF prevention guidance provides practical patterns for validation and blocking internal addresses. [3]

Separate network planes

Keep tool servers in a network segment that:

can reach only what they need
cannot reach internal admin endpoints
cannot reach secrets stores directly unless necessary

Logging and audit without data leaks

Logging is security. Logging is also a leak vector.

OWASP’s Logging Cheat Sheet calls out that logs may contain personal and sensitive information and must be protected from misuse. [6]

Practical logging rules

do not log raw prompts by default
do not log raw tool payloads by default
log structured summaries:
tool name
action class
resource IDs (safe identifiers)
status
latency
store audit events separately from debug logs

Audit events (always on)

Every write/danger tool should emit:

who / what / when / result
plan_id / idempotency_key
before/after resource identifiers (not content)

Audit is what makes “agents in production” defensible to security and compliance teams.

A production checklist

Identity and authorization

Strong auth for clients.
Least-privilege scopes per tool.
MCP HTTP authorization flow implemented where applicable. [2]

Tool contracts

Tools tiered: read/write/danger.
Preview -> apply for dangerous actions.
Schema validation + bounded arguments.

Output handling

No raw model output is executed without validation (OWASP LLM02). [4]

Secrets

Secrets never placed in prompts.
Short-lived, scoped tokens used.
Rotation/audit practices exist (OWASP Secrets Mgmt). [5]

Network

Egress allowlists exist.
SSRF protections implemented. [3]

Logging and audit

Logs are redacted and access-controlled.
Audit events exist for all side-effecting tools.
Log systems protected per OWASP guidance. [6]

References

[1] OWASP - Top 10 for Large Language Model Applications (v1.1): https://owasp.org/www-project-top-10-for-large-language-model-applications/ [2] Model Context Protocol (MCP) - Authorization (Protocol Revision 2025-11-25): https://modelcontextprotocol.io/specification/2025-11-25/basic/authorization [3] OWASP - SSRF Prevention Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html [4] OWASP GenAI Security Project - LLM02: Insecure Output Handling: https://genai.owasp.org/llmrisk2023-24/llm02-insecure-output-handling/ [5] OWASP - Secrets Management Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html [6] OWASP - Logging Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html

Architecture | Roy Gabriel

When Enterprise Defaults Become Enterprise Debt

Why this matters

TL;DR

Contents

Pattern 1: Analysis as a substitute for delivery

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step (low drama)

Pattern 2: Reinventing commodity infrastructure

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 3: VM-first thinking as the default

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 4: Ticket-driven infrastructure

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 5: Change Advisory Board for routine changes

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 6: The shared database empire

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 7: Central integration as a chokepoint

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Pattern 8: Perma-POCs and innovation theater

What it looks like

Why it existed

The hidden tax

The replacement pattern

Transition step

Replace committees with guardrails

Modernize without a rewrite

Verification: how you know it’s working

A practical checklist

Delivery

Platform

Reliability

Architecture

Governance

References

Stop Shipping Slide Decks

Why this matters

TL;DR

Contents

Pattern 1: Deck-driven development

What it looks like

Why it exists

The hidden tax

The replacement pattern

Transition step (low drama)

Pattern 2: SharePoint document cemeteries

What it looks like

Why it exists

The hidden tax

The replacement pattern

Transition step

Pattern 3: Architecture as narrative, not decisions