Cost Is a Reliability Problem

December 13, 2025 · 5 min read
blog

Why this matters

Traditional reliability focuses on uptime. AI systems add a second axis:

Your system can be “up” while your budget is on fire.

A runaway agent doesn’t always crash services. Sometimes it:

  • loops tool calls
  • retries incorrectly
  • escalates to larger models repeatedly
  • expands context windows unnecessarily
  • performs expensive searches without stopping

The result: surprise bills, throttling, and eventually hard outages when quotas are hit.

Google’s SRE framing around error budgets is a useful mental model: budgets create a control mechanism that balances stability with velocity. [1][2] FinOps frames cost management as a collaboration practice between engineering, finance, and business. [3]

This article is the practical bridge: use budgets and guardrails like you would for reliability.


TL;DR

  • Treat cost as an SLO: define acceptable spend per run / per tenant / per day.
  • Enforce budgets at multiple layers:
  • per request/run
  • per tool
  • per tenant
  • per environment
  • Use hard limits + soft limits:
  • soft: degrade model/tool choices
  • hard: stop the run and ask for approval
  • Add cost circuit breakers:
  • abort on runaway loops
  • quarantine tools causing repeated retries
  • Make cost visible (metrics + dashboards) so teams can improve it.
  • Align with FinOps: shared accountability, not “billing surprises.” [3]

Contents


Cost failure modes in agent systems

1) Infinite or long loops

Common triggers:

  • ambiguous tool outputs
  • brittle parsing
  • “try again” reflexes
  • non-idempotent retries

2) Tool spam

Agents sometimes “search until confident.” If you don’t cap it, you get 20+ tool calls on a single request.

3) Model escalation cascades

If your policy says “if uncertain, use a better model,” you can create a cost escalator:

  • cheap model -> “uncertain” -> expensive model
  • expensive model -> still uncertain -> more calls

4) Context growth

If you keep appending tool outputs to the prompt, costs grow superlinearly and performance can degrade.

5) External quotas become outages

Even if cost is acceptable, external services (email APIs, GitHub, calendars) can rate limit you. Cost and reliability are coupled.


Define cost SLOs and budgets

Start with simple “production truths”:

  • How much is one agent run allowed to cost?
  • What is an acceptable daily spend per tenant?
  • What is the max “blast radius” of a single request?

This maps cleanly to SRE’s error budget concept: budgets constrain unsafe behavior while preserving velocity. [2]

Example cost SLOs (pragmatic)

  • Per run: <= $0.10 (p95), <= $0.50 (max)
  • Per tenant/day: <= $50/day
  • Per user/day: <= $5/day
  • Per tool call: <= 3 calls to expensive tools

These aren’t universal. They’re explicit. That’s what matters.


Budget layers: run, tool, tenant, environment

1) Per-run budget

Tracks:

  • max model tokens
  • max tool calls
  • max wall-clock time
  • max “expensive operations” count

Most important budget. This is where you stop runaway behavior early.

2) Per-tool budget

Some tools are inherently expensive:

  • large searches
  • long-running jobs
  • heavy data exports

Budget these separately:

  • max calls
  • max payload size
  • max time range

3) Per-tenant budget

Without this, your best customers can melt your infra.

Per-tenant limits:

  • requests/min
  • concurrent runs
  • daily cost cap

4) Per-environment budget

Environments have different rules:

  • dev: cheap, permissive, more logging
  • prod: bounded, gated, auditable

This is where you implement “read-only mode” during incidents.


Soft limits vs hard limits

Soft limits (degrade gracefully)

When approaching budget:

  • switch to cheaper models
  • reduce context size (summarize)
  • narrow tool search range
  • skip non-essential steps

Hard limits (stop the run)

When budget is exceeded:

  • stop tool calls
  • stop escalation
  • request user confirmation / approval
  • produce a partial answer with an explanation

This is exactly the “control mechanism” idea behind error budgets: it gives the system permission to shift focus when constraints are exceeded. [1]


Circuit breakers for runaway behavior

Add circuit breakers that detect “this is going bad”:

  • loop detector: same tool called with similar args repeatedly
  • retry storm: high retry count for a tool within a run
  • no progress: plan step count increases without new evidence
  • latency breaker: tool p95 spikes beyond threshold

When triggered:

  • stop the run
  • quarantine the tool for this run
  • degrade to safe alternatives
  • emit high-signal telemetry

Cost-aware tool and model selection

Cost control is easier if it’s designed into selection:

  • Rank tools with a “cost weight” (latency + upstream cost + risk)
  • Prefer read-only tools unless a write is required
  • Use caches for common retrieval results
  • Use deterministic summarization boundaries for tool outputs

If you already implement a tool selector (see “Million Tool Problem”), cost becomes another rerank feature.


Dashboards and alerts

This is where FinOps and SRE meet: cost is an operational signal.

Dashboards

  • spend/day by tenant
  • cost per run distribution
  • top cost drivers (tools and models)
  • runaway breaker triggers

Alerts

  • daily spend exceeded
  • sudden spend spikes (slope alerts)
  • high frequency of loop breaker events
  • high fraction of runs hitting hard limits

AWS’s Well-Architected Cost Optimization pillar frames cost optimization as a continual process across the workload lifecycle. That mindset applies here too. [4]


A production checklist

Budgets

  • Per-run cost and tool-call budgets exist.
  • Per-tenant daily caps exist.
  • Per-tool “expensive operation” caps exist.

Enforcement

  • Soft limits degrade gracefully (cheaper models, narrower queries).
  • Hard limits stop and request approval.
  • Circuit breakers detect loops/retry storms.

Telemetry

  • Cost metrics emitted per run and per tenant.
  • Breaker events recorded and alertable.

Culture

  • Cost management is a shared practice (FinOps), not a surprise invoice. [3]

References

[1] Google SRE Workbook - Example Error Budget Policy: https://sre.google/workbook/error-budget-policy/ [2] Google SRE Book - Embracing Risk (error budgets as control mechanism): https://sre.google/sre-book/embracing-risk/ [3] FinOps Foundation - What is FinOps? (definition and principles): https://www.finops.org/introduction/what-is-finops/ [4] AWS Well-Architected Framework - Cost Optimization pillar: https://docs.aws.amazon.com/wellarchitected/latest/framework/cost-optimization.html

Authors
DevOps Architect · Applied AI Engineer
I’ve spent 20 years building systems across embedded systems, micro-controllers, PLCS, security platforms, fintech, SRE, and platform architecture. Today I focus on production AI systems in Go: multi-agent orchestration, MCP server ecosystems, and the DevOps platforms that keep them running. I care about systems that work under pressure: observable, recoverable, and built to last.