Cost Is a Reliability Problem

Why this matters
Traditional reliability focuses on uptime. AI systems add a second axis:
Your system can be “up” while your budget is on fire.
A runaway agent doesn’t always crash services. Sometimes it:
- loops tool calls
- retries incorrectly
- escalates to larger models repeatedly
- expands context windows unnecessarily
- performs expensive searches without stopping
The result: surprise bills, throttling, and eventually hard outages when quotas are hit.
Google’s SRE framing around error budgets is a useful mental model: budgets create a control mechanism that balances stability with velocity. [1][2] FinOps frames cost management as a collaboration practice between engineering, finance, and business. [3]
This article is the practical bridge: use budgets and guardrails like you would for reliability.
TL;DR
- Treat cost as an SLO: define acceptable spend per run / per tenant / per day.
- Enforce budgets at multiple layers:
- per request/run
- per tool
- per tenant
- per environment
- Use hard limits + soft limits:
- soft: degrade model/tool choices
- hard: stop the run and ask for approval
- Add cost circuit breakers:
- abort on runaway loops
- quarantine tools causing repeated retries
- Make cost visible (metrics + dashboards) so teams can improve it.
- Align with FinOps: shared accountability, not “billing surprises.” [3]
Contents
- Cost failure modes in agent systems
- Define cost SLOs and budgets
- Budget layers: run, tool, tenant, environment
- Soft limits vs hard limits
- Circuit breakers for runaway behavior
- Cost-aware tool and model selection
- Dashboards and alerts
- A production checklist
- References
Cost failure modes in agent systems
1) Infinite or long loops
Common triggers:
- ambiguous tool outputs
- brittle parsing
- “try again” reflexes
- non-idempotent retries
2) Tool spam
Agents sometimes “search until confident.” If you don’t cap it, you get 20+ tool calls on a single request.
3) Model escalation cascades
If your policy says “if uncertain, use a better model,” you can create a cost escalator:
- cheap model -> “uncertain” -> expensive model
- expensive model -> still uncertain -> more calls
4) Context growth
If you keep appending tool outputs to the prompt, costs grow superlinearly and performance can degrade.
5) External quotas become outages
Even if cost is acceptable, external services (email APIs, GitHub, calendars) can rate limit you. Cost and reliability are coupled.
Define cost SLOs and budgets
Start with simple “production truths”:
- How much is one agent run allowed to cost?
- What is an acceptable daily spend per tenant?
- What is the max “blast radius” of a single request?
This maps cleanly to SRE’s error budget concept: budgets constrain unsafe behavior while preserving velocity. [2]
Example cost SLOs (pragmatic)
- Per run: <= $0.10 (p95), <= $0.50 (max)
- Per tenant/day: <= $50/day
- Per user/day: <= $5/day
- Per tool call: <= 3 calls to expensive tools
These aren’t universal. They’re explicit. That’s what matters.
Budget layers: run, tool, tenant, environment
1) Per-run budget
Tracks:
- max model tokens
- max tool calls
- max wall-clock time
- max “expensive operations” count
Most important budget. This is where you stop runaway behavior early.
2) Per-tool budget
Some tools are inherently expensive:
- large searches
- long-running jobs
- heavy data exports
Budget these separately:
- max calls
- max payload size
- max time range
3) Per-tenant budget
Without this, your best customers can melt your infra.
Per-tenant limits:
- requests/min
- concurrent runs
- daily cost cap
4) Per-environment budget
Environments have different rules:
- dev: cheap, permissive, more logging
- prod: bounded, gated, auditable
This is where you implement “read-only mode” during incidents.
Soft limits vs hard limits
Soft limits (degrade gracefully)
When approaching budget:
- switch to cheaper models
- reduce context size (summarize)
- narrow tool search range
- skip non-essential steps
Hard limits (stop the run)
When budget is exceeded:
- stop tool calls
- stop escalation
- request user confirmation / approval
- produce a partial answer with an explanation
This is exactly the “control mechanism” idea behind error budgets: it gives the system permission to shift focus when constraints are exceeded. [1]
Circuit breakers for runaway behavior
Add circuit breakers that detect “this is going bad”:
- loop detector: same tool called with similar args repeatedly
- retry storm: high retry count for a tool within a run
- no progress: plan step count increases without new evidence
- latency breaker: tool p95 spikes beyond threshold
When triggered:
- stop the run
- quarantine the tool for this run
- degrade to safe alternatives
- emit high-signal telemetry
Cost-aware tool and model selection
Cost control is easier if it’s designed into selection:
- Rank tools with a “cost weight” (latency + upstream cost + risk)
- Prefer read-only tools unless a write is required
- Use caches for common retrieval results
- Use deterministic summarization boundaries for tool outputs
If you already implement a tool selector (see “Million Tool Problem”), cost becomes another rerank feature.
Dashboards and alerts
This is where FinOps and SRE meet: cost is an operational signal.
Dashboards
- spend/day by tenant
- cost per run distribution
- top cost drivers (tools and models)
- runaway breaker triggers
Alerts
- daily spend exceeded
- sudden spend spikes (slope alerts)
- high frequency of loop breaker events
- high fraction of runs hitting hard limits
AWS’s Well-Architected Cost Optimization pillar frames cost optimization as a continual process across the workload lifecycle. That mindset applies here too. [4]
A production checklist
Budgets
- Per-run cost and tool-call budgets exist.
- Per-tenant daily caps exist.
- Per-tool “expensive operation” caps exist.
Enforcement
- Soft limits degrade gracefully (cheaper models, narrower queries).
- Hard limits stop and request approval.
- Circuit breakers detect loops/retry storms.
Telemetry
- Cost metrics emitted per run and per tenant.
- Breaker events recorded and alertable.
Culture
- Cost management is a shared practice (FinOps), not a surprise invoice. [3]
References
[1] Google SRE Workbook - Example Error Budget Policy: https://sre.google/workbook/error-budget-policy/ [2] Google SRE Book - Embracing Risk (error budgets as control mechanism): https://sre.google/sre-book/embracing-risk/ [3] FinOps Foundation - What is FinOps? (definition and principles): https://www.finops.org/introduction/what-is-finops/ [4] AWS Well-Architected Framework - Cost Optimization pillar: https://docs.aws.amazon.com/wellarchitected/latest/framework/cost-optimization.html