Cost Is a Reliability Problem

Why this matters

Traditional reliability focuses on uptime. AI systems add a second axis:

Your system can be “up” while your budget is on fire.

A runaway agent doesn’t always crash services. Sometimes it:

loops tool calls
retries incorrectly
escalates to larger models repeatedly
expands context windows unnecessarily
performs expensive searches without stopping

The result: surprise bills, throttling, and eventually hard outages when quotas are hit.

Google’s SRE framing around error budgets is a useful mental model: budgets create a control mechanism that balances stability with velocity. [1][2] FinOps frames cost management as a collaboration practice between engineering, finance, and business. [3]

This article is the practical bridge: use budgets and guardrails like you would for reliability.

TL;DR

Treat cost as an SLO: define acceptable spend per run / per tenant / per day.
Enforce budgets at multiple layers:
per request/run
per tool
per tenant
per environment
Use hard limits + soft limits:
soft: degrade model/tool choices
hard: stop the run and ask for approval
Add cost circuit breakers:
abort on runaway loops
quarantine tools causing repeated retries
Make cost visible (metrics + dashboards) so teams can improve it.
Align with FinOps: shared accountability, not “billing surprises.” [3]

Cost failure modes in agent systems
Define cost SLOs and budgets
Budget layers: run, tool, tenant, environment
Soft limits vs hard limits
Circuit breakers for runaway behavior
Cost-aware tool and model selection
Dashboards and alerts
A production checklist
References

Cost failure modes in agent systems

1) Infinite or long loops

Common triggers:

ambiguous tool outputs
brittle parsing
“try again” reflexes
non-idempotent retries

2) Tool spam

Agents sometimes “search until confident.” If you don’t cap it, you get 20+ tool calls on a single request.

3) Model escalation cascades

If your policy says “if uncertain, use a better model,” you can create a cost escalator:

cheap model -> “uncertain” -> expensive model
expensive model -> still uncertain -> more calls

4) Context growth

If you keep appending tool outputs to the prompt, costs grow superlinearly and performance can degrade.

5) External quotas become outages

Even if cost is acceptable, external services (email APIs, GitHub, calendars) can rate limit you. Cost and reliability are coupled.

Define cost SLOs and budgets

Start with simple “production truths”:

How much is one agent run allowed to cost?
What is an acceptable daily spend per tenant?
What is the max “blast radius” of a single request?

This maps cleanly to SRE’s error budget concept: budgets constrain unsafe behavior while preserving velocity. [2]

Example cost SLOs (pragmatic)

Per run: <= $0.10 (p95), <= $0.50 (max)
Per tenant/day: <= $50/day
Per user/day: <= $5/day
Per tool call: <= 3 calls to expensive tools

These aren’t universal. They’re explicit. That’s what matters.

Budget layers: run, tool, tenant, environment

1) Per-run budget

Tracks:

max model tokens
max tool calls
max wall-clock time
max “expensive operations” count

Most important budget. This is where you stop runaway behavior early.

2) Per-tool budget

Some tools are inherently expensive:

large searches
long-running jobs
heavy data exports

Budget these separately:

max calls
max payload size
max time range

3) Per-tenant budget

Without this, your best customers can melt your infra.

Per-tenant limits:

requests/min
concurrent runs
daily cost cap

4) Per-environment budget

Environments have different rules:

dev: cheap, permissive, more logging
prod: bounded, gated, auditable

This is where you implement “read-only mode” during incidents.

Soft limits vs hard limits

Soft limits (degrade gracefully)

When approaching budget:

switch to cheaper models
reduce context size (summarize)
narrow tool search range
skip non-essential steps

Hard limits (stop the run)

When budget is exceeded:

stop tool calls
stop escalation
request user confirmation / approval
produce a partial answer with an explanation

This is exactly the “control mechanism” idea behind error budgets: it gives the system permission to shift focus when constraints are exceeded. [1]

Circuit breakers for runaway behavior

Add circuit breakers that detect “this is going bad”:

loop detector: same tool called with similar args repeatedly
retry storm: high retry count for a tool within a run
no progress: plan step count increases without new evidence
latency breaker: tool p95 spikes beyond threshold

When triggered:

stop the run
quarantine the tool for this run
degrade to safe alternatives
emit high-signal telemetry

Cost-aware tool and model selection

Cost control is easier if it’s designed into selection:

Rank tools with a “cost weight” (latency + upstream cost + risk)
Prefer read-only tools unless a write is required
Use caches for common retrieval results
Use deterministic summarization boundaries for tool outputs

If you already implement a tool selector (see “Million Tool Problem”), cost becomes another rerank feature.

Dashboards and alerts

This is where FinOps and SRE meet: cost is an operational signal.

Dashboards

spend/day by tenant
cost per run distribution
top cost drivers (tools and models)
runaway breaker triggers

Alerts

daily spend exceeded
sudden spend spikes (slope alerts)
high frequency of loop breaker events
high fraction of runs hitting hard limits

AWS’s Well-Architected Cost Optimization pillar frames cost optimization as a continual process across the workload lifecycle. That mindset applies here too. [4]

A production checklist

Budgets

Per-run cost and tool-call budgets exist.
Per-tenant daily caps exist.
Per-tool “expensive operation” caps exist.

Enforcement

Soft limits degrade gracefully (cheaper models, narrower queries).
Hard limits stop and request approval.
Circuit breakers detect loops/retry storms.

Telemetry

Cost metrics emitted per run and per tenant.
Breaker events recorded and alertable.

Culture

Cost management is a shared practice (FinOps), not a surprise invoice. [3]

References

[1] Google SRE Workbook - Example Error Budget Policy: https://sre.google/workbook/error-budget-policy/ [2] Google SRE Book - Embracing Risk (error budgets as control mechanism): https://sre.google/sre-book/embracing-risk/ [3] FinOps Foundation - What is FinOps? (definition and principles): https://www.finops.org/introduction/what-is-finops/ [4] AWS Well-Architected Framework - Cost Optimization pillar: https://docs.aws.amazon.com/wellarchitected/latest/framework/cost-optimization.html

Cost Agents Llm SRE Finops Reliability

Authors

Roy Gabriel

DevOps Architect · Applied AI Engineer

I’ve spent 20 years building systems across embedded systems, micro-controllers, PLCS, security platforms, fintech, SRE, and platform architecture. Today I focus on production AI systems in Go: multi-agent orchestration, MCP server ecosystems, and the DevOps platforms that keep them running. I care about systems that work under pressure: observable, recoverable, and built to last.

← Agent Observability That Doesn't Lie December 20, 2025

Durable Agents with Temporal: Retries, Idempotency, and Long-Running State December 6, 2025 →