<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Finops | Roy Gabriel</title><link>https://roygabriel.dev/tags/finops/</link><description>Roy Gabriel: DevOps Architect &amp; Applied AI Engineer. Technical blog on Go, MCP servers, Kubernetes, and production AI systems.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 27 Feb 2026 03:18:04 +0000</lastBuildDate><atom:link href="https://roygabriel.dev/tags/finops/index.xml" rel="self" type="application/rss+xml"/><item><title>Cost Is a Reliability Problem</title><link>https://roygabriel.dev/blog/cost-is-a-reliability-problem/</link><pubDate>Sat, 13 Dec 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/cost-is-a-reliability-problem/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Traditional reliability focuses on uptime. AI systems add a second axis:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Your system can be &amp;ldquo;up&amp;rdquo; while your budget is on fire.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A runaway agent doesn&amp;rsquo;t always crash services. Sometimes it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;loops tool calls&lt;/li&gt;
&lt;li&gt;retries incorrectly&lt;/li&gt;
&lt;li&gt;escalates to larger models repeatedly&lt;/li&gt;
&lt;li&gt;expands context windows unnecessarily&lt;/li&gt;
&lt;li&gt;performs expensive searches without stopping&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result: surprise bills, throttling, and eventually hard outages when quotas are hit.&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Traditional reliability focuses on uptime. AI systems add a second axis:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;strong&gt;Your system can be &amp;ldquo;up&amp;rdquo; while your budget is on fire.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A runaway agent doesn&amp;rsquo;t always crash services. Sometimes it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;loops tool calls&lt;/li&gt;
&lt;li&gt;retries incorrectly&lt;/li&gt;
&lt;li&gt;escalates to larger models repeatedly&lt;/li&gt;
&lt;li&gt;expands context windows unnecessarily&lt;/li&gt;
&lt;li&gt;performs expensive searches without stopping&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result: surprise bills, throttling, and eventually hard outages when quotas are hit.&lt;/p&gt;
&lt;p&gt;Google&amp;rsquo;s SRE framing around &lt;strong&gt;error budgets&lt;/strong&gt; is a useful mental model: budgets create a control mechanism that balances stability with velocity. [1][2]
FinOps frames cost management as a collaboration practice between engineering, finance, and business. [3]&lt;/p&gt;
&lt;p&gt;This article is the practical bridge: &lt;strong&gt;use budgets and guardrails like you would for reliability.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Treat cost as an SLO: define acceptable spend per run / per tenant / per day.&lt;/li&gt;
&lt;li&gt;Enforce budgets at multiple layers:&lt;/li&gt;
&lt;li&gt;per request/run&lt;/li&gt;
&lt;li&gt;per tool&lt;/li&gt;
&lt;li&gt;per tenant&lt;/li&gt;
&lt;li&gt;per environment&lt;/li&gt;
&lt;li&gt;Use hard limits + soft limits:&lt;/li&gt;
&lt;li&gt;soft: degrade model/tool choices&lt;/li&gt;
&lt;li&gt;hard: stop the run and ask for approval&lt;/li&gt;
&lt;li&gt;Add cost circuit breakers:&lt;/li&gt;
&lt;li&gt;abort on runaway loops&lt;/li&gt;
&lt;li&gt;quarantine tools causing repeated retries&lt;/li&gt;
&lt;li&gt;Make cost visible (metrics + dashboards) so teams can improve it.&lt;/li&gt;
&lt;li&gt;Align with FinOps: shared accountability, not &amp;ldquo;billing surprises.&amp;rdquo; [3]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#cost-failure-modes-in-agent-systems"&gt;Cost failure modes in agent systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#define-cost-slos-and-budgets"&gt;Define cost SLOs and budgets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#budget-layers-run-tool-tenant-environment"&gt;Budget layers: run, tool, tenant, environment&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#soft-limits-vs-hard-limits"&gt;Soft limits vs hard limits&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#circuit-breakers-for-runaway-behavior"&gt;Circuit breakers for runaway behavior&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#cost-aware-tool-and-model-selection"&gt;Cost-aware tool and model selection&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#dashboards-and-alerts"&gt;Dashboards and alerts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="cost-failure-modes-in-agent-systems"&gt;Cost failure modes in agent systems&lt;/h2&gt;
&lt;h3 id="1-infinite-or-long-loops"&gt;1) Infinite or long loops&lt;/h3&gt;
&lt;p&gt;Common triggers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ambiguous tool outputs&lt;/li&gt;
&lt;li&gt;brittle parsing&lt;/li&gt;
&lt;li&gt;&amp;ldquo;try again&amp;rdquo; reflexes&lt;/li&gt;
&lt;li&gt;non-idempotent retries&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-tool-spam"&gt;2) Tool spam&lt;/h3&gt;
&lt;p&gt;Agents sometimes &amp;ldquo;search until confident.&amp;rdquo;
If you don&amp;rsquo;t cap it, you get 20+ tool calls on a single request.&lt;/p&gt;
&lt;h3 id="3-model-escalation-cascades"&gt;3) Model escalation cascades&lt;/h3&gt;
&lt;p&gt;If your policy says &amp;ldquo;if uncertain, use a better model,&amp;rdquo; you can create a cost escalator:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;cheap model -&amp;gt; &amp;ldquo;uncertain&amp;rdquo; -&amp;gt; expensive model&lt;/li&gt;
&lt;li&gt;expensive model -&amp;gt; still uncertain -&amp;gt; more calls&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4-context-growth"&gt;4) Context growth&lt;/h3&gt;
&lt;p&gt;If you keep appending tool outputs to the prompt, costs grow superlinearly and performance can degrade.&lt;/p&gt;
&lt;h3 id="5-external-quotas-become-outages"&gt;5) External quotas become outages&lt;/h3&gt;
&lt;p&gt;Even if cost is acceptable, external services (email APIs, GitHub, calendars) can rate limit you.
Cost and reliability are coupled.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="define-cost-slos-and-budgets"&gt;Define cost SLOs and budgets&lt;/h2&gt;
&lt;p&gt;Start with simple &amp;ldquo;production truths&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How much is one agent run allowed to cost?&lt;/li&gt;
&lt;li&gt;What is an acceptable daily spend per tenant?&lt;/li&gt;
&lt;li&gt;What is the max &amp;ldquo;blast radius&amp;rdquo; of a single request?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This maps cleanly to SRE&amp;rsquo;s error budget concept: budgets constrain unsafe behavior while preserving velocity. [2]&lt;/p&gt;
&lt;h3 id="example-cost-slos-pragmatic"&gt;Example cost SLOs (pragmatic)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Per run:&lt;/strong&gt; &amp;lt;= $0.10 (p95), &lt;= $0.50 (max)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per tenant/day:&lt;/strong&gt; &amp;lt;= $50/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per user/day:&lt;/strong&gt; &amp;lt;= $5/day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per tool call:&lt;/strong&gt; &amp;lt;= 3 calls to expensive tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These aren&amp;rsquo;t universal. They&amp;rsquo;re explicit. That&amp;rsquo;s what matters.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="budget-layers-run-tool-tenant-environment"&gt;Budget layers: run, tool, tenant, environment&lt;/h2&gt;
&lt;h3 id="1-per-run-budget"&gt;1) Per-run budget&lt;/h3&gt;
&lt;p&gt;Tracks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;max model tokens&lt;/li&gt;
&lt;li&gt;max tool calls&lt;/li&gt;
&lt;li&gt;max wall-clock time&lt;/li&gt;
&lt;li&gt;max &amp;ldquo;expensive operations&amp;rdquo; count&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Most important budget.&lt;/strong&gt; This is where you stop runaway behavior early.&lt;/p&gt;
&lt;h3 id="2-per-tool-budget"&gt;2) Per-tool budget&lt;/h3&gt;
&lt;p&gt;Some tools are inherently expensive:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;large searches&lt;/li&gt;
&lt;li&gt;long-running jobs&lt;/li&gt;
&lt;li&gt;heavy data exports&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Budget these separately:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;max calls&lt;/li&gt;
&lt;li&gt;max payload size&lt;/li&gt;
&lt;li&gt;max time range&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-per-tenant-budget"&gt;3) Per-tenant budget&lt;/h3&gt;
&lt;p&gt;Without this, your best customers can melt your infra.&lt;/p&gt;
&lt;p&gt;Per-tenant limits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;requests/min&lt;/li&gt;
&lt;li&gt;concurrent runs&lt;/li&gt;
&lt;li&gt;daily cost cap&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4-per-environment-budget"&gt;4) Per-environment budget&lt;/h3&gt;
&lt;p&gt;Environments have different rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;dev: cheap, permissive, more logging&lt;/li&gt;
&lt;li&gt;prod: bounded, gated, auditable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where you implement &amp;ldquo;read-only mode&amp;rdquo; during incidents.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="soft-limits-vs-hard-limits"&gt;Soft limits vs hard limits&lt;/h2&gt;
&lt;h3 id="soft-limits-degrade-gracefully"&gt;Soft limits (degrade gracefully)&lt;/h3&gt;
&lt;p&gt;When approaching budget:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;switch to cheaper models&lt;/li&gt;
&lt;li&gt;reduce context size (summarize)&lt;/li&gt;
&lt;li&gt;narrow tool search range&lt;/li&gt;
&lt;li&gt;skip non-essential steps&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="hard-limits-stop-the-run"&gt;Hard limits (stop the run)&lt;/h3&gt;
&lt;p&gt;When budget is exceeded:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;stop tool calls&lt;/li&gt;
&lt;li&gt;stop escalation&lt;/li&gt;
&lt;li&gt;request user confirmation / approval&lt;/li&gt;
&lt;li&gt;produce a partial answer with an explanation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is exactly the &amp;ldquo;control mechanism&amp;rdquo; idea behind error budgets: it gives the system permission to shift focus when constraints are exceeded. [1]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="circuit-breakers-for-runaway-behavior"&gt;Circuit breakers for runaway behavior&lt;/h2&gt;
&lt;p&gt;Add circuit breakers that detect &amp;ldquo;this is going bad&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;loop detector&lt;/strong&gt;: same tool called with similar args repeatedly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;retry storm&lt;/strong&gt;: high retry count for a tool within a run&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;no progress&lt;/strong&gt;: plan step count increases without new evidence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;latency breaker&lt;/strong&gt;: tool p95 spikes beyond threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When triggered:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;stop the run&lt;/li&gt;
&lt;li&gt;quarantine the tool for this run&lt;/li&gt;
&lt;li&gt;degrade to safe alternatives&lt;/li&gt;
&lt;li&gt;emit high-signal telemetry&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="cost-aware-tool-and-model-selection"&gt;Cost-aware tool and model selection&lt;/h2&gt;
&lt;p&gt;Cost control is easier if it&amp;rsquo;s designed into selection:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rank tools with a &amp;ldquo;cost weight&amp;rdquo; (latency + upstream cost + risk)&lt;/li&gt;
&lt;li&gt;Prefer read-only tools unless a write is required&lt;/li&gt;
&lt;li&gt;Use caches for common retrieval results&lt;/li&gt;
&lt;li&gt;Use deterministic summarization boundaries for tool outputs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you already implement a tool selector (see &amp;ldquo;Million Tool Problem&amp;rdquo;), cost becomes another rerank feature.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="dashboards-and-alerts"&gt;Dashboards and alerts&lt;/h2&gt;
&lt;p&gt;This is where FinOps and SRE meet: cost is an operational signal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dashboards&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;spend/day by tenant&lt;/li&gt;
&lt;li&gt;cost per run distribution&lt;/li&gt;
&lt;li&gt;top cost drivers (tools and models)&lt;/li&gt;
&lt;li&gt;runaway breaker triggers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Alerts&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;daily spend exceeded&lt;/li&gt;
&lt;li&gt;sudden spend spikes (slope alerts)&lt;/li&gt;
&lt;li&gt;high frequency of loop breaker events&lt;/li&gt;
&lt;li&gt;high fraction of runs hitting hard limits&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AWS&amp;rsquo;s Well-Architected Cost Optimization pillar frames cost optimization as a continual process across the workload lifecycle. That mindset applies here too. [4]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="budgets"&gt;Budgets&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Per-run cost and tool-call budgets exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Per-tenant daily caps exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Per-tool &amp;ldquo;expensive operation&amp;rdquo; caps exist.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="enforcement"&gt;Enforcement&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Soft limits degrade gracefully (cheaper models, narrower queries).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Hard limits stop and request approval.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Circuit breakers detect loops/retry storms.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="telemetry"&gt;Telemetry&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Cost metrics emitted per run and per tenant.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Breaker events recorded and alertable.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="culture"&gt;Culture&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Cost management is a shared practice (FinOps), not a surprise invoice. [3]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Google SRE Workbook - Example Error Budget Policy: &lt;a href="https://sre.google/workbook/error-budget-policy/" target="_blank" rel="noopener noreferrer"&gt;https://sre.google/workbook/error-budget-policy/&lt;/a&gt;
[2] Google SRE Book - Embracing Risk (error budgets as control mechanism): &lt;a href="https://sre.google/sre-book/embracing-risk/" target="_blank" rel="noopener noreferrer"&gt;https://sre.google/sre-book/embracing-risk/&lt;/a&gt;
[3] FinOps Foundation - What is FinOps? (definition and principles): &lt;a href="https://www.finops.org/introduction/what-is-finops/" target="_blank" rel="noopener noreferrer"&gt;https://www.finops.org/introduction/what-is-finops/&lt;/a&gt;
[4] AWS Well-Architected Framework - Cost Optimization pillar: &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/framework/cost-optimization.html" target="_blank" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/wellarchitected/latest/framework/cost-optimization.html&lt;/a&gt;
&lt;/p&gt;</content:encoded></item></channel></rss>