Agent governance · Cost & FinOps
AI agent cost management: a three-axis decomposition (LLM + SaaS-tool + infra)
"AI agent cost management" is three different problems wearing one name. LLM-token cost is one axis, SaaS-tool cost is another, and infrastructure cost is the third. Each has a different blast radius, a different control surface, and a different category of tool that fixes it. Naming them separately is how you stop spending months on a control that addresses 4% of your monthly bill while ignoring the 80% you can't see.
TL;DR
Decompose every dollar an agent spends into three axes. Axis 1 — LLM cost is what you pay OpenAI, Anthropic, or your own GPU bill for tokens. Bounded by model rate, usually $0.01 to $0.10 per call, controlled by an LLM gateway (LiteLLM, Portkey, Helicone). Axis 2 — SaaS-tool cost is what your agent spends on Stripe, Twilio, Resend, Shopify when it acts on the world. Unbounded — a stuck refund loop is $15 × 1,440 calls/day × 30 days; that's $648,000 in a month from one mistake. Controlled by a SaaS-tool governance proxy (this is what Keybrake is). Axis 3 — Infra cost is compute, storage, vector DB, network. Mostly bounded by your cloud bill ceilings, controlled by your existing FinOps stack. The mistake almost every team makes is putting Axis 1 controls in front of an Axis 2 problem.
Why "cost management" alone is the wrong framing
The phrase AI agent cost management is in trouble before the conversation starts, because the people typing it are usually triaging a specific incident — and the incident only ever sits on one axis at a time. A CTO who watched the OpenAI bill jump from $200 to $4,000 in a week is not on the same axis as a CTO who watched Stripe charge a customer twice and refund five times in a fifteen-minute window. The first is an LLM-token problem; the second is a money-moving SaaS-tool problem. They share zero controls. They share zero remediation patterns. The toolkit that solves one of them does nothing for the other.
So the first move is structural: stop calling it cost management in the singular. Call it three-axis cost management. Pick the axis the incident lives on. Apply the control that lives on the same axis. The rest of this page is the map.
Axis 1 — LLM cost
This is the one most teams notice first because the bill arrives once a month with a single number on it. You pay OpenAI, Anthropic, Google, or your own GPU rental for the tokens your agent's model generates and consumes. The cost-per-call is bounded by the context window: a single GPT-4o call costs roughly $0.01 for a short request, up to $0.10 or so for a long-context coding task. The unit is tokens, the parser is the response's usage object, the runaway pattern is "agent loops on a tool call and re-injects the full conversation history each time."
What the LLM-gateway category does about it: rate-limit per agent or per key, cap monthly tokens or dollars per virtual key, route models by cost tier (use Haiku 4.5 for the cheap calls, Opus 4.7 for the hard ones), cache identical prompts, and surface the per-call cost in a dashboard. Players: LiteLLM, Portkey, Helicone, Bifrost, OpenRouter — each with a slightly different stance (LiteLLM is a fan-out proxy, Portkey is a routing-and-policy plane, Helicone is observability-first).
Worst-case shape: a stuck context-stuffing loop on Opus 4.7. At $15/Mtok input and a 200K-token context refilled 200 times per minute by a re-injecting agent, that's 200K × 200/min × $15/M ≈ $36/min ≈ $2,160/hr. Bad, but bounded by your Anthropic spend cap if you set one. The bleed is measured in thousands of dollars per day, not millions, because per-call cost is bounded by context.
This is the axis the popular "AI agent cost" content is about. It is also the axis with the smallest blast radius.
Axis 2 — SaaS-tool cost
This is the axis that ships incidents. When your agent calls Stripe to create a charge, Twilio to send an SMS, Resend to send an email, Shopify to update a product, or Postmark to deliver a transactional, the agent is spending real money on the world's behalf. The unit is dollars per action, not tokens, and the response is the only authoritative source.
Per-call costs across the v1 vendors:
- Stripe charges — average ~
$15for a SaaS subscription charge, but anything from a $0.50 micro-transaction to a $5,000 enterprise renewal. The cost is theamountfield on the charge response. - Twilio SMS —
$0.0079per US SMS, parsed from thepricefield on the message resource after the status callback (the initial 201 doesn't have it yet). - Resend email — ~
$0.0004per email at the 50K-tier paid plan, computed from your tier table because there's no per-email cost field on the API response. - OpenAI API calls —
$0.01avg per GPT-4o call, parsed from theusageobject × the model rate.
Now the runaway math. At one call every 400ms — typical for an agent in a tight loop — that's 2.5 calls/second, 9,000/hour, 216,000/day. At Stripe's $15 average charge, a stuck-refund loop running unattended for one day is $3.24 million. At Twilio's $0.0079, the same loop is $1,706/day — bad but survivable. At Resend, $86/day — annoying. The blast radius is not even on the LLM axis. Our agent blowout calculator lets you slide the call rate per vendor and see the 24-hour cost in real time.
What the LLM-gateway category does about it: nothing. LiteLLM and Portkey were not built to govern a Stripe charge. They were built for OpenAI-compatible model traffic; pointing them at api.stripe.com fails on three technical fronts (path schema, response parsing, auth envelope) — we cover this in the LiteLLM-for-Stripe page. The rate-limiter is not the answer either: rate-limiting a Stripe charge to 1 per second still bills $15 per second.
What actually contains Axis 2: a SaaS-tool governance proxy. Issue a vault key, attach a policy (per-day USD cap, endpoint allowlist, customer allowlist, mid-run-revocable expires_at), have the agent call proxy.keybrake.com/stripe/v1/charges, parse the cost from the response, log it, enforce the cap at write-time. This is a different category from the LLM gateway. It exists because the unit is dollars-per-action, the runaway is multiplicative across calls, and the parser has to understand vendor-specific response shapes. The 2026 agent governance stack walks through where this proxy sits relative to the LLM gateway in the full architecture.
Axis 3 — Infra cost
Where the agent runs. The compute renting CPU and memory; the storage holding its database, vector index, document cache; the network egress. Mostly the same FinOps shape as a normal web app: AWS / Fly / Railway / Render bills, scales with traffic, capped by instance ceilings.
Per-agent ranges in the wild:
- Compute — $5-50/mo for a small persistent agent VM; $0-thousands for an agent that schedules ephemeral GPU jobs (this is where the math gets nasty if your "agent" trains a model).
- Vector DB — Pinecone $70/mo starter, Postgres+pgvector self-hosted $5-50/mo, Weaviate Cloud variable.
- Object storage — usually pennies; the gotcha is egress not storage.
- Browser-use / Selenium-grid / sandboxed code execution — $20-200/mo per concurrent slot if you're running a hosted browser provider; cheaper self-hosted but with operational cost.
Worst-case shape: an agent that recursively spawns sub-agents without a depth limit. Each sub-agent gets its own VM or container; ten levels deep is 1,024 instances; on a $0.05/hour instance class that's $51/hour, $1,228/day. Bad, but most cloud accounts have an instance limit somewhere that catches it before the bill does.
What controls Axis 3: your existing FinOps stack. Cloud spend caps, instance-count limits, autoscaling rules with floors and ceilings, budget alerts in AWS Cost Explorer or GCP Billing. None of this is agent-specific. If you have your normal application's cloud spend under control, your agent's infra cost is the same problem with the same solutions. The mistake here is the opposite of the Axis 1/2 mistake: people sometimes only have FinOps controls, assume that covers the agent, and discover the hard way that AWS has no idea what an agent run is or what a Stripe charge cost.
Cost shapes — when each axis dominates
Three plausible monthly bills for three different agent shapes. Each one is dominated by a single axis, and the dominant axis tells you which control category to invest in first.
| Agent shape | Axis 1 (LLM) | Axis 2 (SaaS-tool) | Axis 3 (infra) | Dominant |
|---|---|---|---|---|
| Long-running research agent (reads docs, writes summaries) | $800/mo | $0/mo | $30/mo | LLM |
| Customer-support agent on Stripe + Resend (refunds, receipts, follow-ups) | $120/mo | $2,400/mo expected · $648,000 worst-case if stuck | $25/mo | SaaS-tool |
| Self-hosted agent with own fine-tuned 70B model on rented GPUs | ~$0 (own model) | $80/mo | $3,200/mo (GPU) | Infra |
The customer-support row is the one to internalise. The expected monthly bill for SaaS-tool is twenty times larger than the LLM bill in steady state — and the worst-case is five thousand times larger than the LLM worst-case. If your team has spent more time picking an LLM gateway than picking a SaaS-tool governance proxy, you've optimised the wrong axis.
Attribution — tying cost back to one agent run
The other half of cost management is asking which run cost what. Answer to "our OpenAI bill jumped" is useless if you can't say which of last week's 4,000 agent runs was the cause. The pattern is the same across all three axes: every component in the stack records an agent_run_id that the agent generates once and propagates as a header. Then the join lives in SQL.
Concretely: the agent sets x-agent-run-id: run_2026_04_25_8a3f on every outbound call. The LLM gateway records the run id alongside the model call and token cost (Axis 1). The SaaS-tool governance proxy records it alongside the parsed dollar cost (Axis 2). Your application logs record it alongside the latency and the application-side computation (Axis 3 — partial; cloud bills don't have per-run granularity unless you tag instances). Now SELECT SUM(cost_usd) FROM agent_call_audit WHERE agent_run_id = ? gives you the answer. Our audit-trail page covers the four-column MVP schema, and the long-form schema post has the full sixteen-column reference with indexes and the five queries that earn it.
Without the join key, the three axes are three disconnected line items on three different bills. With the join key, they're three columns of a single per-run cost row.
The single number CFOs actually want
Operationally, the question is cost per agent run. The CFO doesn't care that you've split it into three axes; they care whether the unit economics work. If your agent makes one customer-support resolution worth (let's say) $5 in retention, the run can cost up to that, in total across the three axes, before the agent costs more than it earns.
The sum is simple once the join key is set: cost_per_run = sum_axis_1_tokens × rate + sum_axis_2_parsed_cost + amortised_axis_3. The hard part is having all three numbers; that's the audit trail's job. Once you have the per-run cost, the next level up is cost per agent version — comparing v3.4.1 of your support agent against v3.4.2, because cost-of-goods-sold has to come down release-over-release the same way it does for software in general.
Teams that don't have the per-run number end up arguing about averages. Teams that have it can put a cost-per-resolution chart on a wall and run the agent product like a unit-economics business.
Three antipatterns
Putting an LLM gateway in front of an Axis 2 incident. The Stripe bill is the problem. The team reaches for LiteLLM because that's the agent-cost tool they've heard of. Six weeks later they have a great LLM dashboard and the Stripe blowout still happens, because LiteLLM's rate-limiter doesn't know what a Stripe charge response is. Different axis. Different category.
Setting a daily LLM cap and calling it a "kill switch." A $200/day OpenAI cap stops Axis 1. It does nothing about Axis 2; the agent can have $0 of remaining LLM budget and still issue a Stripe refund. The four real kill-switch patterns live on the SaaS-tool axis and have measured propagation latencies (Stripe key revoke median 45s, p95 3m12s; Twilio 30s-2m; Resend near-instant). An LLM cap does not replace them.
Treating cloud-billing alerts as the cost control. AWS Budget Alert at $5,000/month catches Axis 3 at the end of the month. By the time the alert fires, the cost has already been incurred. Axis 1 and Axis 2 don't even appear there — the OpenAI bill comes from Anthropic, the Stripe charges come out via Stripe's payout schedule, and neither shows up in your AWS console. Cloud-billing alerts are a backstop, not a primary control.
How the three controls compose
The 2026 stack we recommend looks like a three-layer cake, plus the audit table that joins them. Each layer's controls live on a single axis, and each layer is replaceable independently of the others.
- LLM gateway in front of model calls — handles Axis 1. Pick from the five-option open-source review; we don't ship one. Caps on tokens and per-virtual-key spend; routing across model tiers; observability on prompts and completions.
- SaaS-tool governance proxy in front of vendor APIs — handles Axis 2. Keybrake is this layer. Per-day USD cap per vendor, endpoint allowlist, customer-scope allowlist, mid-run revoke under one second, parsed cost per call.
- Cloud FinOps stack below both — handles Axis 3. AWS / GCP / Azure billing alerts, instance limits, autoscaling ceilings, budget reviews. Existing tooling.
- Audit table joining all three on
agent_run_id— produces the per-run cost that lets you ask whether the unit economics work. Four-column MVP schema; full reference in the schema post.
If one of the three control layers is missing, your cost-management story has a blind spot whose width equals the size of the axis you didn't cover. The customer-support row above is the canonical case for why missing layer 2 is the most expensive of the three blind spots.
Related questions
Doesn't an LLM gateway with rate-limiting solve cost management?
It solves Axis 1. It does not solve Axis 2 (SaaS-tool spend) and is not designed to. LiteLLM, Portkey, Helicone, OpenRouter — all great tools — speak the OpenAI-compatible model API and operate on token-based units. They cannot rate-limit a Stripe charge to less than $15 per call because the rate-limiter has no concept of the response's amount field. The largest cost incident your agent will ever cause is almost always on the SaaS-tool axis, and the LLM gateway is silent about it.
Can I just set Stripe spend caps in Stripe itself?
Stripe Restricted Keys give you scope (which endpoints, which resources) but no per-day USD cap, no parameter-level allowlist, and no sub-second mid-run revocation. The 10-control coverage matrix walks through what Stripe-native covers and what it doesn't — final count is 3 Yes, 2 Partial, 5 No. The five "No" controls are the ones that contain the cost-blowout cases; that's the gap a SaaS-tool governance proxy fills.
Should I worry about Axis 1 (LLM) at all if Axis 2 dominates the bill?
You should still control it, but the order matters. Build Axis 2 first because a single bad day there ends the company; then Axis 1 because cost-per-run unit economics depend on it; then Axis 3 because most teams already have it via existing FinOps. Reverse-order investment is the most common mistake — teams polish Axis 1 dashboards while Axis 2 has zero protection because the LLM-cost content was easier to find on Google.
How small does an agent project have to be to skip this entirely?
A side project with no money-moving outbound calls (no Stripe, no Twilio, no Resend) and a small monthly LLM budget can skip the proxies. Set a $50/month spend cap on your OpenAI key, set a cloud-billing alert at $100, ship. The minute you wire in a vendor that costs real money per call — even just Resend at fractions of a cent — the math changes, because the runaway shape stops being bounded by token-rate and starts being bounded by your bank-account size.
Where does fine-tuning or prompt-caching cost fit?
Fine-tuning is Axis 1 with a long depreciation tail — you pay once to train and amortise across calls. Prompt-caching is an Axis 1 cost reducer; both Anthropic and OpenAI offer it, both are cheaper-to-cache-than-recompute over typical agent context lengths. Neither affects Axes 2 or 3. The audit row should still record the per-call cost so the cache savings show up in the per-run number — otherwise you can't measure whether the cache is paying for itself.
Further reading
- The 2026 agent governance stack: which proxy goes where — the four-layer composition (LLM traffic / LLM observability / SaaS-tool governance / agent identity) with measures-in / prevents framing per layer.
- LiteLLM alternative for Stripe — why pointing an LLM gateway at a SaaS-tool API fails on three technical fronts, and the dual-proxy alternative.
- LiteLLM alternatives — honest open-source review — five-option review of Portkey, Helicone, LangGate, OpenRouter proxy, and Bifrost (the Axis-1 toolbox).
- AI agent kill-switch — patterns and stop-latency — the four real ways to stop a running agent on the Axis-2 surface, with measured propagation numbers per vendor.
- AI agent audit trail — what belongs in one — the four-column MVP schema for joining cost rows on
agent_run_id. - The anatomy of an AI agent audit trail (long form) — sixteen-column reference, six indexes, five operational queries with full SQL.
- Agent blowout calculator — interactive tool: pick a vendor and a calls-per-minute slider, see the 24-hour cost on Axis 2 with and without a cap.
- Rotate vs revoke: a 2am playbook for a stuck agent — Axis-2 incident response with two side-by-side timelines and a per-vendor propagation table.