Agent governance · Cost & FinOps

AI agent cost management: a three-axis decomposition (LLM + SaaS-tool + infra)

"AI agent cost management" is three different problems wearing one name. LLM-token cost is one axis, SaaS-tool cost is another, and infrastructure cost is the third. Each has a different blast radius, a different control surface, and a different category of tool that fixes it. Naming them separately is how you stop spending months on a control that addresses 4% of your monthly bill while ignoring the 80% you can't see.

TL;DR

Decompose every dollar an agent spends into three axes. Axis 1 — LLM cost is what you pay OpenAI, Anthropic, or your own GPU bill for tokens. Bounded by model rate, usually $0.01 to $0.10 per call, controlled by an LLM gateway (LiteLLM, Portkey, Helicone). Axis 2 — SaaS-tool cost is what your agent spends on Stripe, Twilio, Resend, Shopify when it acts on the world. Unbounded — a stuck refund loop is $15 × 1,440 calls/day × 30 days; that's $648,000 in a month from one mistake. Controlled by a SaaS-tool governance proxy (this is what Keybrake is). Axis 3 — Infra cost is compute, storage, vector DB, network. Mostly bounded by your cloud bill ceilings, controlled by your existing FinOps stack. The mistake almost every team makes is putting Axis 1 controls in front of an Axis 2 problem.

Why "cost management" alone is the wrong framing

The phrase AI agent cost management is in trouble before the conversation starts, because the people typing it are usually triaging a specific incident — and the incident only ever sits on one axis at a time. A CTO who watched the OpenAI bill jump from $200 to $4,000 in a week is not on the same axis as a CTO who watched Stripe charge a customer twice and refund five times in a fifteen-minute window. The first is an LLM-token problem; the second is a money-moving SaaS-tool problem. They share zero controls. They share zero remediation patterns. The toolkit that solves one of them does nothing for the other.

So the first move is structural: stop calling it cost management in the singular. Call it three-axis cost management. Pick the axis the incident lives on. Apply the control that lives on the same axis. The rest of this page is the map.

Axis 1 — LLM cost

This is the one most teams notice first because the bill arrives once a month with a single number on it. You pay OpenAI, Anthropic, Google, or your own GPU rental for the tokens your agent's model generates and consumes. The cost-per-call is bounded by the context window: a single GPT-4o call costs roughly $0.01 for a short request, up to $0.10 or so for a long-context coding task. The unit is tokens, the parser is the response's usage object, the runaway pattern is "agent loops on a tool call and re-injects the full conversation history each time."

What the LLM-gateway category does about it: rate-limit per agent or per key, cap monthly tokens or dollars per virtual key, route models by cost tier (use Haiku 4.5 for the cheap calls, Opus 4.7 for the hard ones), cache identical prompts, and surface the per-call cost in a dashboard. Players: LiteLLM, Portkey, Helicone, Bifrost, OpenRouter — each with a slightly different stance (LiteLLM is a fan-out proxy, Portkey is a routing-and-policy plane, Helicone is observability-first).

Worst-case shape: a stuck context-stuffing loop on Opus 4.7. At $15/Mtok input and a 200K-token context refilled 200 times per minute by a re-injecting agent, that's 200K × 200/min × $15/M ≈ $36/min ≈ $2,160/hr. Bad, but bounded by your Anthropic spend cap if you set one. The bleed is measured in thousands of dollars per day, not millions, because per-call cost is bounded by context.

This is the axis the popular "AI agent cost" content is about. It is also the axis with the smallest blast radius.

Axis 2 — SaaS-tool cost

This is the axis that ships incidents. When your agent calls Stripe to create a charge, Twilio to send an SMS, Resend to send an email, Shopify to update a product, or Postmark to deliver a transactional, the agent is spending real money on the world's behalf. The unit is dollars per action, not tokens, and the response is the only authoritative source.

Per-call costs across the v1 vendors:

Stripe charges — average ~$15 for a SaaS subscription charge, but anything from a $0.50 micro-transaction to a $5,000 enterprise renewal. The cost is the amount field on the charge response.
Twilio SMS — $0.0079 per US SMS, parsed from the price field on the message resource after the status callback (the initial 201 doesn't have it yet).
Resend email — ~$0.0004 per email at the 50K-tier paid plan, computed from your tier table because there's no per-email cost field on the API response.
OpenAI API calls — $0.01 avg per GPT-4o call, parsed from the usage object × the model rate.

Now the runaway math. At one call every 400ms — typical for an agent in a tight loop — that's 2.5 calls/second, 9,000/hour, 216,000/day. At Stripe's $15 average charge, a stuck-refund loop running unattended for one day is $3.24 million. At Twilio's $0.0079, the same loop is $1,706/day — bad but survivable. At Resend, $86/day — annoying. The blast radius is not even on the LLM axis. Our agent blowout calculator lets you slide the call rate per vendor and see the 24-hour cost in real time.

What the LLM-gateway category does about it: nothing. LiteLLM and Portkey were not built to govern a Stripe charge. They were built for OpenAI-compatible model traffic; pointing them at api.stripe.com fails on three technical fronts (path schema, response parsing, auth envelope) — we cover this in the LiteLLM-for-Stripe page. The rate-limiter is not the answer either: rate-limiting a Stripe charge to 1 per second still bills $15 per second.

What actually contains Axis 2: a SaaS-tool governance proxy. Issue a vault key, attach a policy (per-day USD cap, endpoint allowlist, customer allowlist, mid-run-revocable expires_at), have the agent call proxy.keybrake.com/stripe/v1/charges, parse the cost from the response, log it, enforce the cap at write-time. This is a different category from the LLM gateway. It exists because the unit is dollars-per-action, the runaway is multiplicative across calls, and the parser has to understand vendor-specific response shapes. The 2026 agent governance stack walks through where this proxy sits relative to the LLM gateway in the full architecture.

Axis 3 — Infra cost

Where the agent runs. The compute renting CPU and memory; the storage holding its database, vector index, document cache; the network egress. Mostly the same FinOps shape as a normal web app: AWS / Fly / Railway / Render bills, scales with traffic, capped by instance ceilings.

Per-agent ranges in the wild:

Compute — $5-50/mo for a small persistent agent VM; $0-thousands for an agent that schedules ephemeral GPU jobs (this is where the math gets nasty if your "agent" trains a model).
Vector DB — Pinecone $70/mo starter, Postgres+pgvector self-hosted $5-50/mo, Weaviate Cloud variable.
Object storage — usually pennies; the gotcha is egress not storage.
Browser-use / Selenium-grid / sandboxed code execution — $20-200/mo per concurrent slot if you're running a hosted browser provider; cheaper self-hosted but with operational cost.

Worst-case shape: an agent that recursively spawns sub-agents without a depth limit. Each sub-agent gets its own VM or container; ten levels deep is 1,024 instances; on a $0.05/hour instance class that's $51/hour, $1,228/day. Bad, but most cloud accounts have an instance limit somewhere that catches it before the bill does.

What controls Axis 3: your existing FinOps stack. Cloud spend caps, instance-count limits, autoscaling rules with floors and ceilings, budget alerts in AWS Cost Explorer or GCP Billing. None of this is agent-specific. If you have your normal application's cloud spend under control, your agent's infra cost is the same problem with the same solutions. The mistake here is the opposite of the Axis 1/2 mistake: people sometimes only have FinOps controls, assume that covers the agent, and discover the hard way that AWS has no idea what an agent run is or what a Stripe charge cost.

Cost shapes — when each axis dominates

Three plausible monthly bills for three different agent shapes. Each one is dominated by a single axis, and the dominant axis tells you which control category to invest in first.

Agent shape	Axis 1 (LLM)	Axis 2 (SaaS-tool)	Axis 3 (infra)	Dominant
Long-running research agent (reads docs, writes summaries)	$800/mo	$0/mo	$30/mo	LLM
Customer-support agent on Stripe + Resend (refunds, receipts, follow-ups)	$120/mo	$2,400/mo expected · $648,000 worst-case if stuck	$25/mo	SaaS-tool
Self-hosted agent with own fine-tuned 70B model on rented GPUs	~$0 (own model)	$80/mo	$3,200/mo (GPU)	Infra

The customer-support row is the one to internalise. The expected monthly bill for SaaS-tool is twenty times larger than the LLM bill in steady state — and the worst-case is five thousand times larger than the LLM worst-case. If your team has spent more time picking an LLM gateway than picking a SaaS-tool governance proxy, you've optimised the wrong axis.

Attribution — tying cost back to one agent run

The other half of cost management is asking which run cost what. Answer to "our OpenAI bill jumped" is useless if you can't say which of last week's 4,000 agent runs was the cause. The pattern is the same across all three axes: every component in the stack records an agent_run_id that the agent generates once and propagates as a header. Then the join lives in SQL.

Concretely: the agent sets x-agent-run-id: run_2026_04_25_8a3f on every outbound call. The LLM gateway records the run id alongside the model call and token cost (Axis 1). The SaaS-tool governance proxy records it alongside the parsed dollar cost (Axis 2). Your application logs record it alongside the latency and the application-side computation (Axis 3 — partial; cloud bills don't have per-run granularity unless you tag instances). Now SELECT SUM(cost_usd) FROM agent_call_audit WHERE agent_run_id = ? gives you the answer. Our audit-trail page covers the four-column MVP schema, and the long-form schema post has the full sixteen-column reference with indexes and the five queries that earn it.

Without the join key, the three axes are three disconnected line items on three different bills. With the join key, they're three columns of a single per-run cost row.

The single number CFOs actually want

Operationally, the question is cost per agent run. The CFO doesn't care that you've split it into three axes; they care whether the unit economics work. If your agent makes one customer-support resolution worth (let's say) $5 in retention, the run can cost up to that, in total across the three axes, before the agent costs more than it earns.

The sum is simple once the join key is set: cost_per_run = sum_axis_1_tokens × rate + sum_axis_2_parsed_cost + amortised_axis_3. The hard part is having all three numbers; that's the audit trail's job. Once you have the per-run cost, the next level up is cost per agent version — comparing v3.4.1 of your support agent against v3.4.2, because cost-of-goods-sold has to come down release-over-release the same way it does for software in general.

Teams that don't have the per-run number end up arguing about averages. Teams that have it can put a cost-per-resolution chart on a wall and run the agent product like a unit-economics business.

Three antipatterns

Putting an LLM gateway in front of an Axis 2 incident. The Stripe bill is the problem. The team reaches for LiteLLM because that's the agent-cost tool they've heard of. Six weeks later they have a great LLM dashboard and the Stripe blowout still happens, because LiteLLM's rate-limiter doesn't know what a Stripe charge response is. Different axis. Different category.

Setting a daily LLM cap and calling it a "kill switch." A $200/day OpenAI cap stops Axis 1. It does nothing about Axis 2; the agent can have $0 of remaining LLM budget and still issue a Stripe refund. The four real kill-switch patterns live on the SaaS-tool axis and have measured propagation latencies (Stripe key revoke median 45s, p95 3m12s; Twilio 30s-2m; Resend near-instant). An LLM cap does not replace them.

Treating cloud-billing alerts as the cost control. AWS Budget Alert at $5,000/month catches Axis 3 at the end of the month. By the time the alert fires, the cost has already been incurred. Axis 1 and Axis 2 don't even appear there — the OpenAI bill comes from Anthropic, the Stripe charges come out via Stripe's payout schedule, and neither shows up in your AWS console. Cloud-billing alerts are a backstop, not a primary control.

How the three controls compose

The 2026 stack we recommend looks like a three-layer cake, plus the audit table that joins them. Each layer's controls live on a single axis, and each layer is replaceable independently of the others.

LLM gateway in front of model calls — handles Axis 1. Pick from the five-option open-source review; we don't ship one. Caps on tokens and per-virtual-key spend; routing across model tiers; observability on prompts and completions.
SaaS-tool governance proxy in front of vendor APIs — handles Axis 2. Keybrake is this layer. Per-day USD cap per vendor, endpoint allowlist, customer-scope allowlist, mid-run revoke under one second, parsed cost per call.
Cloud FinOps stack below both — handles Axis 3. AWS / GCP / Azure billing alerts, instance limits, autoscaling ceilings, budget reviews. Existing tooling.
Audit table joining all three on agent_run_id — produces the per-run cost that lets you ask whether the unit economics work. Four-column MVP schema; full reference in the schema post.

If one of the three control layers is missing, your cost-management story has a blind spot whose width equals the size of the axis you didn't cover. The customer-support row above is the canonical case for why missing layer 2 is the most expensive of the three blind spots.

Get early access to Keybrake