AI agents · budget enforcement · vendor API spend

AI agent budget enforcement: pre-call spending limits that actually stop runaway agents

Budget enforcement is not the same as budget alerting. Alerting notifies you after spend has occurred — "your Stripe charges exceeded $500 today." Enforcement fires before the charge is created — "this agent run would exceed its $500 cap; the call is blocked." For AI agents that call Stripe, Twilio, or Resend autonomously, only enforcement prevents the runaway charge. Alerting limits the damage; enforcement prevents it. This page maps the four monitoring patterns that don't provide enforcement, explains what enforcement actually requires, and shows how per-run vault keys implement it without changing your agent's tool code.

TL;DR

The four common AI agent spend-monitoring patterns — cloud billing alarms, vendor threshold emails, agent-side counters, and LLM observability dashboards — all fire after the vendor charge is created. None can block the next charge. Pre-call enforcement requires a layer that sits between the agent and the vendor API, inspects the request before forwarding it, checks cumulative spend for this specific run, and returns a 429 if the cap would be exceeded. Per-run vault keys issued via Keybrake implement this pattern: each vault key carries a daily_usd_cap that is checked and decremented atomically before every proxied call.

The four monitoring patterns and why they don't enforce

Pattern	When it fires	Latency	Stops the next charge?
Cloud billing alarms (GCP, AWS, Azure)	After spend is ingested into the billing system	8–48 hours after the vendor charge	No — spend is done and invoiced before the alarm fires
Vendor threshold emails (Stripe, Twilio)	After your billing threshold is crossed at the account level	15–60 minutes post-threshold; account-level, not per-agent	No — the threshold is cumulative across all agents; can't identify which run to stop
Agent-side counters (in-process tracking)	Before the vendor call — but only in a single process instance	Zero latency within the process; resets on restart	Partially — works in a single-process agent; fails with parallel agents or on process restart; doesn't aggregate across fleet
LLM observability tools (LangSmith, Helicone, etc.)	After the LLM tool call completes	Real-time for LLM token cost; does not track vendor API spend (Stripe, Twilio)	No — tracks LLM cost, not vendor API cost; no enforcement hook

What enforcement actually requires

Effective pre-call budget enforcement for AI agents requires four properties — none of which the monitoring patterns above provide:

Request interception — the enforcement layer must receive the vendor API request before it reaches Stripe, Twilio, or Resend. Observing requests after they succeed (billing logs, LLM traces) provides monitoring, not enforcement. The enforcement layer must be in the call path, not on the observation path.
Stateful spend accumulation — the enforcement layer must maintain a cumulative spend counter per agent run that persists across requests and is updated atomically. An in-process counter fails for parallel agents (multiple processes each with their own counter that don't aggregate) and on process restart (counter resets to zero). The stateful counter must be external to the agent process.
Cost awareness before forwarding — the enforcement layer must know the cost of the request before deciding whether to forward it. For Stripe, the cost is the amount field in the request body. For Twilio, it's a function of the destination number's rate. The enforcement layer must parse the request, estimate the cost, check it against the remaining cap, and either forward or block — in sub-millisecond latency.
Per-run granularity — enforcement at the account level ("your Stripe charges this month exceeded $10,000") is 100x too coarse for a per-run cap ("this specific billing run should not exceed $500"). Per-run enforcement requires a distinct credential per run — not a shared API key — so the accumulation counter is scoped to the run, not the account.

Three levels of cap granularity

AI agent budget enforcement operates at three distinct granularity levels, each answering a different question:

Level	Question	Example	Enforcement mechanism
Per-run cap	How much can this specific agent execution spend?	This nightly billing run should not charge more than $5,000 total	Per-run vault key with `daily_usd_cap: 5000`
Per-fleet cap	How much can all agents of this type spend in a time window?	All billing agents combined should not exceed $50,000/day	Team-level cap in Keybrake applied to all keys with a shared label prefix
Per-vendor cap	How much should we spend on this vendor API across all agents this month?	Total Twilio spend across all agents should not exceed $2,000/month	Keybrake vendor-level aggregate cap across all keys for the Twilio vendor

Most teams start with per-run caps (row 1) because they're the most actionable: a per-run cap is a direct translation of "this batch job processes N customers at $X each, so the maximum expected spend is $N×X plus a 10% buffer." Per-fleet and per-vendor caps are safeguards against simultaneous runaway across multiple runs.

The agent-side counter failure mode

Agent-side counters (tracking spend in a Python dict, a Redis key owned by the agent process, or an in-memory accumulator) are the most common DIY enforcement attempt and the most common failure mode:

# Naive agent-side counter — breaks in parallel deployments
spent_usd = 0.0

async def charge_customer(customer_id: str, amount_cents: int) -> dict:
    global spent_usd
    amount_usd = amount_cents / 100

    if spent_usd + amount_usd > BUDGET_USD:
        return {"error": "budget_exceeded"}

    # RACE CONDITION: two concurrent calls both read spent_usd = 490.0,
    # both compute 490 + 20 = 510 < 500 is False, so BOTH proceed,
    # creating $540 total spend against a $500 cap.
    response = await call_stripe(customer_id, amount_cents)
    spent_usd += amount_usd   # Updated after the call — race window between read and write
    return response

This pattern has three failure modes that make it unsuitable as enforcement:

Race condition — two concurrent async coroutines read the same spent_usd value before either updates it. Both see the cap as not exceeded. Both proceed. Spend goes over cap by up to one charge amount per concurrent call.
Process restart reset — the counter lives in process memory. If the agent crashes and restarts (or if Kubernetes reschedules the Pod, or if Cloud Run Jobs retries the task container), the counter resets to zero. The restarted process has no knowledge of what the prior execution charged.
Multi-process blindness — a parallel agent deployment with 20 worker processes each maintains its own counter. Each worker enforces its own cap independently. A $500 total cap becomes 20 independent $500 caps if each worker uses its own counter — effective total cap is $10,000, not $500.

Pre-call enforcement with vault keys

import httpx, os

KEYBRAKE_BASE = "https://proxy.keybrake.com"

async def charge_customer(
    customer_id: str,
    amount_cents: int,
    vault_key: str,   # issued per agent run, not shared
) -> dict:
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{KEYBRAKE_BASE}/stripe/v1/payment_intents",
            headers={"Authorization": f"Bearer {vault_key}"},
            json={
                "amount": amount_cents,
                "currency": "usd",
                "customer": customer_id,
            },
        )

    if resp.status_code == 429:
        body = resp.json()
        if body.get("code") == "cap_exhausted":
            # Cap enforced at the proxy — charge was NOT sent to Stripe
            return {
                "error": "cap_exhausted",
                "message": "Agent run budget exhausted. Do not retry.",
                "spent_usd": body.get("spent_usd"),
                "cap_usd": body.get("cap_usd"),
            }

    resp.raise_for_status()
    return resp.json()

The proxy reads the amount field from the request body before forwarding. It checks whether current_accumulated_spend + request_amount would exceed daily_usd_cap. If yes, it returns 429 immediately — the request never reaches Stripe. The Stripe account is never charged. The accumulation check is atomic: a compare-and-swap operation on the spend counter means concurrent requests from parallel agent instances cannot both slip past the cap in the race window. The counter is external to the agent process — it persists across restarts and aggregates across all parallel workers holding the same vault key.

How Keybrake fits

Keybrake provides pre-call budget enforcement for AI agents that call Stripe, Twilio, and Resend. Each agent run receives a vault key issued with a daily_usd_cap. The vault key is the per-run credential — distinct from the shared Stripe API key, revocable without affecting other runs, and auto-expiring per its TTL. Every proxied request triggers an atomic spend check before forwarding: if the cap would be exceeded, Keybrake returns 429 with code: cap_exhausted. The audit log records every proxied call with the vault key's label (the agent run ID), the request cost, and the cumulative spend at call time — giving you the per-run spend reconstruction that cloud billing alarms and vendor dashboards don't provide.

Get early access

Related questions

Can I use Stripe's native rate limits as a form of budget enforcement?

No. Stripe's API rate limits cap request velocity (requests per second), not dollar spend. A single Stripe API call can create a $10,000 charge — Stripe's rate limits will allow it as long as you haven't exceeded the per-second request count. Rate limits protect Stripe's infrastructure from traffic overload; they have no relationship to the financial impact of the requests they're allowing. Vendor rate limits and budget caps are orthogonal controls that address different risks.

How do per-run caps interact with Stripe's idempotency keys?

They're complementary and non-overlapping. Idempotency keys prevent duplicate charges on retry — if a request fails after Stripe processes it but before the response returns, retrying with the same idempotency key deduplicates the charge at Stripe. Spend caps prevent a single run from spending more than its budget across any number of distinct charges. A run that makes 50 unique charges (each with a distinct idempotency key) can still exhaust its spend cap — the idempotency keys prevent duplicates within the 50, but the cap limits the total to the budget. Use both: idempotency keys for retry safety, spend caps for budget enforcement.

What happens to the audit log when a charge is blocked by the cap?

Keybrake records every request in the audit log, including blocked requests. A blocked request (cap exhausted) gets a 429 status in the audit log, the cumulative spend at the time of the block, and the cap that was in effect. This gives you a complete picture of what the agent attempted and where it hit the enforcement boundary — useful for post-incident analysis ("the agent attempted to charge $620 against a $500 cap; the proxy blocked the $621st dollar of spend at 02:14:37 UTC"). The Stripe dashboard, by contrast, only shows charges that were actually created — it has no record of requests that were blocked before reaching Stripe.

How do I set the right cap value for a new agent workflow?

Start with a theoretical maximum: number of operations × maximum cost per operation × 1.2 safety margin. For a billing agent processing 500 customers with a maximum charge of $10 each: 500 × $10 × 1.2 = $6,000 theoretical max. In practice, run the agent in a staging environment with the cap set to 2× the expected spend for a test slice (e.g., 10 customers at $10 = $100, cap at $200). Check the audit log to verify actual spend matches expectations. Then set the production cap at the theoretical maximum. If the agent hits the cap in normal operation, increase the cap — but investigate the reason before increasing, as hitting the cap unexpectedly is the signal that enforcement caught something.