AI agent rate limiting · API spend control · Vendor API governance

AI agent rate limiting: enforcing vendor API spend limits before they're hit

When engineers search for "AI agent rate limiting," they're usually asking one of two different questions. The first: how do I handle 429s from vendor APIs when my agent calls them too fast? The second: how do I enforce my own dollar limits on what the agent can spend before the vendor's limits become relevant? These questions have different answers — and conflating them leads to rate-limit handling code that looks correct but still allows runaway agent spend. This page covers both meanings, the retry-storm failure mode that connects them, and why pre-call proxy enforcement is the right architectural response to both.

TL;DR

Reactive rate limiting (handling vendor 429s) and proactive spend enforcement (capping your own spend before the vendor fires) are complementary, not alternatives. Reactive-only handling creates retry storms when agents misinterpret budget exhaustion as a transient error. Proactive-only enforcement without handling vendor 429s causes agents to fail loudly on rate limits you didn't anticipate. A proxy enforcement layer gives you both: pre-call spend cap enforcement that fires before the vendor rate limit is reached, plus structured 429 responses that your agent can distinguish from transient errors.

Two meanings of "rate limiting" for AI agents

Vendor APIs enforce rate limits for two distinct reasons — and each reason produces a different category of 429 response:

Rate limit typeWhat triggers itVendor exampleCorrect agent response
Request-rate limits Too many API calls per second or per minute Stripe: 25 read req/s, 100 write req/s per secret key Respect Retry-After header; use exponential backoff; reduce concurrency
Spend / volume limits Daily dollar cap, monthly charge volume, or quota exceeded Twilio: account spending limit; Resend: monthly email quota Stop immediately; alert operator; do NOT retry
Proxy-enforced caps Pre-call dollar cap set by the operator, enforced before the request reaches the vendor Keybrake: per-agent daily_usd_cap exceeded Stop immediately; distinguish from transient errors by response header

The critical distinction: request-rate limits are transient — wait a moment and retry. Spend limits are not transient — retrying burns more budget toward the same cap. Agents that treat all 429s as transient errors and retry aggressively will retry into spend limits repeatedly, exhausting budget faster than the original uncapped code would have.

The retry-storm failure mode

Here is the failure sequence that plays out when an agent uses a naive retry-on-429 strategy against a vendor spend limit:

  1. An AI billing agent fans out 1,000 Stripe charges simultaneously.
  2. After $400 in charges, Twilio's account spending limit fires — all further calls return 429.
  3. The agent's retry handler sees 429, waits 1 second, and retries. The 429 repeats.
  4. The agent's exponential backoff: 1s, 2s, 4s, 8s retries — each retry is another Stripe call that fails with 429 immediately. No new charges, but each retry consumes API rate quota against the per-second request-rate limit.
  5. The per-second rate limit fires. Now all retries get 429s for two different reasons — spend exhaustion AND rate limiting — and the agent's error logs show a mix of both, making incident diagnosis harder.
  6. Meanwhile, the operator is looking at a Twilio dashboard showing $400 in charges and 800 failed retries, with no way to tell which original customer IDs were charged and which weren't, because the agent's retry logic didn't record which calls succeeded before the cap hit.

The root cause isn't the retry logic — it's the absence of a distinction between "vendor says slow down" (transient, retry) and "vendor says you're over budget" (terminal, stop).

Why reactive rate limiting alone isn't enough

Adding smarter 429 handling — inspecting error bodies, honoring Retry-After, distinguishing error codes — solves the retry-storm problem but doesn't solve the underlying spend risk. Your agent can now handle a spend-limit 429 gracefully (stop, alert, don't retry), but you've already spent the money by the time the 429 fires. Vendor spend limits are enforced after the call succeeds on their side — Stripe charges the card, then tells you the account limit was hit on the next call.

For many use cases this is acceptable. For AI agents making money-moving calls, it isn't. The window between "the charge succeeded" and "the spend limit fired" can be hundreds of thousands of dollars on a misconfigured fan-out.

Pre-call enforcement: closing the gap reactive handling leaves open

Pre-call enforcement checks the spend cap before forwarding the request to the vendor. If the cumulative spend for a vault key has reached its cap, the proxy returns 429 before the charge is executed — not after. This means:

Pre-call enforcement and reactive 429 handling are complementary:

ApproachFires whenMoney already spent?Retry correct?
Reactive only (handle vendor 429) After vendor's cap fires Yes — charge was processed before 429 No — retrying would exceed cap further
Pre-call enforcement (proxy cap) Before forwarding to vendor No — call was never made No — cap is exhausted, not transient
Both together Proxy cap fires first; vendor cap never reached No Agent can cleanly distinguish cap-hit vs transient 429

What good AI agent rate limiting looks like in practice

A well-implemented AI agent rate limiting strategy has four layers:

  1. Per-agent vault key with a dollar cap: issue a vault key per agent run or per orchestrator job with a cap set to the maximum expected spend plus a safety margin. The proxy enforces this cap pre-call, atomically across concurrent agent instances.
  2. Endpoint allowlist: restrict the vault key to the specific vendor endpoints the agent legitimately needs. A billing agent should only be able to call POST /v1/payment_intents — not POST /v1/refunds or DELETE /v1/customers.
  3. Idempotency keys on money-moving calls: use a stable idempotency key (agent_run_id + customer_id + amount) on every Stripe charge, Twilio send, and Resend delivery. Retries on transient errors become no-ops at the vendor layer — no duplicate charges, no duplicate sends.
  4. Structured 429 handling with error-code inspection: distinguish proxy cap exhaustion (stop, alert, do not retry) from vendor request-rate limiting (wait for Retry-After, retry with backoff) from transient network errors (retry immediately with jitter).

This four-layer approach eliminates both the runaway spend risk (pre-call cap enforcement) and the retry-storm risk (structured 429 handling with correct retry logic per error type).

How Keybrake fits

Keybrake provides layer 1 and layer 2 of the four-layer stack: per-agent vault keys with dollar caps and endpoint allowlists, enforced pre-call at the proxy layer. Layer 3 (idempotency keys) is implemented in your agent code — one line per vendor call. Layer 4 (structured 429 handling) is implemented in your error handler — inspect X-Keybrake-Cap-Hit for cap exhaustion vs vendor-native 429 for rate limiting vs network error for transient failures. The proxy emits a structured audit log with every call's vendor, endpoint, parsed cost, vault key, and enforcement action — giving you the per-call data you need for incident investigation without instrumenting each agent individually.

Get early access

Related questions

What's the difference between rate limiting and spend capping for AI agents?

Rate limiting controls call frequency — how many requests per second or minute the agent can make. Spend capping controls total cost — how many dollars the agent can spend in a time window. Both are important but they're orthogonal. An agent can stay within rate limits while spending far more than intended (a well-paced agent that charges $100/min for 5 minutes hits its rate limit budget but might have a $50 spend cap). Spend capping is the more critical safety control for money-moving agents because it bounds the worst-case financial outcome, not just the worst-case request throughput.

How does a proxy-enforced spend cap interact with Stripe's own spending limits?

Stripe's account-level spending limits are a backstop enforced after charges succeed. A proxy cap is a pre-call gate enforced before charges are attempted. The proxy cap fires first — if your vault key cap is $300 and Stripe's account limit is $500, the proxy stops at $300 and Stripe's limit never fires. This is the correct layering: your proxy cap is the operational control you set per-agent or per-run; Stripe's limit is the organization-level backstop for all agents combined. Set proxy caps per-agent at the expected per-run budget, and keep the Stripe account limit as the hard ceiling across all agents combined.

Can the same proxy cap apply across multiple agent instances running in parallel?

Yes — this is one of the core use cases. Issue one vault key before dispatching a group of parallel agent instances and pass it to all of them. All instances share the same vault key and its cap. The proxy enforces the cap atomically: once cumulative spend across all instances hits the limit, further calls from any instance return 429. This prevents the scenario where 10 agents each think they have a $50 cap individually but collectively spend $500 — the shared cap is enforced at $50 total across all 10, not $50 per agent.

What's the correct retry strategy for an AI agent after a proxy cap-exhaustion 429?

Don't retry. A proxy cap-exhaustion 429 means the budget for this agent run is exhausted — the cap was set intentionally by the operator and hitting it means the agent has done what it was budgeted to do. The correct response is to: (1) record which items in the batch were completed before the cap hit; (2) surface the cap-exhaustion event to the operator with the vault key label, total spend, and remaining items; (3) stop the current run. A new run with a new vault key and a new budget can be triggered by the operator after reviewing the first run's output — not automatically by the agent's retry logic.

How should I choose the right dollar cap value for a vault key?

Start with expected spend × 1.5 as a safety margin. For a billing agent that should charge $10,000 in a run, set the cap at $15,000. The safety margin absorbs legitimate variance (more customers than expected, slightly higher charge amounts) while still bounding the worst case. For agents where you don't know expected spend in advance (open-ended agentic loops), set the cap to the maximum acceptable spend for the entire run — this becomes your financial kill switch. Adjust caps downward after observing actual spend in the first few runs; over-provisioning is safe, under-provisioning causes unnecessary run failures.

Further reading