Incident response · 11 min read
Rotate vs revoke: a 2am playbook for a stuck AI agent
It is 02:14 on a Saturday and PagerDuty has just made the noise it makes. The dashboard says your refund-issuing agent has fired four hundred Stripe calls in the last six minutes. You have a Stripe Dashboard tab open in one window and a terminal in the other. The fastest move is not the move most teams reach for. This is a minute-by-minute playbook for the two paths — rotate the upstream key, or revoke a scoped one — that get conflated under "kill the agent" and turn out to be two-to-three orders of magnitude apart in how fast they actually stop the bleeding.
Rotate and revoke are not synonyms
Half of the incidents I have watched go bad start with a vocabulary mistake. The on-call engineer tabs to the vendor's dashboard and clicks delete this API key, then sits there refreshing their internal "calls per minute" graph and watches it not move. They assume the deletion failed. They click it again. They escalate. The agent keeps firing. By the time they understand what is happening, the cap is a postmortem item.
What is happening is that they reached for rotation and got rotation's latency. The two moves they should have been deciding between were:
- Rotate — invalidate the upstream secret at the vendor itself. Stripe's
sk_live_…, Twilio's auth token, OpenAI'ssk-proj-…. The credential ceases to be valid for anyone, anywhere, including all your other legitimate consumers of that key. Latency is set by the vendor's edge cache and is not under your control. - Revoke — invalidate a scoped credential at a layer between the agent and the vendor. A short-lived
vault_key_…on a proxy, an OAuth access token at your own authorization server, a Stripe Restricted Key whose scope you happen to be willing to delete because no other agent shares it. Latency is set by your own software.
The first is what people reach for, because it is the one with a button on the vendor's website. The second is what they should have set up a month ago, because it is the one that takes effect on the next packet. The rest of this post is about the gap between those two and what you do at 2am if your stack only contains the first.
The propagation tail you didn't ask about
When you click delete on a Stripe API key, what happens? The Dashboard call hits Stripe's control plane and the secret is marked invalidated in their primary datastore. That state then has to propagate to every edge node that authenticates incoming requests. Stripe documents this as "key changes are live within minutes." In our own measurements against sk_live keys against charges.create, the median propagation is 45 seconds and the 95th-percentile tail is three minutes and twelve seconds.
That tail does not sound like a long time. It is. An agent in a tight retry loop calling charges.create every 400 ms is firing 150 calls per minute. A three-minute tail is 450 calls that go through after the moment you clicked Delete. Every one is a real charge against a real customer card, with a real refund obligation when Monday's complaints arrive. The same shape, with different numbers, applies to every other vendor we touch:
| Vendor | Median rotate latency | p95 tail | Calls leaked at 1/400ms |
|---|---|---|---|
| Stripe | ~45s | ~3m12s | ~480 |
| Twilio | ~30s | ~2m | ~300 |
| OpenAI | ~1m | ~5m | ~750 |
| Resend | near-instant | ~5s | ~12 |
Resend is the outlier — they invalidate synchronously across the surface that authenticates your POST /emails, and the leak is bounded by the in-flight requests that were already past the edge. Everyone else has the cache. The cache is what makes rotate the wrong first move.
The full breakdown by vendor — including what each platform documents, what we observed, and which pattern wins on stop latency — is on our kill-switch latency reference. The numbers there agree with the table above; this post is what you do with them at 2am.
Revoke is what rotate should have been
The architectural answer to a several-minute propagation tail is to not put the credential the agent uses on the vendor's edge in the first place. Issue a scoped intermediary credential — call it whatever you like; we call it a vault key — and let your own software hold the mapping from that intermediary to the real upstream secret. Now the question "is this credential still valid?" is answered by your own datastore, on your own server, in front of the request. There is no edge cache. The next packet sees the new state.
This is not a Keybrake-specific idea, although Keybrake is one shape of it. AWS does it for STS-issued temporary credentials; Snowflake does it for OAuth access tokens minted against a federated identity; any team that has built a multi-tenant SaaS has probably built some version of it. The common factor: a layer that can say no on the next request, by changing a row in a database that the layer owns.
For an agent calling Stripe through such a layer, the wall-clock latency from operator clicks revoke to next agent call returns 401 is bounded by how fast that policy change reaches the in-memory state of every proxy instance handling traffic for that vault key. In our deployment that bound is one second; in the median case it is sub-second. The leaked-calls column for the table above collapses from hundreds to single digits.
You can read more on the architectural shape — where this layer fits relative to your LLM gateway, your observability tool, and your identity layer — in our piece on the 2026 agent governance stack. The short version: it is the third layer of four, and it is the one most teams haven't built yet.
The 2am playbook, minute by minute
What follows is the response sequence we walk through when one of our test agents misbehaves on staging. The two columns are the two paths: vault-key revoke through a proxy you operate, and upstream rotate at the vendor itself. They run differently.
Path A — you have a vault key in front of the vendor
t = 0
Page fires. The text says something like stripe.charges.create rate up 94x in the last 5m. You open the policy editor for the vault key the agent runs under. There is one of these per agent or per agent run, never one per organization.
t + 5s
You set the policy to revoked. The proxy's in-memory state updates immediately on the local instance and starts replicating to peers. Replication is bounded at one second by the sync cadence; in practice it is hundreds of milliseconds.
t + 6s
The next outbound call from the agent hits a proxy instance. The instance reads the policy as revoked and returns 401 Unauthorized to the agent without forwarding to Stripe. The agent's HTTP client sees a 401, the SDK retries twice (still 401), the SDK gives up, the loop errors. The audit row gets written with policy_verdict = key_revoked and vendor_status = 0.
t + 30s
You query the audit log to scope the damage. The shape we use is documented in our audit-trail schema post; the query is a SELECT against agent_call_audit filtered on vault_key_id and the last 30 minutes, ordered by started_at. You sum cost_usd_parsed, count policy_verdict = 'allow' rows, and group by customer_scope_id to see whose money moved.
t + 90s
You notify the agent runner (Temporal, Airflow, a cron, whatever started the job) that the agent's credential is dead. Most runners do this for you when the job exits with a non-zero status; check that yours does. If the agent is stuck retrying inside its own loop with no exit condition, kill the process directly.
t + 5m
You write the one-paragraph "what happened" message to the channel that everyone is now watching. You can write it from the audit query output without guessing. Sleep, fix in the morning.
Path B — you only have the upstream key
t = 0
Page fires. You navigate to the vendor's API-keys page. For Stripe that means signing in, finding the right account if you have multiple, finding the right environment, finding the right key. Two-factor prompt fires. The actual key-deletion click is at roughly t + 60s, not t + 5s, because the dashboard is slower than your terminal.
t + 60s
You click delete. Stripe's control plane acknowledges. Their edge nodes do not yet know.
t + 60s to t + 4m
The propagation tail. The agent keeps firing successfully. You can see it in your own metrics — calls-per-minute graph is unchanged. The temptation to click delete a second time, or rotate "one more time to be sure," is strong. Don't; you'll just create a second cache-invalidation race.
t + 4m to t + 8m
Calls start failing as the propagation completes. The agent's retry logic exhausts, the loop errors out. The runner sees the failures and pages a second time.
t + 10m
You realize that every other consumer of that Stripe key is now also broken. Your billing webhook handler can't refund failed charges; your reconciliation cron is throwing 401s; if you have a customer-facing dashboard that uses the same key, that broke too. You're now triaging an outage you caused while triaging the agent.
t + 30m to morning
Generate a new sk_live, distribute it to every consumer, redeploy services that pull credentials at boot. You will probably miss one. The Slack message you write at this point is much harder to write from log rows alone, because the upstream vendor's logs do not have agent_run_id and the agent's own logs do not have cost_usd_parsed.
The two paths are the same incident. The first ends at 02:20 with the agent dead and the rest of your stack untouched. The second ends an hour later with the agent dead, your reconciliation cron broken, and a key-distribution chore on Monday morning's calendar.
When rotate is still the right call
Rotation is not a smell. It is the right move in three specific situations:
- The credential is suspected of being leaked. If the agent itself isn't the problem — a former contractor's laptop, a scraped GitHub commit, a logging misconfiguration that wrote the secret to a shared log — you have to invalidate at the vendor because you do not control where the credential is being used. Revoke at a proxy only stops things that route through your proxy; a leaked secret in the wild needs the upstream rotation.
- You don't have a vault key in front of this vendor yet. The architectural answer takes setup. If your agent is calling Stripe directly with a long-lived key, your rotation latency is what it is, and the playbook above is the one you have. The fix is to put a layer in front, not to argue with the cache.
- Compliance requires it. SOC 2 Type II auditors sometimes prefer to see "we rotated the upstream key" on an incident timeline because the rotation is observable in the vendor's own audit log; a proxy-side revoke is observable only in your audit log, which the auditor has to also trust. Most modern auditors accept proxy-side revoke given a clean log; some old-school ones don't. Know which you have.
In every other case, revoke at the proxy and rotate later — at your leisure, on Monday, when the only thing burning is your laptop fan.
Two anti-playbooks to recognize
There are two response patterns that look like incident response and aren't.
The "let's add a circuit breaker right now" anti-playbook
It is 02:18. The agent is still firing. Someone in the channel says "let me add a feature flag, push, redeploy, and the agent will check it on next call." This is technically a kill-switch (it is pattern 3 in the kill-switch reference). It is also a deploy under pressure on a Saturday at 2am, against a codebase you cannot run a full test suite on, with a CI pipeline that takes 14 minutes to ship. The agent does another 2,100 calls during the deploy. Don't deploy at 2am; revoke (or rotate) at 2am, and add the flag as a Monday improvement.
The "let me first understand what is happening" anti-playbook
The on-call engineer opens the Stripe Dashboard, then the agent's logs, then the LLM proxy's traces, then a database client to look at the agent's input table, then a Slack channel to ask which on-call started the agent. Twenty minutes pass. The agent keeps firing. Stop the bleed first; understand second. The audit log will still be there when you've revoked. The conversation about whether the agent's reward function was wrong, or the prompt was wrong, or the input was wrong, can happen at the standup. The only thing that decays in real time is the bank account.
The morning after
The work that happens after the page is what makes the incident a one-time event instead of a recurring one. Three things to do before lunch.
Reconstruct the run from the audit table. Not from memory, not from Slack, not from a screenshot of a graph — from the agent_call_audit rows. Group on agent_run_id, filter on the time window, order by started_at, and read top to bottom. The query is in our audit schema post; if you don't have that table yet, build it before the next agent runs in production. The programmatic audit-trail page has the four-column minimum if the full sixteen feels excessive.
Decide if you owe customer refunds. Group the audit rows by customer_scope_id, sum cost_usd_parsed for each. Now you know who the agent moved money for, and how much, and whether any of it was outside what their actual support ticket asked for. The hardest version of this work is when you don't have customer_scope_id in your audit, because then you're reconciling Stripe charges against support tickets by hand.
Write the postmortem with two recommendations. One to prevent this specific failure mode (the prompt was bad, the input was bad, the cap was too high). One to compress the next incident's timeline (revoke instead of rotate, vault key per agent run, audit table populated). The second one is the one that makes you faster every time. The first one trains you to expect that this failure mode is the only one you'll see, which is rarely true.
What you set up in calm hours
Everything in the playbook above is a runtime move that depends on prior setup. The relevant setup is, in our opinion, four things:
- One vault key per agent or per agent run. Not one key per organization shared across all agents — that gives you the rotation problem when you only need to revoke one consumer. Per-run is best when you can; per-agent is acceptable; per-org is the failure mode this whole post is about.
- An audit table with
policy_verdict,cost_usd_parsed,customer_scope_id,agent_run_id. The four columns that turn an HTTP log into an audit trail. The audit-trail entry-point page walks through why each one earns its keep. - A revoke action that takes effect on the next packet. Built in-house against your own proxy if you have one; built against a hosted layer like Keybrake if you don't. Either way, the key property is that the action is enforced by network boundary and not by cooperative flag-reading inside the agent's code.
- A monitor on calls-per-minute per vault key, with an alert at 5x baseline. The page in path A above only fires if you have a monitor. Use whatever monitoring you have; the threshold matters more than the tool. Most agents have a stable rate; an outlier rate is signal even before the cap fires.
None of these are visible at 2am. All of them determine whether 2am is forty minutes or four. Build them on a Tuesday afternoon instead.
For when you're touching Stripe specifically
The Stripe-flavoured version of this whole problem is in our five-control checklist for handing an AI agent a Stripe API key. The post predates this one and gives the practical setup for vault keys, scopes, and the per-customer allowlist that prevents the worst class of cross-tenant blast. The mid-run-revoke property is also the row marked Partial in the ten-control coverage matrix for native Stripe Restricted Keys — Stripe's own primitive partially addresses it (you can delete a Restricted Key fast) but inherits the rotation tail (the cache still serves the old key for several minutes). That's why even Restricted Keys benefit from sitting behind a proxy.
If your agent is talking to Stripe via the Stripe Agent Toolkit over MCP, the relevant insertion is the STRIPE_API_BASE env-var swap; the catalogue of which toolkit verbs are dangerous and the proxy-insertion technique are on the Stripe Agent Toolkit MCP page. Restated as it relates to this post: the toolkit gives the agent fourteen verbs, any of which can be the runaway loop; revoke at the proxy stops all fourteen at once.
The one-line summary
Rotate is the move that stops the credential everywhere, eventually. Revoke at a layer you own is the move that stops the credential here, now. At 2am, you want here-and-now. Build the layer.
Get Keybrake when v1 ships
Pre-launch waitlist for the SaaS-API governance proxy. Vault keys take effect on the next packet — sub-second revoke, no upstream rotation, no self-inflicted outage of your other consumers. We'll email you a working code sample for Stripe, Twilio, and Resend the day v1 lands.