Agent Governance
Cohere Command R Stripe Integration: Restricted API Keys, Spend Caps, and Agent Governance
Cohere's Command R and Command R+ models make it easy to build billing-capable agents with tool calling — define a charge_stripe tool, pass it to co.chat(), and run a while tool_calls: loop until the model stops invoking tools. Three specific failure modes emerge in production: Command R+ can return multiple tool_calls entries in a single response, firing two Stripe charges simultaneously before any tool result is registered; the Cohere SDK's RequestOptions(max_retries=N) compounds a multi-step loop by retrying the API call after a network error — the retry has no memory of the charge that already completed, and the model calls charge_stripe again; and Cohere's stateless chat API means sessions reconstructed from a stored chat_history will replay completed billing operations when an ambiguous follow-up prompt references prior context.
The standard Cohere + Stripe setup
A typical Cohere billing agent looks like this with the v1 Python SDK:
import cohere
import stripe
co = cohere.Client(api_key=COHERE_API_KEY)
stripe.api_key = STRIPE_KEY # ← bare key, shared by all calls
TOOLS = [{
"name": "charge_stripe",
"description": "Charge a customer for their monthly subscription.",
"parameter_definitions": {
"customer_id": {"description": "Stripe customer ID", "type": "str", "required": True},
"amount_cents": {"description": "Amount to charge in cents", "type": "int", "required": True},
"billing_period": {"description": "Billing period identifier e.g. '2026-Q2'", "type": "str", "required": True}
}
}]
def run_billing_agent(message: str, chat_history: list) -> str:
response = co.chat(
model="command-r-plus-08-2024",
message=message,
tools=TOOLS,
chat_history=chat_history,
)
while response.tool_calls:
tool_results = []
for tc in response.tool_calls: # ← iterates ALL tool calls in one response
result = execute_tool(tc.name, tc.parameters)
tool_results.append({
"call": tc,
"outputs": [{"result": result}]
})
response = co.chat(
model="command-r-plus-08-2024",
message="",
chat_history=response.chat_history,
tools=TOOLS,
tool_results=tool_results,
)
return response.text
Clean, readable, and correct for single-tool-call scenarios. The problems surface when the model returns more than one tool call, when the loop catches an exception mid-iteration, or when the session is reconstructed from a stored history.
Failure mode 1: parallel tool_calls emit two charges in one response
Command R+ supports parallel tool calling. When a billing task has scope ambiguity — "process all Q2 outstanding invoices," "charge both the starter and pro tier customers," "handle this batch of five accounts" — the model may return two charge_stripe entries in a single response.tool_calls list instead of one at a time.
What goes wrong: the for tc in response.tool_calls: loop executes both charges sequentially before any Stripe result is registered. The first stripe.Charge.create() completes. The second raises a transient APIConnectionError (Stripe accepted the request but the network response never arrived). The caller has no record of the first charge completing. The outer retry fires the full loop again. Both charges run a second time. Customer billed twice.
The model has no way to know this happened. From its perspective, it emitted two tool calls and received no results — so on the retry it generates the same two tool calls again. The Stripe duplicate-charge protection (same card, same amount within a few seconds) may or may not catch it depending on whether the customer ID and amount combination triggers the heuristic.
Here's a minimal reproduction with Command R+:
response = co.chat(
model="command-r-plus-08-2024",
message="Process all Q2 outstanding invoices for accounts A100 and A101",
tools=TOOLS,
)
# response.tool_calls may be:
# [
# ToolCall(name='charge_stripe', parameters={'customer_id': 'cus_A100', 'amount_cents': 4900, 'billing_period': '2026-Q2'}),
# ToolCall(name='charge_stripe', parameters={'customer_id': 'cus_A101', 'amount_cents': 4900, 'billing_period': '2026-Q2'}),
# ]
#
# Both fire in the for-loop. If the second raises APIConnectionError,
# the retry re-runs all tool calls — including the one that already charged cus_A100.
An idempotency key that is stable across all retries is the only correct fix. If stripe.Charge.create() receives the same idempotency key as a prior completed charge, Stripe returns the original charge object rather than creating a new one.
Failure mode 2: RequestOptions(max_retries=N) compounds the multi-step loop
The Cohere Python SDK exposes request_options on every API call for configuring timeout, retries, and headers. It is common to add retry logic at the SDK level for resilience:
from cohere.core import RequestOptions
response = co.chat(
model="command-r-plus-08-2024",
message=message,
tools=TOOLS,
chat_history=chat_history,
request_options=RequestOptions(max_retries=3, timeout_in_seconds=30),
)
What goes wrong: in a multi-step loop, co.chat() is called twice per iteration — once to get tool calls, once to pass tool results back. The second call (passing tool_results) can fail after Stripe has already charged the customer. When RequestOptions(max_retries=3) retries the second co.chat() call, the Cohere API receives the same message with the same tool results. The model sees the confirmed charge result and continues correctly. But if the failure happens on the first co.chat() call in the iteration — the one that emitted the tool calls — the SDK retries without knowing which tool calls already executed. Worse: if the application itself wraps the whole loop in a retry decorator, a network error after the Stripe charge completes causes the outer retry to re-run the loop from the beginning, calling charge_stripe again with no idempotency key and no memory of the prior charge.
There are two failure layers here. The SDK-level retry is mostly safe if you apply it only to the tool-result submission call. The dangerous layer is application-level retry on the whole loop:
@retry(max_attempts=3, exceptions=(requests.Timeout, cohere.CohereAPIError))
def run_billing_agent(message, chat_history):
# If any co.chat() call raises, the decorator re-runs this entire function.
# Stripe was already charged in the first attempt. Second attempt charges again.
response = co.chat(model=MODEL, message=message, tools=TOOLS, chat_history=chat_history)
while response.tool_calls:
results = [execute_tool(tc.name, tc.parameters) for tc in response.tool_calls]
response = co.chat(model=MODEL, message="", tool_results=results, ...)
return response.text
Every framework covered in this series has this same outer-retry problem. The fix is always the same: an idempotency key derived from the billing operation's content, not from the run attempt. A content-hash key derived from (customer_id, amount_cents, billing_period) is identical whether it's the first attempt or the third, so Stripe collapses all retries into a single charge.
Failure mode 3: chat_history accumulation replays billing on resumed sessions
Cohere's chat API is stateless. There is no server-side session. The caller reconstructs context on every call by passing chat_history — a list of prior USER, CHATBOT, TOOL, and TOOL_RESULTS turns. A typical customer-service billing agent stores this history in a database and reloads it when the customer opens a new support conversation.
What goes wrong: stored chat_history contains the prior tool call and its result: TOOL(charge_stripe, {customer_id: ..., amount_cents: ..., billing_period: "2026-Q2"}) followed by TOOL_RESULTS({status: "succeeded", charge_id: "ch_..."}). This history is fed back into the next session. The customer sends a follow-up message: "can you also handle June?" or "please retry that." The model sees a completed billing tool call in its context window. It does not know that "retry" is ambiguous — it calls charge_stripe again with the same or updated arguments. There is no deduplication because the new call is technically for a different conversation turn, not a retry of the same SDK call. Stripe creates a new charge.
Here is the chat history structure that creates the replay risk:
# Chat history stored in DB after a successful billing session
stored_history = [
{"role": "USER", "message": "Process Q1 invoice for cus_A100"},
{"role": "CHATBOT", "message": "", "tool_calls": [
{"name": "charge_stripe", "parameters": {"customer_id": "cus_A100", "amount_cents": 4900, "billing_period": "2026-Q1"}}
]},
{"role": "TOOL", "tool_results": [{"call": ..., "outputs": [{"status": "succeeded", "charge_id": "ch_xyz"}]}]},
{"role": "CHATBOT", "message": "Successfully charged $49.00 for Q1. Let me know if you need anything else."},
]
# New session, customer follows up
response = co.chat(
model="command-r-plus-08-2024",
message="Now process Q2 as well", # Legitimate new request
chat_history=stored_history, # ← Prior billing is visible to the model
tools=TOOLS,
)
# Model sees: Q1 charge already done via charge_stripe.
# "Q2 as well" → calls charge_stripe again for 2026-Q2. Correct.
# But: "same as last time" → may call charge_stripe with 2026-Q1 again. Duplicate.
# "retry if it failed" → same as above. Duplicate charge if Q1 succeeded.
The fix has two parts. First, a check_existing_charge tool (using a read-only audit vault key) gives the model a way to look up prior charge status before creating a new one. Second, a content-hash idempotency key collapses any duplicate charge_stripe calls with the same (customer_id, amount_cents, billing_period) tuple into a single Stripe charge, regardless of how many conversation turns produced the call.
The two-layer fix
The pattern that closes all three failure modes combines a Stripe restricted key with a per-run vault key from a spend-cap proxy. Neither layer alone is sufficient.
Layer 1: content-hash idempotency key
Derive the idempotency key from the billing operation's content, not from the request attempt. The same (customer_id, amount_cents, billing_period) tuple always produces the same key — so parallel tool calls, application-level retries, and chat-history replays all collapse to a single Stripe charge:
import hashlib
def make_idempotency_key(customer_id: str, amount_cents: int, billing_period: str) -> str:
payload = f"{customer_id}:{amount_cents}:{billing_period}:cohere-billing"
return hashlib.sha256(payload.encode()).hexdigest()[:40]
def charge_stripe_tool(customer_id: str, amount_cents: int, billing_period: str) -> dict:
idem_key = make_idempotency_key(customer_id, amount_cents, billing_period)
try:
charge = stripe.Charge.create(
amount=amount_cents,
currency="usd",
customer=customer_id,
idempotency_key=idem_key,
)
return {"status": "succeeded", "charge_id": charge.id}
except stripe.error.StripeError as e:
# Return as string — do NOT re-raise.
# Re-raising causes the Cohere loop to surface an exception,
# which application-level retry wrappers treat as retriable.
return {"status": "error", "message": str(e)}
Layer 2: per-run vault keys via Keybrake proxy
A restricted Stripe key limits which endpoints the agent can call, but it does not limit how much it can charge. A vault key from the proxy adds a daily USD cap per key — so a runaway billing loop for one customer cannot exhaust the day's budget for all customers, and a compromised key cannot drain the account:
import cohere
import stripe
co = cohere.Client(api_key=COHERE_API_KEY)
def make_billing_tool(vault_key: str):
"""Returns a charge_stripe callable bound to one vault key."""
stripe_client = stripe.StripeClient(
api_key=vault_key,
base_url="https://proxy.keybrake.com/stripe/",
)
def charge_stripe_tool(customer_id: str, amount_cents: int, billing_period: str) -> dict:
idem_key = make_idempotency_key(customer_id, amount_cents, billing_period)
try:
charge = stripe_client.charges.create(params={
"amount": amount_cents,
"currency": "usd",
"customer": customer_id,
}, options={"idempotency_key": idem_key})
return {"status": "succeeded", "charge_id": charge.id}
except stripe.StripeError as e:
return {"status": "error", "message": str(e)}
return charge_stripe_tool
def run_billing_agent(message: str, chat_history: list) -> str:
vault_key = get_vault_key("billing") # per-run key from Keybrake
charge_fn = make_billing_tool(vault_key)
response = co.chat(
model="command-r-plus-08-2024",
message=message,
tools=TOOLS,
chat_history=chat_history,
)
while response.tool_calls:
tool_results = []
for tc in response.tool_calls:
if tc.name == "charge_stripe":
result = charge_fn(**tc.parameters)
tool_results.append({"call": tc, "outputs": [result]})
response = co.chat(
model="command-r-plus-08-2024",
message="",
chat_history=response.chat_history,
tools=TOOLS,
tool_results=tool_results,
)
return response.text
The one-line proxy override is stripe.StripeClient(api_key=vault_key, base_url="https://proxy.keybrake.com/stripe/"). The proxy enforces the endpoint allowlist (billing vault key: POST /v1/charges only) and the daily USD cap (billing vault key cap = expected max single-run charge). An audit vault key (GET /v1/charges only, no cap) powers the check_existing_charge lookup tool that guards against chat_history replay.
Comparison: raw key vs restricted key vs vault key
| Property | Raw key (sk_live_) |
Restricted key | Vault key (proxy) |
|---|---|---|---|
| Endpoint allowlist | All Stripe endpoints | Selected resource types | Exact method+path (POST /v1/charges) |
| Daily USD cap | None | None | Per-key cap enforced at proxy |
| Per-run isolation | Module-level global — all calls share | Same global problem | New key per co.chat() loop run |
| Parallel tool call guard | No dedup — two charges fire | No dedup | Idempotency key collapses duplicates |
| SDK/app retry guard | No guard — re-fires charge | No guard | Content-hash idem key across all retries |
| Chat history replay guard | No guard | No guard | Audit vault key powers pre-charge lookup; idem key collapses replays |
| Audit log | Stripe dashboard only | Stripe dashboard only | Per-request structured log at proxy (customer, agent run ID, key, amount, timestamp) |
Pytest enforcement suite
import hashlib, pytest
from unittest.mock import patch, MagicMock
def make_idempotency_key(customer_id, amount_cents, billing_period):
payload = f"{customer_id}:{amount_cents}:{billing_period}:cohere-billing"
return hashlib.sha256(payload.encode()).hexdigest()[:40]
def test_idempotency_key_is_deterministic():
k1 = make_idempotency_key("cus_A100", 4900, "2026-Q2")
k2 = make_idempotency_key("cus_A100", 4900, "2026-Q2")
assert k1 == k2
def test_different_periods_produce_different_keys():
k1 = make_idempotency_key("cus_A100", 4900, "2026-Q2")
k2 = make_idempotency_key("cus_A100", 4900, "2026-Q3")
assert k1 != k2
def test_stripe_error_returned_not_raised(charge_fn):
with patch("stripe.StripeClient.charges.create",
side_effect=stripe.error.APIConnectionError("timeout")):
result = charge_fn("cus_A100", 4900, "2026-Q2")
assert result["status"] == "error"
assert "timeout" in result["message"]
# No exception propagated — no application-level retry trigger
def test_parallel_tool_calls_deduplicated():
calls = []
def fake_charge(customer_id, amount_cents, billing_period, **kw):
idem = make_idempotency_key(customer_id, amount_cents, billing_period)
calls.append(idem)
return MagicMock(id="ch_test")
# Simulate two parallel tool_calls for the same billing operation
with patch("stripe.Charge.create", side_effect=fake_charge):
result_a = charge_stripe_tool("cus_A100", 4900, "2026-Q2")
result_b = charge_stripe_tool("cus_A100", 4900, "2026-Q2")
# Same idempotency key used for both calls
assert calls[0] == calls[1]
def test_per_run_vault_keys_are_distinct():
key_a = get_vault_key("billing")
key_b = get_vault_key("billing")
assert key_a != key_b # Each run issues a fresh vault key from the proxy
Gap analysis
1. Cohere v2 API (ClientV2) messages format
The v2 API uses an OpenAI-compatible messages list instead of chat_history. The chat-history replay risk is identical — stored messages are reconstructed on session resume. The parallel tool-call and retry failure modes also apply unchanged. Apply the same idempotency key and vault key patterns; only the SDK call signature differs (co.chat(messages=[...], tools=[...]) vs co.chat(message=..., chat_history=[...])).
2. Command A and future model releases
Cohere's Command A model (2025) has a 256k context window and is optimized for agentic tasks. A larger context window increases the chat-history replay risk: more prior billing operations fit in context, and the model has more historical evidence to draw on when deciding whether to re-execute a tool. Content-hash idempotency keys are context-window-agnostic — the key is derived from the operation's content, not the conversation position.
3. Cohere Embed + Rerank in billing pipelines
Some billing agents use Cohere Embed to retrieve relevant invoice records from a vector store before calling charge_stripe. If the retrieval step returns the same invoice record twice (near-duplicate embeddings, re-indexed documents), the agent may call charge_stripe twice for the same invoice in a single run. A content-hash idempotency key derived from the invoice record's (customer_id, amount_cents, billing_period) fields deduplicates this at the Stripe layer regardless of how many retrieval results the model consumed.
4. Structured generation and tool-call schema mismatch
Command R+ uses a trained tool-calling format. When the model's output does not match the declared parameter_definitions schema (wrong type for amount_cents, missing billing_period), the Cohere SDK raises a cohere.BadRequestError or returns a malformed ToolCall object. Application code that catches this and retries the full co.chat() call restarts the model from the user message — if the model previously succeeded in calling charge_stripe before hitting the schema error on a subsequent tool, the retry re-executes the successful charge. Validate tool output types before passing them to Stripe; never retry the full loop on schema errors after any Stripe call has completed.
FAQ
Does Cohere's force_single_step=True prevent parallel tool calls?
force_single_step=True (v1 API) forces the model to emit exactly one tool call per response before waiting for a result. This prevents parallel tool calls from the same response, which closes failure mode 1. However, it does not address SDK/app-level retry (failure mode 2) or chat_history replay (failure mode 3). It also increases latency for multi-step billing workflows. Use it if your billing logic is strictly sequential; apply idempotency keys regardless.
Can I use Stripe's built-in idempotency without a content-hash key?
You can pass any string as the idempotency key. A UUID generated once per loop run works if the retry logic always uses the same UUID. The problem is that UUIDs are regenerated on application restart, container redeploy, or after an uncaught exception clears the local variable. A content-hash key derived from (customer_id, amount_cents, billing_period) survives all of these events because it is recomputed from data, not stored state.
How do I handle two legitimate charges for the same customer in the same billing period?
Add a disambiguator to the key: (customer_id, amount_cents, billing_period, charge_type) where charge_type is "subscription", "overage", "setup-fee", etc. This keeps the key stable across retries while allowing multiple distinct charges per period.
What happens if the vault key daily cap is exhausted mid-batch?
The proxy returns 429 Daily cap exceeded. The charge_stripe_tool function catches this as a StripeError and returns {"status": "error", "message": "daily cap exceeded"}. The model receives this as a tool result and can either stop the loop or report the cap to the caller. The key distinction from an uncapped key: the proxy enforces the cap on a single vault key. Other customers' billing runs use their own vault keys with their own caps — one runaway batch does not exhaust the shared Stripe account.
Does this pattern work with Cohere's multi-agent connectors API?
Cohere's connectors (server-side data retrieval integrations) do not expose tool-calling in the same way. For custom tool use with the Cohere connectors API, the same pattern applies: wrap the Stripe call in a connector handler that computes a content-hash idempotency key before calling stripe.Charge.create(). Per-run vault keys require that the connector handler receive the vault key per request, not at initialization time.
Should I use request_options=RequestOptions(max_retries=0) to disable SDK retries?
Disabling SDK retries is a reasonable safeguard for the tool-result submission call (the second co.chat() in the loop). For the initial call (getting tool calls), SDK retries are mostly safe — the model hasn't called any tools yet. The most important layer to protect is the application-level retry wrapper around the whole loop. That retry must not re-run charge_stripe without an idempotency key. Whether or not SDK retries are enabled, a content-hash idempotency key in the tool function is the correct guard.
Scoped keys for every billing call
Keybrake issues per-run vault keys with endpoint allowlists and daily USD caps — so parallel tool calls, retry loops, and session replays all collapse to a single Stripe charge. Drop-in proxy endpoint, one line of code to switch.