Agent Governance

AutoGen Stripe Agent: Restricted API Keys, Spend Caps, and Multi-Agent Governance

By Keybrake · June 13, 2026 · 9 min read

AutoGen's conversation-driven architecture makes it straightforward to register a Stripe function tool and let an AssistantAgent call it across a multi-turn exchange. What you don't get is a cap on how much that conversation can spend on Stripe, a way to cut one conversation without rotating your key everywhere, or an audit trail that attributes each charge to the specific conversation and agent turn that created it.

This post covers the gaps specific to AutoGen's architecture — the conversation loop, the GroupChat function registry, and the code execution mode — and shows the governance pattern that closes all three.

The standard AutoGen Stripe tool pattern

AutoGen registers Stripe access as a function tool, split across two decorators: one that teaches the LLM the function signature (@register_for_llm) and one that actually executes it (@register_for_execution).

import os
import stripe
import autogen

config_list = [{"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}]

assistant = autogen.AssistantAgent(
    name="billing_assistant",
    system_message="You process billing operations for overdue accounts.",
    llm_config={"config_list": config_list},
)

user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=15,
    code_execution_config=False,
)

@user_proxy.register_for_execution()
@assistant.register_for_llm(description="Create a Stripe charge for a customer")
def create_stripe_charge(
    amount_cents: int,
    customer_id: str,
    description: str = "",
) -> str:
    stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # ← live key, no constraints
    charge = stripe.Charge.create(
        amount=amount_cents,
        currency="usd",
        customer=customer_id,
        description=description,
    )
    return f"Charge created: {charge.id}, amount: ${amount_cents / 100:.2f}"

The conversation starts with a task. The assistant decides when to call create_stripe_charge. The proxy executes it. This repeats until the task is done or max_consecutive_auto_reply is hit.

The Stripe key is a live secret, shared across every conversation, with no spend cap attached to it.

Three failure modes specific to AutoGen

Failure 1: the conversation retry storm. The assistant is asked to "charge all customers with balance > $100." A lookup returns 200 customers. The conversation loop iterates: call tool, get response, call tool, get response. The assistant misreads an error as a partial failure and retries. max_consecutive_auto_reply=15 doesn't save you — it resets after each human proxy step. 400 charges instead of 200 before anyone intervenes. In production, this plays out faster than any human monitoring cadence.

Failure 2: a GroupChat agent calls the shared function. You have three agents in a GroupChat: a billing agent, a reconciliation agent, and an analytics agent. The function is registered globally. The reconciliation agent, reasoning about a discrepancy, calls create_stripe_charge with a corrective amount it derived incorrectly. The billing agent did nothing wrong; the shared function registry gave every agent in the group the same Stripe access. Rotating the key now affects all three.

Failure 3: no per-conversation audit trail. Something looks wrong in next month's Stripe dashboard — a cluster of charges at unusual amounts. You need to know which AutoGen conversation created them. Stripe's logs show the charge, the key fingerprint, and the timestamp. Nothing in those logs shows the conversation ID, the agent turn count, or the task description that prompted the charge. Reconstructing the incident means correlating timestamps between your LLM provider's logs and Stripe's logs — if you kept them.

None of these are AutoGen bugs. They're places where the framework's job (coordinating agents, routing tool calls, managing conversation state) ends and your job (enforcing financial policy) begins.

Step 1: replace the live key with a Stripe restricted key

The first control is a Stripe restricted API key scoped to exactly the resources the billing agent uses. A restricted key limits which Stripe API endpoints the key can reach at all — independent of whether the agent decides to call them.

For a billing agent that creates charges and reads customer data:

# Minimum Stripe permissions for an AutoGen billing agent

Charges:              Write   ← create charges
Customers:            Read    ← look up customer records
Refunds:              None    ← agent cannot issue refunds
PaymentIntents:       Write   ← if using Payment Intents flow
Subscriptions:        None
Balance:              None
All other resources:  None

If the reconciliation agent in a GroupChat calls the billing function and tries to access /v1/subscriptions, the restricted key returns a 403 at the Stripe API layer — before the charge is made. That's the right layer for scope enforcement.

See the complete Stripe restricted key permissions reference for all ~60 resource toggles and the minimum permission sets for five agent archetypes.

A restricted key limits blast radius. It doesn't cap spend volume — the agent can still make unlimited charges within its permitted endpoints — and it doesn't give you per-conversation revoke. For that, you need a proxy layer.

Step 2: issue per-conversation vault keys

The governance pattern that closes the remaining gaps is a reverse proxy that sits between your AutoGen tool and the Stripe API. Each conversation gets its own short-lived vault key. The vault key carries a policy: a daily spend cap, an allowed endpoint list, and an expiry time.

The proxy has three jobs:

Spend enforcement — rejects calls that would exceed the daily cap before forwarding to Stripe
Key isolation — the vault key is per-conversation; revoking it stops only that conversation, not every other agent using Stripe
Audit log — every proxied call gets a row with timestamp, endpoint, HTTP status, parsed cost, and a correlation ID you can attach to the AutoGen conversation

From the tool's perspective, the only change is two lines — the key and the base URL:

import stripe

# Before
stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # live key

# After
stripe.api_key = conversation_vault_key           # per-conversation vault key
stripe.api_base = "https://proxy.keybrake.com/stripe"

AutoGen doesn't know a proxy is involved. The tool's function signature is unchanged. The LLM's tool description is unchanged. Only the key and base URL are different.

Updated AutoGen tool with governance

Here's the full pattern — vault key provisioned per conversation, injected into the tool at registration time:

import os
import uuid
import stripe
import autogen

config_list = [{"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}]

def create_billing_conversation(task: str, daily_cap_usd: float = 500.0):
    """Start a governed AutoGen billing conversation with a per-run vault key."""

    # Each conversation gets its own vault key + policy
    conversation_id = str(uuid.uuid4())
    vault_key = os.environ["KEYBRAKE_VAULT_KEY"]  # pre-provisioned per policy

    assistant = autogen.AssistantAgent(
        name="billing_assistant",
        system_message=(
            "You process billing operations. "
            "Use create_stripe_charge for all Stripe charges. "
            f"Conversation ID for audit: {conversation_id}"
        ),
        llm_config={"config_list": config_list},
    )

    user_proxy = autogen.UserProxyAgent(
        name="user_proxy",
        human_input_mode="NEVER",
        max_consecutive_auto_reply=10,
        code_execution_config=False,
    )

    @user_proxy.register_for_execution()
    @assistant.register_for_llm(description="Create a Stripe charge for a customer")
    def create_stripe_charge(
        amount_cents: int,
        customer_id: str,
        description: str = "",
    ) -> str:
        stripe.api_key = vault_key                              # scoped to this conversation
        stripe.api_base = "https://proxy.keybrake.com/stripe"  # proxy enforces spend cap

        charge = stripe.Charge.create(
            amount=amount_cents,
            currency="usd",
            customer=customer_id,
            description=f"[conv:{conversation_id}] {description}",
        )
        return f"Charge created: {charge.id}, amount: ${amount_cents / 100:.2f}"

    # Initiate conversation
    user_proxy.initiate_chat(assistant, message=task)
    return conversation_id

Each call to create_billing_conversation is isolated: a separate vault key, a separate spend cap tracked at the proxy, and a conversation ID threaded into the Stripe description for audit reconstruction.

GroupChat pattern: one vault key per agent role

For GroupChat setups where multiple agents share access to Stripe, the governance pattern extends naturally: provision a separate vault key per agent role, each with different policy parameters.

import autogen

# Separate vault keys per role — different caps, different endpoint allowlists
billing_vault_key   = os.environ["KEYBRAKE_BILLING_VAULT_KEY"]    # $500/day cap, Charges+PaymentIntents
reconcile_vault_key = os.environ["KEYBRAKE_RECONCILE_VAULT_KEY"]  # $0 cap (read-only), no Charges write

groupchat = autogen.GroupChat(
    agents=[billing_agent, reconciliation_agent, analytics_agent],
    messages=[],
    max_round=20,
)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config={"config_list": config_list})

# Each agent's Stripe function uses its own vault key
# Reconciliation agent's function is bound to reconcile_vault_key — $0 write cap
# Even if it hallucinates a charge call, the proxy rejects it before Stripe sees it

The reconciliation agent's vault key carries a zero-dollar write cap. Even if the agent decides to call create_stripe_charge, the proxy rejects the call at the policy layer — not because the code stopped it, but because the key doesn't have budget.

Comparison: raw key vs restricted key vs vault key

Control	Raw live key	Stripe restricted key	Vault key + proxy
Endpoint scope (what Stripe resources the key can touch)	All resources	Configured per key	Proxy allowlist + Stripe restricted key
Per-conversation spend cap	None	None	Daily USD cap enforced before forwarding
Per-agent isolation in GroupChat	None (shared key)	None (shared key)	Separate vault key per role
Kill switch for one conversation	Rotate full key (affects all agents)	Rotate full key (affects all agents)	Revoke vault key — sub-second, conversation-isolated
Per-call audit log with agent context	Stripe dashboard only (no agent attribution)	Stripe dashboard only (no agent attribution)	Proxy audit table: conversation ID, turn, cost, endpoint
Retry storm protection	None	None (charges still go through)	Spend cap stops accumulation before it compounds

Testing the governance layer

The proxy policy — spend cap, endpoint allowlist, expiry — should be verified in tests before any conversation reaches production Stripe. Here's a minimal pytest suite that confirms the policy is enforced:

import pytest
import stripe
import os

PROXY_BASE = "https://proxy.keybrake.com/stripe"
TEST_VAULT_KEY = os.environ["KEYBRAKE_TEST_VAULT_KEY"]  # test key with $10 cap
LIVE_KEY = os.environ["STRIPE_SECRET_KEY"]

def stripe_via_proxy():
    """Return a stripe module configured to use the test vault key + proxy."""
    s = stripe
    s.api_key = TEST_VAULT_KEY
    s.api_base = PROXY_BASE
    return s

def test_charge_within_cap_succeeds():
    """A $5 charge under the $10 daily cap should succeed."""
    s = stripe_via_proxy()
    # Use stripe test customer ID from test fixtures
    charge = s.Charge.create(
        amount=500,  # $5.00
        currency="usd",
        customer="cus_test_fixture",
        description="pytest: within-cap charge",
    )
    assert charge.id.startswith("ch_")

def test_charge_exceeds_cap_is_blocked():
    """A $20 charge against a $10 daily cap should be rejected by the proxy."""
    s = stripe_via_proxy()
    with pytest.raises(stripe.error.AuthenticationError) as exc_info:
        s.Charge.create(
            amount=2000,  # $20.00 — exceeds $10 daily cap
            currency="usd",
            customer="cus_test_fixture",
        )
    # Proxy returns 402 / 403 when policy blocks; stripe-python maps to AuthenticationError
    assert "cap" in str(exc_info.value).lower() or exc_info.value.http_status in (402, 403)

def test_disallowed_endpoint_is_blocked():
    """A subscription endpoint the policy doesn't permit should be blocked."""
    s = stripe_via_proxy()
    with pytest.raises(stripe.error.PermissionError):
        s.Subscription.list()  # not in billing-agent allowlist

def test_live_key_not_accepted_by_proxy():
    """The proxy should reject the live key — only vault keys are valid."""
    s = stripe
    s.api_key = LIVE_KEY
    s.api_base = PROXY_BASE
    with pytest.raises(stripe.error.AuthenticationError):
        s.Charge.create(amount=100, currency="usd", customer="cus_test_fixture")

Gap analysis: what this pattern still doesn't cover

The restricted key + vault key + proxy combination closes the spend, scope, and audit gaps. Three narrower gaps remain worth knowing about:

Parameter-level enforcement. A restricted Stripe key limits which endpoints the key can reach. It doesn't limit the parameters sent to those endpoints — specifically, it can't cap the amount field on a /v1/charges call. A billing agent with Charges:Write can create a $10,000 charge and a $1 charge with the same key. The proxy's spend cap catches this in aggregate (the daily total hits the cap), but a single outsized charge can still go through before the cap triggers.

Customer scope. Neither a restricted key nor a vault key can limit which customer_id values the agent can charge. If the agent is supposed to process one customer and hallucinates a second customer ID, the proxy has no way to know that was wrong. Allowlisting specific customer IDs is application-layer logic that has to live in the tool function or a pre-call validation wrapper.

Cross-conversation accumulation. The vault key's daily spend cap is per-key. If you provision a new vault key for each conversation (the correct pattern), each conversation gets a fresh cap. A high-volume pipeline with 100 conversations/day at $50/conversation each would pass through — each conversation is within its own $500 cap, even though the aggregate is $5,000. Set the per-key cap to the per-conversation budget, not the per-day budget, to avoid this.

FAQ

Does AutoGen support async function tools? Does the proxy handle async calls?

AutoGen v0.4 supports async function tools via register_function with an async def. The proxy handles both sync and async calls identically — it's a standard HTTPS reverse proxy. If you use asyncio with AutoGen, the stripe-python async client (AsyncStripe) works with the proxy the same way: set api_key to the vault key and api_base to the proxy URL, then call the async methods normally.

What happens if the proxy is down? Does AutoGen retry automatically?

AutoGen will retry function calls on transient errors based on the LLM's decision to re-call the tool. If the proxy returns a 5xx, the stripe-python library raises a stripe.error.APIConnectionError. Your function can catch this and return a user-friendly string to the assistant, which the assistant can then reason about. The max_consecutive_auto_reply setting limits how many retries happen before the conversation terminates. Design your system prompt to treat proxy errors as terminal — "if the payment system is unavailable, stop and report" — rather than "retry until success."

How do I provision vault keys per AutoGen conversation in production?

Call the Keybrake admin API before starting the conversation. The typical pattern in a FastAPI service looks like: receive the task request → call POST /admin/vault-keys with a policy (daily cap, allowed endpoints, TTL) → receive the vault key → pass it into create_billing_conversation() via the closure. The vault key is single-use by the conversation and expires at the TTL regardless of whether the conversation completes.

Can I use this pattern with AutoGen's code execution mode?

AutoGen's code execution mode (when code_execution_config is set to a non-False value) lets the agent write and execute arbitrary Python. That's a separate and larger surface area — code execution can import stripe with any key from any source, including os.environ["STRIPE_SECRET_KEY"]. If you use code execution, the vault key pattern doesn't protect you at the code layer. The recommended approach is to disable code execution (code_execution_config=False) for any agent with financial tool access, and route all Stripe operations through registered function tools where you control the key.

How is this different from just setting `max_consecutive_auto_reply` to a low number?

max_consecutive_auto_reply limits the number of turns in one exchange before human input is requested. It doesn't limit per-turn spend. An agent with 5 auto-replies and a $1,000 charge per reply can spend $5,000 before the limit fires. The spend cap at the proxy layer fires in dollars, not in turns — which is the constraint that actually matters for financial risk.

Does this work with AutoGen Studio?

AutoGen Studio is a UI layer on top of the AutoGen SDK. The function tools you register in code map to tools in Studio. The same vault key pattern applies: set the vault key and proxy URL in the function closure, register the function in your agent config, and Studio calls through the proxy the same way programmatic AutoGen does. If Studio supports environment variables in its agent config, you can set KEYBRAKE_VAULT_KEY and KEYBRAKE_PROXY_URL there.

Add spend caps and audit logs to your AutoGen Stripe tools

Keybrake is a scoped API-key proxy for the non-LLM SaaS APIs your agents call. Issue per-conversation vault keys, set daily spend caps, get a per-call audit log that attributes each charge to the agent and run that created it. The proxy is live at proxy.keybrake.com — route your AutoGen Stripe tool there in two lines.