Agent Governance

LlamaIndex Stripe Integration: Restricted API Keys, Spend Caps, and Agent Governance

By Keybrake · June 14, 2026 · 9 min read

LlamaIndex makes wiring a Stripe billing tool into a ReActAgent straightforward. What it doesn't make obvious is that the same agent will retry that tool call on any observation error — without idempotency keys — turning a single transient Stripe timeout into two charges. SubQuestion decomposition parallelizes the problem, and multi-agent pipelines amplify it further.

This post covers three failure modes that are specific to LlamaIndex's architecture, shows the minimal code fix for each, and then presents the governance pattern that scales across all of them: restricted Stripe API keys as a first-layer filter, per-agent vault keys via a proxy as a second layer that enforces spend caps and gives you a kill switch.

The standard LlamaIndex Stripe pattern

LlamaIndex agents use FunctionTool to expose Python functions as callable tools. Adding Stripe is four lines: define the charge function, wrap it in FunctionTool, pass it to a ReActAgent, and call chat().

import os
import stripe
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # sk_live_... — module-level global

def create_charge(customer_id: str, amount_cents: int, description: str) -> str:
    """Create a Stripe charge and return the charge ID."""
    charge = stripe.Charge.create(
        customer=customer_id,
        amount=amount_cents,
        currency="usd",
        description=description,
    )
    return charge.id

charge_tool = FunctionTool.from_defaults(fn=create_charge)

llm = OpenAI(model="gpt-4o")
agent = ReActAgent.from_tools(
    tools=[charge_tool],
    llm=llm,
    verbose=True,
)

response = agent.chat("Charge customer cus_ABC123 $29.99 for the Pro plan")
print(response)  # Thought: I need to call create_charge... → ch_3R4...

This works correctly in the happy path. The gaps appear when the agent encounters an error, when the query requires multiple Stripe operations, or when you add more agents to the pipeline.

Failure mode 1: ReActAgent retry without idempotency keys

Risk: ReActAgent retries the tool call when the observation step returns an error or when the LLM's action parsing fails. Each retry is a fresh stripe.Charge.create() call. Without an idempotency key, Stripe has no way to know it's a retry — and bills the customer twice.

LlamaIndex's ReAct loop is: Thought → Action → Observation → (repeat until Final Answer). If the Observation step receives an exception from the tool — a Stripe timeout, a network error, an unexpected response format — the agent logs the error as an observation and loops back to Thought. It then calls the same tool again with the same or corrected arguments.

# What the ReAct loop actually does when Stripe times out:
#
# Thought: I need to charge the customer.
# Action: create_charge(customer_id="cus_ABC123", amount_cents=2999, ...)
# Observation: Error — stripe.error.APIConnectionError: Request timed out
#
# Thought: The charge failed. I should try again.
# Action: create_charge(customer_id="cus_ABC123", amount_cents=2999, ...)
# Observation: ch_3R4newcharge   ← Stripe created a NEW charge
#
# Final Answer: The charge was created: ch_3R4newcharge
#
# Meanwhile, the first call may have succeeded server-side
# (timeout on response, not on request) — two charges billed.

The fix is to inject a stable idempotency key per agent run. The key must be generated once before the agent starts — not inside the tool function, where each retry call gets a different key:

import uuid
from functools import partial

def create_charge_with_key(
    customer_id: str,
    amount_cents: int,
    description: str,
    idempotency_key: str,
) -> str:
    """Create a Stripe charge with an idempotency key to prevent duplicates."""
    charge = stripe.Charge.create(
        customer=customer_id,
        amount=amount_cents,
        currency="usd",
        description=description,
        idempotency_key=idempotency_key,
    )
    return charge.id

def make_charge_tool(run_id: str | None = None) -> FunctionTool:
    """Return a FunctionTool with a stable idempotency key for this run."""
    key = run_id or str(uuid.uuid4())
    # partial binds the key at tool-creation time, not at call time
    fn = partial(create_charge_with_key, idempotency_key=key)
    fn.__name__ = "create_charge"
    fn.__doc__ = "Create a Stripe charge and return the charge ID."
    return FunctionTool.from_defaults(fn=fn)

# Create a new tool instance per agent run — same key survives all retries
run_id = str(uuid.uuid4())
agent = ReActAgent.from_tools(
    tools=[make_charge_tool(run_id)],
    llm=llm,
)
response = agent.chat("Charge customer cus_ABC123 $29.99 for the Pro plan")

Pattern: Generate the idempotency key once per run before building the agent. partial() binds it at tool-creation time. Every retry within that run reuses the same key, so Stripe deduplicates and returns the original charge object rather than creating a second one.

If your agent can create multiple charges in one run (billing multiple line items, for example), namespace the key per call index: f"{run_id}-charge-{n}" where n increments in a closure. This keeps retries idempotent while allowing multiple distinct charges.

Failure mode 2: SubQuestionQueryEngine parallel Stripe calls

Risk: SubQuestionQueryEngine decomposes a complex query into subquestions and runs them in parallel by default. If two subquestions both produce a Stripe tool call, you get two charges executing simultaneously — without any shared state to prevent the second from firing.

SubQuestion decomposition is one of LlamaIndex's more powerful features: you describe a question that involves multiple data sources or operations, the LLM breaks it into subquestions, each subquestion is routed to the relevant tool, and the answers are synthesized. The problem with billing tools is that "run in parallel" means "fire all the charges at the same moment."

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool

# Billing tool exposed as a QueryEngineTool
billing_engine = ReActAgent.from_tools([charge_tool], llm=llm)
billing_tool = QueryEngineTool.from_defaults(
    query_engine=billing_engine,
    name="billing",
    description="Use this to charge customers via Stripe",
)

# SubQuestion engine with billing included
sq_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[billing_tool, catalog_tool, crm_tool],
    llm=llm,
)

# A query that spans multiple tools — dangerous if billing triggers twice
response = sq_engine.query(
    "Upgrade customer cus_ABC123 to Pro plan and charge them, "
    "then log the upgrade in the CRM"
)

# SubQuestion decomposition might produce:
#   Q1 (billing): "Charge cus_ABC123 $29.99 for Pro plan"   → fires create_charge
#   Q2 (crm):     "Log upgrade for cus_ABC123 to Pro plan"  → fires crm_update
# If Q1 retries, you get two charges — and Q1/Q2 run concurrently.

There are two mitigations. The first is to use use_async=False on the SubQuestionQueryEngine, which serializes execution and at least prevents concurrent firing:

sq_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[billing_tool, catalog_tool, crm_tool],
    llm=llm,
    use_async=False,  # serialize — billing must complete before CRM update runs
)

The second and more robust mitigation is to treat billing tools as non-subquestionable: do not expose create_charge to SubQuestion decomposition at all. Instead, have the top-level agent handle billing directly and use SubQuestion only for read-only operations (catalog lookups, CRM reads, inventory checks). Reserve the charge tool for the root agent where you can apply full idempotency-key governance.

# Safe pattern: billing at root agent level, SubQuestion only for reads
read_only_sq_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[catalog_tool, crm_tool],  # no billing tool here
    llm=llm,
)
read_tool = QueryEngineTool.from_defaults(
    query_engine=read_only_sq_engine,
    name="lookup",
    description="Look up catalog and CRM data. Does NOT charge customers.",
)

# Root agent handles billing with full idempotency governance
root_agent = ReActAgent.from_tools(
    tools=[make_charge_tool(run_id), read_tool],
    llm=llm,
)

Failure mode 3: AgentRunner key sharing across child agents

Risk: LlamaIndex multi-agent pipelines built with AgentRunner or agent-to-agent routing give every child agent access to the same tool set — including the same Stripe key. A refund agent and a billing agent deployed in the same pipeline share one Stripe key with full permissions. One prompt-injection on the refund agent can issue charges.

LlamaIndex supports multi-agent orchestration through several patterns: AgentRunner with a tool-calling orchestrator, worker agents registered as tools, or custom routing via RouterQueryEngine. In all of these, the natural pattern is to define tools once and share them across agents. For Stripe, sharing means sharing the key — and the key's permissions travel with it.

from llama_index.core.agent import AgentRunner, ReActAgentWorker

# One Stripe key — shared across all agents
stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # sk_live_...

def create_charge(customer_id: str, amount_cents: int, description: str) -> str:
    """Charge a customer via Stripe."""
    return stripe.Charge.create(customer=customer_id, amount=amount_cents,
                                currency="usd", description=description).id

def create_refund(charge_id: str, amount_cents: int) -> str:
    """Refund a Stripe charge."""
    return stripe.Refund.create(charge=charge_id, amount=amount_cents).id

# Both workers get both tools — no separation
charge_tool = FunctionTool.from_defaults(fn=create_charge)
refund_tool = FunctionTool.from_defaults(fn=create_refund)

billing_worker = ReActAgentWorker.from_tools([charge_tool, refund_tool], llm=llm)
support_worker = ReActAgentWorker.from_tools([refund_tool, charge_tool], llm=llm)

agent_runner = AgentRunner(agent_worker=billing_worker)

The risk is role confusion under adversarial input. A support agent that should only issue refunds can also issue charges because the charge tool is in its tool list. And both agents use the same unrestricted Stripe key, so a policy error on one affects both.

The fix is per-role key isolation using Stripe restricted API keys for a first-level filter, and per-agent vault keys for the second level. Each agent gets exactly the vault key it needs for its job:

import os

# Different vault keys per role — different policies enforced at proxy
BILLING_VAULT_KEY  = os.environ["KEYBRAKE_BILLING_KEY"]   # policy: charges only, $500/day cap
SUPPORT_VAULT_KEY  = os.environ["KEYBRAKE_SUPPORT_KEY"]   # policy: refunds only, $200/day cap

PROXY_BASE = "https://proxy.keybrake.com"

def make_stripe_tools(vault_key: str, allowed_ops: list[str]) -> list[FunctionTool]:
    """Return Stripe tools scoped to a specific vault key and operation set."""
    tools = []

    if "charge" in allowed_ops:
        def create_charge(customer_id: str, amount_cents: int, description: str) -> str:
            """Charge a customer. Requires billing-role vault key."""
            import httpx
            resp = httpx.post(
                f"{PROXY_BASE}/stripe/v1/charges",
                headers={"Authorization": f"Bearer {vault_key}"},
                json={"customer": customer_id, "amount": amount_cents,
                      "currency": "usd", "description": description},
            )
            resp.raise_for_status()
            return resp.json()["id"]
        tools.append(FunctionTool.from_defaults(fn=create_charge))

    if "refund" in allowed_ops:
        def create_refund(charge_id: str, amount_cents: int) -> str:
            """Refund a charge. Requires support-role vault key."""
            import httpx
            resp = httpx.post(
                f"{PROXY_BASE}/stripe/v1/refunds",
                headers={"Authorization": f"Bearer {vault_key}"},
                json={"charge": charge_id, "amount": amount_cents},
            )
            resp.raise_for_status()
            return resp.json()["id"]
        tools.append(FunctionTool.from_defaults(fn=create_refund))

    return tools

# Billing agent: can charge, cannot refund — vault key policy enforces this at the proxy
billing_worker = ReActAgentWorker.from_tools(
    tools=make_stripe_tools(BILLING_VAULT_KEY, allowed_ops=["charge"]),
    llm=llm,
)

# Support agent: can refund, cannot charge — even if the LLM tries to charge, proxy returns 403
support_worker = ReActAgentWorker.from_tools(
    tools=make_stripe_tools(SUPPORT_VAULT_KEY, allowed_ops=["refund"]),
    llm=llm,
)

Pattern: One vault key per agent role. The proxy enforces the policy — it rejects a "charge" call from the support vault key with HTTP 403, even if the LLM generates the action correctly. No code change needed to enforce the boundary; it lives at the infrastructure layer.

The complete governance stack for LlamaIndex + Stripe

Putting all three fixes together — idempotency keys for retries, serialized or separated SubQuestion execution for parallel safety, and per-role vault keys for multi-agent isolation — the production pattern looks like this:

import os, uuid, httpx
from functools import partial
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI

PROXY_BASE = "https://proxy.keybrake.com"
VAULT_KEY  = os.environ["KEYBRAKE_VAULT_KEY"]
llm = OpenAI(model="gpt-4o")

def _charge(
    customer_id: str,
    amount_cents: int,
    description: str,
    idempotency_key: str,
) -> str:
    """Internal: charge via Keybrake proxy with idempotency key."""
    resp = httpx.post(
        f"{PROXY_BASE}/stripe/v1/charges",
        headers={
            "Authorization": f"Bearer {VAULT_KEY}",
            "Idempotency-Key": idempotency_key,
        },
        json={
            "customer":    customer_id,
            "amount":      amount_cents,
            "currency":    "usd",
            "description": description,
        },
        timeout=10,
    )
    if resp.status_code == 429:
        raise RuntimeError(f"Daily spend cap reached: {resp.json().get('error', '')}")
    resp.raise_for_status()
    return resp.json()["id"]

def make_billing_agent(run_id: str | None = None) -> ReActAgent:
    """Build a billing agent with a stable per-run idempotency key."""
    key = run_id or str(uuid.uuid4())
    fn = partial(_charge, idempotency_key=key)
    fn.__name__ = "create_charge"
    fn.__doc__ = "Charge a customer via Stripe. Returns the charge ID."
    return ReActAgent.from_tools(
        tools=[FunctionTool.from_defaults(fn=fn)],
        llm=llm,
        max_iterations=5,
    )

# One agent per request — fresh idempotency key per run
agent = make_billing_agent()
response = agent.chat("Charge customer cus_ABC123 $29.99 for the Pro plan")

What this buys you: every retry within a run reuses the same idempotency key (Stripe deduplicates); the proxy enforces the daily cap and returns 429 when it's exceeded; the vault key is scoped to POST /stripe/v1/charges only, so even if the LLM generates a refund action it will be rejected with 403; and the audit log at keybrake.com/app shows every call, its parsed cost, and its outcome.

Comparison: raw key vs restricted key vs vault key

Feature	Raw `sk_live_`	Stripe restricted key	Keybrake vault key
Endpoint allowlist	All endpoints	Resource-level (e.g. Charges write-only)	URL-level (e.g. `POST /charges` only)
Daily spend cap	None	None	Enforced by proxy (returns 429)
Per-agent isolation	No — all agents share one key	Manual — need one restricted key per role	Yes — one vault key per agent/run, one Stripe key in vault
Retry dedup	Only with `idempotency_key` param	Only with `idempotency_key` param	Proxy forwards `Idempotency-Key` header to Stripe
SubQuestion safety	No — parallel calls fire independently	No — parallel calls fire independently	Cap enforced globally; concurrent calls still dedup with key
Audit trail	Stripe Dashboard only	Stripe Dashboard only	Proxy audit log: vault key, endpoint, parsed cost, timestamp
Kill switch	Rotate key in code + redeploy	Disable in Stripe Dashboard	One-click revoke in Keybrake dashboard; agent gets 401 immediately

Enforcing governance with pytest

Governance that isn't tested is governance you'll forget under pressure. These tests verify the key properties of the LlamaIndex + Keybrake integration:

import pytest, uuid, httpx
from unittest.mock import patch, MagicMock

def test_retry_uses_stable_idempotency_key():
    """Same idempotency key on all retries from one run."""
    keys_seen = []

    def mock_post(url, headers, json, **kwargs):
        keys_seen.append(headers.get("Idempotency-Key", ""))
        mock = MagicMock()
        mock.status_code = 500  # force retry
        mock.raise_for_status.side_effect = httpx.HTTPStatusError(
            "500", request=MagicMock(), response=mock
        )
        return mock

    run_id = str(uuid.uuid4())
    with patch("httpx.post", side_effect=mock_post):
        try:
            agent = make_billing_agent(run_id)
            agent.chat("Charge cus_test $100")
        except Exception:
            pass  # expected — Stripe returning 500

    # All calls share the same idempotency key
    assert len(set(keys_seen)) == 1, f"Expected 1 unique key, got {len(set(keys_seen))}: {keys_seen}"
    assert keys_seen[0] == run_id

def test_daily_cap_returns_error_string():
    """Agent receives error message when cap is exceeded, not an exception."""
    def mock_post(url, headers, json, **kwargs):
        mock = MagicMock()
        mock.status_code = 429
        mock.json.return_value = {"error": "Daily cap of $500 exceeded"}
        mock.raise_for_status.return_value = None
        return mock

    with patch("httpx.post", return_value=mock_post(None, {}, {})):
        agent = make_billing_agent()
        # Agent should handle the 429 gracefully — RuntimeError from _charge
        # propagates as an observation error and the agent stops
        with pytest.raises(RuntimeError, match="Daily spend cap"):
            agent.chat("Charge cus_test $1000")

def test_support_vault_key_cannot_charge():
    """Vault key scoped to refunds-only returns 403 on charge attempt."""
    def mock_post(url, headers, json, **kwargs):
        if "charges" in url:
            mock = MagicMock()
            mock.status_code = 403
            mock.raise_for_status.side_effect = httpx.HTTPStatusError(
                "403 Forbidden", request=MagicMock(), response=mock
            )
            return mock
        mock = MagicMock()
        mock.status_code = 200
        mock.json.return_value = {"id": "re_test"}
        return mock

    with patch("httpx.post", side_effect=mock_post):
        tools = make_stripe_tools(SUPPORT_VAULT_KEY, allowed_ops=["refund"])
        agent = ReActAgent.from_tools(tools=tools, llm=llm)
        # Attempting to charge raises an error at the proxy layer
        with pytest.raises(httpx.HTTPStatusError):
            # Internal call — in practice the agent receives the 403 as an observation error
            _charge("cus_test", 100, "test", "key-test")

def test_no_live_key_in_proxy_headers():
    """Real Stripe key never appears in outbound headers to the proxy."""
    headers_sent = []

    def mock_post(url, headers, json, **kwargs):
        headers_sent.append(dict(headers))
        mock = MagicMock()
        mock.status_code = 200
        mock.json.return_value = {"id": "ch_test"}
        mock.raise_for_status.return_value = None
        return mock

    with patch("httpx.post", side_effect=mock_post):
        agent = make_billing_agent()
        agent.chat("Charge cus_ABC123 $29")

    for h in headers_sent:
        for v in h.values():
            assert not str(v).startswith("sk_live_"), f"Live key found in headers: {h}"

Gap analysis

LlamaIndex Workflow steps

LlamaIndex Workflows (introduced in v0.10) let you define multi-step pipelines where each step is a Python class with @step decorators. If two steps both call Stripe and the workflow is configured with num_workers > 1, they execute concurrently. The idempotency-key pattern still applies — generate one key per workflow run in the StartEvent handler and thread it through via the event context. See the idempotency guide for the per-run key closure pattern.

Tool calling vs structured output

LlamaIndex supports both ReAct-style tool calling and structured prediction (dspy-inspired Predict modules via llama-index-experimental). For structured prediction that includes a Stripe output field, the agent does not call FunctionTool directly — it generates a Pydantic model. Be careful: the model might hallucinate a charge_id instead of calling the tool. Always validate charge IDs against the Stripe API before treating them as authoritative.

Streaming agent responses

LlamaIndex supports streaming via agent.stream_chat(). The tool call happens synchronously mid-stream — the charge fires before the streaming response completes. If the client disconnects mid-stream, the charge may have already succeeded. Always check the audit log before retrying a streaming billing operation.

Memory and persistence

LlamaIndex's ChatMemoryBuffer persists conversation history across turns. If your agent is conversational and a user says "charge me again" in turn 3, the agent has full context of the prior charge (charge ID, amount) in its memory. This is useful for references but dangerous if the agent decides to duplicate the charge rather than retrieve the existing one. Scope billing confirmations to single turns, not persistent sessions.

FAQ

Does LlamaIndex have a built-in idempotency mechanism?

No. LlamaIndex provides the tool-calling infrastructure but does not inject idempotency keys into tool calls. You need to implement them yourself — or route through a proxy that handles them at the infrastructure layer.

Should I use `FunctionTool` or `QueryEngineTool` for Stripe?

Use FunctionTool for write operations like charges and refunds. QueryEngineTool wraps a query engine, which is intended for read/retrieval operations. Mixing write tools into a query engine that SubQuestion decomposition can access is how you end up with parallel charge firing.

Can I use LlamaIndex's built-in retry logic with the vault key proxy?

Yes. The proxy forwards the Idempotency-Key header to Stripe, so LlamaIndex's agent-level retries (which reuse the same tool with the same key) are safe. The proxy also deduplicates at its own layer if you send the same vault key + idempotency key combination twice.

How do I correlate a LlamaIndex agent run with the Keybrake audit log?

Use the run_id as both the agent's trace ID and the idempotency key prefix. Pass it through as a custom header: "X-Agent-Run-Id": run_id. The proxy logs all headers with each call, so you can filter the audit log by run_id to see every Stripe call made by a specific agent run.

Does the vault key proxy add latency?

The proxy adds one network hop — typically 5–15 ms on the same region. For billing operations where a human is involved in the approval flow, this is imperceptible. For high-frequency automated pipelines, colocate the proxy in the same data center as your LlamaIndex workers.

What happens if the proxy itself goes down?

Requests fail with a connection error, which the ReActAgent treats as a tool observation error. The agent retries — with the same idempotency key. When the proxy comes back up, the retry is forwarded to Stripe and deduplicated. The net result is at-most-once charge semantics as long as the proxy recovers before the agent's max_iterations is exceeded.

Scoped Stripe keys for your LlamaIndex agents

Keybrake issues per-agent vault keys that enforce spend caps, endpoint allowlists, and audit logging on every proxied Stripe call — no changes to your LlamaIndex tool definitions beyond swapping the base URL.