Agent Governance

DSPy Stripe Integration: Restricted API Keys, Spend Caps, and Agent Governance

By Keybrake · June 13, 2026 · 9 min read

DSPy makes it easy to wire a Stripe tool into a ReAct module and start charging customers from a language model. It also makes it easy to accidentally bill your production Stripe account 50 times while tuning prompts — because MIPRO runs real tool calls during optimization, and nobody told you that meant real Stripe charges.

This post covers three failure modes specific to DSPy's architecture — optimizer trial explosion, assertion-triggered retry without idempotency keys, and process-global stripe.api_key contamination — and shows the governance pattern that closes all three: restricted Stripe API keys as a first layer, per-module vault keys via a proxy as a second layer.

The standard DSPy Stripe pattern

DSPy agents typically use dspy.ReAct to give a language model access to external tools. Adding Stripe is three lines: define the tool function, pass it to ReAct, call forward().

import os
import dspy
import stripe

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # sk_live_... — module-level global

def stripe_charge_tool(customer_id: str, amount_cents: int, description: str) -> str:
    """Create a Stripe charge for a customer. Returns the charge ID."""
    charge = stripe.Charge.create(
        customer=customer_id,
        amount=amount_cents,
        currency="usd",
        description=description,
    )
    return charge.id

class BillingAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "customer_id, amount_cents, description -> charge_id",
            tools=[stripe_charge_tool],
        )

    def forward(self, customer_id, amount_cents, description):
        return self.react(
            customer_id=customer_id,
            amount_cents=amount_cents,
            description=description,
        )

dspy.configure(lm=dspy.LM("openai/gpt-4o"))
agent = BillingAgent()
result = agent("cus_ABC123", 2999, "Pro plan - June 2026")
print(result.charge_id)  # ch_3R4...

This works. But once you try to optimize it — or add a constraint, or deploy it alongside a refund agent — the gaps appear fast.

Failure mode 1: Optimizer trial explosion

Risk: MIPRO and BootstrapFewShot execute real Stripe tool calls during prompt optimization. 50 optimizer trials × 1 tool call per trial = 50 Stripe charges billed to your production account before your program is even deployed.

DSPy's teleprompters (optimizers) work by running your program multiple times against a training set to find the best prompts and few-shot demonstrations. Each trial is a genuine forward pass through your module, which means every tool your module uses gets called with real inputs.

from dspy.teleprompt import MIPRO, BootstrapFewShot

# This looks harmless — it's just tuning prompts
teleprompter = MIPRO(metric=charge_success_metric, num_threads=4)

training_examples = [
    dspy.Example(customer_id="cus_AAA", amount_cents=999, description="Basic plan").with_inputs("customer_id", "amount_cents", "description"),
    dspy.Example(customer_id="cus_BBB", amount_cents=2999, description="Pro plan").with_inputs("customer_id", "amount_cents", "description"),
    # ... 48 more examples
]

# MIPRO with num_candidates=10 runs ~50 trials across your training set.
# Each trial calls stripe_charge_tool with a real customer ID.
# Result: up to 50 live Stripe charges. In production.
optimized_agent = teleprompter.compile(BillingAgent(), trainset=training_examples)

BootstrapFewShot is equally dangerous: it bootstraps demonstrations by running each training example through the full forward pass. 20 training examples = 20 Stripe charges to collect the demonstrations. The default max_bootstrapped_demos=4 doesn't cap the number of forward passes — it caps how many demonstrations get kept.

The fix is to make your module's tool configurable between live and mock modes, and always optimize against mocks:

def stripe_charge_mock(customer_id: str, amount_cents: int, description: str) -> str:
    """Mock tool that returns a fake charge ID without hitting Stripe."""
    return f"ch_mock_{customer_id[:8]}_{amount_cents}"

class BillingAgent(dspy.Module):
    def __init__(self, use_live_stripe: bool = False):
        charge_tool = stripe_charge_tool if use_live_stripe else stripe_charge_mock
        self.react = dspy.ReAct(
            "customer_id, amount_cents, description -> charge_id",
            tools=[charge_tool],
        )

    def forward(self, customer_id, amount_cents, description):
        return self.react(
            customer_id=customer_id,
            amount_cents=amount_cents,
            description=description,
        )

# Optimize with mocks — zero Stripe API calls
optimized_agent = teleprompter.compile(
    BillingAgent(use_live_stripe=False),
    trainset=training_examples,
)
optimized_agent.save("compiled/billing_v3.json")

# Load in production with live (or vault) keys
prod_agent = BillingAgent(use_live_stripe=True)
prod_agent.load("compiled/billing_v3.json")

Pattern: Optimize against mocks, deploy with live (or vault) keys. The compiled JSON stores prompts and demonstrations — not the tool implementation — so loading the compiled artifact into a live-tool instance works correctly.

The same rule applies to dspy.evaluate.Evaluate(): if your evaluation metric calls forward() against a real Stripe key and you have 100 evaluation examples, you're creating 100 Stripe charges every time you run an eval. Always evaluate with mocks or against a Stripe restricted key scoped to your test customer IDs only.

Failure mode 2: Assertion retry without idempotency keys

Risk: dspy.Assert and dspy.Suggest trigger a full forward-pass retry when a constraint fails. If the constraint fires after the Stripe tool has already been called, the retry creates a duplicate charge. Without an idempotency key, Stripe has no way to know it's a retry.

DSPy's constrained generation system is genuinely useful: you can assert that the LLM's output matches a schema, starts with a valid prefix, or satisfies a business rule. When the constraint fails, DSPy reruns the forward pass with the failed attempt included as a hint in context. The problem is that reruns re-invoke every tool in the module.

class GovernedBillingAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "customer_id, amount_cents, description -> charge_id",
            tools=[stripe_charge_tool],
        )

    def forward(self, customer_id, amount_cents, description):
        result = self.react(
            customer_id=customer_id,
            amount_cents=amount_cents,
            description=description,
        )
        # If the LLM returns "charge_123abc" instead of "ch_3R4xxx",
        # this assertion fails and reruns the entire forward pass —
        # including another call to stripe_charge_tool.
        dspy.Assert(
            result.charge_id.startswith("ch_"),
            "charge_id must be a valid Stripe charge ID starting with ch_",
        )
        return result

The failure scenario: the LLM runs stripe_charge_tool, which succeeds and returns ch_3R4xxxxx, but then the ReAct module includes extra reasoning text in the final output field. The assertion fires because the extracted charge_id field contains "ch_3R4xxxxx. The charge was successful." instead of the bare ID. DSPy reruns. The tool is called again. Now you have two charges.

The fix is to inject a stable, per-run idempotency key that survives retries. The key must be generated once before any forward pass begins — not inside the tool function, where each retry generates a new key:

import uuid

def make_charge_tool(run_idempotency_key: str):
    """Return a Stripe charge tool locked to a specific idempotency key."""
    call_count = [0]

    def stripe_charge_tool(customer_id: str, amount_cents: int, description: str) -> str:
        """Create a Stripe charge for a customer. Returns the charge ID."""
        call_count[0] += 1
        # Idempotency key: stable per run, namespaced per call index.
        # If DSPy retries, call_count resets to 0 (same index = same key = safe).
        idem_key = f"{run_idempotency_key}-charge-{call_count[0]}"
        charge = stripe.Charge.create(
            customer=customer_id,
            amount=amount_cents,
            currency="usd",
            description=description,
            idempotency_key=idem_key,
        )
        return charge.id

    return stripe_charge_tool

class BillingAgent(dspy.Module):
    def __init__(self, vault_key: str):
        self._vault_key = vault_key
        self.react_template = dspy.ReAct(
            "customer_id, amount_cents, description -> charge_id",
            tools=[],  # placeholder — tools injected per forward() call
        )

    def forward(self, customer_id, amount_cents, description):
        run_key = str(uuid.uuid4())
        charge_tool = make_charge_tool(run_key)
        # Rebind tools for this specific run
        self.react_template.tools = [charge_tool]
        result = self.react_template(
            customer_id=customer_id,
            amount_cents=amount_cents,
            description=description,
        )
        dspy.Assert(
            result.charge_id.startswith("ch_"),
            "charge_id must be a valid Stripe charge ID",
        )
        return result

With a stable idempotency key, Stripe deduplicates retries: if stripe_charge_tool is called twice in the same DSPy forward pass with the same idem_key, the second call returns the original charge object without creating a new one. For more on idempotency patterns in agentic contexts, see our post on Stripe idempotency keys for AI agents.

Failure mode 3: Process-global stripe.api_key contamination

Risk: stripe.api_key is a module-level global. Every DSPy module loaded in the same Python process shares it. A billing agent and a refund agent deployed in the same worker process share one Stripe key — meaning the refund agent's key has whatever permissions the billing agent needed, and vice versa.

When you deploy multiple DSPy agents as a web service, they typically run in the same Python worker process. You set stripe.api_key once at startup, and every Stripe call in every module uses it:

# worker.py — initialized once at startup
import stripe
stripe.api_key = os.environ["STRIPE_LIVE_KEY"]  # module-level global

# Both agents share the same stripe.api_key.
# BillingAgent needs: Charges:Write, Customers:Read
# RefundAgent needs:  Refunds:Write, Charges:Read
# The shared key must have ALL of these — broadening both agents' scope.
billing_agent = BillingAgent()
billing_agent.load("compiled/billing_v3.json")

refund_agent = RefundAgent()
refund_agent.load("compiled/refund_v1.json")

@app.post("/charge")
async def charge(req: ChargeRequest):
    return billing_agent(req.customer_id, req.amount_cents, req.description)

@app.post("/refund")
async def refund(req: RefundRequest):
    return refund_agent(req.charge_id, req.amount_cents)

This means the billing agent — which should only be able to create charges — is running with a key that also has Refunds:Write permission, because the refund agent needs it. And the refund agent is running with Charges:Write permission it should never need. A prompt injection attack against either agent can cross into the other's domain.

The fix is to pass the Stripe API key as an explicit per-call parameter inside each tool function rather than relying on the module global. This is what Stripe restricted keys alone can't solve — even if you create separate restricted keys for billing vs. refund, you still need to route the right key to the right tool call at runtime. A vault proxy handles this cleanly:

import httpx

def make_stripe_charge_tool(vault_key: str):
    """Return a charge tool that routes through Keybrake proxy with the given vault key."""
    def stripe_charge_tool(customer_id: str, amount_cents: int, description: str) -> str:
        """Create a Stripe charge for a customer. Returns the charge ID."""
        response = httpx.post(
            "https://proxy.keybrake.com/stripe/v1/charges",
            headers={"Authorization": f"Bearer {vault_key}"},
            data={
                "customer": customer_id,
                "amount": str(amount_cents),
                "currency": "usd",
                "description": description,
            },
            timeout=10.0,
        )
        if response.status_code == 429:
            return "ERROR: spend cap reached — halt all billing operations"
        response.raise_for_status()
        return response.json()["id"]
    return stripe_charge_tool

def make_stripe_refund_tool(vault_key: str):
    """Return a refund tool that routes through Keybrake proxy with the given vault key."""
    def stripe_refund_tool(charge_id: str, amount_cents: int) -> str:
        """Issue a Stripe refund for a charge. Returns the refund ID."""
        response = httpx.post(
            "https://proxy.keybrake.com/stripe/v1/refunds",
            headers={"Authorization": f"Bearer {vault_key}"},
            data={"charge": charge_id, "amount": str(amount_cents)},
            timeout=10.0,
        )
        if response.status_code == 429:
            return "ERROR: spend cap reached — halt all refund operations"
        response.raise_for_status()
        return response.json()["id"]
    return stripe_refund_tool

# worker.py — vault keys are per-module, not a shared global
BILLING_VAULT_KEY = os.environ["KEYBRAKE_BILLING_VAULT_KEY"]  # policy: Charges:Write, Customers:Read, daily_cap=$500
REFUND_VAULT_KEY  = os.environ["KEYBRAKE_REFUND_VAULT_KEY"]   # policy: Refunds:Write, Charges:Read, daily_cap=$200

billing_agent = BillingAgent(vault_key=BILLING_VAULT_KEY)
billing_agent.load("compiled/billing_v3.json")

refund_agent = RefundAgent(vault_key=REFUND_VAULT_KEY)
refund_agent.load("compiled/refund_v1.json")

Pattern: Issue one Keybrake vault key per DSPy module, each with a policy scoped to exactly the Stripe endpoints that module needs. The vault key is injected at module init time via a tool factory function — no shared global, no cross-agent key contamination, and a per-module spend cap that the proxy enforces before calls reach Stripe.

Six-control comparison table

Control	Raw `sk_live_` key	Restricted key	Vault key (Keybrake)
Endpoint allowlist	All endpoints	Resource-level (e.g. Charges:Write)	Endpoint-level (e.g. `POST /v1/charges` only)
Daily spend cap	None	None	Per-key USD cap (proxy enforces, returns 429)
Per-module isolation	No — shared global	Possible, but requires key rotation per worker restart	Yes — separate vault key per module, injected at init
Optimizer trial exposure	Full access during every MIPRO trial	Reduced access, still live calls	Zero — optimize with mocks; vault key used only in production
Assert retry dedup	Duplicate charges without idempotency key	Duplicate charges without idempotency key	Proxy logs each call; idempotency key pattern prevents duplicates
Audit trail	Stripe dashboard only	Stripe dashboard only	Keybrake audit log: vault key, DSPy module, timestamp, amount parsed from response

Putting it together: the governed DSPy billing agent

Here's the full pattern combining all three fixes — mock-based optimization, idempotency-keyed tool closures, and per-module vault keys:

import os
import uuid
import httpx
import dspy

# --- Tool factory functions ---

def make_charge_tool(vault_key: str, optimize_mode: bool = False):
    """Return a charge tool. In optimize_mode, returns mock IDs without hitting Stripe."""
    call_count = [0]
    run_key = str(uuid.uuid4())

    def stripe_charge_tool(customer_id: str, amount_cents: int, description: str) -> str:
        """Create a Stripe charge for a customer. Returns the charge ID."""
        if optimize_mode:
            return f"ch_mock_{customer_id[:6]}_{amount_cents}"

        call_count[0] += 1
        idem_key = f"{run_key}-charge-{call_count[0]}"

        response = httpx.post(
            "https://proxy.keybrake.com/stripe/v1/charges",
            headers={"Authorization": f"Bearer {vault_key}"},
            data={
                "customer": customer_id,
                "amount": str(amount_cents),
                "currency": "usd",
                "description": description,
                "idempotency_key": idem_key,
            },
            timeout=10.0,
        )
        if response.status_code == 429:
            return "ERROR: spend cap reached — halt billing"
        response.raise_for_status()
        return response.json()["id"]

    return stripe_charge_tool


# --- DSPy module ---

class BillingAgent(dspy.Module):
    def __init__(self, vault_key: str, optimize_mode: bool = False):
        self._vault_key = vault_key
        self._optimize_mode = optimize_mode
        self.react = dspy.ReAct(
            "customer_id, amount_cents, description -> charge_id",
            tools=[make_charge_tool(vault_key, optimize_mode)],
        )

    def forward(self, customer_id, amount_cents, description):
        # Rebind per-run tool so each forward() gets a fresh idempotency key
        self.react.tools = [make_charge_tool(self._vault_key, self._optimize_mode)]
        result = self.react(
            customer_id=customer_id,
            amount_cents=amount_cents,
            description=description,
        )
        dspy.Assert(
            isinstance(result.charge_id, str) and (
                result.charge_id.startswith("ch_") or
                result.charge_id.startswith("ch_mock_")
            ),
            "charge_id must be a valid Stripe charge ID",
        )
        return result


# --- Optimize with mocks ---

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))  # cheaper model for optimization

from dspy.teleprompt import BootstrapFewShot

training_examples = [
    dspy.Example(
        customer_id="cus_test_001",
        amount_cents=999,
        description="Basic plan",
        charge_id="ch_mock_cus_tes_999",
    ).with_inputs("customer_id", "amount_cents", "description")
    # ... more examples
]

def charge_metric(example, pred, trace=None):
    return pred.charge_id.startswith("ch_")

teleprompter = BootstrapFewShot(metric=charge_metric, max_bootstrapped_demos=3)
optimized = teleprompter.compile(
    BillingAgent(vault_key="unused_in_mock_mode", optimize_mode=True),
    trainset=training_examples,
)
optimized.save("compiled/billing_v4.json")

# --- Production deployment ---

BILLING_VAULT_KEY = os.environ["KEYBRAKE_BILLING_VAULT_KEY"]
dspy.configure(lm=dspy.LM("openai/gpt-4o"))  # full model for production

prod_agent = BillingAgent(vault_key=BILLING_VAULT_KEY, optimize_mode=False)
prod_agent.load("compiled/billing_v4.json")

pytest enforcement suite

import pytest
from unittest.mock import patch, MagicMock, call
import os, dspy

MOCK_VAULT_KEY = "vault_key_test_xxx"
os.environ.setdefault("KEYBRAKE_BILLING_VAULT_KEY", MOCK_VAULT_KEY)

def test_optimization_never_calls_live_stripe():
    """MIPRO and BootstrapFewShot must never hit live Stripe during compilation."""
    with patch("httpx.post") as mock_post:
        agent = BillingAgent(vault_key=MOCK_VAULT_KEY, optimize_mode=True)
        result = agent("cus_test_001", 999, "Basic plan")
        mock_post.assert_not_called()
    assert result.charge_id.startswith("ch_mock_")

def test_assert_retry_uses_stable_idempotency_key():
    """DSPy assert retry must reuse the same idempotency key to prevent double-charging."""
    captured_keys = []

    def capture_post(url, **kwargs):
        idem_key = kwargs.get("data", {}).get("idempotency_key", "")
        captured_keys.append(idem_key)
        mock_resp = MagicMock()
        mock_resp.status_code = 200
        mock_resp.json.return_value = {"id": "ch_3R4test001"}
        return mock_resp

    with patch("httpx.post", side_effect=capture_post):
        agent = BillingAgent(vault_key=MOCK_VAULT_KEY, optimize_mode=False)
        agent("cus_test_001", 999, "Basic plan")

    # All Stripe calls in a single forward() should share the same run idempotency key
    if len(captured_keys) > 1:
        base_keys = [k.rsplit("-charge-", 1)[0] for k in captured_keys]
        assert len(set(base_keys)) == 1, "All retries must share the same run idempotency key prefix"

def test_billing_vault_key_returns_403_for_refunds():
    """Billing vault key must be rejected by the proxy for refund endpoints."""
    mock_resp = MagicMock()
    mock_resp.status_code = 403
    mock_resp.raise_for_status.side_effect = Exception("403 Forbidden")

    with patch("httpx.post", return_value=mock_resp):
        with pytest.raises(Exception, match="403"):
            httpx.post(
                "https://proxy.keybrake.com/stripe/v1/refunds",
                headers={"Authorization": f"Bearer {MOCK_VAULT_KEY}"},
                data={"charge": "ch_3R4test001", "amount": "999"},
            ).raise_for_status()

def test_spend_cap_halts_billing_agent():
    """Proxy 429 must cause the charge tool to return an error string, not raise."""
    mock_resp = MagicMock()
    mock_resp.status_code = 429

    with patch("httpx.post", return_value=mock_resp):
        tool = make_charge_tool(MOCK_VAULT_KEY, optimize_mode=False)
        result = tool("cus_test_001", 999, "Basic plan")
    assert "spend cap" in result.lower()

def test_no_sk_live_key_in_proxy_headers():
    """Vault key sent to proxy must not be a raw sk_live_ Stripe key."""
    captured_headers = []

    def capture_post(url, **kwargs):
        captured_headers.append(kwargs.get("headers", {}))
        mock_resp = MagicMock()
        mock_resp.status_code = 200
        mock_resp.json.return_value = {"id": "ch_3R4test001"}
        return mock_resp

    with patch("httpx.post", side_effect=capture_post):
        agent = BillingAgent(vault_key=MOCK_VAULT_KEY, optimize_mode=False)
        agent("cus_test_001", 999, "Basic plan")

    for headers in captured_headers:
        auth = headers.get("Authorization", "")
        assert "sk_live_" not in auth, "Live Stripe key must not appear in proxy request headers"

Gap analysis

Evaluation dataset contamination. dspy.evaluate.Evaluate() runs every example through the full forward pass. If your evaluation set contains real customer IDs and you're running against a live vault key, each evaluation creates a real charge. The fix: always evaluate with optimize_mode=True or against a Stripe test-mode key — never against a production vault key.

Teleprompter parallelism and thread-local state. MIPRO(num_threads=N) runs N concurrent forward passes. The call_count counter in the idempotency key factory is not thread-safe. Use threading.local() or generate a fully independent UUID per tool invocation. Two concurrent MIPRO trials can share the same call_count state if the tool function is not thread-isolated.

Compiled program few-shot demonstrations. optimized.save("compiled/billing_v4.json") persists few-shot demonstrations as JSON. If your training examples contain real customer IDs or charge amounts (not synthetic test data), those IDs are now in a plaintext JSON file committed to your repository. Use fully synthetic training data — never real production objects as training examples.

dspy.settings.cache and tool result replay. DSPy caches LM completions by hash of (model, prompt). The cache does not cover tool calls — tool results are not cached and are always re-executed. This is the correct behavior, but it means disabling the LM cache (dspy.settings.configure(cache=False)) has no effect on tool call frequency. Optimizer trials always fire real tool calls regardless of cache state.

For a broader look at how these patterns apply across frameworks, the AutoGen Stripe governance post covers the analogous failure modes in multi-agent group chats, and the LangChain Stripe integration post walks the same progression from bare key to vault key for tool-calling chains.

FAQ

Does this pattern work with DSPy's async support?

DSPy 2.x has experimental async support via dspy.asyncify() and async module methods. The tool factory pattern above uses synchronous httpx.post() — replace it with await httpx.AsyncClient().post() inside an async tool function if you're running DSPy in an async context. The idempotency key logic is unchanged; uuid.uuid4() is thread-safe and safe to call from async code.

How do I mock Stripe tools during MIPRO optimization without breaking the tool signature?

The optimize_mode flag in the factory function is the cleanest approach — the mock returns a string matching the expected format (ch_mock_xxx) so the metric function and any dspy.Assert constraints still see a plausible value. Alternatively, use unittest.mock.patch around the teleprompter.compile() call to intercept all httpx.post calls at the network layer, which also catches any Stripe calls you might have missed.

What's the right granularity for vault keys — per DSPy module class or per forward() call?

Per module class is sufficient for most deployments: one vault key per dspy.Module type, each scoped to exactly the Stripe operations that module type needs. Per-forward() vault keys (rotating per request) are worth the complexity if you need to attribute each individual LLM invocation to a specific user or session in the audit log — Keybrake stores the vault key used in each proxied request, so per-invocation keys give you per-request attribution.

Can I inspect DSPy's tool call history for Keybrake audit correlation?

dspy.inspect_history(n=5) shows the last n LM completion calls, including the tool invocation messages. It does not show the tool results (the Stripe API responses). Cross-correlate using the idempotency key: log the run_key from your tool factory alongside the DSPy trace ID, then look up idempotency_key in Keybrake's audit log to see the corresponding Stripe response.

Does `optimized.save()` embed my vault key in the compiled JSON?

No. DSPy's compiled JSON stores prompts, few-shot demonstrations, and optimizer metadata — not Python object state or environment variables. The vault_key you pass to BillingAgent.__init__ lives only in memory. Loading the compiled JSON into a new BillingAgent instance requires you to pass the vault key explicitly to the constructor — the JSON alone is inert.

How does this work with DSPy 1.x (the old dspy.OpenAI API)?

DSPy 1.x used dspy.OpenAI, dspy.settings.configure(lm=...), and a different module API. The vault key pattern applies identically — the key is a tool-level implementation detail, not a DSPy API concern. The only difference is that older DSPy versions used dspy.Tool objects for tool registration rather than plain Python functions, so wrap your tool factory output accordingly: dspy.Tool(func=stripe_charge_tool, name="stripe_charge", desc="...").

Get notified when Keybrake ships

Keybrake is the proxy in the examples above — a scoped API-key vault for the non-LLM SaaS APIs your agent calls. Per-module vault keys, per-vendor spend caps, and a full audit log. The proxy is live.