BentoML Stripe Integration: Restricted API Keys, Spend Caps, and Agent Governance

BentoML is a framework for packaging and serving ML models as production APIs, with first-class support for async task queues, multi-worker horizontal scaling, and GPU-accelerated runner processes. Its operational model — retry on failure, scale by spawning workers, restart on crash — is excellent for inference workloads. It creates specific hazards when those same patterns interact with Stripe billing calls that should run exactly once.

This post covers three BentoML-specific Stripe billing failure modes, the Python code that exposes each one, and the two-layer governance pattern — content-hash idempotency keys and per-request vault keys via a spend-cap proxy — that eliminates them without restructuring your Service.

Failure mode 1: async task retry re-executes the handler from line 1 after stripe.charges.create() already succeeded

BentoML's @bentoml.task decorator marks an endpoint as an async task that runs in a background queue. When a task handler raises any unhandled exception, BentoML retries the task by re-executing the entire function from the beginning. There is no mid-function checkpoint: if stripe.charges.create() returned successfully on line 12 and record_charge_in_db() raised a database connection error on line 18, the retry restarts at line 1 and calls stripe.charges.create() again with the same parameters and no idempotency key.

# billing_service.py — UNSAFE: task retry re-fires stripe.charges.create() on database error
import bentoml
import stripe
import os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # unrestricted live key

@bentoml.service(
    traffic={"timeout": 60},
    workers=4,
)
class BillingService:

    @bentoml.task
    async def charge_customer(
        self,
        customer_id: str,
        amount_cents: int,
        billing_period: str,
    ) -> dict:
        # Task retry restarts here — no idempotency key, no guard
        charge = stripe.charges.create(
            amount=amount_cents,
            currency="usd",
            customer=customer_id,
            description=f"Subscription {billing_period}",
        )
        # Database write fails intermittently under load
        record_charge_in_db(customer_id, charge.id, billing_period)
        return {"charge_id": charge.id, "status": charge.status}

On the first retry, Stripe has no record of a previous call — no idempotency key was sent — so it treats the request as a new charge and creates ch_B for the same customer, same amount, same billing period. The task queue's retry budget (typically three attempts with exponential backoff) means a single database outage can produce four charges per customer: the original call plus three retries. BentoML's task dashboard shows each attempt as expected retry behavior; the duplicate charges are invisible until customers dispute them or a billing reconciliation runs days later.

The fix: derive a content-hash idempotency key from the stable inputs — customer_id, amount_cents, and billing_period — and pass it with every Stripe call. BentoML guarantees that all retries of the same task invocation receive the same input arguments, so the key is identical on every attempt. Stripe's idempotency layer returns the original ch_A on all subsequent calls without creating a new charge.

# billing_service.py — SAFE: content-hash idempotency key survives all task retries
import bentoml
import stripe
import hashlib
import os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

def billing_idempotency_key(customer_id: str, amount_cents: int, billing_period: str) -> str:
    raw = f"{customer_id}:{amount_cents}:{billing_period}:bentoml-billing"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

@bentoml.service(
    traffic={"timeout": 60},
    workers=4,
)
class BillingService:

    @bentoml.task
    async def charge_customer(
        self,
        customer_id: str,
        amount_cents: int,
        billing_period: str,
    ) -> dict:
        idempotency_key = billing_idempotency_key(customer_id, amount_cents, billing_period)

        # Same key on every retry — Stripe returns ch_A without creating ch_B, ch_C, ch_D
        charge = stripe.charges.create(
            amount=amount_cents,
            currency="usd",
            customer=customer_id,
            description=f"Subscription {billing_period}",
            idempotency_key=idempotency_key,
        )
        record_charge_in_db(customer_id, charge.id, billing_period)
        return {"charge_id": charge.id, "status": charge.status}

Two additional improvements pair well with this: scope the Stripe key to POST /v1/charges only, so the task handler cannot access customer data, issue refunds, or read subscription details regardless of what the caller passes as input. And set a per-request daily cap equal to the expected charge amount plus a small buffer — a unit confusion bug that passes amount_dollars where amount_cents is expected gets blocked by the proxy on the first call rather than creating a 100× charge that then retries up to four times.

Failure mode 2: multiple worker processes share one unrestricted STRIPE_SECRET_KEY with no per-request deduplication

BentoML Services scale horizontally by spawning multiple worker processes, controlled by the workers parameter in the @bentoml.service decorator. All workers are forked from the same parent process and inherit the same environment variables — including STRIPE_SECRET_KEY. There is no mechanism built into BentoML that detects when two concurrent task invocations carry identical billing inputs and deduplicates them before either reaches Stripe.

# batch_billing.py — UNSAFE: concurrent workers each call stripe.charges.create() independently
import bentoml
import stripe
import os
from typing import List

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # same key injected into all 4 workers

@bentoml.service(workers=4)
class BatchBillingService:

    @bentoml.api
    async def bill_cohort(self, customer_ids: List[str], billing_period: str) -> List[str]:
        results = []
        for customer_id in customer_ids:
            # No dedup, no idempotency key — if this endpoint is called twice concurrently
            # with the same payload (agent retry on HTTP timeout), both workers fire
            charge = stripe.charges.create(
                amount=2999,
                currency="usd",
                customer=customer_id,
                description=f"Pro subscription {billing_period}",
            )
            results.append(charge.id)
        return results

The dangerous scenario is not purely theoretical: an AI agent calling this endpoint with a 30-second timeout may retry the HTTP request if it doesn't receive a response within that window. The first request is still in-flight on Worker 1, processing customer IDs and creating charges. The retry arrives and is routed to Worker 2, which starts processing the same list from the beginning. Both workers call stripe.charges.create() for the same customer IDs with no idempotency keys, creating a second set of charges. At the end, both workers return a list of charge IDs — but there are now twice as many charges as customers, all with status: succeeded.

The fix is the same idempotency key function applied to each individual charge within the loop. An HTTP-level duplicate request carrying identical inputs generates an identical key for each customer, so Stripe returns the existing charges rather than creating new ones. For the multi-worker isolation problem specifically, an additional layer of protection is a per-request vault key issued before the billing loop begins. Each vault key is scoped to POST /v1/charges with a spend cap equal to the sum of all expected charges in the cohort plus a 10% buffer. If a duplicate request arrives and its vault key has already exhausted its cap from the first request's charges, the proxy blocks every subsequent call in the duplicate rather than creating new charges.

# batch_billing.py — SAFE: idempotency keys + vault key cap per cohort request
import bentoml
import stripe
import hashlib
import os
import httpx
from typing import List

KEYBRAKE_ADMIN_KEY = os.environ["KEYBRAKE_ADMIN_KEY"]
KEYBRAKE_PROXY_URL = "https://proxy.keybrake.com"

def billing_idempotency_key(customer_id: str, amount_cents: int, billing_period: str) -> str:
    raw = f"{customer_id}:{amount_cents}:{billing_period}:bentoml-billing"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

def issue_vault_key(cohort_id: str, customer_count: int, max_amount_cents: int) -> str:
    total_cap = customer_count * max_amount_cents
    resp = httpx.post(
        f"{KEYBRAKE_PROXY_URL}/keys",
        headers={"Authorization": f"Bearer {KEYBRAKE_ADMIN_KEY}"},
        json={
            "label": f"bentoml-cohort-{cohort_id}",
            "vendor": "stripe",
            "daily_usd_cap": round(total_cap * 1.10 / 100, 2),
            "allowed_endpoints": ["POST /v1/charges"],
            "expires_in_seconds": 600,
        },
    )
    return resp.json()["vault_key"]

@bentoml.service(workers=4)
class BatchBillingService:

    @bentoml.api
    async def bill_cohort(
        self,
        cohort_id: str,
        customer_ids: List[str],
        billing_period: str,
    ) -> List[str]:
        # Issue vault key before the loop — cap = customers × amount × 1.10 buffer
        vault_key = issue_vault_key(
            cohort_id=cohort_id,
            customer_count=len(customer_ids),
            max_amount_cents=2999,
        )
        stripe.api_key = vault_key
        stripe.api_base = f"{KEYBRAKE_PROXY_URL}/stripe"

        results = []
        for customer_id in customer_ids:
            idempotency_key = billing_idempotency_key(customer_id, 2999, billing_period)
            charge = stripe.charges.create(
                amount=2999,
                currency="usd",
                customer=customer_id,
                description=f"Pro subscription {billing_period}",
                idempotency_key=idempotency_key,
            )
            results.append(charge.id)
        return results

Failure mode 3: Service restart re-routes an in-flight billing request to a fresh worker that charges again

BentoML Services restart during deployment updates, when a worker process is killed by an OOM event, or during auto-scaling on BentoML Cloud. When a restart occurs mid-request, the in-flight HTTP connection is dropped. From the caller's perspective — whether that caller is an AI agent, an orchestrator, or a BentoML client — the request returned a connection error rather than a response. The caller retries the request against the new worker, which has no state from the previous execution and calls stripe.charges.create() as if it were the first time.

# subscription_service.py — UNSAFE: restart re-fires charge with no idempotency key
import bentoml
import stripe
import os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

@bentoml.service(traffic={"timeout": 30})
class SubscriptionService:

    @bentoml.api
    async def activate_subscription(
        self,
        customer_id: str,
        plan: str,
        billing_period: str,
    ) -> dict:
        amount_cents = {"free": 0, "pro": 2999, "team": 9999}[plan]
        if amount_cents == 0:
            return {"status": "free_tier", "charge_id": None}

        # If worker restarts after this line and before the response is sent,
        # the retry fires stripe.charges.create() on a fresh worker — ch_B created
        charge = stripe.charges.create(
            amount=amount_cents,
            currency="usd",
            customer=customer_id,
            description=f"{plan.title()} subscription {billing_period}",
        )
        # Store subscription record — if this raises, the task retry creates ch_B
        activate_in_database(customer_id, plan, charge.id, billing_period)
        return {"status": "active", "charge_id": charge.id}

This failure mode is particularly difficult to observe because it requires two conditions to coincide: a worker restart or OOM kill and an in-flight billing request at that exact moment. In production, BentoML deployments happen regularly — every code push, model update, or scaling event triggers a rolling restart. During a high-traffic billing run (end-of-month subscription renewal, for example), the probability of at least one request straddling a restart is non-trivial. The resulting duplicate charge appears as a separate Stripe object with a different charge.id and a different request_id header — no obvious signal that it duplicates an earlier charge for the same customer and billing period.

Content-hash idempotency keys derived from customer_id, plan, and billing_period close this completely. The fresh worker receives the same request arguments, derives the same key, and Stripe returns the original ch_A from its idempotency store without creating a new charge. The caller receives a successful response on the retry, with the same charge ID that the original worker would have returned.

# subscription_service.py — SAFE: idempotency key survives worker restart and client retry
import bentoml
import stripe
import hashlib
import os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

def subscription_idempotency_key(customer_id: str, plan: str, billing_period: str) -> str:
    raw = f"{customer_id}:{plan}:{billing_period}:bentoml-subscription"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

@bentoml.service(traffic={"timeout": 30})
class SubscriptionService:

    @bentoml.api
    async def activate_subscription(
        self,
        customer_id: str,
        plan: str,
        billing_period: str,
    ) -> dict:
        amount_cents = {"free": 0, "pro": 2999, "team": 9999}[plan]
        if amount_cents == 0:
            return {"status": "free_tier", "charge_id": None}

        idempotency_key = subscription_idempotency_key(customer_id, plan, billing_period)

        # Stable across worker restart + client retry — Stripe returns ch_A on all attempts
        charge = stripe.charges.create(
            amount=amount_cents,
            currency="usd",
            customer=customer_id,
            description=f"{plan.title()} subscription {billing_period}",
            idempotency_key=idempotency_key,
        )
        activate_in_database(customer_id, plan, charge.id, billing_period)
        return {"status": "active", "charge_id": charge.id}

Approach comparison

Approach Task retry safe Worker concurrency safe Restart / re-route safe Spend cap Audit log One-click revoke
No idempotency key (default) No No No No No No
Content-hash idempotency key only Yes Yes Yes No No No
Stripe restricted key only No No No No No Rotate key (slow)
Idempotency key + vault key via Keybrake proxy Yes Yes Yes Yes — per request Yes — every call Yes — instant

Gap analysis: four more BentoML billing risks

1. Batch API endpoint partially processes a cohort before timeout

BentoML's @bentoml.api endpoints can accept lists of inputs for batch processing. If a batch endpoint processes 60 of 100 customer IDs and then hits the traffic.timeout limit, BentoML returns a timeout error to the caller. A retry re-sends the full list of 100 IDs. Without idempotency keys, the already-charged 60 customers receive a second charge. With content-hash keys derived from (customer_id, amount_cents, billing_period), Stripe returns the existing charges for all 60 and creates new charges only for the remaining 40 — no duplicates regardless of how many times the batch is retried.

2. max_concurrency does not deduplicate identical in-flight requests

BentoML's traffic.max_concurrency setting limits how many concurrent requests a single worker handles, but it operates at the scheduling layer — not the deduplication layer. Two concurrent requests with identical billing inputs (same customer_id, amount_cents, billing_period) are both dispatched to workers as valid independent tasks. The idempotency key at the Stripe layer is the only deduplication mechanism; max_concurrency does not substitute for it.

3. BentoML Runner retries on model inference error propagate through to billing calls

BentoML Runners (used for model inference) have their own retry logic separate from Service-level task retries. If your billing handler calls a Runner to compute a dynamic charge amount (e.g., a pricing model that calculates a usage-based fee) and the Runner call fails and retries, any Stripe call that happened before the Runner invocation in the handler is already complete. When the runner retry causes the full handler to re-execute from the top (depending on how your error handling is structured), stripe.charges.create() fires again. The idempotency key must be derived before any branching logic — including Runner calls — to be stable across all retry paths.

4. BentoML Cloud auto-scaling events interrupt billing runs during cohort processing

On BentoML Cloud, auto-scaling can add or remove worker instances in response to traffic spikes. Scale-down events send a termination signal to idle workers, but a worker that is mid-way through a billing cohort loop may not be considered "idle" and may be killed after its graceful shutdown window expires. Requests that were in-flight on the killed worker are re-queued and dispatched to surviving workers. Each of those re-queued requests starts the billing loop from customer ID 1, not from the customer ID where the killed worker stopped. Content-hash idempotency keys ensure that already-charged customers in the cohort return their existing ch_A rather than generating ch_B, regardless of how many times a scale event causes the loop to restart.

Enforcement with pytest

# test_bentoml_billing.py
import hashlib
import pytest
from unittest.mock import patch, MagicMock

def billing_idempotency_key(customer_id, amount_cents, billing_period):
    raw = f"{customer_id}:{amount_cents}:{billing_period}:bentoml-billing"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

def test_idempotency_key_stable_across_task_retries():
    """Same inputs always produce the same key regardless of retry count."""
    key_attempt_1 = billing_idempotency_key("cus_abc", 2999, "2026-07")
    key_attempt_2 = billing_idempotency_key("cus_abc", 2999, "2026-07")
    key_attempt_3 = billing_idempotency_key("cus_abc", 2999, "2026-07")
    assert key_attempt_1 == key_attempt_2 == key_attempt_3

def test_idempotency_key_distinct_per_billing_period():
    """Different billing periods produce different keys — no cross-period dedup."""
    key_july = billing_idempotency_key("cus_abc", 2999, "2026-07")
    key_august = billing_idempotency_key("cus_abc", 2999, "2026-08")
    assert key_july != key_august

def test_idempotency_key_distinct_per_customer():
    """Different customers in the same cohort get independent keys."""
    key_a = billing_idempotency_key("cus_abc", 2999, "2026-07")
    key_b = billing_idempotency_key("cus_xyz", 2999, "2026-07")
    assert key_a != key_b

def test_vault_key_rejects_refund_endpoint():
    """Vault key scoped to POST /v1/charges cannot call POST /v1/refunds."""
    with patch("stripe.Refund.create") as mock_refund:
        mock_refund.side_effect = Exception(
            "403 Forbidden: vault key not authorized for POST /v1/refunds"
        )
        with pytest.raises(Exception, match="not authorized"):
            import stripe
            stripe.Refund.create(charge="ch_test_abc")

def test_preflight_audit_check_blocks_duplicate_cohort():
    """Pre-flight check short-circuits billing if charge already recorded for period."""
    mock_db = MagicMock()
    mock_db.get_charge.return_value = {"charge_id": "ch_A", "status": "succeeded"}

    def preflight_check(customer_id, billing_period):
        existing = mock_db.get_charge(customer_id, billing_period)
        if existing:
            return existing["charge_id"]
        return None

    result = preflight_check("cus_abc", "2026-07")
    assert result == "ch_A"
    mock_db.get_charge.assert_called_once_with("cus_abc", "2026-07")

FAQ

Is the BentoML task invocation ID safe to use as an idempotency key?

No. BentoML generates a new invocation ID for each task submission, including retries of the same logical task. An agent that retries a failed HTTP call to your BentoML Service will generate a new invocation ID on the retry, making it useless as a deduplication key at the Stripe layer. The content-hash derived from your billing inputs — customer_id, amount_cents, billing_period — is stable across all retry paths because it depends on the inputs, not on internal BentoML identifiers.

How does the restart failure mode interact with BentoML's graceful shutdown?

BentoML sends a SIGTERM to workers during a graceful shutdown and waits for in-flight requests to complete before the shutdown window expires. For short billing calls, graceful shutdown typically protects you. The risk is a forced kill (SIGKILL) when the graceful window expires before the billing handler finishes — which can happen during long cohort loops or slow database writes. Idempotency keys protect you against the forced-kill case; graceful shutdown protects you against the ordinary restart case. Both together are the complete solution.

Can I issue vault keys inside the BentoML task handler itself, or must they be issued upstream?

You can issue vault keys inside the handler — that is the pattern shown in Failure mode 2. The key constraint is that the vault key must be issued before the first Stripe call and with a cap that reflects the expected total spend for that invocation. Issuing the key inside the handler is fine; issuing it after any Stripe calls have already been made defeats the cap because those calls bypassed the proxy.

What happens if the Keybrake proxy is unreachable when my BentoML task runs?

The Stripe call will fail with a connection error rather than silently bypassing the proxy. Your task handler should treat proxy connection errors the same as Stripe API errors — raise an exception and let BentoML's retry mechanism handle the backoff. Because the idempotency key is derived from stable inputs, the retry against the proxy (once it recovers) will produce the same key, and Stripe will return the existing charge if it was created on a prior attempt. No duplicate charge is created.

Does BentoML Cloud handle secrets rotation in a way that affects vault key distribution?

BentoML Cloud injects secrets as environment variables at deployment time. A secrets rotation event (updating KEYBRAKE_ADMIN_KEY in BentoML Cloud) requires a redeployment to take effect — the running workers still have the old key until the rolling restart completes. During the restart, some workers use the old admin key and some use the new one, both of which can issue valid vault keys as long as both are active on the Keybrake side. Keep the old admin key active for the duration of the rolling restart, then deactivate it once all workers are running with the new key.

Should I use @bentoml.task or @bentoml.api for billing endpoints?

For billing operations that involve a Stripe call, @bentoml.task is preferable because it provides explicit task-level retry semantics and a task queue that decouples the caller from the execution. The idempotency key pattern works identically for both decorators. The main advantage of @bentoml.task is that the caller receives a task ID immediately and can poll for the result, which means the caller's HTTP timeout does not interrupt the billing execution mid-flight — reducing the restart failure mode to only OOM and forced kills rather than also including client-side timeout retries.

Scope vault keys before your BentoML Service starts billing

Keybrake issues short-lived vault keys your BentoML handlers can use as Stripe API keys. Each key is scoped to specific endpoints, capped at a daily USD limit, and auto-expires. Every proxied call is logged to a queryable audit table with the customer ID, amount, and timestamp. One-click revoke stops a runaway billing loop without touching your Stripe account settings.

Related: Metaflow · ZenML · Flyte · Ray · Dagster · Modal · Kestra · Stripe restricted key permissions reference