Agent Governance

Vertex AI Stripe Integration: Restricted API Keys, Spend Caps, and Agent Governance

Google's Vertex AI SDK and the google.generativeai package make it straightforward to wire Stripe into a Gemini-powered agent via function declarations. The risk surfaces in three specific places: API retry logic that re-executes FunctionCall responses, Gemini's native parallel function calling that can fire two Stripe charges in a single model response, and ChatSession conversation history that replays completed billing operations on ambiguous follow-up prompts.

This post covers all three failure modes specific to the Vertex AI / Gemini SDK architecture and shows the two-layer governance pattern that closes each one: a restricted Stripe API key as a first layer, and per-run vault keys via a spend-cap proxy as a second.

The standard Vertex AI Stripe function-calling pattern

The baseline pattern for a Stripe-capable agent with Gemini uses the vertexai or google.generativeai SDK to define a function declaration and handle the resulting FunctionCall parts:

import vertexai
from vertexai.generative_models import (
    FunctionDeclaration, GenerativeModel, Part, Tool
)
import stripe, os

vertexai.init(project="my-project", location="us-central1")
stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # sk_live_...

charge_fn = FunctionDeclaration(
    name="charge_stripe",
    description="Charge a Stripe customer for a billing period.",
    parameters={
        "type": "object",
        "properties": {
            "customer_id":    {"type": "string", "description": "Stripe customer ID"},
            "amount_cents":   {"type": "integer", "description": "Amount in cents"},
            "billing_period": {"type": "string", "description": "e.g. '2026-06'"},
        },
        "required": ["customer_id", "amount_cents", "billing_period"],
    },
)

model = GenerativeModel(
    "gemini-1.5-pro-002",
    tools=[Tool(function_declarations=[charge_fn])],
)

response = model.generate_content("Charge customer cus_Abc123 $29 for the June plan")

# Handle function call
if response.candidates[0].content.parts[0].function_call:
    fc = response.candidates[0].content.parts[0].function_call
    if fc.name == "charge_stripe":
        args = dict(fc.args)
        charge = stripe.Charge.create(
            amount=args["amount_cents"],
            currency="usd",
            customer=args["customer_id"],
            description=f"Subscription {args['billing_period']}",
        )
        print(f"Charged: {charge.id}")

This works correctly in the happy path. Three distinct failure modes appear when the API call needs to be retried, when Gemini emits parallel function calls, or when the agent uses ChatSession for multi-turn conversation.

Failure mode 1: generate_content() retry re-executes the FunctionCall

The Vertex AI API — like most Google Cloud APIs — uses exponential backoff retry on transient errors (503 Service Unavailable, 429 Resource Exhausted). Most production SDK configurations, including the Vertex AI Python SDK's built-in retry policy and wrappers like tenacity or google-api-core's retry, automatically re-issue the generate_content() request on failure. When the model returns a FunctionCall response and the client is what fails (or the network between client and model does), retrying the same prompt with the same conversation history can produce an identical FunctionCall response — and your handler executes it again:

import vertexai
from vertexai.generative_models import GenerativeModel, Tool, FunctionDeclaration
from google.api_core import retry
import stripe, os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

# Retry on transient errors — standard production pattern
@retry.Retry(predicate=retry.if_transient_error)
def call_model(model, prompt):
    return model.generate_content(prompt)

model = GenerativeModel("gemini-1.5-pro-002", tools=[Tool(function_declarations=[charge_fn])])
response = call_model(model, "Charge customer cus_Abc123 $29 for June")

# If the first call timed out before the response arrived,
# call_model retried and returned a second FunctionCall.
# The handler below executes on whatever response came back.
# If both calls completed, this handler runs once per successful response —
# but the charge may have already fired before the timeout.
for part in response.candidates[0].content.parts:
    if part.function_call and part.function_call.name == "charge_stripe":
        args = dict(part.function_call.args)
        # No idempotency key — retry scenario creates second charge
        stripe.Charge.create(
            amount=args["amount_cents"],
            currency="usd",
            customer=args["customer_id"],
        )

What breaks: On a transient error — say, the Vertex AI API returned 503 after accepting the request but before the client received the response — the retry fires a new generate_content() request. The model produces an identical FunctionCall response. Your handler calls stripe.Charge.create() again with no idempotency key. Stripe treats it as a new request and creates a second charge. More critically: even if only one generate_content() call succeeded, the charge may have fired before the timeout — so the retry produces a duplicate.

The fix: bind a content-hash idempotency key to the Stripe call at handler time, derived from the function call arguments themselves. Because the arguments are identical across retries (same customer, same amount, same billing period), the idempotency key is identical — Stripe deduplicates and returns the original charge object:

import hashlib, stripe, os

PROXY_URL = "https://proxy.keybrake.com"
VAULT_KEY = os.environ["KEYBRAKE_VAULT_KEY_BILLING"]  # POST /v1/charges only

def handle_function_call(fc_name: str, args: dict) -> dict:
    if fc_name != "charge_stripe":
        return {"error": "unknown_function"}

    idempotency_key = hashlib.sha256(
        f"{args['customer_id']}:{args['amount_cents']}:{args['billing_period']}".encode()
    ).hexdigest()[:32]

    client = stripe.StripeClient(api_key=VAULT_KEY, base_url=PROXY_URL + "/stripe/")
    try:
        charge = client.charges.create(params={
            "amount":          args["amount_cents"],
            "currency":        "usd",
            "customer":        args["customer_id"],
            "description":     f"Subscription {args['billing_period']}",
            "idempotency_key": idempotency_key,
        })
        return {"charge_id": charge.id, "status": charge.status}
    except stripe.StripeError as e:
        return {"error": str(e), "idempotency_key": idempotency_key}

What this fixes: The idempotency key is computed from (customer_id, amount_cents, billing_period) — arguments the model emits deterministically for the same billing operation. Whether the handler runs once, twice on retry, or three times on a flaky connection, Stripe deduplicates all calls with the same key and returns the original charge object. The customer is charged exactly once. The vault key ensures the call also routes through the proxy for spend-cap enforcement and audit logging.

Failure mode 2: Parallel function calls charge simultaneously

Gemini 1.5 Pro and Gemini 2.0 Flash support parallel function calling: the model can emit multiple Part.function_call objects in a single generate_content() response. This is useful when the model determines that two independent operations can be executed concurrently — fetching data from two sources, looking up two customers at once. The risk is that the model may decide two billing operations are independent and emit them as parallel calls:

import vertexai
from vertexai.generative_models import GenerativeModel, Tool, FunctionDeclaration, Part
import stripe, os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

model = GenerativeModel("gemini-1.5-pro-002", tools=[Tool(function_declarations=[charge_fn])])

# "Process this month's renewals" — model may batch multiple charges
response = model.generate_content(
    "Process renewals for cus_Abc123 ($29) and cus_Xyz789 ($29) for June"
)

# Standard pattern: iterate all parts
for part in response.candidates[0].content.parts:
    if part.function_call:
        fc = part.function_call
        args = dict(fc.args)
        # Both charge_stripe calls execute — correct here (different customers).
        # The bug: if the model emits the same customer twice due to ambiguity,
        # both calls go to Stripe with no idempotency key = two charges.
        stripe.Charge.create(
            amount=args["amount_cents"],
            currency="usd",
            customer=args["customer_id"],
            description=f"Subscription {args.get('billing_period', 'unknown')}",
        )

What breaks: Parallel function calls are correct when the customers are different. The failure mode emerges in two scenarios. First, if the model makes an inference error and emits charge_stripe twice with the same customer_id — perhaps because the prompt was ambiguous about whether "and Abc123 again" means a second charge — both calls fire simultaneously with no idempotency key, and Stripe creates two charges. Second, in a batch renewal workflow, the model may emit a charge call for a customer who was already charged in a prior session, because it doesn't have access to charge history in its context.

Content-hash idempotency keys resolve both scenarios: identical arguments produce identical keys, so even two simultaneous calls for the same customer in the same billing period collapse to one charge at Stripe. Per-role vault keys with a daily spend cap provide a secondary backstop:

import asyncio, hashlib, stripe, os

PROXY_URL = "https://proxy.keybrake.com"
VAULT_KEY = os.environ["KEYBRAKE_VAULT_KEY_BILLING"]

async def execute_charge(args: dict) -> dict:
    idempotency_key = hashlib.sha256(
        f"{args['customer_id']}:{args['amount_cents']}:{args['billing_period']}".encode()
    ).hexdigest()[:32]

    client = stripe.StripeClient(api_key=VAULT_KEY, base_url=PROXY_URL + "/stripe/")
    try:
        # asyncio.to_thread keeps the event loop free while Stripe responds
        charge = await asyncio.to_thread(
            client.charges.create,
            params={
                "amount":          args["amount_cents"],
                "currency":        "usd",
                "customer":        args["customer_id"],
                "description":     f"Subscription {args['billing_period']}",
                "idempotency_key": idempotency_key,
            },
        )
        return {"charge_id": charge.id, "status": charge.status, "key": idempotency_key}
    except stripe.StripeError as e:
        return {"error": str(e), "idempotency_key": idempotency_key}

async def handle_parallel_calls(response) -> list[dict]:
    tasks = []
    for part in response.candidates[0].content.parts:
        if part.function_call and part.function_call.name == "charge_stripe":
            tasks.append(execute_charge(dict(part.function_call.args)))
    return await asyncio.gather(*tasks)

What this fixes: Even when asyncio.gather fires all charge calls concurrently, identical (customer_id, amount_cents, billing_period) tuples produce the same idempotency key. Stripe's idempotency layer deduplicates concurrent requests with the same key — one charge is created, both async calls receive the same charge object. The vault key's daily spend cap provides a secondary guard: if 50 customers should be charged $29 but the model somehow emits 100 calls, the cap prevents the overage.

Failure mode 3: ChatSession history replays completed billing steps

Gemini's ChatSession (via GenerativeModel.start_chat()) maintains conversation history automatically. Every message, function call, and function response is accumulated in chat.history and resent with each subsequent send_message() call. This is what enables multi-turn agents to maintain context. The risk: completed billing operations — the original FunctionCall request and the FunctionResponse with the charge ID — appear in the history. On ambiguous follow-up messages, the model may interpret the history as an incomplete operation and call charge_stripe again:

import vertexai
from vertexai.generative_models import GenerativeModel, Tool, Part, Content
import stripe, os

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

model = GenerativeModel("gemini-1.5-flash-002", tools=[Tool(function_declarations=[charge_fn])])
chat = model.start_chat()

# Turn 1: charge the customer
response1 = chat.send_message("Charge cus_Abc123 $29 for June")
# ... handler fires charge_stripe, gets charge_ch_abc ...
# ... send FunctionResponse back to chat ...

# Turn 2 (hours later, in resumed session): ambiguous prompt
response2 = chat.send_message(
    "The customer says they haven't received a receipt — can you check?"
)
# Gemini sees: prior FunctionCall(charge_stripe, {...}) + FunctionResponse({charge_id: ch_abc})
# Interpretation: "maybe the charge didn't complete — let me try again"
# Model emits: FunctionCall(charge_stripe, same args as before)
# Handler fires again — second Stripe charge, no idempotency key

What breaks: The model has access to the full billing history in its context and can misinterpret a customer-service message as an instruction to retry billing. This is especially likely when the follow-up message is ambiguous ("can you check?" could mean "check the charge status" or "re-run the charge if it failed"). Because the history contains both the function call and its response, the model has all the arguments it needs to fire an exact duplicate — and without idempotency keys, Stripe creates a second charge for the same amount.

Three controls work together: idempotency keys make the Stripe call safe to retry; explicit function call semantics in the system prompt reduce misinterpretation; and a read-only audit vault key lets the agent look up existing charges without being able to create new ones:

import hashlib, stripe, os
from vertexai.generative_models import GenerativeModel, Tool, FunctionDeclaration, Part, Content

PROXY_URL = "https://proxy.keybrake.com"
BILLING_KEY = os.environ["KEYBRAKE_VAULT_KEY_BILLING"]  # POST /v1/charges only
AUDIT_KEY   = os.environ["KEYBRAKE_VAULT_KEY_AUDIT"]   # GET /v1/charges only

# Separate function for looking up existing charges — routes through audit vault key
lookup_fn = FunctionDeclaration(
    name="get_charge_status",
    description="Look up an existing Stripe charge by ID. Read-only.",
    parameters={
        "type": "object",
        "properties": {
            "charge_id": {"type": "string", "description": "Stripe charge ID (ch_...)"},
        },
        "required": ["charge_id"],
    },
)

# Model has both functions; router decides which vault key to use
model = GenerativeModel(
    "gemini-1.5-flash-002",
    system_instruction=(
        "Use charge_stripe only when explicitly instructed to bill a customer for a "
        "specific billing_period. Use get_charge_status to look up any charge mentioned "
        "in the conversation history. Never re-charge a customer based solely on context "
        "from a prior conversation turn — always require an explicit instruction with "
        "a confirmed billing_period."
    ),
    tools=[Tool(function_declarations=[charge_fn, lookup_fn])],
)

def handle_function_call(fc_name: str, args: dict) -> dict:
    if fc_name == "charge_stripe":
        idempotency_key = hashlib.sha256(
            f"{args['customer_id']}:{args['amount_cents']}:{args['billing_period']}".encode()
        ).hexdigest()[:32]
        client = stripe.StripeClient(api_key=BILLING_KEY, base_url=PROXY_URL + "/stripe/")
        try:
            charge = client.charges.create(params={
                "amount":          args["amount_cents"],
                "currency":        "usd",
                "customer":        args["customer_id"],
                "description":     f"Subscription {args['billing_period']}",
                "idempotency_key": idempotency_key,
            })
            return {"charge_id": charge.id, "status": charge.status}
        except stripe.StripeError as e:
            return {"error": str(e)}

    elif fc_name == "get_charge_status":
        # Read-only: routes through audit vault key — POST /v1/charges → 403
        audit_client = stripe.StripeClient(api_key=AUDIT_KEY, base_url=PROXY_URL + "/stripe/")
        try:
            charge = audit_client.charges.retrieve(args["charge_id"])
            return {"charge_id": charge.id, "status": charge.status, "amount": charge.amount}
        except stripe.StripeError as e:
            return {"error": str(e)}

    return {"error": f"unknown function: {fc_name}"}

What this fixes: When a customer asks about their charge, the model calls get_charge_status (read-only, routes through audit vault key) rather than re-firing charge_stripe. If the model mistakenly calls charge_stripe with the audit vault key, the proxy rejects the request with 403 before it reaches Stripe — no charge is created. When charge_stripe is correctly called with the billing vault key, the content-hash idempotency key ensures any duplicate calls collapse to one charge at Stripe. The system prompt instruction to require an explicit billing_period prevents the model from inferring the billing period from conversation history.

One-line proxy override

The Keybrake proxy is compatible with the standard Stripe Python SDK's StripeClient interface. Switching from a direct Stripe call to the proxy requires one line change:

# Before
stripe.api_key = os.environ["STRIPE_SECRET_KEY"]
charge = stripe.Charge.create(amount=2900, currency="usd", customer="cus_Abc123")

# After — routes through proxy, enforces spend cap, writes audit log
from stripe import StripeClient
client = StripeClient(
    api_key=os.environ["KEYBRAKE_VAULT_KEY"],
    base_url="https://proxy.keybrake.com/stripe/",
)
charge = client.charges.create(params={"amount": 2900, "currency": "usd", "customer": "cus_Abc123"})

All three failure modes above use StripeClient with base_url pointing at the proxy. No other Vertex AI or Gemini SDK code changes are needed — the function declaration, tool configuration, and response handling are unchanged.

Comparison: raw key vs restricted key vs vault key

Property Raw sk_live_ key Restricted Stripe key Vault key (Keybrake proxy)
Endpoint allowlist No — full API access Partial — Stripe-enforced resource set Yes — per-role allowlist, proxy-enforced
Daily spend cap No No Yes — configurable per vault key
Per-agent isolation No — all Gemini agents share one key No — still shared Yes — one vault key per agent or role
API retry guard No — retry duplicates charges No — application-level problem Yes — with idempotency key in handler
Parallel call dedup No — parallel calls each create a charge No — application-level problem Yes — content-hash idempotency key collapses duplicates
Audit trail Stripe Dashboard only Stripe Dashboard only Yes — per-call log at proxy layer
Kill switch Rotate key (affects all agents) Rotate key (affects all agents) Revoke vault key (scoped to one agent or role)

pytest enforcement suite

import pytest, os, hashlib, stripe

PROXY_URL = "https://proxy.keybrake.com"
BILLING_KEY = os.environ.get("KEYBRAKE_VAULT_KEY_BILLING", "")
AUDIT_KEY   = os.environ.get("KEYBRAKE_VAULT_KEY_AUDIT", "")

def test_vault_key_not_live():
    """Billing vault key must never be a raw Stripe live key."""
    assert not BILLING_KEY.startswith("sk_live_"), (
        "KEYBRAKE_VAULT_KEY_BILLING must be a vault key, not a raw sk_live_ key"
    )

def test_stripe_client_uses_proxy():
    """StripeClient must route through the Keybrake proxy, not api.stripe.com."""
    client = stripe.StripeClient(api_key=BILLING_KEY, base_url=PROXY_URL + "/stripe/")
    assert PROXY_URL in client.base_url, (
        "StripeClient base_url must point at proxy.keybrake.com"
    )

def test_idempotency_key_is_deterministic():
    """Same (customer, amount, period) must produce the same idempotency key."""
    def make_key(customer_id, amount_cents, billing_period):
        return hashlib.sha256(
            f"{customer_id}:{amount_cents}:{billing_period}".encode()
        ).hexdigest()[:32]

    key1 = make_key("cus_Abc123", 2900, "2026-06")
    key2 = make_key("cus_Abc123", 2900, "2026-06")
    assert key1 == key2, "Idempotency key must be deterministic for same inputs"

def test_different_customers_get_different_keys():
    """Different customers in a parallel batch must get different idempotency keys."""
    def make_key(customer_id, amount_cents, billing_period):
        return hashlib.sha256(
            f"{customer_id}:{amount_cents}:{billing_period}".encode()
        ).hexdigest()[:32]

    key_abc = make_key("cus_Abc123", 2900, "2026-06")
    key_xyz = make_key("cus_Xyz789", 2900, "2026-06")
    assert key_abc != key_xyz, "Different customers must produce different idempotency keys"

def test_audit_key_cannot_create_charges(monkeypatch):
    """Audit vault key must be rejected by proxy when used for POST /v1/charges."""
    # Proxy returns 403 for audit key on POST endpoints
    import httpx

    def mock_request(*args, **kwargs):
        return httpx.Response(403, json={"error": "vault_key_not_authorized"})

    monkeypatch.setattr(httpx, "request", mock_request)
    client = stripe.StripeClient(api_key=AUDIT_KEY, base_url=PROXY_URL + "/stripe/")
    with pytest.raises(stripe.PermissionError):
        client.charges.create(params={"amount": 2900, "currency": "usd", "customer": "cus_test"})

Gap analysis

Vertex AI API quota retry vs application retry. Google Cloud's google-api-core library applies automatic retry on 429 (quota exceeded) and 503 errors by default. If your code also wraps generate_content() in application-level retry logic (e.g., tenacity), you may get two layers of retry — SDK-level and application-level — each capable of re-executing the function call handler. Use one retry mechanism, not two, and ensure idempotency keys are computed before any retry boundary.

Gemini streaming and function call boundaries. The Vertex AI SDK supports streaming responses via generate_content(stream=True). With streaming, function call parts arrive progressively; if the stream is interrupted mid-response, retrying produces the full response including the function call. Ensure your handler only executes the function call after the stream is fully consumed (all parts received), not on partial receipt.

Parallel function calls and billing_period scoping. In a batch renewal that charges 50 customers, all with the same billing_period, the idempotency keys differ per customer because customer_id is part of the hash input. However, if a customer has two subscriptions at different price points and both appear in the batch, two keys are generated — both charges are intentional. Verify that your billing data source doesn't accidentally include the same subscription twice before constructing the function call arguments.

Vertex AI Agent Builder webhooks. If you're using Vertex AI Agent Builder (the managed platform) rather than the raw Gemini SDK, billing calls happen via webhook — an HTTPS endpoint your Cloud Function or Cloud Run service exposes. Vertex AI Agent Builder retries webhook calls on 5xx responses with up to 3 retries. Apply the same idempotency key pattern in the webhook handler: derive the key from the function call arguments passed by the platform, not from a request-time UUID. The platform passes the same argument values on each retry.

ChatSession history growth and context window. Long-running ChatSession conversations accumulate history. After dozens of turns, the history may include multiple prior billing operations across different billing periods. The model's context window can correctly distinguish them by billing_period, but a very long history increases the risk of context confusion. Consider periodically pruning old FunctionCall/FunctionResponse pairs from chat.history once the billing period they reference has closed.

Frequently asked questions

Does the Vertex AI SDK's built-in retry automatically deduplicate Stripe calls?

No. The Vertex AI SDK retries the generate_content() API call itself, not the function call execution. Deduplication of Stripe calls is entirely the responsibility of the handler code that processes FunctionCall responses. The SDK has no awareness of what happens after it returns a response — it doesn't know you're about to call Stripe. You must add idempotency keys in the handler.

How does Gemini parallel function calling differ from a normal batch?

In a normal batch, you call generate_content() once per customer and execute each response sequentially. In Gemini parallel function calling, a single generate_content() call returns multiple Part.function_call objects in one response — the model decides to parallelize. The difference matters because a normal batch gives you explicit control over sequencing and deduplication; parallel function calls happen as part of one model response and require handling all parts safely before sending any results back.

Should I use ChatSession or stateless generate_content() for billing agents?

Prefer stateless generate_content() for billing steps. Stateless calls don't accumulate billing history in an in-memory session — each call is self-contained, with only the context you explicitly provide. Use ChatSession for conversational interfaces where the user may refer back to earlier messages, but manage history explicitly: don't include prior FunctionCall/FunctionResponse pairs from completed billing periods in new sessions.

How do I scope vault keys when using Vertex AI Agent Builder webhooks?

Create one vault key per webhook endpoint and per agent role. Your "billing" webhook endpoint uses a vault key that allows only POST /v1/charges. Your "status" webhook endpoint uses a vault key that allows only GET /v1/charges. Pass the appropriate vault key as an environment variable to each Cloud Function or Cloud Run service. The proxy enforces the allowlist — even if someone extracts the key from the billing webhook's environment, it cannot be used to read customer data or access other Stripe endpoints.

Does the proxy work with Vertex AI Agent Builder's Cloud Run webhook integration?

Yes. The proxy is a standard HTTPS endpoint that the Stripe SDK routes through. Your Cloud Run webhook handler calls stripe.StripeClient(api_key=VAULT_KEY, base_url=PROXY_URL+"/stripe/") and otherwise uses the SDK normally. No changes to the Vertex AI Agent Builder configuration are needed — the platform calls your webhook, your webhook calls Stripe via the proxy, and the proxy enforces spend caps and logs the call.

What happens to the ChatSession history if I include a charge error in the conversation?

When a charge fails — card declined, network error, spend cap exceeded — and you include the error in the FunctionResponse you send back to the chat, the model sees it as an incomplete operation. On the next message, it may attempt to retry. This is usually correct behavior, but pair it with idempotency keys: if the original charge actually succeeded before the error was returned (e.g., Stripe processed it but the SDK timed out on the response), the retry with the same key returns the original charge rather than creating a second one.

Vault keys for Vertex AI Stripe workflows

Keybrake issues scoped vault keys for Stripe — per-agent endpoint allowlists, daily spend caps, and a per-call audit log. One line change from stripe.api_key to stripe.StripeClient(api_key=VAULT_KEY, base_url=PROXY_URL+"/stripe/"). Proxy is live now.