Python · Governance

AI agent API governance in Python: policies, enforcement, and audit logs

How to implement a production-grade governance layer for your AI agent's API calls in Python — from policy definition with Pydantic to pre-call spend enforcement, per-call audit logging, and the two gaps that agent-side code alone can't close.

The happy path of an AI agent calling Stripe is three lines of Python. The failure mode — a stuck loop issuing refunds, a billing agent that ignores its own error handling, a tool that retries indefinitely on a transient network error — can cost thousands of dollars before any alert fires. Most observability tools will tell you what happened after the fact. Governance is what prevents it in the first place.

This post is a practical guide to building that governance layer in Python: how to define spend policies, how to enforce them at call time, what to log, and how to test your enforcement code. We'll also be explicit about where agent-side Python governance runs out of road and why a proxy layer is required for production-strength control.

For background on why Stripe restricted keys aren't the complete answer, see why your Stripe restricted key probably isn't restricted enough. For the complete reference on Stripe's permission toggles, see Stripe restricted API key permissions. This post assumes you already have restricted keys configured and are building the enforcement and audit layer on top.

What governance actually means for an API-calling agent

In a human-operated system, governance happens through approval workflows: a person reviews a refund before clicking Submit. In an autonomous agent system, that person is gone. Governance has to move into code — and it has to run before the API call, not after.

Three properties define useful agent API governance:

The code patterns below address all three — with clear notes on where each pattern satisfies the property and where it doesn't.

Defining a policy with Pydantic

Start with a clean policy model. Pydantic gives you validation for free and makes policies serializable to JSON (useful for storing per-run policy alongside audit logs).

from datetime import datetime, timezone
from typing import Optional
from pydantic import BaseModel, field_validator

class VendorPolicy(BaseModel):
    vendor: str                      # "stripe" | "twilio" | "resend"
    daily_usd_cap: float             # hard limit for the day in USD
    allowed_endpoints: list[str]     # e.g. ["/v1/refunds", "/v1/charges"]
    allowed_customer_ids: Optional[list[str]] = None  # None = all customers
    expires_at: Optional[datetime] = None             # None = never

    @field_validator("daily_usd_cap")
    @classmethod
    def cap_must_be_positive(cls, v: float) -> float:
        if v <= 0:
            raise ValueError("daily_usd_cap must be positive")
        return v

    def is_expired(self) -> bool:
        if self.expires_at is None:
            return False
        return datetime.now(timezone.utc) > self.expires_at

    def allows_endpoint(self, path: str) -> bool:
        return any(path.startswith(ep) for ep in self.allowed_endpoints)

A refund agent running in production might have a policy like this:

refund_agent_policy = VendorPolicy(
    vendor="stripe",
    daily_usd_cap=500.00,         # max $500 in refunds per day
    allowed_endpoints=["/v1/refunds"],
    allowed_customer_ids=None,    # any customer (scope added later)
    expires_at=None,
)

Instantiate one policy per agent role. Store policies alongside your agent configuration — not hardcoded in the agent's tool implementation — so they can be updated without touching agent logic.

A pre-call spend validator

Once you have a policy, you need an enforcer: a class that tracks spend for the current day and raises before any call that would exceed the cap. Here's a minimal in-process validator:

import threading
from decimal import Decimal

class SpendEnforcer:
    """
    Thread-safe in-process spend tracker.
    Resets on midnight UTC. Does NOT survive process restarts.
    """
    def __init__(self, policy: VendorPolicy):
        self._policy = policy
        self._spent_today: Decimal = Decimal("0.00")
        self._lock = threading.Lock()
        self._reset_date: str = self._today()

    @staticmethod
    def _today() -> str:
        return datetime.now(timezone.utc).strftime("%Y-%m-%d")

    def _check_date_rollover(self):
        today = self._today()
        if today != self._reset_date:
            self._spent_today = Decimal("0.00")
            self._reset_date = today

    def check_and_reserve(self, estimated_usd: float, endpoint: str) -> None:
        """
        Call this before every API call.
        Raises PolicyViolation if the call would exceed the cap.
        """
        if self._policy.is_expired():
            raise PolicyViolation("Agent policy has expired")

        if not self._policy.allows_endpoint(endpoint):
            raise PolicyViolation(
                f"Endpoint {endpoint!r} not in policy allowlist: "
                f"{self._policy.allowed_endpoints}"
            )

        with self._lock:
            self._check_date_rollover()
            projected = self._spent_today + Decimal(str(estimated_usd))
            cap = Decimal(str(self._policy.daily_usd_cap))
            if projected > cap:
                raise PolicyViolation(
                    f"Spend cap exceeded: projected ${projected:.2f} > "
                    f"cap ${cap:.2f} (spent today: ${self._spent_today:.2f})"
                )

    def record_actual(self, actual_usd: float) -> None:
        """Call this after the API call completes, with the parsed actual cost."""
        with self._lock:
            self._check_date_rollover()
            self._spent_today += Decimal(str(actual_usd))

class PolicyViolation(Exception):
    pass
Important limitation: This validator tracks spend in memory. It resets to zero if the process crashes and restarts. It does not aggregate across multiple agent instances. If you run ten agent workers in parallel, each has its own counter — giving you 10× your intended cap. This is the fundamental gap of agent-side enforcement, addressed in the proxy section below.

Wiring the enforcer into your agent's tool calls

The enforcer needs to wrap every Stripe call. With OpenAI Agents SDK, this looks like:

import stripe
from agents import function_tool

enforcer = SpendEnforcer(refund_agent_policy)

@function_tool
def issue_refund(charge_id: str, amount_cents: int) -> dict:
    amount_usd = amount_cents / 100

    # governance check — raises PolicyViolation before touching Stripe
    enforcer.check_and_reserve(
        estimated_usd=amount_usd,
        endpoint="/v1/refunds",
    )

    try:
        refund = stripe.Refund.create(charge=charge_id, amount=amount_cents)
        enforcer.record_actual(amount_usd)  # confirm the spend
        return {"id": refund.id, "status": refund.status}
    except stripe.error.StripeError as e:
        # call failed — do NOT record_actual (no money moved)
        raise

The same pattern works with LangChain tools (@tool) and plain Python tool wrappers. The key invariant: check_and_reserve runs before the network call, record_actual runs only on success.

Audit logging per call

An audit log isn't a nice-to-have for agent systems — it's the only way to reconstruct what happened in a post-incident investigation. At minimum, every API call should produce a log row with these fields:

Field Why it matters Where it comes from
run_id Ties all calls in one agent run together Generated at agent startup (UUID4)
agent_id Identifies the agent type (not instance) Configuration / environment
timestamp_utc Absolute time for ordering and incident windows datetime.now(timezone.utc)
vendor Which API was called Policy / tool metadata
endpoint Which API endpoint Tool implementation
method GET / POST / DELETE Tool implementation
http_status Success vs. error Response object
vendor_request_id Stripe's Request-Id for cross-referencing the Dashboard response.headers["Request-Id"]
cost_usd Actual cost parsed from response Tool implementation or proxy
policy_id Which policy was in effect Policy model identifier

A simple SQLite-backed audit logger for a Python agent:

import sqlite3
import uuid
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path

@dataclass
class AuditEntry:
    run_id: str
    agent_id: str
    timestamp_utc: str
    vendor: str
    endpoint: str
    method: str
    http_status: int
    vendor_request_id: str
    cost_usd: float
    policy_id: str
    error: str = ""

class AuditLogger:
    def __init__(self, db_path: str = "agent_audit.db"):
        self._conn = sqlite3.connect(db_path, check_same_thread=False)
        self._conn.execute("""
            CREATE TABLE IF NOT EXISTS audit_log (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                run_id TEXT NOT NULL,
                agent_id TEXT NOT NULL,
                timestamp_utc TEXT NOT NULL,
                vendor TEXT NOT NULL,
                endpoint TEXT NOT NULL,
                method TEXT NOT NULL,
                http_status INTEGER NOT NULL,
                vendor_request_id TEXT,
                cost_usd REAL NOT NULL DEFAULT 0.0,
                policy_id TEXT,
                error TEXT DEFAULT ''
            )
        """)
        self._conn.execute(
            "CREATE INDEX IF NOT EXISTS idx_run ON audit_log(run_id)"
        )
        self._conn.commit()

    def log(self, entry: AuditEntry) -> None:
        d = asdict(entry)
        self._conn.execute(
            "INSERT INTO audit_log VALUES (NULL,"
            ":run_id,:agent_id,:timestamp_utc,:vendor,:endpoint,"
            ":method,:http_status,:vendor_request_id,:cost_usd,:policy_id,:error)",
            d,
        )
        self._conn.commit()

    def run_summary(self, run_id: str) -> dict:
        row = self._conn.execute(
            "SELECT COUNT(*) as calls, SUM(cost_usd) as total_usd "
            "FROM audit_log WHERE run_id = ?",
            (run_id,),
        ).fetchone()
        return {"calls": row[0], "total_usd": row[1] or 0.0}

Use one AuditLogger instance across the process. Generate run_id = str(uuid.uuid4()) once at agent startup and pass it to every tool.

Testing governance enforcement with pytest

Governance code that isn't tested isn't governance. Three test categories cover the most common failure modes:

import pytest
from datetime import datetime, timezone, timedelta
from your_module import VendorPolicy, SpendEnforcer, PolicyViolation

@pytest.fixture
def policy():
    return VendorPolicy(
        vendor="stripe",
        daily_usd_cap=100.00,
        allowed_endpoints=["/v1/refunds"],
    )

@pytest.fixture
def enforcer(policy):
    return SpendEnforcer(policy)


class TestEndpointAllowlist:
    def test_allowed_endpoint_passes(self, enforcer):
        enforcer.check_and_reserve(10.0, "/v1/refunds")  # no exception

    def test_disallowed_endpoint_raises(self, enforcer):
        with pytest.raises(PolicyViolation, match="not in policy allowlist"):
            enforcer.check_and_reserve(10.0, "/v1/charges")

    def test_endpoint_prefix_match(self, enforcer):
        # /v1/refunds/{id} should match the /v1/refunds prefix
        enforcer.check_and_reserve(10.0, "/v1/refunds/re_abc123")


class TestSpendCap:
    def test_within_cap_passes(self, enforcer):
        enforcer.check_and_reserve(50.0, "/v1/refunds")  # 50 < 100

    def test_at_cap_passes(self, enforcer):
        enforcer.check_and_reserve(100.0, "/v1/refunds")  # exactly at cap

    def test_over_cap_raises(self, enforcer):
        with pytest.raises(PolicyViolation, match="Spend cap exceeded"):
            enforcer.check_and_reserve(100.01, "/v1/refunds")

    def test_cumulative_spend_enforced(self, enforcer):
        enforcer.check_and_reserve(60.0, "/v1/refunds")
        enforcer.record_actual(60.0)
        with pytest.raises(PolicyViolation, match="Spend cap exceeded"):
            enforcer.check_and_reserve(41.0, "/v1/refunds")  # 60 + 41 = 101


class TestPolicyExpiry:
    def test_expired_policy_raises(self, policy):
        policy.expires_at = datetime.now(timezone.utc) - timedelta(seconds=1)
        enforcer = SpendEnforcer(policy)
        with pytest.raises(PolicyViolation, match="expired"):
            enforcer.check_and_reserve(10.0, "/v1/refunds")

    def test_future_expiry_passes(self, policy):
        policy.expires_at = datetime.now(timezone.utc) + timedelta(hours=1)
        enforcer = SpendEnforcer(policy)
        enforcer.check_and_reserve(10.0, "/v1/refunds")

Run these tests in CI with the same policy objects you use in production. A policy change that accidentally widens the endpoint allowlist will fail the test_disallowed_endpoint_raises test immediately.

Where agent-side governance falls short

The in-process pattern above covers single-process, single-run agents well. It breaks in three production scenarios:

1. Multi-instance deployments

If you run five copies of your agent worker, each has its own SpendEnforcer with its own _spent_today counter. Each instance enforces the cap independently, so in the worst case your effective cap is 5 × daily_usd_cap before any single enforcer fires. Coordinating spend across processes requires a shared store (Redis, a database) with atomic increment operations — which is exactly what a proxy layer provides.

2. Process crashes and restarts

An agent that crashes mid-run loses its in-memory spend accumulation. If it restarts and the day hasn't rolled over, the new instance starts at zero — unaware of what the crashed instance already spent. For high-value operations like Stripe charges, this gap matters.

3. Revoke latency

If you detect a runaway agent and need to stop it, there's no clean way to revoke access from an in-process enforcer without killing the process — which may leave inflight calls mid-execution. A proxy layer that sits between your agent and Stripe can reject calls the moment a revoke flag is set, with sub-second latency, without touching the agent process at all.

The gap in summary: Agent-side governance works for a single process in a controlled environment. For multi-instance production deployments, the spend counter needs to live outside the agent — in a database or a proxy — so all instances share the same view of how much has been spent today. The revoke path has the same requirement: a shared control plane, not per-process state.

Combining agent-side and proxy-layer governance

The two layers complement each other. Agent-side enforcement catches policy violations before the request even leaves the process — including calls to endpoints that the proxy isn't watching. Proxy enforcement handles the multi-instance case, provides the authoritative spend counter, and gives you sub-second revoke.

The architecture looks like this in practice:

# Instead of calling Stripe directly:
#   stripe.api_key = "rk_live_..."

# Point your Python agent at the proxy, using a vault key:
import stripe

stripe.api_key = os.environ["VAULT_KEY"]        # vault_key_xxx
stripe.api_base = "https://proxy.keybrake.com"  # proxy endpoint

# All stripe-python calls now route through the proxy.
# The proxy looks up the real Stripe key, enforces the policy,
# forwards the call, parses the cost from the response,
# and logs the call to a shared audit table.
# Your SpendEnforcer runs in addition — catching violations
# before they reach the proxy's network round-trip.

The SpendEnforcer in your Python code is a fast, in-process guard that blocks obviously bad calls before they generate network traffic. The proxy is the authoritative enforcer — its spend counter is the ground truth, and its revoke path is the kill switch.

For the full picture of why a proxy layer is required for multi-agent production setups, see why your AI agent fleet needs a vendor API gateway. For the four budget alert patterns and when each fires, see budget alerts for AI agents ranked by how late they fire.

Frequently asked questions

What's the right granularity for a policy — per agent type or per agent run?

Both. Define policies per agent type (refund agent, billing agent, analytics agent) as your baseline. Issue per-run vault keys with a tighter daily_usd_cap and a short expires_at if the run has a known scope — for example, a billing reconciliation run that should touch at most ten customers. The type-level policy is your floor; the run-level policy is your ceiling for that specific execution.

Should the SpendEnforcer use a pessimistic or optimistic reservation model?

Pessimistic. Reserve the estimated cost before the call; confirm the actual cost after. Never skip the pre-call reservation and rely only on post-call recording — if the agent issues two calls in the same millisecond (common with async code), both will pass the pre-call check independently, and you'll overshoot the cap by the cost of the second call.

How do I estimate the cost of a Stripe API call before making it?

For refunds and charges, the cost is in the request parameters (amount in cents, divided by 100 for USD). For Stripe fee operations, you'll need to parse the response — Stripe doesn't expose cost before the call. In those cases, use a conservative estimate (the maximum expected cost for the operation) for the pre-call reservation and adjust with the actual cost from the response using record_actual.

Is SQLite sufficient for a production audit log?

For single-process agents, yes. SQLite in WAL mode handles concurrent reads and serialized writes without corruption, and the on-disk format is durable. For multi-process deployments, switch to PostgreSQL or a hosted SQLite solution with multi-writer support. The schema in this post is compatible with both — the only change is the connection string.

Can I use the same SpendEnforcer for multiple vendors (Stripe, Twilio, Resend)?

Instantiate one SpendEnforcer per VendorPolicy. If your agent calls both Stripe and Twilio, you need two enforcers with separate policies — they track different currencies, different endpoints, and have different cost structures. Don't combine them into a single multi-vendor counter; that makes it impossible to set per-vendor caps.

What should the agent do when PolicyViolation is raised?

Stop the tool call immediately. Log the violation with the same fields as a normal audit entry (including cost_usd=0 since no call was made), then surface the exception to the agent's error handler. Do not catch PolicyViolation inside the tool itself — let it propagate so the agent's orchestration layer can decide whether to terminate the run or wait for the cap to reset. Swallowing the exception defeats the purpose of the governance layer.

Agent-side governance + a proxy enforcement layer

The Pydantic policy model and SpendEnforcer above are the right starting point for single-process agents. Keybrake provides the proxy layer that makes them production-grade: a shared spend counter across all agent instances, a sub-second kill switch, and a per-call audit log with parsed cost — without changing your Python agent code beyond two environment variables.