Python · Governance
AI agent API governance in Python: policies, enforcement, and audit logs
How to implement a production-grade governance layer for your AI agent's API calls in Python — from policy definition with Pydantic to pre-call spend enforcement, per-call audit logging, and the two gaps that agent-side code alone can't close.
The happy path of an AI agent calling Stripe is three lines of Python. The failure mode — a stuck loop issuing refunds, a billing agent that ignores its own error handling, a tool that retries indefinitely on a transient network error — can cost thousands of dollars before any alert fires. Most observability tools will tell you what happened after the fact. Governance is what prevents it in the first place.
This post is a practical guide to building that governance layer in Python: how to define spend policies, how to enforce them at call time, what to log, and how to test your enforcement code. We'll also be explicit about where agent-side Python governance runs out of road and why a proxy layer is required for production-strength control.
For background on why Stripe restricted keys aren't the complete answer, see why your Stripe restricted key probably isn't restricted enough. For the complete reference on Stripe's permission toggles, see Stripe restricted API key permissions. This post assumes you already have restricted keys configured and are building the enforcement and audit layer on top.
What governance actually means for an API-calling agent
In a human-operated system, governance happens through approval workflows: a person reviews a refund before clicking Submit. In an autonomous agent system, that person is gone. Governance has to move into code — and it has to run before the API call, not after.
Three properties define useful agent API governance:
- Pre-call enforcement. The check happens before the request leaves your process. Not in a webhook, not in a billing alarm, not in a Slack notification that fires 20 minutes later.
- Persistence across restarts. A spend counter in a Python variable resets when the process crashes. Governance state needs to survive the failure modes that agents actually experience.
- Attribution. When something goes wrong, you need to know which agent run caused which calls. A shared API key gives you one audit stream with no per-run separation.
The code patterns below address all three — with clear notes on where each pattern satisfies the property and where it doesn't.
Defining a policy with Pydantic
Start with a clean policy model. Pydantic gives you validation for free and makes policies serializable to JSON (useful for storing per-run policy alongside audit logs).
from datetime import datetime, timezone
from typing import Optional
from pydantic import BaseModel, field_validator
class VendorPolicy(BaseModel):
vendor: str # "stripe" | "twilio" | "resend"
daily_usd_cap: float # hard limit for the day in USD
allowed_endpoints: list[str] # e.g. ["/v1/refunds", "/v1/charges"]
allowed_customer_ids: Optional[list[str]] = None # None = all customers
expires_at: Optional[datetime] = None # None = never
@field_validator("daily_usd_cap")
@classmethod
def cap_must_be_positive(cls, v: float) -> float:
if v <= 0:
raise ValueError("daily_usd_cap must be positive")
return v
def is_expired(self) -> bool:
if self.expires_at is None:
return False
return datetime.now(timezone.utc) > self.expires_at
def allows_endpoint(self, path: str) -> bool:
return any(path.startswith(ep) for ep in self.allowed_endpoints)
A refund agent running in production might have a policy like this:
refund_agent_policy = VendorPolicy(
vendor="stripe",
daily_usd_cap=500.00, # max $500 in refunds per day
allowed_endpoints=["/v1/refunds"],
allowed_customer_ids=None, # any customer (scope added later)
expires_at=None,
)
Instantiate one policy per agent role. Store policies alongside your agent configuration — not hardcoded in the agent's tool implementation — so they can be updated without touching agent logic.
A pre-call spend validator
Once you have a policy, you need an enforcer: a class that tracks spend for the current day and raises before any call that would exceed the cap. Here's a minimal in-process validator:
import threading
from decimal import Decimal
class SpendEnforcer:
"""
Thread-safe in-process spend tracker.
Resets on midnight UTC. Does NOT survive process restarts.
"""
def __init__(self, policy: VendorPolicy):
self._policy = policy
self._spent_today: Decimal = Decimal("0.00")
self._lock = threading.Lock()
self._reset_date: str = self._today()
@staticmethod
def _today() -> str:
return datetime.now(timezone.utc).strftime("%Y-%m-%d")
def _check_date_rollover(self):
today = self._today()
if today != self._reset_date:
self._spent_today = Decimal("0.00")
self._reset_date = today
def check_and_reserve(self, estimated_usd: float, endpoint: str) -> None:
"""
Call this before every API call.
Raises PolicyViolation if the call would exceed the cap.
"""
if self._policy.is_expired():
raise PolicyViolation("Agent policy has expired")
if not self._policy.allows_endpoint(endpoint):
raise PolicyViolation(
f"Endpoint {endpoint!r} not in policy allowlist: "
f"{self._policy.allowed_endpoints}"
)
with self._lock:
self._check_date_rollover()
projected = self._spent_today + Decimal(str(estimated_usd))
cap = Decimal(str(self._policy.daily_usd_cap))
if projected > cap:
raise PolicyViolation(
f"Spend cap exceeded: projected ${projected:.2f} > "
f"cap ${cap:.2f} (spent today: ${self._spent_today:.2f})"
)
def record_actual(self, actual_usd: float) -> None:
"""Call this after the API call completes, with the parsed actual cost."""
with self._lock:
self._check_date_rollover()
self._spent_today += Decimal(str(actual_usd))
class PolicyViolation(Exception):
pass
Wiring the enforcer into your agent's tool calls
The enforcer needs to wrap every Stripe call. With OpenAI Agents SDK, this looks like:
import stripe
from agents import function_tool
enforcer = SpendEnforcer(refund_agent_policy)
@function_tool
def issue_refund(charge_id: str, amount_cents: int) -> dict:
amount_usd = amount_cents / 100
# governance check — raises PolicyViolation before touching Stripe
enforcer.check_and_reserve(
estimated_usd=amount_usd,
endpoint="/v1/refunds",
)
try:
refund = stripe.Refund.create(charge=charge_id, amount=amount_cents)
enforcer.record_actual(amount_usd) # confirm the spend
return {"id": refund.id, "status": refund.status}
except stripe.error.StripeError as e:
# call failed — do NOT record_actual (no money moved)
raise
The same pattern works with LangChain tools (@tool) and plain Python tool wrappers. The key invariant: check_and_reserve runs before the network call, record_actual runs only on success.
Audit logging per call
An audit log isn't a nice-to-have for agent systems — it's the only way to reconstruct what happened in a post-incident investigation. At minimum, every API call should produce a log row with these fields:
| Field | Why it matters | Where it comes from |
|---|---|---|
run_id |
Ties all calls in one agent run together | Generated at agent startup (UUID4) |
agent_id |
Identifies the agent type (not instance) | Configuration / environment |
timestamp_utc |
Absolute time for ordering and incident windows | datetime.now(timezone.utc) |
vendor |
Which API was called | Policy / tool metadata |
endpoint |
Which API endpoint | Tool implementation |
method |
GET / POST / DELETE | Tool implementation |
http_status |
Success vs. error | Response object |
vendor_request_id |
Stripe's Request-Id for cross-referencing the Dashboard |
response.headers["Request-Id"] |
cost_usd |
Actual cost parsed from response | Tool implementation or proxy |
policy_id |
Which policy was in effect | Policy model identifier |
A simple SQLite-backed audit logger for a Python agent:
import sqlite3
import uuid
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path
@dataclass
class AuditEntry:
run_id: str
agent_id: str
timestamp_utc: str
vendor: str
endpoint: str
method: str
http_status: int
vendor_request_id: str
cost_usd: float
policy_id: str
error: str = ""
class AuditLogger:
def __init__(self, db_path: str = "agent_audit.db"):
self._conn = sqlite3.connect(db_path, check_same_thread=False)
self._conn.execute("""
CREATE TABLE IF NOT EXISTS audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL,
agent_id TEXT NOT NULL,
timestamp_utc TEXT NOT NULL,
vendor TEXT NOT NULL,
endpoint TEXT NOT NULL,
method TEXT NOT NULL,
http_status INTEGER NOT NULL,
vendor_request_id TEXT,
cost_usd REAL NOT NULL DEFAULT 0.0,
policy_id TEXT,
error TEXT DEFAULT ''
)
""")
self._conn.execute(
"CREATE INDEX IF NOT EXISTS idx_run ON audit_log(run_id)"
)
self._conn.commit()
def log(self, entry: AuditEntry) -> None:
d = asdict(entry)
self._conn.execute(
"INSERT INTO audit_log VALUES (NULL,"
":run_id,:agent_id,:timestamp_utc,:vendor,:endpoint,"
":method,:http_status,:vendor_request_id,:cost_usd,:policy_id,:error)",
d,
)
self._conn.commit()
def run_summary(self, run_id: str) -> dict:
row = self._conn.execute(
"SELECT COUNT(*) as calls, SUM(cost_usd) as total_usd "
"FROM audit_log WHERE run_id = ?",
(run_id,),
).fetchone()
return {"calls": row[0], "total_usd": row[1] or 0.0}
Use one AuditLogger instance across the process. Generate run_id = str(uuid.uuid4()) once at agent startup and pass it to every tool.
Testing governance enforcement with pytest
Governance code that isn't tested isn't governance. Three test categories cover the most common failure modes:
import pytest
from datetime import datetime, timezone, timedelta
from your_module import VendorPolicy, SpendEnforcer, PolicyViolation
@pytest.fixture
def policy():
return VendorPolicy(
vendor="stripe",
daily_usd_cap=100.00,
allowed_endpoints=["/v1/refunds"],
)
@pytest.fixture
def enforcer(policy):
return SpendEnforcer(policy)
class TestEndpointAllowlist:
def test_allowed_endpoint_passes(self, enforcer):
enforcer.check_and_reserve(10.0, "/v1/refunds") # no exception
def test_disallowed_endpoint_raises(self, enforcer):
with pytest.raises(PolicyViolation, match="not in policy allowlist"):
enforcer.check_and_reserve(10.0, "/v1/charges")
def test_endpoint_prefix_match(self, enforcer):
# /v1/refunds/{id} should match the /v1/refunds prefix
enforcer.check_and_reserve(10.0, "/v1/refunds/re_abc123")
class TestSpendCap:
def test_within_cap_passes(self, enforcer):
enforcer.check_and_reserve(50.0, "/v1/refunds") # 50 < 100
def test_at_cap_passes(self, enforcer):
enforcer.check_and_reserve(100.0, "/v1/refunds") # exactly at cap
def test_over_cap_raises(self, enforcer):
with pytest.raises(PolicyViolation, match="Spend cap exceeded"):
enforcer.check_and_reserve(100.01, "/v1/refunds")
def test_cumulative_spend_enforced(self, enforcer):
enforcer.check_and_reserve(60.0, "/v1/refunds")
enforcer.record_actual(60.0)
with pytest.raises(PolicyViolation, match="Spend cap exceeded"):
enforcer.check_and_reserve(41.0, "/v1/refunds") # 60 + 41 = 101
class TestPolicyExpiry:
def test_expired_policy_raises(self, policy):
policy.expires_at = datetime.now(timezone.utc) - timedelta(seconds=1)
enforcer = SpendEnforcer(policy)
with pytest.raises(PolicyViolation, match="expired"):
enforcer.check_and_reserve(10.0, "/v1/refunds")
def test_future_expiry_passes(self, policy):
policy.expires_at = datetime.now(timezone.utc) + timedelta(hours=1)
enforcer = SpendEnforcer(policy)
enforcer.check_and_reserve(10.0, "/v1/refunds")
Run these tests in CI with the same policy objects you use in production. A policy change that accidentally widens the endpoint allowlist will fail the test_disallowed_endpoint_raises test immediately.
Where agent-side governance falls short
The in-process pattern above covers single-process, single-run agents well. It breaks in three production scenarios:
1. Multi-instance deployments
If you run five copies of your agent worker, each has its own SpendEnforcer with its own _spent_today counter. Each instance enforces the cap independently, so in the worst case your effective cap is 5 × daily_usd_cap before any single enforcer fires. Coordinating spend across processes requires a shared store (Redis, a database) with atomic increment operations — which is exactly what a proxy layer provides.
2. Process crashes and restarts
An agent that crashes mid-run loses its in-memory spend accumulation. If it restarts and the day hasn't rolled over, the new instance starts at zero — unaware of what the crashed instance already spent. For high-value operations like Stripe charges, this gap matters.
3. Revoke latency
If you detect a runaway agent and need to stop it, there's no clean way to revoke access from an in-process enforcer without killing the process — which may leave inflight calls mid-execution. A proxy layer that sits between your agent and Stripe can reject calls the moment a revoke flag is set, with sub-second latency, without touching the agent process at all.
Combining agent-side and proxy-layer governance
The two layers complement each other. Agent-side enforcement catches policy violations before the request even leaves the process — including calls to endpoints that the proxy isn't watching. Proxy enforcement handles the multi-instance case, provides the authoritative spend counter, and gives you sub-second revoke.
The architecture looks like this in practice:
# Instead of calling Stripe directly:
# stripe.api_key = "rk_live_..."
# Point your Python agent at the proxy, using a vault key:
import stripe
stripe.api_key = os.environ["VAULT_KEY"] # vault_key_xxx
stripe.api_base = "https://proxy.keybrake.com" # proxy endpoint
# All stripe-python calls now route through the proxy.
# The proxy looks up the real Stripe key, enforces the policy,
# forwards the call, parses the cost from the response,
# and logs the call to a shared audit table.
# Your SpendEnforcer runs in addition — catching violations
# before they reach the proxy's network round-trip.
The SpendEnforcer in your Python code is a fast, in-process guard that blocks obviously bad calls before they generate network traffic. The proxy is the authoritative enforcer — its spend counter is the ground truth, and its revoke path is the kill switch.
For the full picture of why a proxy layer is required for multi-agent production setups, see why your AI agent fleet needs a vendor API gateway. For the four budget alert patterns and when each fires, see budget alerts for AI agents ranked by how late they fire.
Frequently asked questions
What's the right granularity for a policy — per agent type or per agent run?
Both. Define policies per agent type (refund agent, billing agent, analytics agent) as your baseline. Issue per-run vault keys with a tighter daily_usd_cap and a short expires_at if the run has a known scope — for example, a billing reconciliation run that should touch at most ten customers. The type-level policy is your floor; the run-level policy is your ceiling for that specific execution.
Should the SpendEnforcer use a pessimistic or optimistic reservation model?
Pessimistic. Reserve the estimated cost before the call; confirm the actual cost after. Never skip the pre-call reservation and rely only on post-call recording — if the agent issues two calls in the same millisecond (common with async code), both will pass the pre-call check independently, and you'll overshoot the cap by the cost of the second call.
How do I estimate the cost of a Stripe API call before making it?
For refunds and charges, the cost is in the request parameters (amount in cents, divided by 100 for USD). For Stripe fee operations, you'll need to parse the response — Stripe doesn't expose cost before the call. In those cases, use a conservative estimate (the maximum expected cost for the operation) for the pre-call reservation and adjust with the actual cost from the response using record_actual.
Is SQLite sufficient for a production audit log?
For single-process agents, yes. SQLite in WAL mode handles concurrent reads and serialized writes without corruption, and the on-disk format is durable. For multi-process deployments, switch to PostgreSQL or a hosted SQLite solution with multi-writer support. The schema in this post is compatible with both — the only change is the connection string.
Can I use the same SpendEnforcer for multiple vendors (Stripe, Twilio, Resend)?
Instantiate one SpendEnforcer per VendorPolicy. If your agent calls both Stripe and Twilio, you need two enforcers with separate policies — they track different currencies, different endpoints, and have different cost structures. Don't combine them into a single multi-vendor counter; that makes it impossible to set per-vendor caps.
What should the agent do when PolicyViolation is raised?
Stop the tool call immediately. Log the violation with the same fields as a normal audit entry (including cost_usd=0 since no call was made), then surface the exception to the agent's error handler. Do not catch PolicyViolation inside the tool itself — let it propagate so the agent's orchestration layer can decide whether to terminate the run or wait for the cap to reset. Swallowing the exception defeats the purpose of the governance layer.
Agent-side governance + a proxy enforcement layer
The Pydantic policy model and SpendEnforcer above are the right starting point for single-process agents. Keybrake provides the proxy layer that makes them production-grade: a shared spend counter across all agent instances, a sub-second kill switch, and a per-call audit log with parsed cost — without changing your Python agent code beyond two environment variables.