Agent governance · Audit & compliance

AI agent audit trail: what belongs in one, with the minimum schema

An AI agent audit trail isn't an API log with extra fields. It answers four questions an HTTP access log can't, and on a bad day those four questions are the only ones that matter. Here's the minimum viable shape, the full sixteen-column reference, where the cost data comes from per vendor, and the queries that decide whether the audit was worth keeping.

TL;DR

An audit trail for an AI agent has to record per call: an agent_run_id joining the row to the rest of the run, a policy_verdict the proxy or wrapper used, a cost_usd_parsed number lifted out of the vendor's response, and a customer_scope_id recording which customer or merchant the call touched. Those four columns are the minimum that lets you answer "did our governance work?" If you also keep cap_usage_after_usd, vendor_request_id, and the verdict reason string, you have the full reference shape — a CREATE TABLE for which is given in our long-form schema post. An HTTP access log alone gets none of the four, which is why "we already log every request" is the most common reason a 2am incident has no answer.

What an AI agent audit trail actually is

When a credit-card company audits a transaction, they don't ask the network: they ask their own ledger. The wire says the message reached the merchant; the ledger says whether the charge was authorised, what scope it sat inside, who was billed, and what it cost. An AI agent audit trail is the same idea applied to the calls an autonomous agent makes against SaaS APIs — Stripe, Twilio, Resend, Shopify, Postmark, anything that moves money or sends a real-world artefact.

The job of the audit trail isn't to show the wire was healthy. It's to prove — months after the fact, on a regulator's clock or a postmortem call — that the agent stayed inside the policy you wrote, the calls cost what you expected, and the runs you can name in English have rows you can name in SQL. An access log fails all three.

The four questions an HTTP access log can't answer

If your "audit" is the JSON Caddy or NGINX writes for every request, here's what you're missing. The framing is from the schema post; the short version below is what you can copy-paste into a design review.

  1. Did our policy decide correctly on this call? The wire shows a 200 or a 403. The audit has to show whether the policy you wrote — daily $50 cap, customer allowlist, endpoint allowlist — was the reason. A cap-hit denial is a policy success that the HTTP record makes look like a 403 bug; you'll never close out the postmortem without a separate policy_verdict column.
  2. Did the call stay inside its declared scope? Your support agent was supposed to refund only customer cus_X. The call refunded cus_Y. The wire says 200. The audit has to record the scope the call was supposed to honour and the scope it actually touched, so you can write WHERE customer_scope_id != intended_customer_scope_id and find every violation in the last 90 days in one query.
  3. What did the call cost in dollars? Not bytes, not tokens — dollars in the currency your bank account is denominated in, parsed from the vendor response. Stripe puts it in the amount field of the charge object. Twilio puts it in price on the message resource. Resend doesn't expose it at all and you compute it from your tier. Without a cost_usd_parsed column the cost-by-vendor query is a manual export job; with it, it's a SUM(cost_usd_parsed) GROUP BY vendor.
  4. How does this call group with the rest of the run? One row never tells you about a stuck loop. The pattern only resolves when you group on the agent run and order by time. That requires an agent_run_id set by the agent — usually as a header named x-agent-run-id — and written into every layer of the stack so the join works across LLM proxy, SaaS proxy, and your own application logs.

Minimum viable schema (four columns)

If you only ship four columns, ship these. Each one answers one of the four questions above. Everything else is enrichment.

ColumnTypeWhat it answersSource
agent_run_idTEXTWhich agent run was this call part of?Header set by agent (x-agent-run-id) or generated by the proxy on first call
policy_verdictTEXTDid our governance allow, deny, or queue this?Output of the proxy's policy engine — one of allow, deny_cap, deny_scope, deny_endpoint, queued_approval
cost_usd_parsedREALWhat did this call cost in real money?Parsed from the vendor response (Stripe amount, Twilio price) or computed from your tier (Resend)
customer_scope_idTEXTWhich downstream customer did the call touch?Lifted from the request body or vendor response — customer, to, recipient depending on vendor

Add an autoincrement id and a ts timestamp and you have a working schema in five minutes. The first time someone asks "did the agent ever charge a customer outside the allowlist," you'll write the answer in one query.

The full reference schema (sixteen columns)

The minimum gets you out of the woods. The full reference earns its keep on the third bad day, when the cap-hit happened but you can't tell which call tripped it, or the parsed cost diverged from the invoice and you can't tell whether to trust the parser or the invoice. The expanded schema and its six indexes (two of them partial) live in the schema post — the columns we add on top of the MVP four are:

Sixteen columns sounds like a lot. Each one is the answer to a query you will run at least once. The schema post walks through five of those queries — top-10 spend spike, cap-hit in the last 24 hours, run reconstruction, slow-vendor p95, 90-day customer action history — with full SQL.

Where the cost data comes from, per vendor

Parsed cost is the column non-trivially hardest to ship. Each vendor exposes it differently; one of them doesn't expose it at all.

The queries that justify the audit

An audit table no one queries is just storage. Five questions you should be able to answer in one SQL statement. The schema post has the full text of each; below is the shape and the column they pivot on.

  1. Top-10 spend by run in the last 7 days. Pivots on agent_run_id, sums cost_usd_parsed. Identifies the runs you should care about before they go viral.
  2. Every cap-hit in the last 24 hours. Pivots on policy_verdict LIKE 'deny_cap%'. Tells you which agents tripped guards yesterday — a leading indicator of a stuck-loop incident.
  3. Reconstruct an agent run from a single trace. Pivots on agent_run_id = ?, ordered by ts. The 2am incident query.
  4. Slow-vendor p95 latency by endpoint. Pivots on vendor, vendor_endpoint, percentile latency_ms. Catches vendor-side regressions that look like agent regressions.
  5. Customer action history over 90 days. Pivots on customer_scope_id = ?, ordered by ts. The compliance query — a regulator asks what your AI did to a specific user; you have one row per call to point at.

Three implementation paths

You can land an agent audit trail in three places, each with a different cost-of-ownership shape.

1. SDK-wrapper. You wrap every Stripe, Twilio, and Resend SDK call in your application code with a logger that writes the audit row. Cheapest to start, breaks the moment a third-party SDK or MCP server bypasses the wrapper. Zero infrastructure. The cost-parsing logic lives in your application repository, where it'll bit-rot the first time a vendor changes a response field.

2. Sidecar / outbound proxy. You point your agent at a sidecar (Envoy, Caddy, custom Node) that intercepts outbound HTTP, parses the response, and writes the audit row. Survives third-party SDKs because it works at the network layer. Adds an operational component you have to keep alive. Cost-parsing lives in the proxy, which is the right place for it.

3. Hosted SaaS-tool governance proxy. What Keybrake is. You use a vault key against proxy.keybrake.com, we enforce the policy, parse the cost, write the audit row, and you query the audit through our dashboard or a SQL export. Same shape as the sidecar pattern, without the sidecar to operate. Best for teams that don't want their on-call rotation paged because the audit pipeline blocked a Stripe charge.

Three mistakes that ruin the audit

Logging every header verbatim. Vendor request headers include the API key. If you write that into the audit table, you've turned the audit into a credential-leak surface. Hash or strip the auth header at write time. The audit gets the vault_key_id identifier, not the raw secret.

Not setting agent_run_id at the agent. If the proxy generates the run ID on the first call it sees, you can't join the audit to your own application telemetry — your code never had the same ID. The agent has to set x-agent-run-id on outbound calls and the proxy has to honour it. The MCP API key auth page covers how this looks for MCP servers specifically.

Treating the audit as append-only with no retention policy. Audit rows include customer identifiers. After a customer deletion request you have to delete or anonymise their rows. Plan for it on day one — pick a retention horizon (90 days for the Hobby plan, longer for Team and Scale) and a delete-by-customer query path so deletion is a SQL statement, not a database migration.

How Keybrake produces this audit trail

Keybrake is a governance proxy for the non-LLM SaaS APIs your agent hits. You issue a vault key, attach a policy, and the agent calls proxy.keybrake.com/<vendor> as if it were the real endpoint. Every call lands one row in our agent_call_audit table — the sixteen-column schema above, indexed on agent_run_id, vault_key_id, customer_scope_id, and a partial index on cap-hit verdicts so the "every cap-hit yesterday" query stays fast at scale. You can read the audit through the dashboard, the API, or a daily SQL export. The cost-parsing logic for Stripe, Twilio, and Resend is maintained by us, not your team — when Stripe renames a field, we update the parser, your audit keeps working.

If you want a kill switch on top of the audit, see our piece on the four kill-switch patterns — Keybrake combines pattern 2 (credential revoke) with pattern 3 (proxy-enforced policy flag) so the audit table has the row that tripped the guard and the next call is rejected sub-second.

Get early access

Related questions

Is an audit trail the same as a kill switch?

No, and that's the most common confusion. A kill switch stops the agent — it's about the present moment. An audit trail records what the agent did — it's about reconstructing the past. You need both: the kill switch contains the bleeding, the audit tells you what the bleeding cost. The kill-switch patterns page covers the four real options for stopping a running agent; the audit trail is what you query after you've pulled the lever.

Can I use my LLM observability tool's logs as my audit trail?

Tools like Helicone and Langfuse log LLM traffic — model, prompt tokens, completion tokens, latency. They don't see the SaaS-tool calls your agent makes downstream of those LLM calls. If your agent issues a refund based on a model output, the LLM observability tool has the prompt and completion; it doesn't have the row that says the refund was for $43.50 to cus_R12. You need both layers, joined on agent_run_id. The agent governance stack post goes through the four-layer composition.

How long should I retain audit rows?

Two competing pressures. Operationally, 90 days is enough for almost every postmortem — incidents that took longer than 90 days to surface are rare, and SQL on a 90-day table stays fast. Compliance-wise, financial regulators frequently want longer retention (12 months, sometimes 7 years for certain transaction types). The pragmatic shape is: 90-day operational retention in the hot table, longer cold-storage retention as a daily Parquet export to S3 with a customer-scoped delete path for deletion requests.

Why parse cost from the vendor response instead of using token-counts or request counts?

Token counts answer "how much LLM did we use," which is a different question. Request counts answer "how busy was the agent," which is also different. Neither tells you "how much real money moved." If your audit is the basis for charging customers (a usage-billed agent product), or for triggering a hard-stop (a cap-based policy), it has to be in dollars — and the only authoritative source is the vendor's own response. Vendor pricing changes; tier tables drift; only the response is dollar-accurate at the moment of the call.

What's the smallest team that should care about this?

Anyone whose agent has touched production. The first cap-hit you can't explain is the moment the audit pays for itself, and that happens at companies of every size. The four-column MVP is small enough that a solo founder can ship it in a Sunday afternoon, and we strongly recommend doing so before pointing an agent at a money-moving API. If you don't want to ship it yourself, that's our pitch.

Further reading