Reference · 11 min read

The anatomy of an AI agent audit trail: an opinionated schema

Keybrake team · April 25, 2026

An audit trail for an AI agent is not a request log. A request log answers "what happened on the wire." An audit trail answers "did our policy work, did the agent stay in scope, and what did each step cost?" Those are different questions, they need different columns, and the difference becomes operational the first time you have to reconstruct an incident on a Tuesday morning. This is the table we run, the indexes that earn their place, and five queries that get used more than you'd expect.

Why agent audit logs aren't API logs

The default move, when someone says "we should audit our agent's API calls," is to point an HTTP access log at the proxy and call it done. That gets you method, path, status, timestamp, and bytes. None of those tell you what you actually need on a bad day.

An API log answers "is the wire healthy?". An agent audit answers four different questions, and they are the questions that matter when there is real money on the line:

Did our policy decide correctly on this call? Not "did the call go through" — every cap-hit denial is a policy success that an HTTP 403 row makes look like a bug. You need a column for the policy verdict that is independent of the network outcome.
Did the call stay inside its declared scope? The agent was supposed to refund only customer cus_X. The call was a refund of cus_Y. An HTTP log shows a 200; the audit must show a scope violation that ran to completion before the policy caught up — that's the row you need to email the on-call engineer.
What did this call actually cost? Not in tokens, not in bytes — in dollars, parsed from the vendor response, in the currency the bank account is denominated in. Stripe puts it in amount; Twilio puts it in price; Resend doesn't put it anywhere and you have to look up your tier. Each of these is a parser the log writer has to own.
How did this call relate to all the other calls in the same agent run? A stuck refund loop is invisible in any single row. It only resolves into a pattern when you can group on the agent run and order by elapsed. That requires a join key set by the agent, written into every layer of the stack — see our piece on the 2026 agent governance stack for the full x-agent-run-id argument.

You can answer none of those four with the default access log. So you build a side table. The shape of that table is the rest of this post.

The five questions the table has to answer

Before columns, the queries. We design the schema backwards from the questions we know we want to ask, so every column on the table earns its index by paying for one of them. The five we run weekly:

"What's our top-10 spend spike this week, by agent run?" — billing reconciliation, anomaly detection.
"Which calls hit a cap, when, and which agent run did the cap belong to?" — proves the policy is doing its job; produces the screenshot for the next compliance review.
"Walk me through agent run r-xyz from first call to last." — incident reconstruction. Has to return rows in order with cost, status, and policy verdict per call.
"Which vendor responses ran longer than 2 seconds in the last hour?" — vendor health monitoring; warns you before the vendor's status page does.
"For customer cus_27b3ac, every action our agent took in the last 90 days." — the question a disputing customer or an auditor will eventually ask. If you can't answer this on demand, you don't have an audit trail; you have a blob of rows.

Each of those queries dictates either a column or an index that the schema below has. None of them are speculative — every team running money-moving agent traffic ends up running a variant of these on a regular cadence.

The sixteen columns that earn their keep

Here is the column list, with the question each one answers and a note on whether it carries an index. cost_usd_parsed is the column that distinguishes a real audit table from a fancy access log; policy_verdict is the column that distinguishes governance from observation.

Column	Type	What it answers	Indexed?
`id`	uuid	Primary key for joins to richer downstream rows	PK
`agent_run_id`	uuid	Cross-layer join key (set by the agent, propagated as `x-agent-run-id`)	Yes
`vault_key_id`	text	Which scoped key the agent used; a single agent run may rotate keys	Yes
`vendor`	text	stripe / twilio / resend / shopify	Yes
`endpoint_path`	text	Normalized path: `/v1/charges`, not `/v1/charges?expand=...`	—
`http_method`	text	POST / GET / PUT / DELETE	—
`policy_verdict`	text	allow / cap_exceeded / scope_violation / endpoint_denied / key_expired	Yes
`vendor_status`	int	HTTP status returned by the vendor (or 0 if call never went out)	—
`cost_usd_parsed`	numeric(12,4)	Cost in USD parsed from the response body or the rate sheet	—
`cost_source`	text	Which field the cost came from: `amount`, `price`, `fixed_rate`, `n/a`	—
`cap_usage_after_usd`	numeric(12,4)	Running total against the per-vendor daily cap after this call landed	—
`vendor_request_id`	text	Stripe `Request-Id`, Twilio `Sid`, Resend message ID — for vendor-side joins	Yes
`customer_scope_id`	text, nullable	Multi-tenant attribution: which end-customer this action was on behalf of	Yes
`started_at`	timestamptz	When the proxy received the request	Yes
`elapsed_ms`	int	Wall time including policy check + vendor call + response parse	—
`error_class`	text, nullable	null / network / vendor_4xx / vendor_5xx / parse_failed / cap_hit / scope	Yes

Sixteen columns is more than you need for an MVP and fewer than you'll regret in production. Two columns deserve their own paragraph because they're often missing from first-cut schemas.

cost_source exists because the moment you parse from multiple places ("Stripe puts it in amount, Twilio in price, Resend gets a flat rate from a config table"), the parsing logic becomes the part most likely to break silently. A column that records where the number came from lets you backfill or correct an entire vendor's cost column when a parser bug ships, instead of guessing per-row. We've corrected ours twice in nine months; both times the audit was three hours of work, not three days.

cap_usage_after_usd exists because reconstructing "what was the cap utilisation at moment T" by summing all rows up to T is correct, slow, and wrong if there are out-of-order writes. Snapshotting the running total at write time makes "show me the run that drove us over the cap" a single equality lookup instead of a window function.

The schema, with indexes

This is the actual CREATE TABLE we run, lightly rephrased to be vendor-portable (the version we ship to customers is Postgres-flavoured; SQLite drops the NUMERIC precision and works otherwise unchanged). Six indexes, all of them earned by one of the queries below.

CREATE TABLE agent_call_audit (
    id                   UUID         PRIMARY KEY,
    agent_run_id         UUID         NOT NULL,
    vault_key_id         TEXT         NOT NULL,
    vendor               TEXT         NOT NULL,
    endpoint_path        TEXT         NOT NULL,
    http_method          TEXT         NOT NULL,
    policy_verdict       TEXT         NOT NULL,
    vendor_status        INTEGER      NOT NULL DEFAULT 0,
    cost_usd_parsed      NUMERIC(12,4) NOT NULL DEFAULT 0,
    cost_source          TEXT         NOT NULL DEFAULT 'n/a',
    cap_usage_after_usd  NUMERIC(12,4) NOT NULL DEFAULT 0,
    vendor_request_id    TEXT,
    customer_scope_id    TEXT,
    started_at           TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    elapsed_ms           INTEGER      NOT NULL DEFAULT 0,
    error_class          TEXT
);

CREATE INDEX idx_audit_run         ON agent_call_audit (agent_run_id, started_at);
CREATE INDEX idx_audit_vault       ON agent_call_audit (vault_key_id, started_at);
CREATE INDEX idx_audit_vendor_time ON agent_call_audit (vendor, started_at DESC);
CREATE INDEX idx_audit_verdict     ON agent_call_audit (policy_verdict, started_at) WHERE policy_verdict != 'allow';
CREATE INDEX idx_audit_customer    ON agent_call_audit (customer_scope_id, started_at) WHERE customer_scope_id IS NOT NULL;
CREATE INDEX idx_audit_request     ON agent_call_audit (vendor, vendor_request_id);

Two notes on the indexes. The verdict index is partial — most rows are allow, and we only ever query the index when looking for non-allow rows, so excluding the majority shrinks it by an order of magnitude in practice. The customer-scope index is partial for the same reason: a single-tenant deployment will leave that column null for every row, and a partial index costs nothing in that case.

Five queries you'll run more than you expect

Every query below has been used in real diagnostics. Each maps back to one of the five questions the schema is designed to answer.

1. Top-10 spend spikes this week

SELECT agent_run_id,
       vendor,
       SUM(cost_usd_parsed)        AS spent,
       COUNT(*)                    AS calls,
       MIN(started_at)             AS started,
       MAX(started_at)             AS finished
FROM   agent_call_audit
WHERE  started_at > NOW() - INTERVAL '7 days'
  AND  policy_verdict = 'allow'
GROUP  BY agent_run_id, vendor
ORDER  BY spent DESC
LIMIT  10;

Backed by idx_audit_vendor_time. Filters on policy_verdict = 'allow' deliberately — denied calls aren't spend, they're successful saves. We re-run this every Monday morning. Twice in nine months it has surfaced an agent run that quietly burned $500-1,200 over a long weekend.

2. Cap-hit calls in the last 24 hours

SELECT started_at,
       agent_run_id,
       vault_key_id,
       vendor,
       endpoint_path,
       cap_usage_after_usd
FROM   agent_call_audit
WHERE  policy_verdict = 'cap_exceeded'
  AND  started_at > NOW() - INTERVAL '1 day'
ORDER  BY started_at DESC;

Backed by idx_audit_verdict. The output is the screenshot you want to paste into a quarterly compliance review: every time the cap fired, who hit it, on what endpoint, with what running total. If your audit can't produce this on demand, the cap exists in code only — not in the record.

3. Reconstruct an agent run, in order

SELECT started_at,
       elapsed_ms,
       vendor,
       endpoint_path,
       policy_verdict,
       vendor_status,
       cost_usd_parsed,
       cap_usage_after_usd,
       error_class
FROM   agent_call_audit
WHERE  agent_run_id = 'r-xyz-uuid'
ORDER  BY started_at ASC;

Backed by idx_audit_run. This is the incident-response query. The shape of the output — call after call with cost and verdict — gives you the agent's full money-moving history in one view. Co-runs nicely with the same UUID against your LLM proxy's table and your observability tool's table to produce a cross-layer trace.

4. Slow vendor responses in the last hour

SELECT vendor,
       endpoint_path,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY elapsed_ms)  AS p50,
       PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY elapsed_ms) AS p95,
       MAX(elapsed_ms)                                          AS p_max,
       COUNT(*)                                                 AS n
FROM   agent_call_audit
WHERE  started_at > NOW() - INTERVAL '1 hour'
  AND  vendor_status BETWEEN 200 AND 299
GROUP  BY vendor, endpoint_path
HAVING MAX(elapsed_ms) > 2000
ORDER  BY p95 DESC;

Backed by idx_audit_vendor_time. You'll see a vendor's status page light up about ten minutes after this query starts being interesting — the audit is your earliest signal. Filters on success status because the latency of an erroring call is a different question (it includes retries inside the vendor's edge).

5. Every action on behalf of one customer

SELECT started_at,
       agent_run_id,
       vendor,
       endpoint_path,
       policy_verdict,
       cost_usd_parsed,
       vendor_request_id
FROM   agent_call_audit
WHERE  customer_scope_id = 'cus_27b3ac'
  AND  started_at > NOW() - INTERVAL '90 days'
ORDER  BY started_at ASC;

Backed by idx_audit_customer. This is the query a disputing customer's lawyer or a SOC 2 auditor will eventually ask for. Its existence in production is the line between "we have an audit trail" and "we have logs." The vendor_request_id column lets the recipient cross-check each row against the vendor's own dashboard if they want to.

A synthetic incident, traced from log rows alone

To make the schema concrete: imagine a refund-issuing agent running on a Saturday. The agent had a Stripe vault key with refunds:write scope (see our Stripe Restricted Key permissions reference for the full scope table) and a $50/day USD cap. At 02:14 UTC it began processing a queue of refund requests submitted by support agents during the day.

The first eight calls succeeded. Each row had policy_verdict = 'allow', vendor_status = 200, cost_usd_parsed reflecting the refunded amount, cap_usage_after_usd climbing from 12.40 to 39.85.

Call nine was the one that mattered. The customer record had been updated mid-day with a stale charge ID. The agent retried the refund — vendor_status = 404, error_class = 'vendor_4xx', cost_usd_parsed = 0. The retry policy in the agent's own code (not the proxy's) decided to re-queue. Call ten: same charge ID, same 404, requeue.

By call seventeen, the cap was unaffected (no real refunds were going through), the LLM proxy didn't see the loop because it wasn't an LLM call, and the agent's internal metrics showed "calls per minute up, success rate down." Without the audit table, the on-call engineer is staring at "Stripe seems flaky." With the audit table, this query nails it in seconds:

SELECT vendor_request_id, error_class, COUNT(*)
FROM   agent_call_audit
WHERE  agent_run_id = 'r-sat-refund-23'
  AND  vendor_status = 404
GROUP  BY vendor_request_id, error_class
ORDER  BY COUNT(*) DESC;

Output: one row, cost_source = 'amount', error_class = 'vendor_4xx', count = 9. Same charge ID, hammering against a Stripe 404. The fix is the agent's retry policy, not Stripe. The diagnosis is one query, two minutes. For the related question of what to do in the next ten minutes — rotate the upstream key versus revoke the vault key — see our kill-switch latency analysis: a vault-key revoke is sub-second; a Stripe key rotation is median 45s, p95 3m12s.

What the schema deliberately omits

Three things are not in the table. Each omission is deliberate.

Request and response bodies. Storing the full body of every call is appealing — until the day legal asks why your audit table contains card PANs in a Stripe error response, or HIPAA-relevant text in a Twilio SMS body. We log vendor_request_id instead and refer back to the vendor's dashboard for the full body when needed. The vendor already has retention; you don't need to duplicate it. This is the same principle Helicone applies to LLM bodies versus headers, with the dial set the other way — see our Helicone-alternative comparison for the contrast.

The agent's prompt or tool-call decision. That belongs in the LLM proxy or the observability layer, where it's already structured. Duplicating it into the SaaS audit table doubles the storage and triples the redaction risk. The agent_run_id link is enough to join across layers when an incident demands it.

Free-form notes or operator comments. Annotations get added to a side-table, not to immutable audit rows. An audit table whose rows can be edited is a justification log, not an audit log; auditors will reject it on inspection.

What you forget without one

Three concrete examples of facts that are unrecoverable without a properly shaped audit table:

"Did our cap actually fire?" Without policy_verdict as a first-class column you cannot tell a deny from a vendor 4xx. The cap's existence is a code claim, not an operational one.
"How much did agent run X cost the company in real money?" Without cost_usd_parsed + cost_source you can roughly answer this for one vendor by JOIN-ing on vendor logs; you cannot answer it across vendors in one query, which is the question you actually have.
"Show me every charge our agent issued on customer Y's account in March." Without customer_scope_id you fall back to grep over JSON, which is fine for a developer and unacceptable for an auditor.

You don't need all sixteen columns on day one. You do need the four that turn this into an audit instead of a log: agent_run_id, policy_verdict, cost_usd_parsed, and customer_scope_id. Everything else accretes from real questions over the first three months. We've watched two companies build the rest of the table by trial and error; this post is what we'd hand them on day one if we could go back.

If you're putting this in front of a Stripe-touching agent for the first time, the audit table is the first piece — without it, the controls upstream don't have a record to enforce against. Build the table, log every call, then add caps. That order keeps the audit honest about what your enforcement actually did, not what you intended.

Get Keybrake when v1 ships

Pre-launch waitlist for the SaaS-API governance layer. Every Keybrake call writes to an audit table shaped exactly like the one above — same columns, same indexes, same query catalogue. We'll email you a vault key when v1 is live, with a working code sample for Stripe, Twilio, and Resend.