AI agents · Observability · Vendor API spend

AI agent observability: what tracing tools miss about vendor API spend

Observability for AI agent systems has a structural blind spot: vendor API spend. Standard observability tools — OpenTelemetry, Datadog APM, AWS X-Ray, Jaeger — record span durations, error rates, and throughput for every HTTP call your agent makes to Stripe, Twilio, and Resend. They don't parse the dollar amount out of the Stripe PaymentIntent response, attribute that spend to the specific agent run that triggered the call, or fire an alert when a single orchestration execution exceeds a dollar threshold you set. Operations teams running autonomous billing agents, payment processors, or SMS automation can tell you their p99 Stripe API latency from APM. They cannot tell you what last night's agent run cost in Stripe charges without opening the Stripe dashboard and manually cross-referencing timestamps. This page covers the four observability questions that matter for agent systems, why standard tools don't answer them, and how a proxy-based audit log fills the gap.

TL;DR

Standard APM tools observe the HTTP channel (latency, status codes, span duration). Vendor spend observability requires parsing dollar amounts from vendor response bodies — Stripe's amount field, Twilio's price field, Resend's fixed per-email rate — and correlating them with the agent run ID that made the call. A proxy positioned between your agent and vendor APIs sees every call, can parse cost from the response, and stores a structured audit log that answers the four questions: per-run cost, fleet-wide spend, per-vendor daily trend, and spend anomalies. No custom span attributes, no SDK changes across every service, no post-hoc data pipeline.

What standard observability tools record for vendor API calls

When your agent calls https://api.stripe.com/v1/payment_intents, an OpenTelemetry auto-instrumentation span records:

Span: POST https://api.stripe.com/v1/payment_intents
  duration_ms: 342
  http.status_code: 200
  http.method: POST
  http.url: https://api.stripe.com/v1/payment_intents
  error: false
  trace_id: 4d8f1a2b3c9e0f7a
  parent_span_id: 7f3e2c1b9a4d8f0e

What it doesn't record: the amount field from the response body (which tells you how many dollars just moved), the Stripe PaymentIntent.id that would let you find this charge in the Stripe dashboard, which agent run or workflow execution triggered this call, or how this call's cost compares to the budget allocated for this run. The span tells you the call succeeded in 342ms. It doesn't tell you it charged $149.99.

The four observability questions agent operations teams need

Question	Why teams need it	Standard APM answer
What did this agent run cost?	After a billing run, you need to know if the run's total Stripe charges matched expectations. A $50,000 run that should have been $45,000 has a bug somewhere. You need per-run cost attribution, not just a count of successful Stripe API calls.	APM shows how many Stripe spans completed with status 200. It cannot sum the `amount` fields from the response bodies to give you total dollars charged.
Which agent or run is the biggest spender today?	In a multi-agent fleet — billing agent, refund processor, subscription management agent — you need a fleet-level spend view to see which agent is generating the most Stripe volume on a given day. If one agent is unexpectedly in the top position, something is wrong.	APM can show you which service has the most Stripe outbound spans. It cannot attribute spend to agent identity or sort by dollar amount.
What's the per-vendor daily trend?	You want to see Stripe spend, Twilio spend, and Resend spend trending over 30 days, per vendor, to catch gradual drift. An agent that was spending $200/day in Stripe and is now spending $800/day has a problem that didn't appear in any single run's alerting.	APM can show you call volume trends per vendor domain. Dollar cost trends require vendor-specific response parsing that APM tools don't perform.
Is there a spend anomaly right now?	A stuck loop or a prompt injection attack can cause an agent to make 10× its normal vendor API volume within an hour. You need an alert that fires during the anomaly, not a CloudWatch billing alert that fires the next day after the damage is done.	APM can alert on request rate spikes. Dollar-based anomaly detection (spend velocity > 3× 30-day hourly average) requires cost data that APM spans don't carry.

Why vendor dashboards don't fill the gap

The Stripe dashboard shows total charges, but it doesn't know which of your agents made each charge. Stripe metadata can carry agent identifiers — if you instrument every Stripe API call to include metadata: { agent_run_id: "run_123" } — but this requires modifying every service that calls Stripe, maintaining the instrumentation across deploys, and building custom reports in the Stripe Radar or Sigma products to aggregate by metadata field. The Twilio dashboard shows SMS cost but has no per-agent breakdown. Resend has no per-call cost breakdown at all (flat rate billing). And none of the vendor dashboards give you a cross-vendor view: how much did the billing agent spend across Stripe, Twilio, and Resend combined for the 3am run?

The minimum audit log schema for agent spend observability

The data you need is a structured event for each vendor API call with cost and agent attribution:

-- Minimum audit log schema
CREATE TABLE agent_spend_audit (
  id            TEXT PRIMARY KEY,
  vault_key_id  TEXT NOT NULL,        -- which scoped key made this call
  agent_run_label TEXT,               -- "billing-agent/run_abc123" from vault key
  vendor        TEXT NOT NULL,        -- 'stripe', 'twilio', 'resend'
  endpoint      TEXT NOT NULL,        -- 'POST /v1/payment_intents'
  cost_usd      DECIMAL(10,4),        -- parsed from vendor response
  vendor_txn_id TEXT,                 -- Stripe PaymentIntent.id, Twilio SID, Resend email_id
  policy_verdict TEXT,                -- 'allowed', 'cap_exhausted', 'endpoint_blocked'
  called_at     TIMESTAMPTZ NOT NULL,
  http_status   INTEGER
);

-- Index for per-run queries
CREATE INDEX idx_audit_run_label ON agent_spend_audit(agent_run_label, called_at);
-- Index for anomaly detection
CREATE INDEX idx_audit_called_at ON agent_spend_audit(called_at, vendor);

Four SQL queries that answer the observability questions

-- 1. Per-run cost summary (last 24h)
SELECT
  agent_run_label,
  vendor,
  SUM(cost_usd) AS total_cost_usd,
  COUNT(*) AS call_count,
  COUNT(*) FILTER (WHERE policy_verdict = 'cap_exhausted') AS cap_hits
FROM agent_spend_audit
WHERE called_at > NOW() - INTERVAL '24 hours'
  AND policy_verdict = 'allowed'
GROUP BY agent_run_label, vendor
ORDER BY total_cost_usd DESC;

-- 2. Fleet-wide spend by agent today
SELECT
  split_part(agent_run_label, '/', 1) AS agent_name,
  SUM(cost_usd) AS total_cost_usd,
  COUNT(DISTINCT split_part(agent_run_label, '/', 2)) AS run_count
FROM agent_spend_audit
WHERE called_at >= CURRENT_DATE
GROUP BY 1
ORDER BY total_cost_usd DESC;

-- 3. Per-vendor daily trend (30d)
SELECT
  DATE_TRUNC('day', called_at) AS day,
  vendor,
  SUM(cost_usd) AS daily_spend_usd
FROM agent_spend_audit
WHERE called_at > NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 1, 2;

-- 4. Spend anomaly detection (>3x 30-day hourly baseline)
WITH hourly AS (
  SELECT
    DATE_TRUNC('hour', called_at) AS hour,
    SUM(cost_usd) AS hourly_spend
  FROM agent_spend_audit
  WHERE called_at > NOW() - INTERVAL '30 days'
  GROUP BY 1
),
baseline AS (
  SELECT AVG(hourly_spend) AS avg_hourly
  FROM hourly
  WHERE hour < NOW() - INTERVAL '2 hours'
)
SELECT
  h.hour,
  h.hourly_spend,
  b.avg_hourly,
  ROUND(h.hourly_spend / NULLIF(b.avg_hourly, 0), 1) AS multiplier
FROM hourly h, baseline b
WHERE h.hour >= NOW() - INTERVAL '3 hours'
  AND h.hourly_spend > b.avg_hourly * 3
ORDER BY h.hour DESC;

How cost is parsed from vendor responses

The proxy reads vendor API responses before forwarding them to the caller and extracts cost signals from response fields:

Vendor	Endpoint	Cost signal	Parsing method
Stripe	`POST /v1/payment_intents`	Response body `amount` (in cents) + `currency`	Parse `amount / 100` converted to USD at current exchange rate
Stripe	`POST /v1/charges`	Response body `amount` (in cents)	Same as above; `application_fee_amount` also parsed if present
Twilio	`POST /2010-04-01/Accounts/{}/Messages`	Response body `price` field (e.g. "-0.0075") + `price_unit`	Parse absolute value of `price`; available on message status callback
Resend	`POST /emails`	No per-call price in response; billing is subscription-based	Increment fixed per-email rate counter based on plan tier

How Keybrake fits

Keybrake is the proxy positioned between your agent services and vendor APIs. Every call that flows through the proxy is logged to the audit table with cost_usd parsed from the vendor response, agent_run_label from the vault key's metadata, and vendor_txn_id from the vendor's response (Stripe PaymentIntent.id, Twilio SID). No custom span attributes, no SDK changes in your agent code, no post-hoc data pipeline joining APM telemetry with vendor dashboards. The dashboard at keybrake.com/app surfaces today's spend per vendor, recent calls with cost breakdown, and cap-hit rate. The audit log is queryable SQL for teams that want to build custom reports or wire alerting on anomaly detection queries.

Get early access