AI agents · Observability · Vendor API spend
AI agent observability: what tracing tools miss about vendor API spend
Observability for AI agent systems has a structural blind spot: vendor API spend. Standard observability tools — OpenTelemetry, Datadog APM, AWS X-Ray, Jaeger — record span durations, error rates, and throughput for every HTTP call your agent makes to Stripe, Twilio, and Resend. They don't parse the dollar amount out of the Stripe PaymentIntent response, attribute that spend to the specific agent run that triggered the call, or fire an alert when a single orchestration execution exceeds a dollar threshold you set. Operations teams running autonomous billing agents, payment processors, or SMS automation can tell you their p99 Stripe API latency from APM. They cannot tell you what last night's agent run cost in Stripe charges without opening the Stripe dashboard and manually cross-referencing timestamps. This page covers the four observability questions that matter for agent systems, why standard tools don't answer them, and how a proxy-based audit log fills the gap.
TL;DR
Standard APM tools observe the HTTP channel (latency, status codes, span duration). Vendor spend observability requires parsing dollar amounts from vendor response bodies — Stripe's amount field, Twilio's price field, Resend's fixed per-email rate — and correlating them with the agent run ID that made the call. A proxy positioned between your agent and vendor APIs sees every call, can parse cost from the response, and stores a structured audit log that answers the four questions: per-run cost, fleet-wide spend, per-vendor daily trend, and spend anomalies. No custom span attributes, no SDK changes across every service, no post-hoc data pipeline.
What standard observability tools record for vendor API calls
When your agent calls https://api.stripe.com/v1/payment_intents, an OpenTelemetry auto-instrumentation span records:
Span: POST https://api.stripe.com/v1/payment_intents
duration_ms: 342
http.status_code: 200
http.method: POST
http.url: https://api.stripe.com/v1/payment_intents
error: false
trace_id: 4d8f1a2b3c9e0f7a
parent_span_id: 7f3e2c1b9a4d8f0e
What it doesn't record: the amount field from the response body (which tells you how many dollars just moved), the Stripe PaymentIntent.id that would let you find this charge in the Stripe dashboard, which agent run or workflow execution triggered this call, or how this call's cost compares to the budget allocated for this run. The span tells you the call succeeded in 342ms. It doesn't tell you it charged $149.99.
The four observability questions agent operations teams need
| Question | Why teams need it | Standard APM answer |
|---|---|---|
| What did this agent run cost? | After a billing run, you need to know if the run's total Stripe charges matched expectations. A $50,000 run that should have been $45,000 has a bug somewhere. You need per-run cost attribution, not just a count of successful Stripe API calls. | APM shows how many Stripe spans completed with status 200. It cannot sum the amount fields from the response bodies to give you total dollars charged. |
| Which agent or run is the biggest spender today? | In a multi-agent fleet — billing agent, refund processor, subscription management agent — you need a fleet-level spend view to see which agent is generating the most Stripe volume on a given day. If one agent is unexpectedly in the top position, something is wrong. | APM can show you which service has the most Stripe outbound spans. It cannot attribute spend to agent identity or sort by dollar amount. |
| What's the per-vendor daily trend? | You want to see Stripe spend, Twilio spend, and Resend spend trending over 30 days, per vendor, to catch gradual drift. An agent that was spending $200/day in Stripe and is now spending $800/day has a problem that didn't appear in any single run's alerting. | APM can show you call volume trends per vendor domain. Dollar cost trends require vendor-specific response parsing that APM tools don't perform. |
| Is there a spend anomaly right now? | A stuck loop or a prompt injection attack can cause an agent to make 10× its normal vendor API volume within an hour. You need an alert that fires during the anomaly, not a CloudWatch billing alert that fires the next day after the damage is done. | APM can alert on request rate spikes. Dollar-based anomaly detection (spend velocity > 3× 30-day hourly average) requires cost data that APM spans don't carry. |
Why vendor dashboards don't fill the gap
The Stripe dashboard shows total charges, but it doesn't know which of your agents made each charge. Stripe metadata can carry agent identifiers — if you instrument every Stripe API call to include metadata: { agent_run_id: "run_123" } — but this requires modifying every service that calls Stripe, maintaining the instrumentation across deploys, and building custom reports in the Stripe Radar or Sigma products to aggregate by metadata field. The Twilio dashboard shows SMS cost but has no per-agent breakdown. Resend has no per-call cost breakdown at all (flat rate billing). And none of the vendor dashboards give you a cross-vendor view: how much did the billing agent spend across Stripe, Twilio, and Resend combined for the 3am run?
The minimum audit log schema for agent spend observability
The data you need is a structured event for each vendor API call with cost and agent attribution:
-- Minimum audit log schema
CREATE TABLE agent_spend_audit (
id TEXT PRIMARY KEY,
vault_key_id TEXT NOT NULL, -- which scoped key made this call
agent_run_label TEXT, -- "billing-agent/run_abc123" from vault key
vendor TEXT NOT NULL, -- 'stripe', 'twilio', 'resend'
endpoint TEXT NOT NULL, -- 'POST /v1/payment_intents'
cost_usd DECIMAL(10,4), -- parsed from vendor response
vendor_txn_id TEXT, -- Stripe PaymentIntent.id, Twilio SID, Resend email_id
policy_verdict TEXT, -- 'allowed', 'cap_exhausted', 'endpoint_blocked'
called_at TIMESTAMPTZ NOT NULL,
http_status INTEGER
);
-- Index for per-run queries
CREATE INDEX idx_audit_run_label ON agent_spend_audit(agent_run_label, called_at);
-- Index for anomaly detection
CREATE INDEX idx_audit_called_at ON agent_spend_audit(called_at, vendor);
Four SQL queries that answer the observability questions
-- 1. Per-run cost summary (last 24h)
SELECT
agent_run_label,
vendor,
SUM(cost_usd) AS total_cost_usd,
COUNT(*) AS call_count,
COUNT(*) FILTER (WHERE policy_verdict = 'cap_exhausted') AS cap_hits
FROM agent_spend_audit
WHERE called_at > NOW() - INTERVAL '24 hours'
AND policy_verdict = 'allowed'
GROUP BY agent_run_label, vendor
ORDER BY total_cost_usd DESC;
-- 2. Fleet-wide spend by agent today
SELECT
split_part(agent_run_label, '/', 1) AS agent_name,
SUM(cost_usd) AS total_cost_usd,
COUNT(DISTINCT split_part(agent_run_label, '/', 2)) AS run_count
FROM agent_spend_audit
WHERE called_at >= CURRENT_DATE
GROUP BY 1
ORDER BY total_cost_usd DESC;
-- 3. Per-vendor daily trend (30d)
SELECT
DATE_TRUNC('day', called_at) AS day,
vendor,
SUM(cost_usd) AS daily_spend_usd
FROM agent_spend_audit
WHERE called_at > NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 1, 2;
-- 4. Spend anomaly detection (>3x 30-day hourly baseline)
WITH hourly AS (
SELECT
DATE_TRUNC('hour', called_at) AS hour,
SUM(cost_usd) AS hourly_spend
FROM agent_spend_audit
WHERE called_at > NOW() - INTERVAL '30 days'
GROUP BY 1
),
baseline AS (
SELECT AVG(hourly_spend) AS avg_hourly
FROM hourly
WHERE hour < NOW() - INTERVAL '2 hours'
)
SELECT
h.hour,
h.hourly_spend,
b.avg_hourly,
ROUND(h.hourly_spend / NULLIF(b.avg_hourly, 0), 1) AS multiplier
FROM hourly h, baseline b
WHERE h.hour >= NOW() - INTERVAL '3 hours'
AND h.hourly_spend > b.avg_hourly * 3
ORDER BY h.hour DESC;
How cost is parsed from vendor responses
The proxy reads vendor API responses before forwarding them to the caller and extracts cost signals from response fields:
| Vendor | Endpoint | Cost signal | Parsing method |
|---|---|---|---|
| Stripe | POST /v1/payment_intents |
Response body amount (in cents) + currency |
Parse amount / 100 converted to USD at current exchange rate |
| Stripe | POST /v1/charges |
Response body amount (in cents) |
Same as above; application_fee_amount also parsed if present |
| Twilio | POST /2010-04-01/Accounts/{}/Messages |
Response body price field (e.g. "-0.0075") + price_unit |
Parse absolute value of price; available on message status callback |
| Resend | POST /emails |
No per-call price in response; billing is subscription-based | Increment fixed per-email rate counter based on plan tier |
How Keybrake fits
Keybrake is the proxy positioned between your agent services and vendor APIs. Every call that flows through the proxy is logged to the audit table with cost_usd parsed from the vendor response, agent_run_label from the vault key's metadata, and vendor_txn_id from the vendor's response (Stripe PaymentIntent.id, Twilio SID). No custom span attributes, no SDK changes in your agent code, no post-hoc data pipeline joining APM telemetry with vendor dashboards. The dashboard at keybrake.com/app surfaces today's spend per vendor, recent calls with cost breakdown, and cap-hit rate. The audit log is queryable SQL for teams that want to build custom reports or wire alerting on anomaly detection queries.
Related questions
Can I use OpenTelemetry alongside Keybrake for observability?
Yes — they complement each other. OpenTelemetry and APM tools (Datadog, Honeycomb, Jaeger) answer latency and error-rate questions: is my Stripe API call slow, is it failing, what's the trace leading to a timeout. Keybrake answers spend questions: how much did that call cost, which agent run made it, is spend trending abnormally. Run both. The proxy call to Keybrake is itself a short HTTP request that OpenTelemetry can trace. The proxy's audit log gives you cost data that no span attribute can carry natively. Use OpenTelemetry for operational debugging and Keybrake for spend accountability.
How does Keybrake calculate cost_usd for vendors without explicit per-call pricing?
For Stripe payment_intents and charges, the amount field in the response body is the authoritative transaction amount — Keybrake parses it directly. For Twilio SMS, the price field appears in the message resource once the SMS is delivered (available via status callback or a subsequent GET on the message SID). Keybrake stores a provisional cost at call time based on the per-message rate for the destination country and updates it when the final price is confirmed. For Resend, there is no per-email price in the response — Keybrake records a fixed rate based on the plan tier configured in your account settings. Vendors with opaque pricing (Shopify Admin API, Segment) are counted as API calls with no cost_usd value, still captured in the audit log for volume tracking.
What's the difference between AI agent observability and LLM observability (LangSmith, Langfuse)?
LLM observability tools (LangSmith, Langfuse, Helicone) observe the LLM layer — prompt tokens, completion tokens, model latency, LLM cost per call. They don't observe what the agent does with the LLM's output: the downstream vendor API calls to Stripe, Twilio, or Resend that the agent's tool use triggers. An agent that makes one cheap LLM call and then triggers $5,000 in Stripe charges is invisible to LangSmith. AI agent spend observability targets the vendor API layer, not the LLM layer. The two are complementary: use LangSmith for LLM cost and prompt debugging; use Keybrake for vendor API spend and policy enforcement.
Further reading
- AI agent spend reporting — the four SQL reporting queries that answer per-run, fleet-wide, trend, and anomaly questions from the proxy audit log.
- AI agent policy enforcement — how runtime spend caps complement observability: observability tells you what happened, caps prevent it from happening in the first place.
- AI agent audit trail — the compliance and forensic requirements that the proxy audit log satisfies beyond just spend visibility.
- AI agent cost management — the relationship between observability (seeing what happened) and cost management (controlling what can happen).