LangSmith Stripe Tracing: Close the Observability Gap for AI Agent Payments

LangSmith gives you full visibility into your LLM calls — tokens, latency, reasoning chains, tool invocations. But the moment your agent charges Stripe, LangSmith goes blind. It records that the tool was called, not whether the charge succeeded, what the charge ID was, or how much money moved. This post covers the gap and how to close it without forking your code.

What LangSmith actually traces

LangSmith is LangChain's hosted tracing and evaluation platform. When you add the LANGCHAIN_TRACING_V2=true environment variable and a project key, every LLM call in your chain is traced automatically:

  • Model name, input messages, output tokens
  • Latency at each step of a chain or agent
  • Tool definitions and the arguments the model chose to pass
  • Errors and retries at the LLM layer
  • Total token cost (via model pricing tables LangSmith maintains)

This is genuinely useful. If your agent makes five LLM calls before deciding to trigger a Stripe charge, you can see each reasoning step, the token cost of the planning phase, and exactly which tool arguments the model produced.

What LangSmith does not trace is what happens after the tool arguments are handed to your tool function. From LangSmith's perspective, a tool call is a black box: arguments in, result string out.

The observability gap: tool execution is a black box

Consider a simple LangChain billing agent. The agent decides to charge a customer $49.00 and calls a charge_customer tool. In LangSmith, you see something like:

Tool: charge_customer
Input: {"customer_id": "cus_abc123", "amount_cents": 4900, "currency": "usd"}
Output: "Charge created: ch_xyz789"
Latency: 310ms

That looks complete. But here's what LangSmith didn't capture:

  • Which Stripe key was used — the restricted key, the full secret, or the wrong env var because the agent ran in the wrong environment?
  • The actual HTTP response — was the charge status succeeded or pending? Did Stripe return a card_error that your tool code swallowed?
  • The real cost — LangSmith tracks LLM token cost. It has no knowledge that your agent just moved $49 in Stripe.
  • Daily spend accumulation — if the agent runs 50 times today, LangSmith shows you 50 tool calls. It does not tell you the agent charged $2,450 to real cards.
  • Rate-limit headers — Stripe 429s that your retry logic handles silently appear as a slightly higher latency in LangSmith.

For a simple invoice-on-demand agent, this gap is acceptable. For an autonomous billing agent running unattended — recurring charges, dunning retries, refund decisions — the gap becomes a liability.

A minimal LangChain + Stripe example

Let's make the gap concrete. Here's a LangChain agent that creates Stripe charges, with LangSmith tracing enabled:

import os
import stripe
from langchain_anthropic import ChatAnthropic
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool

# LangSmith tracing (traces LLM calls automatically)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.environ["LANGSMITH_API_KEY"]
os.environ["LANGCHAIN_PROJECT"] = "billing-agent"

stripe.api_key = os.environ["STRIPE_SECRET_KEY"]  # full secret — risky

@tool
def charge_customer(customer_id: str, amount_cents: int, currency: str = "usd") -> str:
    """Create a Stripe charge for a customer."""
    charge = stripe.PaymentIntent.create(
        amount=amount_cents,
        currency=currency,
        customer=customer_id,
        confirm=True,
        automatic_payment_methods={"enabled": True, "allow_redirects": "never"},
    )
    return f"PaymentIntent {charge.id}: {charge.status}"

llm = ChatAnthropic(model="claude-sonnet-4-6")
tools = [charge_customer]
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a billing agent. Use the charge_customer tool when asked."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

result = executor.invoke({"input": "Charge customer cus_abc123 $49 for the Pro plan renewal."})
print(result["output"])

With LANGCHAIN_TRACING_V2=true, LangSmith captures the Claude call, the tool invocation arguments, and the string result. What it does not see is the Stripe POST /v1/payment_intents HTTP call, the response body, the charge.id in the Stripe audit trail, or the fact that $49 moved between accounts.

What LangSmith shows vs. what it misses

Here's the gap laid out as a table:

Signal LangSmith LangSmith + Keybrake
LLM model and token count ✅ Full trace ✅ Full trace
Agent reasoning chain ✅ Full trace ✅ Full trace
Tool call arguments (what the model chose) ✅ Logged ✅ Logged
Stripe HTTP request path + method ❌ Not captured ✅ Every call logged
Stripe response status + charge ID ❌ Not captured ✅ Full response metadata
Which Stripe key was used ❌ Not captured ✅ Vault key ID logged per call
Real dollar amount moved per call ❌ Not captured ✅ Parsed from response, cumulative
Daily vendor spend cap enforcement ❌ No enforcement ✅ Hard stop at configured limit
Stripe rate-limit events (429s) ❌ Appears as latency ✅ Explicit 429 log entries
Per-agent or per-run isolation ❌ Shared key across runs ✅ Vault key per agent instance
Kill switch (revoke mid-run) ❌ Not possible ✅ Revoke vault key, all calls stop

Adding Keybrake: one-line proxy routing

Keybrake is a reverse proxy that sits between your agent and Stripe. You route Stripe API calls through proxy.keybrake.com/stripe/v1/ instead of api.stripe.com/v1/. The proxy looks up your real Stripe key (stored server-side), enforces your policy (spend cap, allowed endpoints, expiry), forwards the call to Stripe, and logs the result.

For the Stripe Python SDK, there is exactly one line to change:

import stripe

# Before: direct Stripe call, no observability
stripe.api_key = os.environ["STRIPE_SECRET_KEY"]

# After: routed through Keybrake proxy
stripe.api_key = os.environ["KEYBRAKE_VAULT_KEY"]   # vault_key_xxx
stripe.api_base = "https://proxy.keybrake.com/stripe"  # one new line

Every Stripe method call in your codebase — stripe.PaymentIntent.create(), stripe.Refund.create(), stripe.Customer.retrieve() — now flows through the proxy without any other changes. The Stripe SDK's request structure is preserved; the proxy simply intercepts, enforces policy, and forwards.

Updated LangSmith + Keybrake example

import os
import stripe
from langchain_anthropic import ChatAnthropic
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
from pydantic import BaseModel, Field

# LangSmith: traces LLM layer
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.environ["LANGSMITH_API_KEY"]
os.environ["LANGCHAIN_PROJECT"] = "billing-agent"

# Keybrake: traces Stripe layer
stripe.api_key = os.environ["KEYBRAKE_VAULT_KEY"]    # vault_key_xxx
stripe.api_base = "https://proxy.keybrake.com/stripe"  # proxy routing

class ChargeInput(BaseModel):
    customer_id: str = Field(description="Stripe customer ID (cus_...)")
    amount_cents: int = Field(gt=0, le=100_000, description="Amount in cents, max $1,000")
    currency: str = Field(default="usd", pattern="^[a-z]{3}$")

@tool(args_schema=ChargeInput)
def charge_customer(customer_id: str, amount_cents: int, currency: str = "usd") -> str:
    """Create a Stripe PaymentIntent for a customer. Max $1,000 per call."""
    intent = stripe.PaymentIntent.create(
        amount=amount_cents,
        currency=currency,
        customer=customer_id,
        confirm=True,
        automatic_payment_methods={"enabled": True, "allow_redirects": "never"},
    )
    return f"PaymentIntent {intent.id} status={intent.status} amount={amount_cents/100:.2f} {currency.upper()}"

llm = ChatAnthropic(model="claude-sonnet-4-6")
tools = [charge_customer]
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a billing agent. Always confirm the amount before charging."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

result = executor.invoke({"input": "Charge customer cus_abc123 $49 for the Pro plan."})
print(result["output"])

The only changes from the original example are the two stripe.* lines at the top. LangSmith tracing is unchanged. LangChain, the tools, and the agent logic are identical.

What you can now observe end-to-end

After adding the proxy, you have two complementary observability layers:

LangSmith shows

  • The full LLM reasoning chain that led to the charging decision
  • Token usage and latency for the Claude model call
  • The ChargeInput the model constructed (customer ID, amount, currency)
  • The string the tool returned (PaymentIntent pi_xxx status=succeeded)
  • Total LLM cost for the run

Keybrake audit log shows

  • The HTTP method and path: POST /stripe/v1/payment_intents
  • Which vault key was used: vault_key_prod_agent_billing
  • Stripe's response status (200, 402 card declined, 429 rate limited)
  • The charge amount parsed from the response body
  • Cumulative vendor spend for the day against your configured cap
  • Request duration and Stripe's Request-Id header for support lookup

Together, LangSmith tells you why the agent decided to charge, and Keybrake tells you what actually happened at Stripe. Neither alone gives the full picture.

Production setup: per-agent vault keys and spend caps

In production, you issue a separate vault key per agent instance or per deployment environment. This matters for two reasons:

  1. Isolation. If a staging agent misbehaves, revoking its vault key does not affect the production agent. With a shared Stripe key, your only option is rotating the secret everywhere at once.
  2. Attribution. Keybrake's audit log ties each Stripe call to a vault key. When LangSmith shows you an anomalous run at 03:00 UTC, you can cross-reference the Keybrake log for the same timestamp to see exactly which Stripe calls that run made.
# Create a vault key with a $500/day cap on payment intents only
POST https://proxy.keybrake.com/admin/vault_keys
Authorization: Bearer {your_admin_key}
Content-Type: application/json

{
  "label": "billing-agent-prod",
  "vendor": "stripe",
  "daily_usd_cap": 500,
  "allowed_endpoints": [
    "POST /v1/payment_intents",
    "GET /v1/payment_intents/*",
    "GET /v1/customers/*"
  ],
  "expires_at": "2026-09-01T00:00:00Z"
}

# Response
{
  "vault_key": "vault_key_prod_abc123xyz",
  "vendor": "stripe",
  "daily_usd_cap": 500,
  "status": "active"
}

Set this vault key in your agent's environment and the spend cap is enforced server-side — no SDK changes, no try/except wrappers in your tool code. If the agent hits $500 in Stripe charges in one day, subsequent calls return HTTP 429 from the proxy (not from Stripe), and LangSmith will log the tool error for you to investigate.

Setting vault keys per LangSmith project

LangSmith's project concept maps naturally to Keybrake's vault key concept. One LangSmith project, one vault key:

# .env.production
LANGCHAIN_PROJECT=billing-agent-prod
LANGSMITH_API_KEY=ls_prod_...
KEYBRAKE_VAULT_KEY=vault_key_prod_abc123xyz

# .env.staging
LANGCHAIN_PROJECT=billing-agent-staging
LANGSMITH_API_KEY=ls_staging_...
KEYBRAKE_VAULT_KEY=vault_key_staging_def456uvw

With this setup, LangSmith runs for billing-agent-staging and Keybrake logs for vault_key_staging_def456uvw can be correlated by timestamp to reconstruct any run end-to-end.

Correlating LangSmith traces with Keybrake logs

For deeper correlation, add the LangSmith run ID to your tool's Stripe metadata so it appears in both systems:

from langchain_core.callbacks import get_openai_callback
from langchain_core.runnables import RunnableConfig
import langsmith

@tool(args_schema=ChargeInput)
def charge_customer(
    customer_id: str,
    amount_cents: int,
    currency: str = "usd",
    config: RunnableConfig = None,
) -> str:
    """Create a Stripe PaymentIntent for a customer."""
    # Get the current LangSmith run ID if available
    run_id = None
    if config and config.get("callbacks"):
        for cb in config["callbacks"]:
            if hasattr(cb, "run_id"):
                run_id = str(cb.run_id)
                break

    intent = stripe.PaymentIntent.create(
        amount=amount_cents,
        currency=currency,
        customer=customer_id,
        confirm=True,
        automatic_payment_methods={"enabled": True, "allow_redirects": "never"},
        metadata={
            "langsmith_run_id": run_id or "unknown",
            "agent": "billing-agent",
        },
    )
    return f"PaymentIntent {intent.id} status={intent.status}"

The langsmith_run_id appears in Stripe's charge metadata and in Keybrake's forwarded request body. Given a Stripe charge ID from a dispute or refund request, you can look up the LangSmith run that created it in seconds.

Gap analysis: what this setup still doesn't cover

LangSmith + Keybrake closes the major observability gap, but a few scenarios still fall through:

  • Stripe webhooks. When Stripe sends a payment_intent.succeeded webhook to your server, that event was not triggered by your agent's HTTP call — it's an inbound request. Neither LangSmith nor Keybrake captures inbound webhooks. You'll want a separate webhook handler with its own logging.
  • Stripe API calls outside the agent. If your server-side code (outside LangChain) also calls Stripe using the same key, Keybrake will capture those calls but LangSmith will not have a corresponding trace. Keep agent Stripe calls and server-side Stripe calls on separate vault keys.
  • LangSmith token cost vs. real cost. LangSmith estimates LLM cost using published model pricing. If you're on a custom contract or a batched pricing tier, LangSmith's cost estimate may differ from your invoice. This is a LangSmith limitation, unrelated to Keybrake.
  • Complex multi-step charges. If one agent run creates a PaymentIntent, a second run confirms it, and a third run captures it, Keybrake logs each call separately but the LangSmith traces are on three different run IDs. The langsmith_run_id metadata approach above handles this if you pass a shared session ID across runs.

FAQ

Does adding Keybrake slow down Stripe calls?

The proxy adds 5–15ms of latency for the proxy-to-Stripe leg. Stripe API calls typically take 200–400ms, so this is less than 5% overhead. For billing agents that run offline or asynchronously, this is not perceptible. For latency-sensitive payment flows, benchmark your specific use case first.

Does LangSmith see Keybrake's 429 (spend cap exceeded) errors?

Yes. When the proxy rejects a call because the daily spend cap is hit, the Stripe SDK raises a stripe.error.RateLimitError. If your tool lets that exception propagate, LangSmith logs it as a tool error with the full traceback. This is useful: you can set a LangSmith alert on tool errors matching RateLimitError to get notified when an agent hits its spend cap before the money runs out.

Can I use LangSmith's dataset and evaluation features alongside Keybrake?

Yes, but carefully. LangSmith lets you replay traced runs as evaluations. When replaying a run that contained a Stripe tool call, the evaluation will hit Keybrake again — and if the vault key has a daily cap, evaluation runs count toward it. Use a separate vault key with a very low cap (e.g., $1) for evaluation runs to prevent accidental charges.

Does this work with LangGraph agents?

Yes. LangGraph builds on LangChain and is traced by LangSmith automatically. Tool nodes in a LangGraph graph work the same way as LangChain tools — adding the two stripe.* lines is sufficient. The graph's full state machine is visible in LangSmith's trace, and each Stripe call within any node is logged by Keybrake.

Do I need to set up Keybrake's real Stripe key server-side before this works?

Yes. The vault key your agent uses is a credential that maps server-side to your real Stripe key plus a policy. You create the vault key via Keybrake's admin API (or dashboard), specify the real Stripe key, set a daily cap and allowed endpoints, and Keybrake stores the real key encrypted. Your agent code never sees the real Stripe key — only the vault key. Start at proxy.keybrake.com to set up your first vault key.

I use LangSmith's prompt playground to test my agent. Will Stripe calls fire in the playground?

Yes, if your tool is wired to a live vault key. LangSmith's playground runs your chain with real tool calls. To prevent accidental charges during prompt iteration, either (a) use a vault key scoped to Stripe's test mode (with a test Stripe key), or (b) use a vault key with a $0 cap that blocks all calls. Both approaches let you iterate on your prompt logic without risk.

Summary

LangSmith and Keybrake cover complementary blind spots in AI agent observability:

  • LangSmith — LLM reasoning, token cost, chain latency, tool arguments
  • Keybrake — Stripe HTTP calls, charge IDs, real dollar amounts, spend cap enforcement, per-agent isolation

The integration requires two lines of code (stripe.api_key and stripe.api_base) and no changes to your LangSmith setup, agent logic, or tool definitions. Both systems are fully independent and can be added to an existing LangChain codebase incrementally.

If your agent makes Stripe calls today with a full secret key and no spend cap, the next incident is a matter of when, not if. A stuck retry loop, a prompt injection, or a logic bug can create real charges before LangSmith can alert you — because LangSmith doesn't see those charges. Adding Keybrake closes the loop.

Try the proxy — free up to 1,000 requests/month →

See also: LangChain Stripe Integration: Safe Agent Payments with Policy Enforcement · AI Agent API Governance in Python: Policy Models, Spend Enforcement, and Audit Logs · Stripe Restricted API Key Permissions: Complete Reference for AI Agents