Architecture · 10 min read

The 2026 agent governance stack: which proxy goes where

Keybrake team · April 24, 2026

"Agent governance" is not a product. It's a stack with four layers, and most production agents run at least two of them. The layers answer different questions, measure in different units, and fail in different ways. This post maps which proxy goes where, which players live at which layer, and the one header that lets you join an incident across all of them.

Why one proxy isn't enough

The first reaction most teams have, when asked to "govern an agent," is to buy or build a single gateway and route everything through it. This fails for a clean technical reason: the proxies that govern LLM traffic and the proxies that govern SaaS API traffic are shaped differently because the traffic itself is shaped differently. An LLM proxy expects OpenAI-compatible request and response schemas, counts tokens, and derives USD from a token-price table. A SaaS API proxy has to parse vendor-specific response bodies to learn that a call just moved $240, because Stripe doesn't put that number in a standard header.

So teams end up running two proxies, or one proxy and one observability SaaS, or — if they want a paper trail for a compliance reviewer — three services in series. The question stops being "which one do I pick" and becomes "how do they compose, and what falls through the cracks between them."

The four layers

Every production agent stack we've looked at in the last six months is some subset of these four layers:

Layer	Prevents	Measures in	Example players
1. LLM traffic	Token over-spend, provider lock-in, missing fallbacks	Tokens → USD	LiteLLM, Portkey, OpenRouter, Bifrost
2. LLM observability	Regression, latency drift, unseen prompt changes	Traces, request-response pairs	Helicone, Langfuse, Traceloop, Arize Phoenix
3. SaaS API governance	Stripe / Twilio / Resend over-spend, blast-radius breach, audit gaps	USD parsed from vendor response	Keybrake (cross-vendor), Stripe Restricted Keys (single-vendor, partial)
4. Agent identity	Cross-tenant attribution, audit fidelity, per-customer scope	Tokens, assertions	Auth0 FGA for AI (emerging), WorkOS AuthKit per-agent, manual service accounts

Layers 1 and 2 are crowded and mature — "pick one, integrate in 30 minutes." Layer 3 is thinly populated because the problem is vendor-specific: parsing Stripe response bodies is not the same skill as parsing Twilio webhooks. Layer 4 is unshipped in practice; most teams use a shared service-account token and accept the audit gap.

Layer 1 — LLM traffic

This is the layer every team names when you say "agent proxy," because it's the layer that was built first and has the most tooling. The job: intercept calls to api.openai.com, api.anthropic.com, and friends; route to a fallback provider when the primary is down; count tokens per request; convert to USD via a periodically-updated price table; and reject calls when a per-key, per-day, or per-model budget is exceeded.

LiteLLM is the open-source default. Portkey is the commercial play on the same surface. OpenRouter sells itself as the single aggregator rather than a self-hosted proxy. All three speak OpenAI-compatible endpoints, which is why they compose: your agent SDK doesn't know or care that it's talking to a proxy instead of the model provider.

Where this layer stops: the moment your agent issues a charge, sends an SMS, or dispatches a transactional email, the LLM proxy is blind. The traffic doesn't route through it. The traffic isn't OpenAI-compatible. The cost isn't in tokens. We wrote a longer piece on exactly this misfit at LiteLLM alternative — the short version is that 2026's agent stack is typically dual-proxy: an LLM gateway at Layer 1, and something else at Layer 3.

Layer 2 — LLM observability

Observability is the layer that exists to answer post-hoc questions: "why did p95 latency spike last Tuesday?", "which of our prompts has the worst cache hit rate?", "which agent run produced the output the user complained about?". Helicone, Langfuse, Traceloop, and Arize Phoenix all live here. Many teams run Layer 1 and Layer 2 in series, because they answer different questions: Layer 1 prevents the next bad call, Layer 2 explains the last bad call.

The distinction matters because some observability tools bill themselves as proxies — they sit on the path and log, rather than ingesting logs out-of-band. That's a deployment choice, not a category one: "dashboard-first" is still different from "policy-first." We drew out this split in detail at Helicone alternative: Helicone's home screen is a chart of what happened; Keybrake's home screen is a policy editor for what can't happen.

If your agent never touches money-moving APIs, Layer 1 plus Layer 2 is a complete stack. Most teams at that scope stop here and skip directly to shipping.

Layer 3 — SaaS API governance

This is the layer most teams don't realise they need until the incident has already happened. It sits between the agent and every non-LLM API the agent calls: Stripe, Twilio, Resend, Shopify Admin, Postmark, Segment, and the thirty others. Its job is the same shape as Layer 1 — enforce a cap, allowlist paths, block disallowed scope, log the call — but the mechanics are different.

Units are different. A Stripe charge response carries amount in cents and currency as a string. A Twilio SMS response has price as a pre-negotiated float. Resend has a flat per-email rate. There is no OpenAI-compatible wrapper for these; a governance proxy has to speak each vendor natively and parse each response body to learn what the call cost.

Scopes are different. "Which models can this agent call" maps to a list. "Which customers can this agent issue refunds to" maps to a query against a live customer table. Layer 1's allowlist vocabulary doesn't cover it. Neither do Stripe's own Restricted Keys — they toggle resources (Charges:Write), not records (only customers created by this agent). We laid out the full gap matrix on Stripe Agent Toolkit over MCP: of fourteen default tools, one (create_charge) is Critical blast-radius, and none of the native controls let you bound it per-customer.

Revocation latency is different. When an LLM agent goes stuck, you flip an API key and the next request 401s. When a payments agent goes stuck, the "revoke" pathway depends on whether you're rotating the upstream Stripe key (median 45s, p95 3m12s per our kill-switch measurements) or flipping a flag in a proxy database (next request, sub-second). For anyone running a real incident playbook, that delta is the difference between "we lost $200" and "we lost $14,000."

Keybrake is a Layer 3 proxy. That's the whole company. The two-line elevator version: scoped vault keys with per-vendor USD caps, endpoint allowlists, customer scope, sub-second revoke, and a per-call audit table with parsed cost. Three vendors at v1 (Stripe, Twilio, Resend), one policy schema, one audit table, one base-URL change in your agent.

Layer 4 — Agent identity

This is the layer that almost nobody has in production, and the layer that almost everybody will have in 2027. It answers a different question from the other three: not "what is this agent allowed to do?" but "who is this agent acting for, and can it prove it?"

Today, the typical pattern is a shared service-account token — every agent run uses the same API key, which means your audit table has an agent_run_id but no linkable "customer-on-whose-behalf" column. That's fine for a single-tenant product. It breaks immediately when you're running an agent on behalf of N customers and one of them disputes a charge: "show me every action your agent took on my account" becomes a fuzzy query over request bodies.

The emerging players here are Auth0's FGA for AI, WorkOS's per-agent tokens, and the identity pieces folded into Anthropic's computer-use billing attribution. None of them are category-defining yet. Most serious teams skip Layer 4 at v1, accept the audit imprecision, and revisit when compliance starts asking hard questions. That's a defensible choice — but the imprecision should be on the record, not unexamined.

How the layers compose

Here is the shape of a production agent stack that uses all four layers:

               ┌──────────────────────────────────────────┐
               │         AGENT RUNTIME (x-agent-run-id)   │
               └──────────────────────────────────────────┘
                        │                          │
                        ▼                          ▼
        ┌───────────────────────┐   ┌───────────────────────────────┐
        │ LAYER 1: LLM proxy    │   │ LAYER 3: SaaS API proxy       │
        │  LiteLLM / Portkey    │   │  Keybrake                     │
        │  • token cap          │   │  • per-vendor USD cap         │
        │  • fallback routing   │   │  • endpoint / customer scope  │
        └──────────┬────────────┘   │  • sub-second revoke          │
                   │                └───────────────┬───────────────┘
                   ▼                                ▼
           OpenAI / Anthropic              Stripe / Twilio / Resend
                   │                                │
                   └─────────┐      ┌───────────────┘
                             ▼      ▼
                  ┌────────────────────────┐
                  │ LAYER 2: observability │
                  │  Helicone / Langfuse   │
                  │  (ingests traces       │
                  │   from L1 + L3)        │
                  └────────────────────────┘

The join key is x-agent-run-id: one UUID per agent run, set by the agent before it makes its first call, attached as a header to every request to every layer. Each layer logs that UUID with its own rows. When an incident happens, you query across all three databases on the same UUID and get a complete trace — LLM call at 12:04:01.204Z, tool call decision, SaaS API call at 12:04:01.612Z, policy verdict, cost.

This is the cheapest coordination mechanism available. It doesn't require shared schemas or a central orchestrator. It requires one header everyone agrees to set and log. Our audit columns at Layer 3 have agent_run_id as a first-class index; Helicone has the same; LiteLLM passes arbitrary headers through. It composes.

Which layers do you actually need?

The decision collapses to which traffic your agent touches:

Agent only touches LLMs. Layer 1 plus Layer 2. LiteLLM (self-hosted) or OpenRouter (aggregator) plus Helicone or Langfuse. Skip Layer 3 entirely — there's no non-LLM traffic to govern. Skip Layer 4 unless you're multi-tenant.
Agent touches LLMs and money-moving SaaS. Layer 1 plus Layer 3. Observability is optional at first and becomes useful around month two, when you want to understand why agent runs that hit the LLM budget cap were the same ones that hit the Stripe budget cap (usually: a retry loop). This is the most common 2026 shape.
Agent only touches SaaS APIs. Layer 3 alone. This happens more than people expect — rule-based orchestrators that dispatch Stripe, Shopify, and Twilio from deterministic code, no LLM involved. A Layer 1 proxy here is dead weight.
Agent acts on behalf of multiple customers. Add Layer 4 the day the first auditor asks "who authorised this charge?". The answer "the agent did" is not a compliance answer; the answer "customer 27b3ac, signed assertion attached" is.

If you're at the first step and wondering which Layer 1 option to pick, the open-source review we wrote at LiteLLM alternatives, open source maps five of them by the axes that actually matter (provider coverage, self-hosted footprint, policy DSL, throughput under load).

Where the gaps still are

Three gaps in the current stack, all of them unshipped and all of them real:

Cross-layer correlation tooling. When an incident crosses Layer 1 and Layer 3, you're writing a manual join query across two databases, on agent_run_id. No dashboard ingests both yet. Helicone could grow into this but today it's LLM-only. The composability works, but the UI for it doesn't exist yet.

A shared policy language. LiteLLM has its own policy DSL (YAML with model allowlists, budget units in tokens). Portkey has another (a config graph with guardrails). Keybrake has a third (vendor-aware policy JSON with customer scope and endpoint allowlists). None of them interchange. A team running both layers maintains two policy files for what is conceptually the same agent's allowed behaviour. This is a standards problem that nobody has the incentive to fix yet.

Pre-call cost prediction for SaaS. Layer 1 knows roughly what a call will cost before it makes it — OpenAI publishes token prices, and the proxy can estimate upward from prompt tokens. Layer 3 mostly can't. A Stripe charge response tells you what happened, not what's about to happen. Twilio and Resend have pricing tables you can join against request parameters; Stripe charge amounts are client-specified. So the "reject before the call" mode works for some Layer 3 vendors and becomes "reject if this would push us over" only after the vendor confirms.

None of these gaps should stop you from shipping. They're the direction the category will evolve in over the next twelve months.

If you're assembling this stack now

Start with the layer that matches your blast radius. If your agent moves money, that's Layer 3 and the rest can wait a week. If it only calls models, that's Layer 1 plus Layer 2 and you can be live by Friday. If it does both — which is where most teams end up — pick up Keybrake for the money-moving surface and compose it with whatever LLM proxy your team already likes. Join the runs on x-agent-run-id. Revisit in three months with the real incident data in hand.

The stack is younger than it looks. Most of it will be standardised, simplified, and partially absorbed into the vendors themselves over the next eighteen months. Your job right now is to pick composable pieces, keep the join key clean, and not paint yourself into a corner where one layer owns data another layer needs to see.

Get Keybrake when v1 ships

Pre-launch waitlist for the SaaS-API governance layer. We'll email you a vault key when the proxy is live, with a working code sample for Stripe, Twilio, and Resend.