Agent governance · Tool list
AI agent governance tools: opinionated shortlist per layer for 2026
"AI agent governance tools" is a query that returns a vendor sea — five different categories of product, each lobbying for the whole word. This page names them all, sorts them into the four layers governance actually contains, and gives an opinionated pick plus one runner-up per layer. Skip the vendor sea: read the four-tool minimum stack at the bottom and you have what a production agent needs.
TL;DR
Governance contains four layers, each with its own tool category. Layer 1 (identity): pick from your existing secret store — 1Password / AWS Secrets Manager / HashiCorp Vault / Doppler / Infisical. Layer 2 (runtime policy enforcement): split by axis — LiteLLM (or Portkey / Bifrost / Helicone) on the LLM axis, Keybrake on the SaaS-tool axis. Layer 3 (audit and cost): Langfuse for LLM traces, your Layer-2 proxy emits the SaaS-tool rows, join on agent_run_id. Layer 4 (post-hoc evaluation): Promptfoo if open-source, Lakera Guard or Lasso Security if commercial. Total: four named tools, joined on one ID. The vendors that pitch a unified "platform" cover one or two of these layers and skip Layer 2 — see the governance platform rebuttal for why.
The four layers, named in one paragraph
Skip this section if you already know the layer model from the platform rebuttal page. For the rest: Layer 1 is who the agent is and what credentials it carries. Layer 2 is what the agent can do at the moment of action — per-day USD caps, endpoint allowlists, customer scope, mid-run revoke. Layer 3 is what the agent actually did, joined on agent_run_id. Layer 4 is whether the output was safe / correct / on-brand, evaluated post-hoc. Different surfaces, different tools. The error this page exists to prevent is buying one tool and assuming "governance" is now solved — it never is, because no single tool covers more than two of the four layers.
Quick reference — the four-tool minimum stack
For the impatient: this is the table you want. One row per layer; the pick column is what we'd buy on day one for a 10-50 person team running production agents that touch money-moving SaaS APIs.
| Layer | Job | Pick | Runner-up | Open-source |
|---|---|---|---|---|
| L1 — Identity | Store and rotate per-agent credentials | 1Password Secrets Automation | Doppler | HashiCorp Vault, Infisical |
| L2 — Runtime policy (LLM axis) | Token cap, model fallback, prompt-level guard | LiteLLM Proxy | Portkey Gateway | LiteLLM, Portkey, Bifrost, LangGate |
| L2 — Runtime policy (SaaS-tool axis) | USD cap, endpoint allowlist, customer scope, revoke on Stripe / Twilio / Resend / Shopify | Keybrake | DIY SDK wrapper | agentgateway.dev (alpha) |
| L3 — Audit and cost | Per-call log joined on agent_run_id with parsed cost |
Langfuse + L2 proxy emits | Helicone | Langfuse, Phoenix (Arize), OpenLIT, Laminar |
| L4 — Post-hoc evaluation | Output safety / correctness / regression scores | Promptfoo | Lakera Guard (commercial) | Promptfoo, Garak, DeepEval, Ragas |
Total monthly spend for a 10-50 person team running this exact stack: roughly $200-$500 on the proxies and gateways combined (free tiers cover most teams under 100k requests/month), plus whatever Layer-4 commitments your industry's compliance regime forces. Self-hosting all five is also viable — every "open-source" entry above is genuinely OSS-licensed, not source-available with a commercial gate. The cost of buying nothing is the worst-case figure: $3.24M / day for a stuck Stripe refund loop at 216,000 calls/day × $15/refund — the math is in our three-axis cost decomposition.
Layer 1 — Identity tools
Layer 1 is the boring layer, and that's a feature. The tools here are mature, well-understood, and almost universally already in place at any company past 20 engineers. The agent-governance angle on Layer 1 is narrow: can you issue a per-agent or per-run credential, and rotate it without code changes? If yes, your existing tool is sufficient. If no, you have a Layer-1 gap to fill before any other layer matters.
The named tools
- 1Password Secrets Automation — our pick. Reads from the same vault humans already use; service accounts handle agent identity. The "rotate" story is via Connect + a script in your deploy job. Best when humans already use 1Password (which is most product companies).
- HashiCorp Vault — the canonical OSS choice, sufficient for any scale, but you operate it. The dynamic-secrets engine (issue a database credential valid for 1h, then auto-revoke) is genuinely the best Layer-1 control on the market — most teams underuse it. If your platform team already runs Vault for non-agent workloads, the agent extension is the right call.
- AWS Secrets Manager — fine if you're an AWS shop. The rotation Lambda pattern works for vendors with rotation hooks (Stripe, GitHub apps, etc.) but is awkward for vendors without (Twilio, Resend). Pair with IAM role-per-agent for clean attribution.
- Doppler — strong runner-up for teams not already on 1Password or Vault. Per-environment secret sync, decent UI, decent CLI. The "service tokens" feature is the natural per-agent identity.
- Infisical — the up-and-coming OSS option. Self-host or hosted. Has a per-agent identity primitive and decent rotation story. Newer than Doppler/Vault, fewer integrations, but moving fast.
- WorkOS Agents (preview, 2026) — explicitly framed as agent identity infrastructure. SCIM-style provisioning for agents. Worth watching but too new to bet on for a current incident-prone production system.
- Auth0 Fine Grained Authorization — Auth0's policy engine, useful if your agent calls authenticated APIs you control. Doesn't help with third-party API keys (Stripe, Twilio) — that's Layer 2.
What Layer 1 alone does not solve
The most common Layer-1 illusion: "we issued a Stripe Restricted Key per agent, so we have governance covered." You don't. Restricted Keys are a great Layer-1 artefact and a partial Layer-2 enforcement (path-level scope), but they have no daily USD cap, no parameter-level allowlist on amount, no sub-second revoke without a human in the dashboard. The runaway-loop scenarios live a layer up from where Restricted Keys operate. The "does Stripe's native feature cover X" page walks through exactly which controls Layer 1 misses.
Layer 2 — Runtime policy enforcement tools
This is the layer that catches incidents and the layer where the vendor categories are most confused. There is no single Layer-2 product because Layer-2 traffic splits cleanly into two axes — LLM API traffic and SaaS-tool API traffic — and the technical work to govern each is different (different cost-data sources, different cap units, different fallback grammar). The right mental model is two proxies, in series, each governing one axis. The 2026 governance stack post has the diagram.
Layer 2 — LLM axis tools
These sit between the agent and OpenAI / Anthropic / Bedrock / your self-hosted vLLM / etc. They cap tokens, route between models, retry on 5xx, and produce model-call traces. The category has high vendor density and most options are good.
- LiteLLM Proxy — our pick for the LLM axis. Genuine OSS, OpenAI-compatible, multi-model, virtual keys, daily token caps, fallback chains, Postgres-backed audit. The standalone-server form (vs the SDK form) is the right Layer-2 shape. We have detailed posts on LiteLLM alternatives for teams that need different tradeoffs.
- Portkey Gateway — strong runner-up. Node/Hono-based, MIT licence, the most expressive routing config (YAML conditionals across model, latency, cost). Pick this over LiteLLM if your routing logic is complex.
- Bifrost — Go-based, Apache 2.0. ~10x throughput at high QPS vs Python proxies. Pick this if you're hitting Python GIL pain on LiteLLM at >1k req/s.
- LangGate — Python Kubernetes operator, Apache 2.0. CRD-driven config. Pick this if your stack is heavily K8s-native and you want declarative gateway config.
- Helicone — observability-first, with policy as a side feature. Pick this if your immediate need is "show me what the agent is doing" rather than "stop the agent doing it." Has a decent free tier.
- Envoy AI Gateway / Kong AI Gateway / Apache APISIX ai-proxy — these are LLM plugins for existing API gateways. Pick if your platform team already operates Envoy/Kong/APISIX and you want to consolidate. Don't pick if you don't already operate these — the operational tax of the host gateway dwarfs the agent governance work.
- OpenRouter — model marketplace with proxy semantics. Useful for cost arbitrage across models; weak on policy enforcement. Pair with one of the above, don't replace.
Layer 2 — SaaS-tool axis tools
These sit between the agent and Stripe / Twilio / Resend / Shopify / Postmark / Segment / etc. They enforce per-vendor USD cap, endpoint allowlist, customer scope, parameter-level allowlist, mid-run revoke. The vendor density here is much lower — this is a category in formation. You have three viable choices, ordered by maturity.
- Keybrake — our own pick because we built it for exactly this layer, and there is no honest competitor with the same scope. Sits in front of Stripe, Twilio, Resend (v1, with Shopify and Postmark in the next milestone). Per-vendor USD caps, endpoint allowlist, customer-scope on Stripe Customer, parameter allowlist on the
amountfield, sub-second mid-run revoke (propagation latency measured per vendor). Free tier covers 1,000 proxied requests / month, 1 vendor, 7-day audit retention. - DIY SDK wrapper — runner-up because it actually works for one vendor. Wrap the Stripe SDK in a thin policy layer in your own service; check daily USD spent against a Redis counter; reject if over cap. Cheapest path for one vendor; gets expensive when you need to repeat it for Twilio, Resend, Shopify (each has different cost-data shape — our Stripe-key implementation walkthrough sketches the SDK-wrapper version). Build vs buy decision is per-team; most teams underestimate the per-vendor parser work.
- agentgateway.dev — the closest open-source attempt at the SaaS-axis Layer 2 we've seen. Alpha-stage as of April 2026, narrow vendor coverage, MIT-licensed. Worth watching; not yet production-grade for incident-sensitive deployments.
- "Just use a feature flag" — what we hear most often as a "DIY" answer. Feature flags are a Layer-2 control only for your own first-party services. They don't enforce a USD cap on Stripe; they only stop your code from calling Stripe, which is a strictly weaker guarantee (the agent can still call Stripe through any code path that doesn't check the flag, including library code).
What the SaaS-axis Layer 2 explicitly does not include: the LLM gateways listed above. LiteLLM is not a Stripe proxy — it speaks OpenAI's wire format and parses tokens, neither of which describes a Stripe charge. Pointing an LLM gateway at api.stripe.com fails on path schemas, response parsing, and auth envelope. Use both proxies, don't try to reuse one.
Layer 3 — Audit and cost tools
Layer 3 is where the rows live. The output is a single queryable table whose rows are individual API calls, joined to a per-run grouping key (agent_run_id) so you can ask "what did run run_2026_04_30_8a3f spend?" and get one number across all vendors. Most Layer-3 capability falls out of Layer 2 for free — the proxy that enforces the cap is the natural place to record the call. The remainder is the join: rows from the LLM gateway plus rows from the SaaS-tool proxy plus identity attribution from Layer 1, all on the same key.
The named tools
- Langfuse — our pick. OSS, self-hostable, multi-vendor SDK support (Python, JS, plus OpenTelemetry), the trace-and-span model is genuinely well-suited to agent runs (parent run = trace, individual API call = span). The dashboard shows per-trace cost; queryable via SQL on the underlying Postgres.
- Helicone — strong runner-up, especially if you're already using it for Layer 2. Tracks LLM calls natively; SaaS-tool calls require manual instrumentation. Good for teams that want a managed offering and only care about the LLM axis.
- Phoenix (Arize) — OSS, OpenInference-based, strong evaluation integration. Pick if you want Layer 3 + Layer 4 in one tool; weaker on the per-vendor cost parsing.
- OpenLIT — OpenTelemetry-native, OSS. Pick if your existing observability stack is already OTel and you want to add LLM traces without a separate backend.
- Laminar — OSS, Rust-based, OpenTelemetry-compatible. Newer than Langfuse but moving fast; per-call cost parsing for major LLM providers built in.
- Datadog LLM Observability / New Relic AI Monitoring — fine if you already pay these vendors. Expensive at agent-scale request volumes, no per-vendor SaaS-tool support yet.
- Your own Postgres / SQLite + the agent_run_id join — entirely viable for Layer 3 alone. Our four-column MVP schema is
agent_run_id+policy_verdict+cost_usd_parsed+customer_scope_id. The full reference is the sixteen-column post. Many teams don't need a vendor product here — a table and a cron query is sufficient.
The agent_run_id join
The Layer-3 capability that actually matters at incident time is "what did one specific agent run spend, across all vendors?" That requires a join on a key emitted by the agent and forwarded by every proxy in the chain. The header convention we recommend (and which Keybrake propagates from incoming requests) is x-agent-run-id: run_2026_04_30_8a3f. Set it once at the top of your agent, propagate through your LLM gateway, propagate through your SaaS-tool proxy, log it everywhere. The result: SELECT vendor, SUM(cost_usd_parsed) FROM audit WHERE agent_run_id = ? gives you the per-run total. Without this, you have rows from each proxy that can't be correlated, and the runaway-loop postmortem takes hours instead of minutes.
Layer 4 — Post-hoc evaluation tools
This is the layer with the highest vendor density and the most "governance platform" marketing. Pick is more about your compliance regime than tool quality — the OSS options are excellent for teams without specific regulatory requirements, and the commercial vendors earn their licence fees by mapping their controls onto specific regulatory frameworks (EU AI Act, NIST AI RMF, ISO 42001).
Open-source picks
- Promptfoo — our pick if you can self-host. CLI-first, declarative test files, supports red-teaming and unit-test-style assertions. The "test your prompts in CI" workflow generalises naturally to agent runs.
- Garak — pick for adversarial / red-team evaluation specifically. Hundreds of probes for prompt injection, jailbreaks, data leaks. Pair with Promptfoo for end-to-end coverage.
- DeepEval — pytest-style framework for LLM evaluation. Pick if your team prefers a code-first / unit-test-aligned workflow over Promptfoo's YAML.
- Ragas — RAG-specific evaluation. Faithfulness, answer-relevance, context-precision. Pick if your agent is dominantly retrieval-augmented and you need RAG-specific scores.
- OpenLIT / Langfuse evals — both bundle evaluation alongside Layer 3. Worth using if you're already on the parent tool; not worth picking the parent tool just for the eval module.
Commercial picks
- Lakera Guard — pick for production-grade prompt-injection detection with low false-positive rate. The closest commercial option to a real-time Layer-4 control (it can sit inline and block bad prompts pre-model). Pricing is per-1k requests; gets expensive at agent-scale traffic, but the SLA is real.
- Lasso Security — runner-up. Strong on the dashboard and on integrations with secret-stores. Has a "shield" mode that's a partial Layer-2 control on prompt-level attacks specifically.
- CalypsoAI — pick if your buyer is an enterprise CISO and the deliverable is a regulator-friendly risk register. Strong audit-export story.
- Robust Intelligence (Cisco) — pick for enterprise-grade red-teaming and continuous validation. The model-eval scorecards are best-in-class. Heavyweight; not appropriate for a 10-person team.
- Credo AI / Holistic AI — pick if your governance epic is driven primarily by EU AI Act compliance. These are governance-frameworks-as-a-service rather than runtime controls.
The rule of thumb across Layer 4: if your immediate worry is incident prevention, pick OSS — Promptfoo + Garak gives you 80% of the value for $0. If your immediate worry is regulatory compliance, pick commercial — the evidence-export and audit-trail features are what you're paying for. Neither addresses Layer 2; the governance-platform rebuttal has the longer version of why these vendors don't extend into the runtime policy layer.
Cross-layer / orchestration tools
A handful of tools don't fit one layer cleanly because they touch the workflow that produces agent runs in the first place. They are not governance tools per se, but they shape what governance you can do.
- Inngest / Temporal — durable execution platforms for agent workflows. Useful because they emit a
run_idfor free that can become youragent_run_id. Don't enforce policy themselves; enable it by giving you a stable identifier. - OpenTelemetry — the standards-based way to wire Layer 3 across all proxies and your own services. Pick this over a vendor-specific tracer if you want lock-in protection or already run an OTel collector.
- OpenLLMetry — Traceloop's OTel-on-LLM project. Auto-instruments most LLM SDKs to emit OTel spans with cost data. Pair with Langfuse, Phoenix, or your existing OTel backend.
- Anthropic Computer Use sandboxing / E2B / Daytona — these are containerisation primitives for agents that run shell commands or browser interactions. Layer-1.5 in a sense — they constrain the blast radius of code execution but don't help with API-call governance. Worth knowing about; orthogonal to the four-layer model.
What we run for Keybrake's own production stack
Eat-our-own-dogfood disclosure. Keybrake runs the dual-proxy + audit + eval stack on a single VPS:
- Layer 1 — 1Password Secrets Automation. Per-agent service tokens; rotated quarterly; deploy job pulls live values via Connect.
- Layer 2 (LLM axis) — LiteLLM Proxy, self-hosted, in front of the Anthropic and OpenAI calls our own product makes (e.g. when an internal agent posts a build-in-public update on X). Daily token cap of 200k tokens per agent service token.
- Layer 2 (SaaS-tool axis) — Keybrake itself, on the same VPS. The "self-hosted in production" deployment shape is the one our docs assume; the hosted version is what we're building for paying customers.
- Layer 3 — SQLite + the four-column MVP schema, plus a nightly export to a read-only Postgres for richer queries. We will graduate to Langfuse once we cross 1k requests/day; today the SQLite-and-cron setup is sufficient.
- Layer 4 — Promptfoo in CI for the prompts our internal build-in-public agent uses. No commercial Layer-4 tool today; we'll revisit when we have a customer asking for an SOC 2 Type II report.
This is more honest than aspirational — the production system we describe in the rest of our marketing is exactly this. The omission of a commercial Layer-4 tool is deliberate: pre-revenue and pre-compliance, the cost-benefit isn't there. A team in a regulated industry would pick differently and we'd recommend they do.
Get early access to Keybrake (Layer 2 — SaaS-tool axis)
What to skip
The "AI agent governance tools" search returns several categories of result that we don't think belong on a shortlist for a production agent. Brief disclosure of what we considered and why we left it off:
- Generic API gateways without LLM plugins (Tyk, Apigee, AWS API Gateway). Fine for human-traffic governance; do not parse cost from LLM responses; do not enforce per-call USD caps on Stripe. Use one only if you already operate it for non-agent workloads.
- Compliance-management SaaS (Vanta, Drata, Secureframe). These help you produce audit evidence; they do not enforce a Layer-2 cap. Different problem.
- Generic application firewalls (Cloudflare, Imperva). Path-level rules for human-shaped traffic. Cannot inspect a Stripe charge amount. Can block IPs at scale, which is occasionally useful as a panic button — not a governance tool.
- "Agent platforms" that bundle their own proxy (some autonomous-agent frameworks ship a built-in HTTP layer that claims governance). The bundled controls are usually rudimentary (binary on/off per tool), not the per-vendor caps and customer-scope you need at incident time. Use the agent framework for orchestration; pair with an actual Layer-2 proxy for governance.
- Tools currently in alpha (most of the WorkOS Agents / Skyvern Pay / x402 cohort). Worth tracking; not appropriate for production deployment until v1 with measured propagation latency. Don't bet the runaway-loop incident on alpha software.
Related questions
What's the absolute minimum I can ship with?
One Layer-2 proxy on whichever axis matters most for your agent. If your agent calls Stripe, that's the SaaS-axis proxy (Keybrake or DIY); if your agent only calls LLMs, that's an LLM gateway (LiteLLM or Portkey). Layer 3 falls out of Layer 2 for free if you log every call. Layer 1 is whatever your secret store already is. Layer 4 is deferrable until you have a specific compliance ask. The minimum minimum is one proxy and a SQLite table. Everything else compounds value but isn't strictly necessary on day one.
I already use Datadog / New Relic / Honeycomb. Do I need a separate Layer-3 tool?
Probably not, if you're willing to wire in OpenTelemetry and pay for the additional spans. The downsides are (1) cost — agent traffic generates many spans cheaply, your APM bill grows fast — and (2) the LLM-specific cost parsers in Langfuse / Phoenix / Helicone do work for you that generic APMs don't. If you already pay Datadog and have headroom, the OTel route is cleanest. If you're cost-sensitive, a dedicated tool will be cheaper at agent volume.
Where does an MCP server fit in this list?
MCP is a Layer-1 artefact (it provides a discoverable, scoped credential surface for tools the agent can call) plus a small Layer-2 component (the tool definition is implicitly an allowlist of what the agent can attempt). It does not enforce per-day USD caps or sub-second mid-run revoke — both are still Layer 2's job. The MCP-auth page covers the auth handshake; the Stripe Agent Toolkit page covers what Stripe's own MCP server gives you and where it stops. Net: an MCP-served tool is still upstream of Layer 2 — put a Layer-2 proxy between the MCP server and the underlying SaaS API.
Open-source-only stack — what's the answer?
Layer 1: HashiCorp Vault. Layer 2 LLM: LiteLLM. Layer 2 SaaS-tool: roll your own SDK wrapper using the patterns in our Stripe-key blog post (or use Keybrake — we're OSS-licensed for self-host, and that's also the cheapest path that scales beyond one vendor). Layer 3: Langfuse self-hosted (Postgres backend, decent UI). Layer 4: Promptfoo + Garak. Total monthly spend: $0 in software licences, plus whatever your ops team's time is worth running five OSS services. Real total is probably 1-3 days of platform-engineer time per month; whether that's cheaper than commercial-licensed tooling depends on your salary structure.
Pre-revenue startup — should I bother with all four layers?
Layer 2 only, on the axis where your agent burns money. If your agent is calling LLMs to generate copy, Layer 2 LLM is the answer (LiteLLM with a $50/day cap); the rest can wait until you have customers. If your agent is touching Stripe at all, Layer 2 SaaS-tool is the answer regardless of revenue stage — a stuck refund loop pre-revenue is just as expensive as one post-revenue, and pre-revenue your runway is more precious. Layer 4 is almost never the right pre-revenue investment unless your industry's compliance regime makes it a launch blocker.
How do I evaluate "vendor X says they're a governance platform"?
Three-question test. (1) Do they sit inline on the request path between the agent and the SaaS API, or only on the model input/output? If only the latter, they're Layer 4 marketed as Layer 2. (2) Can they enforce a per-day USD cap on Stripe today, with parsed-from-response cost? If no, the most expensive scenarios are not in scope. (3) What's the median and p95 propagation latency for a mid-run revoke? If they can't tell you, they don't have one. Any vendor failing two of three questions is solving a different problem than the one in your head when you searched "agent governance".
Is "agent governance tools" the same search as "AI governance tools"?
Overlapping but distinct. "AI governance tools" historically returns Credo AI, Holistic AI, IBM Watson OpenScale, Datadog AI, ServiceNow Risk — model-risk, regulatory compliance, fairness audits at the org level. "Agent governance tools" (this page's intent) is more operational: runtime controls that constrain an autonomous agent's actions during execution. The first is about model outputs and policy posture; the second is about the agent's hand on the wheel. The vocabulary cross-pollinates because the vendors that built tools for the first category are now marketing into the second. Read the L2 column carefully when shopping; it's the easiest way to tell which problem a tool actually solves.
Further reading
- AI agent governance platform — why governance is not a single platform — sibling page; the rebuttal-shaped argument behind the four-layer model used here.
- The 2026 agent governance stack: which proxy goes where — long-form companion; the dual-proxy architecture diagram with measures-in / prevents framing per layer.
- AI agent cost management — three-axis decomposition — the cost math that motivates why Layer 2 is the most expensive layer to skip ($3.24M/day worst case).
- AI agent kill-switch — patterns and stop-latency — the four real Layer-2 enforcement patterns with measured propagation latencies.
- AI agent audit trail — what belongs in one — the four-column MVP schema for Layer 3.
- Anatomy of an AI agent audit trail (long form) — sixteen-column reference, six indexes, five operational queries.
- LiteLLM alternative for Stripe — why the LLM-gateway category does not extend into SaaS-tool Layer 2.
- LiteLLM alternatives — honest open-source review — the LLM-axis Layer-2 toolbox.
- LiteLLM Proxy alternatives — six gateways for the proxy-server shape — narrower Layer-2 LLM picks for proxy-shape deployments.
- AI agent payment gateway — 2026 category map — the three-category split for payment-axis tooling and where governance fits.
- Stripe Agent Toolkit over MCP — 14-tool blast-radius catalogue — Layer 1 + Layer 2 picture for Stripe's own MCP server.
- MCP server API key auth — 4 patterns — Layer 1 picture for MCP credential handling.
- How to give an AI agent a Stripe API key without losing $4,000 — practical Layer-2 implementation walkthrough.
- Rotate vs revoke: a 2am playbook for a stuck agent — Layer-2 incident response with two side-by-side timelines.
- Agent blowout calculator — interactive: pick a vendor and a calls-per-minute slider, see the 24-hour Layer-2 cost.
- Newsletter issue #01 — how long your kill switch actually takes to kill — per-vendor revoke latency measurements.