Agent governance · Tool list

AI agent governance tools: opinionated shortlist per layer for 2026

"AI agent governance tools" is a query that returns a vendor sea — five different categories of product, each lobbying for the whole word. This page names them all, sorts them into the four layers governance actually contains, and gives an opinionated pick plus one runner-up per layer. Skip the vendor sea: read the four-tool minimum stack at the bottom and you have what a production agent needs.

TL;DR

Governance contains four layers, each with its own tool category. Layer 1 (identity): pick from your existing secret store — 1Password / AWS Secrets Manager / HashiCorp Vault / Doppler / Infisical. Layer 2 (runtime policy enforcement): split by axis — LiteLLM (or Portkey / Bifrost / Helicone) on the LLM axis, Keybrake on the SaaS-tool axis. Layer 3 (audit and cost): Langfuse for LLM traces, your Layer-2 proxy emits the SaaS-tool rows, join on agent_run_id. Layer 4 (post-hoc evaluation): Promptfoo if open-source, Lakera Guard or Lasso Security if commercial. Total: four named tools, joined on one ID. The vendors that pitch a unified "platform" cover one or two of these layers and skip Layer 2 — see the governance platform rebuttal for why.

The four layers, named in one paragraph

Skip this section if you already know the layer model from the platform rebuttal page. For the rest: Layer 1 is who the agent is and what credentials it carries. Layer 2 is what the agent can do at the moment of action — per-day USD caps, endpoint allowlists, customer scope, mid-run revoke. Layer 3 is what the agent actually did, joined on agent_run_id. Layer 4 is whether the output was safe / correct / on-brand, evaluated post-hoc. Different surfaces, different tools. The error this page exists to prevent is buying one tool and assuming "governance" is now solved — it never is, because no single tool covers more than two of the four layers.

Quick reference — the four-tool minimum stack

For the impatient: this is the table you want. One row per layer; the pick column is what we'd buy on day one for a 10-50 person team running production agents that touch money-moving SaaS APIs.

Layer	Job	Pick	Runner-up	Open-source
L1 — Identity	Store and rotate per-agent credentials	1Password Secrets Automation	Doppler	HashiCorp Vault, Infisical
L2 — Runtime policy (LLM axis)	Token cap, model fallback, prompt-level guard	LiteLLM Proxy	Portkey Gateway	LiteLLM, Portkey, Bifrost, LangGate
L2 — Runtime policy (SaaS-tool axis)	USD cap, endpoint allowlist, customer scope, revoke on Stripe / Twilio / Resend / Shopify	Keybrake	DIY SDK wrapper	agentgateway.dev (alpha)
L3 — Audit and cost	Per-call log joined on `agent_run_id` with parsed cost	Langfuse + L2 proxy emits	Helicone	Langfuse, Phoenix (Arize), OpenLIT, Laminar
L4 — Post-hoc evaluation	Output safety / correctness / regression scores	Promptfoo	Lakera Guard (commercial)	Promptfoo, Garak, DeepEval, Ragas

Total monthly spend for a 10-50 person team running this exact stack: roughly $200-$500 on the proxies and gateways combined (free tiers cover most teams under 100k requests/month), plus whatever Layer-4 commitments your industry's compliance regime forces. Self-hosting all five is also viable — every "open-source" entry above is genuinely OSS-licensed, not source-available with a commercial gate. The cost of buying nothing is the worst-case figure: $3.24M / day for a stuck Stripe refund loop at 216,000 calls/day × $15/refund — the math is in our three-axis cost decomposition.

Layer 1 — Identity tools

Layer 1 is the boring layer, and that's a feature. The tools here are mature, well-understood, and almost universally already in place at any company past 20 engineers. The agent-governance angle on Layer 1 is narrow: can you issue a per-agent or per-run credential, and rotate it without code changes? If yes, your existing tool is sufficient. If no, you have a Layer-1 gap to fill before any other layer matters.

The named tools

1Password Secrets Automation — our pick. Reads from the same vault humans already use; service accounts handle agent identity. The "rotate" story is via Connect + a script in your deploy job. Best when humans already use 1Password (which is most product companies).
HashiCorp Vault — the canonical OSS choice, sufficient for any scale, but you operate it. The dynamic-secrets engine (issue a database credential valid for 1h, then auto-revoke) is genuinely the best Layer-1 control on the market — most teams underuse it. If your platform team already runs Vault for non-agent workloads, the agent extension is the right call.
AWS Secrets Manager — fine if you're an AWS shop. The rotation Lambda pattern works for vendors with rotation hooks (Stripe, GitHub apps, etc.) but is awkward for vendors without (Twilio, Resend). Pair with IAM role-per-agent for clean attribution.
Doppler — strong runner-up for teams not already on 1Password or Vault. Per-environment secret sync, decent UI, decent CLI. The "service tokens" feature is the natural per-agent identity.
Infisical — the up-and-coming OSS option. Self-host or hosted. Has a per-agent identity primitive and decent rotation story. Newer than Doppler/Vault, fewer integrations, but moving fast.
WorkOS Agents (preview, 2026) — explicitly framed as agent identity infrastructure. SCIM-style provisioning for agents. Worth watching but too new to bet on for a current incident-prone production system.
Auth0 Fine Grained Authorization — Auth0's policy engine, useful if your agent calls authenticated APIs you control. Doesn't help with third-party API keys (Stripe, Twilio) — that's Layer 2.

What Layer 1 alone does not solve

The most common Layer-1 illusion: "we issued a Stripe Restricted Key per agent, so we have governance covered." You don't. Restricted Keys are a great Layer-1 artefact and a partial Layer-2 enforcement (path-level scope), but they have no daily USD cap, no parameter-level allowlist on amount, no sub-second revoke without a human in the dashboard. The runaway-loop scenarios live a layer up from where Restricted Keys operate. The "does Stripe's native feature cover X" page walks through exactly which controls Layer 1 misses.

Layer 2 — Runtime policy enforcement tools

This is the layer that catches incidents and the layer where the vendor categories are most confused. There is no single Layer-2 product because Layer-2 traffic splits cleanly into two axes — LLM API traffic and SaaS-tool API traffic — and the technical work to govern each is different (different cost-data sources, different cap units, different fallback grammar). The right mental model is two proxies, in series, each governing one axis. The 2026 governance stack post has the diagram.

Layer 2 — LLM axis tools

These sit between the agent and OpenAI / Anthropic / Bedrock / your self-hosted vLLM / etc. They cap tokens, route between models, retry on 5xx, and produce model-call traces. The category has high vendor density and most options are good.

LiteLLM Proxy — our pick for the LLM axis. Genuine OSS, OpenAI-compatible, multi-model, virtual keys, daily token caps, fallback chains, Postgres-backed audit. The standalone-server form (vs the SDK form) is the right Layer-2 shape. We have detailed posts on LiteLLM alternatives for teams that need different tradeoffs.
Portkey Gateway — strong runner-up. Node/Hono-based, MIT licence, the most expressive routing config (YAML conditionals across model, latency, cost). Pick this over LiteLLM if your routing logic is complex.
Bifrost — Go-based, Apache 2.0. ~10x throughput at high QPS vs Python proxies. Pick this if you're hitting Python GIL pain on LiteLLM at >1k req/s.
LangGate — Python Kubernetes operator, Apache 2.0. CRD-driven config. Pick this if your stack is heavily K8s-native and you want declarative gateway config.
Helicone — observability-first, with policy as a side feature. Pick this if your immediate need is "show me what the agent is doing" rather than "stop the agent doing it." Has a decent free tier.
Envoy AI Gateway / Kong AI Gateway / Apache APISIX ai-proxy — these are LLM plugins for existing API gateways. Pick if your platform team already operates Envoy/Kong/APISIX and you want to consolidate. Don't pick if you don't already operate these — the operational tax of the host gateway dwarfs the agent governance work.
OpenRouter — model marketplace with proxy semantics. Useful for cost arbitrage across models; weak on policy enforcement. Pair with one of the above, don't replace.

Layer 2 — SaaS-tool axis tools

These sit between the agent and Stripe / Twilio / Resend / Shopify / Postmark / Segment / etc. They enforce per-vendor USD cap, endpoint allowlist, customer scope, parameter-level allowlist, mid-run revoke. The vendor density here is much lower — this is a category in formation. You have three viable choices, ordered by maturity.

Keybrake — our own pick because we built it for exactly this layer, and there is no honest competitor with the same scope. Sits in front of Stripe, Twilio, Resend (v1, with Shopify and Postmark in the next milestone). Per-vendor USD caps, endpoint allowlist, customer-scope on Stripe Customer, parameter allowlist on the amount field, sub-second mid-run revoke (propagation latency measured per vendor). Free tier covers 1,000 proxied requests / month, 1 vendor, 7-day audit retention.
DIY SDK wrapper — runner-up because it actually works for one vendor. Wrap the Stripe SDK in a thin policy layer in your own service; check daily USD spent against a Redis counter; reject if over cap. Cheapest path for one vendor; gets expensive when you need to repeat it for Twilio, Resend, Shopify (each has different cost-data shape — our Stripe-key implementation walkthrough sketches the SDK-wrapper version). Build vs buy decision is per-team; most teams underestimate the per-vendor parser work.
agentgateway.dev — the closest open-source attempt at the SaaS-axis Layer 2 we've seen. Alpha-stage as of April 2026, narrow vendor coverage, MIT-licensed. Worth watching; not yet production-grade for incident-sensitive deployments.
"Just use a feature flag" — what we hear most often as a "DIY" answer. Feature flags are a Layer-2 control only for your own first-party services. They don't enforce a USD cap on Stripe; they only stop your code from calling Stripe, which is a strictly weaker guarantee (the agent can still call Stripe through any code path that doesn't check the flag, including library code).

What the SaaS-axis Layer 2 explicitly does not include: the LLM gateways listed above. LiteLLM is not a Stripe proxy — it speaks OpenAI's wire format and parses tokens, neither of which describes a Stripe charge. Pointing an LLM gateway at api.stripe.com fails on path schemas, response parsing, and auth envelope. Use both proxies, don't try to reuse one.

Layer 3 — Audit and cost tools

Layer 3 is where the rows live. The output is a single queryable table whose rows are individual API calls, joined to a per-run grouping key (agent_run_id) so you can ask "what did run run_2026_04_30_8a3f spend?" and get one number across all vendors. Most Layer-3 capability falls out of Layer 2 for free — the proxy that enforces the cap is the natural place to record the call. The remainder is the join: rows from the LLM gateway plus rows from the SaaS-tool proxy plus identity attribution from Layer 1, all on the same key.

The named tools

Langfuse — our pick. OSS, self-hostable, multi-vendor SDK support (Python, JS, plus OpenTelemetry), the trace-and-span model is genuinely well-suited to agent runs (parent run = trace, individual API call = span). The dashboard shows per-trace cost; queryable via SQL on the underlying Postgres.
Helicone — strong runner-up, especially if you're already using it for Layer 2. Tracks LLM calls natively; SaaS-tool calls require manual instrumentation. Good for teams that want a managed offering and only care about the LLM axis.
Phoenix (Arize) — OSS, OpenInference-based, strong evaluation integration. Pick if you want Layer 3 + Layer 4 in one tool; weaker on the per-vendor cost parsing.
OpenLIT — OpenTelemetry-native, OSS. Pick if your existing observability stack is already OTel and you want to add LLM traces without a separate backend.
Laminar — OSS, Rust-based, OpenTelemetry-compatible. Newer than Langfuse but moving fast; per-call cost parsing for major LLM providers built in.
Datadog LLM Observability / New Relic AI Monitoring — fine if you already pay these vendors. Expensive at agent-scale request volumes, no per-vendor SaaS-tool support yet.
Your own Postgres / SQLite + the agent_run_id join — entirely viable for Layer 3 alone. Our four-column MVP schema is agent_run_id + policy_verdict + cost_usd_parsed + customer_scope_id. The full reference is the sixteen-column post. Many teams don't need a vendor product here — a table and a cron query is sufficient.

The agent_run_id join

The Layer-3 capability that actually matters at incident time is "what did one specific agent run spend, across all vendors?" That requires a join on a key emitted by the agent and forwarded by every proxy in the chain. The header convention we recommend (and which Keybrake propagates from incoming requests) is x-agent-run-id: run_2026_04_30_8a3f. Set it once at the top of your agent, propagate through your LLM gateway, propagate through your SaaS-tool proxy, log it everywhere. The result: SELECT vendor, SUM(cost_usd_parsed) FROM audit WHERE agent_run_id = ? gives you the per-run total. Without this, you have rows from each proxy that can't be correlated, and the runaway-loop postmortem takes hours instead of minutes.

Layer 4 — Post-hoc evaluation tools

This is the layer with the highest vendor density and the most "governance platform" marketing. Pick is more about your compliance regime than tool quality — the OSS options are excellent for teams without specific regulatory requirements, and the commercial vendors earn their licence fees by mapping their controls onto specific regulatory frameworks (EU AI Act, NIST AI RMF, ISO 42001).

Open-source picks

Promptfoo — our pick if you can self-host. CLI-first, declarative test files, supports red-teaming and unit-test-style assertions. The "test your prompts in CI" workflow generalises naturally to agent runs.
Garak — pick for adversarial / red-team evaluation specifically. Hundreds of probes for prompt injection, jailbreaks, data leaks. Pair with Promptfoo for end-to-end coverage.
DeepEval — pytest-style framework for LLM evaluation. Pick if your team prefers a code-first / unit-test-aligned workflow over Promptfoo's YAML.
Ragas — RAG-specific evaluation. Faithfulness, answer-relevance, context-precision. Pick if your agent is dominantly retrieval-augmented and you need RAG-specific scores.
OpenLIT / Langfuse evals — both bundle evaluation alongside Layer 3. Worth using if you're already on the parent tool; not worth picking the parent tool just for the eval module.

Commercial picks

Lakera Guard — pick for production-grade prompt-injection detection with low false-positive rate. The closest commercial option to a real-time Layer-4 control (it can sit inline and block bad prompts pre-model). Pricing is per-1k requests; gets expensive at agent-scale traffic, but the SLA is real.
Lasso Security — runner-up. Strong on the dashboard and on integrations with secret-stores. Has a "shield" mode that's a partial Layer-2 control on prompt-level attacks specifically.
CalypsoAI — pick if your buyer is an enterprise CISO and the deliverable is a regulator-friendly risk register. Strong audit-export story.
Robust Intelligence (Cisco) — pick for enterprise-grade red-teaming and continuous validation. The model-eval scorecards are best-in-class. Heavyweight; not appropriate for a 10-person team.
Credo AI / Holistic AI — pick if your governance epic is driven primarily by EU AI Act compliance. These are governance-frameworks-as-a-service rather than runtime controls.

The rule of thumb across Layer 4: if your immediate worry is incident prevention, pick OSS — Promptfoo + Garak gives you 80% of the value for $0. If your immediate worry is regulatory compliance, pick commercial — the evidence-export and audit-trail features are what you're paying for. Neither addresses Layer 2; the governance-platform rebuttal has the longer version of why these vendors don't extend into the runtime policy layer.

Cross-layer / orchestration tools

A handful of tools don't fit one layer cleanly because they touch the workflow that produces agent runs in the first place. They are not governance tools per se, but they shape what governance you can do.

Inngest / Temporal — durable execution platforms for agent workflows. Useful because they emit a run_id for free that can become your agent_run_id. Don't enforce policy themselves; enable it by giving you a stable identifier.
OpenTelemetry — the standards-based way to wire Layer 3 across all proxies and your own services. Pick this over a vendor-specific tracer if you want lock-in protection or already run an OTel collector.
OpenLLMetry — Traceloop's OTel-on-LLM project. Auto-instruments most LLM SDKs to emit OTel spans with cost data. Pair with Langfuse, Phoenix, or your existing OTel backend.
Anthropic Computer Use sandboxing / E2B / Daytona — these are containerisation primitives for agents that run shell commands or browser interactions. Layer-1.5 in a sense — they constrain the blast radius of code execution but don't help with API-call governance. Worth knowing about; orthogonal to the four-layer model.

What we run for Keybrake's own production stack

Eat-our-own-dogfood disclosure. Keybrake runs the dual-proxy + audit + eval stack on a single VPS:

Layer 1 — 1Password Secrets Automation. Per-agent service tokens; rotated quarterly; deploy job pulls live values via Connect.
Layer 2 (LLM axis) — LiteLLM Proxy, self-hosted, in front of the Anthropic and OpenAI calls our own product makes (e.g. when an internal agent posts a build-in-public update on X). Daily token cap of 200k tokens per agent service token.
Layer 2 (SaaS-tool axis) — Keybrake itself, on the same VPS. The "self-hosted in production" deployment shape is the one our docs assume; the hosted version is what we're building for paying customers.
Layer 3 — SQLite + the four-column MVP schema, plus a nightly export to a read-only Postgres for richer queries. We will graduate to Langfuse once we cross 1k requests/day; today the SQLite-and-cron setup is sufficient.
Layer 4 — Promptfoo in CI for the prompts our internal build-in-public agent uses. No commercial Layer-4 tool today; we'll revisit when we have a customer asking for an SOC 2 Type II report.

This is more honest than aspirational — the production system we describe in the rest of our marketing is exactly this. The omission of a commercial Layer-4 tool is deliberate: pre-revenue and pre-compliance, the cost-benefit isn't there. A team in a regulated industry would pick differently and we'd recommend they do.

Get early access to Keybrake (Layer 2 — SaaS-tool axis)

What to skip

The "AI agent governance tools" search returns several categories of result that we don't think belong on a shortlist for a production agent. Brief disclosure of what we considered and why we left it off:

Generic API gateways without LLM plugins (Tyk, Apigee, AWS API Gateway). Fine for human-traffic governance; do not parse cost from LLM responses; do not enforce per-call USD caps on Stripe. Use one only if you already operate it for non-agent workloads.
Compliance-management SaaS (Vanta, Drata, Secureframe). These help you produce audit evidence; they do not enforce a Layer-2 cap. Different problem.
Generic application firewalls (Cloudflare, Imperva). Path-level rules for human-shaped traffic. Cannot inspect a Stripe charge amount. Can block IPs at scale, which is occasionally useful as a panic button — not a governance tool.
"Agent platforms" that bundle their own proxy (some autonomous-agent frameworks ship a built-in HTTP layer that claims governance). The bundled controls are usually rudimentary (binary on/off per tool), not the per-vendor caps and customer-scope you need at incident time. Use the agent framework for orchestration; pair with an actual Layer-2 proxy for governance.
Tools currently in alpha (most of the WorkOS Agents / Skyvern Pay / x402 cohort). Worth tracking; not appropriate for production deployment until v1 with measured propagation latency. Don't bet the runaway-loop incident on alpha software.

AI agent governance tools: opinionated shortlist per layer for 2026

TL;DR

The four layers, named in one paragraph

Quick reference — the four-tool minimum stack

Layer 1 — Identity tools

The named tools

What Layer 1 alone does not solve

Layer 2 — Runtime policy enforcement tools

Layer 2 — LLM axis tools

Layer 2 — SaaS-tool axis tools

Layer 3 — Audit and cost tools

The named tools

The agent_run_id join

Layer 4 — Post-hoc evaluation tools

Open-source picks

Commercial picks

Cross-layer / orchestration tools

What we run for Keybrake's own production stack

What to skip

Related questions

Further reading