Agent governance · Tool list

AI agent governance tools: opinionated shortlist per layer for 2026

"AI agent governance tools" is a query that returns a vendor sea — five different categories of product, each lobbying for the whole word. This page names them all, sorts them into the four layers governance actually contains, and gives an opinionated pick plus one runner-up per layer. Skip the vendor sea: read the four-tool minimum stack at the bottom and you have what a production agent needs.

TL;DR

Governance contains four layers, each with its own tool category. Layer 1 (identity): pick from your existing secret store — 1Password / AWS Secrets Manager / HashiCorp Vault / Doppler / Infisical. Layer 2 (runtime policy enforcement): split by axis — LiteLLM (or Portkey / Bifrost / Helicone) on the LLM axis, Keybrake on the SaaS-tool axis. Layer 3 (audit and cost): Langfuse for LLM traces, your Layer-2 proxy emits the SaaS-tool rows, join on agent_run_id. Layer 4 (post-hoc evaluation): Promptfoo if open-source, Lakera Guard or Lasso Security if commercial. Total: four named tools, joined on one ID. The vendors that pitch a unified "platform" cover one or two of these layers and skip Layer 2 — see the governance platform rebuttal for why.

The four layers, named in one paragraph

Skip this section if you already know the layer model from the platform rebuttal page. For the rest: Layer 1 is who the agent is and what credentials it carries. Layer 2 is what the agent can do at the moment of action — per-day USD caps, endpoint allowlists, customer scope, mid-run revoke. Layer 3 is what the agent actually did, joined on agent_run_id. Layer 4 is whether the output was safe / correct / on-brand, evaluated post-hoc. Different surfaces, different tools. The error this page exists to prevent is buying one tool and assuming "governance" is now solved — it never is, because no single tool covers more than two of the four layers.

Quick reference — the four-tool minimum stack

For the impatient: this is the table you want. One row per layer; the pick column is what we'd buy on day one for a 10-50 person team running production agents that touch money-moving SaaS APIs.

Layer Job Pick Runner-up Open-source
L1 — Identity Store and rotate per-agent credentials 1Password Secrets Automation Doppler HashiCorp Vault, Infisical
L2 — Runtime policy (LLM axis) Token cap, model fallback, prompt-level guard LiteLLM Proxy Portkey Gateway LiteLLM, Portkey, Bifrost, LangGate
L2 — Runtime policy (SaaS-tool axis) USD cap, endpoint allowlist, customer scope, revoke on Stripe / Twilio / Resend / Shopify Keybrake DIY SDK wrapper agentgateway.dev (alpha)
L3 — Audit and cost Per-call log joined on agent_run_id with parsed cost Langfuse + L2 proxy emits Helicone Langfuse, Phoenix (Arize), OpenLIT, Laminar
L4 — Post-hoc evaluation Output safety / correctness / regression scores Promptfoo Lakera Guard (commercial) Promptfoo, Garak, DeepEval, Ragas

Total monthly spend for a 10-50 person team running this exact stack: roughly $200-$500 on the proxies and gateways combined (free tiers cover most teams under 100k requests/month), plus whatever Layer-4 commitments your industry's compliance regime forces. Self-hosting all five is also viable — every "open-source" entry above is genuinely OSS-licensed, not source-available with a commercial gate. The cost of buying nothing is the worst-case figure: $3.24M / day for a stuck Stripe refund loop at 216,000 calls/day × $15/refund — the math is in our three-axis cost decomposition.

Layer 1 — Identity tools

Layer 1 is the boring layer, and that's a feature. The tools here are mature, well-understood, and almost universally already in place at any company past 20 engineers. The agent-governance angle on Layer 1 is narrow: can you issue a per-agent or per-run credential, and rotate it without code changes? If yes, your existing tool is sufficient. If no, you have a Layer-1 gap to fill before any other layer matters.

The named tools

What Layer 1 alone does not solve

The most common Layer-1 illusion: "we issued a Stripe Restricted Key per agent, so we have governance covered." You don't. Restricted Keys are a great Layer-1 artefact and a partial Layer-2 enforcement (path-level scope), but they have no daily USD cap, no parameter-level allowlist on amount, no sub-second revoke without a human in the dashboard. The runaway-loop scenarios live a layer up from where Restricted Keys operate. The "does Stripe's native feature cover X" page walks through exactly which controls Layer 1 misses.

Layer 2 — Runtime policy enforcement tools

This is the layer that catches incidents and the layer where the vendor categories are most confused. There is no single Layer-2 product because Layer-2 traffic splits cleanly into two axes — LLM API traffic and SaaS-tool API traffic — and the technical work to govern each is different (different cost-data sources, different cap units, different fallback grammar). The right mental model is two proxies, in series, each governing one axis. The 2026 governance stack post has the diagram.

Layer 2 — LLM axis tools

These sit between the agent and OpenAI / Anthropic / Bedrock / your self-hosted vLLM / etc. They cap tokens, route between models, retry on 5xx, and produce model-call traces. The category has high vendor density and most options are good.

Layer 2 — SaaS-tool axis tools

These sit between the agent and Stripe / Twilio / Resend / Shopify / Postmark / Segment / etc. They enforce per-vendor USD cap, endpoint allowlist, customer scope, parameter-level allowlist, mid-run revoke. The vendor density here is much lower — this is a category in formation. You have three viable choices, ordered by maturity.

What the SaaS-axis Layer 2 explicitly does not include: the LLM gateways listed above. LiteLLM is not a Stripe proxy — it speaks OpenAI's wire format and parses tokens, neither of which describes a Stripe charge. Pointing an LLM gateway at api.stripe.com fails on path schemas, response parsing, and auth envelope. Use both proxies, don't try to reuse one.

Layer 3 — Audit and cost tools

Layer 3 is where the rows live. The output is a single queryable table whose rows are individual API calls, joined to a per-run grouping key (agent_run_id) so you can ask "what did run run_2026_04_30_8a3f spend?" and get one number across all vendors. Most Layer-3 capability falls out of Layer 2 for free — the proxy that enforces the cap is the natural place to record the call. The remainder is the join: rows from the LLM gateway plus rows from the SaaS-tool proxy plus identity attribution from Layer 1, all on the same key.

The named tools

The agent_run_id join

The Layer-3 capability that actually matters at incident time is "what did one specific agent run spend, across all vendors?" That requires a join on a key emitted by the agent and forwarded by every proxy in the chain. The header convention we recommend (and which Keybrake propagates from incoming requests) is x-agent-run-id: run_2026_04_30_8a3f. Set it once at the top of your agent, propagate through your LLM gateway, propagate through your SaaS-tool proxy, log it everywhere. The result: SELECT vendor, SUM(cost_usd_parsed) FROM audit WHERE agent_run_id = ? gives you the per-run total. Without this, you have rows from each proxy that can't be correlated, and the runaway-loop postmortem takes hours instead of minutes.

Layer 4 — Post-hoc evaluation tools

This is the layer with the highest vendor density and the most "governance platform" marketing. Pick is more about your compliance regime than tool quality — the OSS options are excellent for teams without specific regulatory requirements, and the commercial vendors earn their licence fees by mapping their controls onto specific regulatory frameworks (EU AI Act, NIST AI RMF, ISO 42001).

Open-source picks

Commercial picks

The rule of thumb across Layer 4: if your immediate worry is incident prevention, pick OSS — Promptfoo + Garak gives you 80% of the value for $0. If your immediate worry is regulatory compliance, pick commercial — the evidence-export and audit-trail features are what you're paying for. Neither addresses Layer 2; the governance-platform rebuttal has the longer version of why these vendors don't extend into the runtime policy layer.

Cross-layer / orchestration tools

A handful of tools don't fit one layer cleanly because they touch the workflow that produces agent runs in the first place. They are not governance tools per se, but they shape what governance you can do.

What we run for Keybrake's own production stack

Eat-our-own-dogfood disclosure. Keybrake runs the dual-proxy + audit + eval stack on a single VPS:

This is more honest than aspirational — the production system we describe in the rest of our marketing is exactly this. The omission of a commercial Layer-4 tool is deliberate: pre-revenue and pre-compliance, the cost-benefit isn't there. A team in a regulated industry would pick differently and we'd recommend they do.

Get early access to Keybrake (Layer 2 — SaaS-tool axis)

What to skip

The "AI agent governance tools" search returns several categories of result that we don't think belong on a shortlist for a production agent. Brief disclosure of what we considered and why we left it off:

Related questions

What's the absolute minimum I can ship with?

One Layer-2 proxy on whichever axis matters most for your agent. If your agent calls Stripe, that's the SaaS-axis proxy (Keybrake or DIY); if your agent only calls LLMs, that's an LLM gateway (LiteLLM or Portkey). Layer 3 falls out of Layer 2 for free if you log every call. Layer 1 is whatever your secret store already is. Layer 4 is deferrable until you have a specific compliance ask. The minimum minimum is one proxy and a SQLite table. Everything else compounds value but isn't strictly necessary on day one.

I already use Datadog / New Relic / Honeycomb. Do I need a separate Layer-3 tool?

Probably not, if you're willing to wire in OpenTelemetry and pay for the additional spans. The downsides are (1) cost — agent traffic generates many spans cheaply, your APM bill grows fast — and (2) the LLM-specific cost parsers in Langfuse / Phoenix / Helicone do work for you that generic APMs don't. If you already pay Datadog and have headroom, the OTel route is cleanest. If you're cost-sensitive, a dedicated tool will be cheaper at agent volume.

Where does an MCP server fit in this list?

MCP is a Layer-1 artefact (it provides a discoverable, scoped credential surface for tools the agent can call) plus a small Layer-2 component (the tool definition is implicitly an allowlist of what the agent can attempt). It does not enforce per-day USD caps or sub-second mid-run revoke — both are still Layer 2's job. The MCP-auth page covers the auth handshake; the Stripe Agent Toolkit page covers what Stripe's own MCP server gives you and where it stops. Net: an MCP-served tool is still upstream of Layer 2 — put a Layer-2 proxy between the MCP server and the underlying SaaS API.

Open-source-only stack — what's the answer?

Layer 1: HashiCorp Vault. Layer 2 LLM: LiteLLM. Layer 2 SaaS-tool: roll your own SDK wrapper using the patterns in our Stripe-key blog post (or use Keybrake — we're OSS-licensed for self-host, and that's also the cheapest path that scales beyond one vendor). Layer 3: Langfuse self-hosted (Postgres backend, decent UI). Layer 4: Promptfoo + Garak. Total monthly spend: $0 in software licences, plus whatever your ops team's time is worth running five OSS services. Real total is probably 1-3 days of platform-engineer time per month; whether that's cheaper than commercial-licensed tooling depends on your salary structure.

Pre-revenue startup — should I bother with all four layers?

Layer 2 only, on the axis where your agent burns money. If your agent is calling LLMs to generate copy, Layer 2 LLM is the answer (LiteLLM with a $50/day cap); the rest can wait until you have customers. If your agent is touching Stripe at all, Layer 2 SaaS-tool is the answer regardless of revenue stage — a stuck refund loop pre-revenue is just as expensive as one post-revenue, and pre-revenue your runway is more precious. Layer 4 is almost never the right pre-revenue investment unless your industry's compliance regime makes it a launch blocker.

How do I evaluate "vendor X says they're a governance platform"?

Three-question test. (1) Do they sit inline on the request path between the agent and the SaaS API, or only on the model input/output? If only the latter, they're Layer 4 marketed as Layer 2. (2) Can they enforce a per-day USD cap on Stripe today, with parsed-from-response cost? If no, the most expensive scenarios are not in scope. (3) What's the median and p95 propagation latency for a mid-run revoke? If they can't tell you, they don't have one. Any vendor failing two of three questions is solving a different problem than the one in your head when you searched "agent governance".

Is "agent governance tools" the same search as "AI governance tools"?

Overlapping but distinct. "AI governance tools" historically returns Credo AI, Holistic AI, IBM Watson OpenScale, Datadog AI, ServiceNow Risk — model-risk, regulatory compliance, fairness audits at the org level. "Agent governance tools" (this page's intent) is more operational: runtime controls that constrain an autonomous agent's actions during execution. The first is about model outputs and policy posture; the second is about the agent's hand on the wheel. The vocabulary cross-pollinates because the vendors that built tools for the first category are now marketing into the second. Read the L2 column carefully when shopping; it's the easiest way to tell which problem a tool actually solves.

Further reading