AWS Step Functions · AI agents · API key security
AWS Step Functions AI agent API key: scoping vendor calls in state machine workflows
AWS Step Functions is a fully managed serverless orchestration service that executes state machines — Task, Choice, Wait, Map, and Parallel states — with built-in error handling, retry logic, and visual workflow monitoring. AI agent teams on AWS adopt Step Functions because it provides durable execution, audit trails in CloudWatch, and parallel fan-out without managing queues or workers. When Map states dispatch Lambda functions that call Stripe, Twilio, or Resend, Step Functions' reliability features become vendor spend amplifiers: MaxConcurrency: 0 dispatches all iterations simultaneously with no dollar cap, Retry rules re-execute failed states that may have already reached Stripe (duplicate charge risk without stable idempotency keys derived from the execution ARN), and there is no per-execution spend limit built into the Step Functions runtime. This page covers the vault-key pattern that bounds vendor spend per Step Functions execution.
TL;DR
Add an IssueVaultKey Lambda task as the first state in your Step Functions definition. It calls the Keybrake API to issue a scoped vault key and writes the result to the execution's state data under $.vault. Downstream states receive vault_key via InputPath or explicit field mapping — for Map state iterations, pass "vault_key.$": "$.vault.vault_key" in ItemSelector so every Lambda invocation receives the same vault key. All concurrent iterations share one cap that accumulates atomically. Revoking a runaway execution is a single DELETE /vault/keys/{key_id} call — no StopExecution API call, no Secrets Manager rotation, no redeployment.
How Step Functions AI agent workflows call vendor APIs
A typical agent billing workflow uses a Map state to invoke a charge Lambda in parallel for each customer:
{
"Comment": "Agent billing workflow",
"StartAt": "FetchCustomers",
"States": {
"FetchCustomers": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:fetch-customers",
"Next": "ChargeCustomers"
},
"ChargeCustomers": {
"Type": "Map",
"ItemsPath": "$.customers",
"MaxConcurrency": 0,
"Iterator": {
"StartAt": "ChargeOne",
"States": {
"ChargeOne": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:charge-customer",
"Retry": [{
"ErrorEquals": ["Lambda.ServiceException", "States.TaskFailed"],
"IntervalSeconds": 30,
"MaxAttempts": 3
}],
"End": true
}
}
},
"End": true
}
}
}
This pattern has two compounding risks. First, MaxConcurrency: 0 means Step Functions dispatches all customer records as simultaneous Lambda invocations — 3,000 customers equals 3,000 concurrent Lambda executions each calling Stripe with the same STRIPE_SECRET_KEY from the function's environment variables. There is no dollar-spend stop condition at the Map level; the Map state runs all iterations to completion. Second, the Retry rule re-executes the ChargeOne task on failure. If Stripe returned a 500 after partially applying a charge, the retry re-calls Stripe without an idempotency key, potentially creating a duplicate charge. The execution ARN is available in the Context Object as $$.Execution.Id, but Step Functions doesn't automatically pass it to Lambda invocations — teams that don't explicitly wire it through create retry-unsafe code.
Three gaps Step Functions' native tooling doesn't fill for vendor spend control
| Gap | What happens in practice | Step Functions' answer |
|---|---|---|
| No per-execution spend cap | Step Functions has no mechanism to halt a Map state when cumulative vendor API spend reaches a dollar threshold. MaxConcurrency limits concurrent invocations by count, not by cost. AWS Cost and Billing budget alerts fire after spend has occurred — typically hours after — too late to stop a Map state that completes in minutes. A heartbeat timeout (HeartbeatSeconds) caps how long a single Task state can run between heartbeats, but that is wall-clock time on a single task, not cumulative vendor dollars across a Map state's iterations. |
AWS Cost Anomaly Detection sends alerts after spend has been incurred. No pre-call, per-execution dollar cap in the Step Functions runtime. |
| No mid-execution vendor revoke without Secrets Manager rotation | The Stripe API key is typically stored in AWS Secrets Manager or as a Lambda environment variable. Rotating the Secrets Manager secret version prevents new Lambda cold starts from fetching the old key — but warm Lambda execution environments that already loaded the key into process memory continue using it until the function is recycled (up to 15 minutes). Stopping an execution via StopExecution sends a cancellation signal, but Lambda invocations that are already mid-execution cannot be interrupted — they complete with the old key. |
StopExecution marks the execution as failed but cannot interrupt in-flight Lambda invocations. No per-execution API key scoping that revokes cleanly without function recycling. |
| No per-call audit with execution context | CloudWatch Logs and X-Ray capture Lambda invocation events, durations, and errors, but they don't parse dollar amounts from Stripe response bodies, correlate Stripe PaymentIntent.id values with the Step Functions execution ARN and Map iteration index in a structured cost table, or provide a queryable per-execution spend summary. Reconstructing what a runaway execution charged requires cross-referencing CloudWatch Logs, X-Ray traces, and the Stripe dashboard with manual timestamp matching since no shared cost identifier is propagated by default. |
Step Functions execution history logs state transitions and I/O payloads. No structured vendor cost tracking or execution-ARN-to-charge correlation. |
The Map state amplification risk
The Map state with MaxConcurrency: 0 is the primary fan-out mechanism and the primary spend amplifier. Each iteration runs as an independent Lambda invocation with its own connection to Stripe. A Map state iterating over 500 customer records dispatches 500 concurrent Lambda calls to Stripe simultaneously. If Lambda's reserved concurrency for the function is set to 1,000, Step Functions can dispatch up to 1,000 simultaneous Stripe calls from a single Map state invocation. The execution completes when the last iteration finishes — but vendor charges have already been applied by every iteration that succeeded before a cap would have been hit.
The Retry amplification compounds this. When a Lambda invocation returns a non-success result or throws an unhandled exception, Step Functions re-invokes the Lambda up to MaxAttempts times. A Stripe 500 that partially applied a charge before returning an error is retried without checking whether the charge actually landed — without a stable idempotency key derived from $$.Execution.Id and $$.Map.Item.Index, retries produce duplicate charges at the rate of your MaxAttempts setting.
Scoping vault keys per Step Functions execution
{
"Comment": "Agent billing workflow with vault key",
"StartAt": "IssueVaultKey",
"States": {
"IssueVaultKey": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:issue-vault-key",
"Parameters": {
"vendor": "stripe",
"daily_usd_cap.$": "$.budget_usd",
"allowed_endpoints": ["POST /v1/payment_intents"],
"expires_in": "2h",
"agent_run_label.$": "$$.Execution.Id"
},
"ResultPath": "$.vault",
"Next": "FetchCustomers"
},
"FetchCustomers": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:fetch-customers",
"Parameters": { "plan_id.$": "$.plan_id" },
"ResultPath": "$.customers",
"Next": "ChargeCustomers"
},
"ChargeCustomers": {
"Type": "Map",
"ItemsPath": "$.customers",
"MaxConcurrency": 0,
"ItemSelector": {
"customer.$": "$$.Map.Item.Value",
"item_index.$": "$$.Map.Item.Index",
"vault_key.$": "$.vault.vault_key",
"execution_id.$": "$$.Execution.Id"
},
"Iterator": {
"StartAt": "ChargeOne",
"States": {
"ChargeOne": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:charge-customer",
"Catch": [{
"ErrorEquals": ["CapExhausted"],
"Next": "CapHitEnd"
}],
"End": true
},
"CapHitEnd": { "Type": "Succeed" }
}
},
"End": true
}
}
}
// charge-customer Lambda function
const KEYBRAKE_BASE = 'https://proxy.keybrake.com';
exports.handler = async (event) => {
const { customer, vault_key, execution_id, item_index } = event;
// Idempotency key: stable across retries, unique per customer per execution
const idempotencyKey = `${execution_id}-${item_index}`;
const res = await fetch(`${KEYBRAKE_BASE}/stripe/v1/payment_intents`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${vault_key}`,
'Idempotency-Key': idempotencyKey,
'Content-Type': 'application/json'
},
body: JSON.stringify({
amount: customer.amount_cents,
currency: 'usd',
customer: customer.id
})
});
if (res.status === 429) {
const body = await res.json();
if (body.code === 'cap_exhausted') {
// Throw a named error so the Catch block in the state definition handles it
// without triggering a Retry — cap exhaustion is intentional, not transient
const err = new Error(body.message);
err.name = 'CapExhausted';
throw err;
}
}
if (!res.ok) throw new Error(`Stripe error: ${res.status}`);
return res.json();
};
The IssueVaultKey state runs before the Map state and writes the vault key into $.vault.vault_key using ResultPath. The Map state's ItemSelector passes vault_key explicitly into each iteration's input alongside item_index (from $$.Map.Item.Index) and execution_id. All concurrent Lambda invocations receive the same vault key — they share one cap that accumulates atomically. The idempotency key execution_id-item_index is stable across retries: if the ChargeOne task is retried by Step Functions, the Lambda receives the same execution_id and item_index, constructs the same idempotency key, and Stripe deduplicates the call.
How Keybrake fits
Keybrake is the proxy layer between your charge Lambda functions and Stripe, Twilio, or Resend. The vault key issued in the IssueVaultKey state replaces the STRIPE_SECRET_KEY environment variable previously loaded into each Lambda function. The real Stripe secret stays in Keybrake — it is never present in Lambda environment variables, CloudWatch Logs, or Step Functions execution history. For Map state fan-out, the vault key is passed via ItemSelector — all iterations use the same key and the same cap accumulates across all concurrent Lambda invocations. Revoking a runaway execution is a single DELETE /vault/keys/{key_id} call — effective on the next proxied request, without a StopExecution call, without Secrets Manager rotation, and without affecting other running executions that hold different vault keys.
Related questions
How do I pass the vault key to states inside a Map iterator?
Use ItemSelector (formerly Parameters inside Map state) to inject the vault key from the outer execution state into each iteration's input. The key line is "vault_key.$": "$.vault.vault_key" — this reads vault_key from the outer execution's state data (where IssueVaultKey wrote it to ResultPath: "$.vault") and passes it as a top-level field in each iteration's input object. Inside the iterator's Lambda invocation, the event object will contain event.vault_key. Do not issue a separate vault key per iteration — that creates N independent caps and defeats the purpose of bounding total spend for the execution.
How do I distinguish cap exhaustion from a transient Stripe error in Retry rules?
Throw a custom named error (err.name = 'CapExhausted') from the Lambda when the proxy returns a 429 with code: 'cap_exhausted' in the response body. In your Step Functions state definition, add a Catch block that matches CapExhausted and transitions to a terminal success or notification state — not a retry. Keep your Retry rules scoped to transient errors (Lambda.ServiceException, States.TaskFailed, States.Timeout) and explicitly exclude CapExhausted by adding it to an earlier Catch. Step Functions evaluates Catch blocks before Retry, so a matching Catch prevents retries on cap-exhaustion errors.
What vault key TTL should I use for Step Functions executions with Wait states?
Set expires_in to cover the execution's expected wall-clock duration including any Wait states. A standard billing workflow without Waits might run in under 10 minutes — use expires_in: "30m". If your state machine includes WaitForTaskToken states (human approval flows, external event callbacks), the execution can pause for hours or days. In this case, issue the vault key after the callback resumes rather than at execution start — include the wait state as a predecessor to a second IssueVaultKey task. This avoids issuing a vault key that expires while the execution is paused. For executions with both synchronous steps and wait states, split into two vault key issuances: one for pre-wait vendor calls and one for post-wait vendor calls.
Further reading
- Temporal AI agent API key — similar per-workflow vault key pattern for durable execution; Temporal Activities map to Step Functions Task states and the same idempotency-key-from-execution-ID pattern applies.
- AWS Lambda AI agent API key — for Lambda functions invoked directly by SQS or EventBridge rather than through Step Functions, with SQS messageId as the natural idempotency key.
- AI agent idempotency — why execution-ID-based idempotency keys are essential when Step Functions retries failed Task states, and how to derive a stable key from
$$.Execution.Idand$$.Map.Item.Index. - AI agent spend reporting — the four reporting queries that give per-execution cost visibility that CloudWatch and X-Ray don't provide natively.