Kubernetes · CronJobs · AI agents · API key security

Kubernetes CronJob AI agent API key: scoping vendor calls in scheduled batch Pods

Kubernetes CronJobs are the standard scheduled batch mechanism on Kubernetes — periodic billing agents, report generation, and recurring data-processing pipelines all run as CronJob-created Jobs. When a Job's parallelism is greater than one, multiple Pods start simultaneously, each mounting the same Stripe API key from a Kubernetes Secret and calling the vendor API independently with no per-job dollar cap. The Job's backoffLimit retries failed Pods — Pods that may have already charged Stripe. This page covers the vault-key pattern that bounds vendor spend per CronJob run.

TL;DR

Add an init container to your Job's Pod template. The init container calls the Keybrake API to issue a vault key scoped to the job's budget and writes the vault key to an emptyDir volume shared with the main container. The main container reads the vault key from the shared volume at startup and uses it as the Stripe credential via the Keybrake proxy. All Pods in the Job share one Keybrake-issued vault key whose cap accumulates across all concurrent Pods. Revoking a runaway Job is a single DELETE /vault/keys/{key_id} call — no Kubernetes Secret rotation, no Pod termination required.

How Kubernetes CronJobs call vendor APIs

A typical Kubernetes CronJob for billing fans out parallel Pods across a customer slice:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: billing-agent
spec:
  schedule: "0 2 * * *"   # nightly at 02:00 UTC
  jobTemplate:
    spec:
      completions: 100     # 100 Pods total (each handles a customer slice)
      parallelism: 20      # 20 Pods running simultaneously
      backoffLimit: 3      # retry failed Pods up to 3 times
      template:
        spec:
          containers:
          - name: billing
            image: myregistry/billing-agent:latest
            env:
            - name: STRIPE_SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: stripe-credentials
                  key: secret_key
            - name: JOB_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

This pattern has two compounding risks. First, parallelism: 20 means Kubernetes starts 20 Pods simultaneously — each Pod mounts the same stripe-credentials Secret and independently calls Stripe using that key. Twenty concurrent Pods processing customers simultaneously means 20 concurrent Stripe API calls, each drawing from the same unlimited key with no dollar cap on the total job run. Second, backoffLimit: 3 retries failed Pods. A Pod that called Stripe, received a 500 after the charge partially succeeded, and exited with a non-zero code will be rescheduled. Without an idempotency key derived from the Pod's completion index, the retry creates a new Stripe charge. Kubernetes provides the completion index in the annotation batch.kubernetes.io/job-completion-index, but it requires explicit wiring into the Pod spec and the application code to use.

Three gaps Kubernetes' native tooling doesn't fill for vendor spend control

GapWhat happens in practiceKubernetes' answer
No per-job spend cap The Kubernetes Job controller tracks Pod completions and failures — it has no concept of dollar spend. activeDeadlineSeconds on the Job terminates all running Pods after a wall-clock duration, but that is a time limit, not a cost limit. A billing job that charges $200 in 30 seconds will hit neither a wall-clock limit nor a Kubernetes-native cost limit before completing. Cloud billing budget alerts (GCP, AWS, Azure) fire after spend is ingested — typically 8 to 24 hours after the vendor charge, by which time the CronJob's Pods are long gone. activeDeadlineSeconds limits wall-clock duration. No per-job vendor dollar cap exists in the Kubernetes Job controller.
No mid-job vendor revoke without Secret rotation The Stripe key is mounted from a Kubernetes Secret at Pod startup. Rotating the Secret (updating the Secret's data field) does not affect already-running Pods — they loaded the key into their environment variables or mounted file at container start and continue using it until the container exits. Kubernetes does not propagate Secret updates to running containers for env-sourced secrets; for volume-mounted secrets, propagation takes up to the kubelet's syncFrequency (default 1 minute). Deleting Pods (kubectl delete pods -l job-name=billing-agent) terminates the containers but in-flight Stripe API calls complete before the container process exits. Kubernetes Secret rotation does not affect running Pods with env-sourced secrets. Pod deletion allows in-flight API calls to complete during graceful shutdown.
No per-call audit with job context Kubernetes events and Pod logs capture cluster-level activity. Stripe's dashboard captures API calls by key and timestamp. Correlating which CronJob execution — and which Pod completion index — triggered which Stripe charge requires adding structured logging to the application code with the job name, job UID, and completion index, then cross-referencing those logs with Stripe's dashboard by timestamp. Teams that don't instrument this upfront cannot reconstruct what a specific nightly billing run charged when a customer reports a duplicate. Pod logs and Kubernetes events track Pod lifecycle. No structured vendor cost tracking or Pod-to-charge correlation built into the Job controller.

The backoffLimit retry amplification risk

Kubernetes Job's backoffLimit retries failed Pods by scheduling new Pod instances for the same completion index. If a Pod fails (non-zero exit code) after calling Stripe, Kubernetes schedules a replacement Pod for the same completion index. Without an idempotency key that maps the completion index to a stable Stripe idempotency key, the replacement Pod makes a new, independent Stripe charge for the same customer.

The completion index is available as JOB_COMPLETION_INDEX in the Pod's environment when the Job's completionMode is Indexed (the default is NonIndexed; set completionMode: Indexed explicitly). The Job's UID (metadata.uid of the Job object) is available via the Downward API. Composing {job_uid}-{completion_index} as the Stripe idempotency key makes Stripe deduplication stable across Kubernetes retries — but this requires knowing to wire both values through the Pod spec and into the application code, which the default CronJob template does not do.

Scoping vault keys per Kubernetes CronJob execution

apiVersion: batch/v1
kind: CronJob
metadata:
  name: billing-agent
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      completions: 100
      parallelism: 20
      completionMode: Indexed
      backoffLimit: 3
      template:
        spec:
          initContainers:
          - name: issue-vault-key
            image: curlimages/curl:latest
            env:
            - name: KEYBRAKE_ADMIN_KEY
              valueFrom:
                secretKeyRef:
                  name: keybrake-credentials
                  key: admin_key
            - name: JOB_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['controller-uid']
            - name: JOB_BUDGET_USD
              value: "500"
            command:
            - sh
            - -c
            - |
              # Only completion index 0 issues the key; others wait for it
              if [ "$JOB_COMPLETION_INDEX" = "0" ]; then
                curl -sf -X POST https://proxy.keybrake.com/vault/keys \
                  -H "Authorization: Bearer $KEYBRAKE_ADMIN_KEY" \
                  -H "Content-Type: application/json" \
                  -d "{\"vendor\":\"stripe\",\"daily_usd_cap\":$JOB_BUDGET_USD,\"allowed_endpoints\":[\"POST /v1/payment_intents\"],\"expires_in\":\"2h\",\"label\":\"$JOB_UID\"}" \
                  > /vault/key.json
              else
                # Wait for index 0 to write the key (up to 30s)
                for i in $(seq 1 30); do
                  [ -f /vault/key.json ] && break
                  sleep 1
                done
              fi
            volumeMounts:
            - name: vault
              mountPath: /vault
          containers:
          - name: billing
            image: myregistry/billing-agent:latest
            env:
            - name: KEYBRAKE_BASE
              value: "https://proxy.keybrake.com"
            - name: JOB_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['controller-uid']
            volumeMounts:
            - name: vault
              mountPath: /vault
          volumes:
          - name: vault
            emptyDir: {}
# billing-agent container (Python)
import os, json, httpx

KEYBRAKE_BASE = os.environ["KEYBRAKE_BASE"]
JOB_UID = os.environ["JOB_UID"]
COMPLETION_INDEX = os.environ.get("JOB_COMPLETION_INDEX", "0")

with open("/vault/key.json") as f:
    vault_key = json.load(f)["vault_key"]

customers = fetch_customer_slice(int(COMPLETION_INDEX), 100)

for customer in customers:
    idempotency_key = f"{JOB_UID}-{customer['id']}"

    resp = httpx.post(
        f"{KEYBRAKE_BASE}/stripe/v1/payment_intents",
        headers={
            "Authorization": f"Bearer {vault_key}",
            "Idempotency-Key": idempotency_key,
        },
        json={
            "amount": customer["amount_cents"],
            "currency": "usd",
            "customer": customer["id"],
        },
    )

    if resp.status_code == 429 and resp.json().get("code") == "cap_exhausted":
        print(f"[{JOB_UID}/{COMPLETION_INDEX}] cap exhausted, exiting cleanly")
        break   # Exit 0 — do not trigger backoffLimit retry for cap exhaustion

    resp.raise_for_status()

The init container for completion index 0 issues the vault key and writes it to the /vault/key.json file on the shared emptyDir volume. All other completion indices wait for the file to appear before proceeding — the 30-second wait covers the time for index 0's init container to call the Keybrake API. The main billing container reads the vault key from the shared volume. All 20 parallel Pods share the same vault key and the same accumulating spend cap. The idempotency key {job_uid}-{customer_id} is stable across Kubernetes Pod retries for the same completion index, preventing duplicate charges.

How Keybrake fits

Keybrake is the proxy layer between your Kubernetes batch Pods and Stripe, Twilio, or Resend. The vault key issued by the init container replaces the Kubernetes Secret that was previously mounted into each Pod's environment — the real Stripe secret stays in Keybrake and never appears in Pod environment dumps, kubectl describe pod output, or application logs. All parallel Pods share one accumulating cap that is enforced atomically on each proxied request; 20 simultaneous Pods cannot collectively exceed the per-job budget. Revoking a runaway CronJob is a single DELETE /vault/keys/{key_id} call — effective on the next proxied request, without kubectl delete pods, without Kubernetes Secret rotation, and without affecting next night's billing run (which will issue a new vault key).

Get early access

Related questions

How do I ensure only one vault key is issued per CronJob execution?

Use the completion index 0 Pod as the key-issuer. Since Indexed completion mode guarantees exactly one Pod per index, index 0 is started exactly once per job execution. Write the issued key to a shared emptyDir volume that all init containers and main containers in the same Pod mount. For multi-Pod parallelism (where different Pods have different completion indices), you need a shared coordination mechanism: GCS, Redis, or an etcd sidecar work. The init-container-writes-to-emptyDir pattern above handles the single-Pod case (one Pod, one main container); for multi-Pod fan-out, replace emptyDir with a GCS bucket write (strongly consistent) or a Redis SETNX operation.

How do I use the Job UID as the vault key label for per-run traceability?

Expose the Job's UID via the Downward API. In the Pod spec, add an environment variable sourced from metadata.labels['controller-uid'] — this is the label Kubernetes sets on all Pods created by a Job, and its value is the Job object's UID. This UID is stable across all Pods in the same Job execution and changes with each new CronJob-created Job. Use it as the vault key label and as the idempotency key prefix ({job_uid}-{customer_id}). Keybrake's audit log can then be queried by label = {job_uid} to reconstruct total spend for any specific CronJob run.

What happens if the CronJob's concurrencyPolicy allows overlapping runs?

Set concurrencyPolicy: Forbid unless you specifically need overlapping CronJob executions. With Allow or Replace, two CronJob executions can run simultaneously — each creates its own Job object with its own Pods. If you issue one vault key per Job (keyed by job UID), two overlapping executions get two independent vault keys with two independent caps. This is the correct behavior — overlapping runs should not share a cap. But if your budget is per-day (not per-run), you need to issue the key with a daily_usd_cap that accounts for the possibility of two concurrent runs, or use a Keybrake team-level cap that spans all keys issued for the same vendor on the same day.

Further reading