Celery Stripe Integration: Restricted API Keys, Spend Caps, and Agent Governance
Celery's autoretry_for is the most convenient way to make a billing task reliable. It is also the mechanism most likely to silently fire duplicate Stripe charges when a downstream step fails after the charge already went through.
Celery is the most widely deployed distributed task queue in the Python ecosystem. It powers background jobs at companies running everything from Django e-commerce sites to LLM agent pipelines that fan billing operations across dozens of concurrent workers. When those agents involve Stripe — subscription renewals, usage-based billing, invoice generation — Celery's retry and reliability semantics introduce billing failure modes that are not obvious from reading the task decorator alone.
This post covers three failure modes specific to Celery's architecture: autoretry_for re-running a completed charge on downstream exception, a module-level stripe.api_key bleeding across all worker processes in the pool, and acks_late=True re-delivering a task to the broker after a worker crash that happened after the charge succeeded. Each section includes Python code and the governance pattern that closes it — content-hash idempotency keys at the Stripe layer and per-task vault keys via the Keybrake proxy at the key-management layer. A note on Celery as Airflow's executor backend closes the post for teams running both.
Failure mode 1: autoretry_for fires duplicate charges on downstream failure
Celery's autoretry_for parameter is the idiomatic way to add retry logic to a task without writing explicit retry code. You list exception types that should trigger an automatic retry, set max_retries, and Celery handles the rest. For a billing task, the intent is to retry on transient network errors — a stripe.error.APIConnectionError, a database timeout. The problem is that autoretry_for retries the entire task callable from line 1, with no awareness of what succeeded within that callable before the exception was raised.
# billing_tasks.py — UNSAFE: autoretry_for fires duplicate charges
import stripe
import os
from celery import Celery
app = Celery("billing", broker="redis://localhost:6379/0")
stripe.api_key = os.environ["STRIPE_SECRET_KEY"] # module-level — shared across all workers
@app.task(
autoretry_for=(Exception,),
max_retries=3,
retry_backoff=True,
)
def charge_customer(customer_id: str, amount_cents: int, billing_period: str) -> str:
# Charge succeeds — Stripe returns ch_A
charge = stripe.charges.create(
amount=amount_cents,
currency="usd",
customer=customer_id,
description=f"Subscription {billing_period}",
# No idempotency_key — every retry = new Stripe charge
)
# If this write raises DatabaseConnectionError, autoretry_for triggers.
# Retry 1: stripe.charges.create() fires again → ch_B
# Retry 2: → ch_C. Retry 3: → ch_D. Four charges total.
update_billing_record(customer_id, charge["id"], billing_period)
return charge["id"]
The failure sequence: stripe.charges.create() returns ch_A. update_billing_record() raises a DatabaseConnectionError. Celery catches it (because Exception is in autoretry_for), waits for the backoff interval, and re-queues the task. The worker picks it up and runs the callable from the top. stripe.charges.create() fires again — Stripe has no information linking this new request to the previous one, so it creates ch_B. With max_retries=3 and a persistent database issue, the customer is charged four times. The Celery task log shows three retries and a final failure. The duplicate charges appear only in the Stripe dashboard.
This pattern is especially common because the most reliable failure point is not Stripe — it is the database write that records the charge after the fact. Stripe has extremely high uptime; your application database or CRM does not. Any transient error in the recording step triggers the retry chain, and because the Stripe call happens before the recording, every retry starts by calling Stripe again.
The fix: content-hash idempotency key + vault key via proxy
The idempotency key must be derived from the billing parameters themselves — not generated inside the task callable with uuid4(), which produces a different value on every retry. A content-hash of (customer_id, amount_cents, billing_period) is stable across every retry of the same billing operation, so Stripe deduplicates them into the original charge regardless of how many times the task re-runs.
# billing_tasks.py — SAFE: content-hash idempotency key + vault key per task
import hashlib
import stripe
from celery import Celery
app = Celery("billing", broker="redis://localhost:6379/0")
def billing_idempotency_key(customer_id: str, amount_cents: int, billing_period: str) -> str:
raw = f"{customer_id}:{amount_cents}:{billing_period}:celery-billing"
return hashlib.sha256(raw.encode()).hexdigest()[:32]
@app.task(
autoretry_for=(Exception,),
max_retries=3,
retry_backoff=True,
)
def charge_customer(customer_id: str, amount_cents: int, billing_period: str,
vault_key: str) -> str:
# vault_key: POST /v1/charges only, daily cap = expected max charge, expires 24h
# Passed as task argument — not read from module-level env var
stripe_client = stripe.StripeClient(
vault_key,
base_url="https://proxy.keybrake.com/stripe",
)
idempotency_key = billing_idempotency_key(customer_id, amount_cents, billing_period)
# Same key on every retry — Stripe returns ch_A on all retries after the first
charge = stripe_client.charges.create(
params={
"amount": amount_cents,
"currency": "usd",
"customer": customer_id,
"description": f"Subscription {billing_period}",
"metadata": {"billing_period": billing_period},
},
options={"idempotency_key": idempotency_key},
)
update_billing_record(customer_id, charge.id, billing_period)
return charge.id
With this pattern, any retry of the same task — whether triggered by autoretry_for, a manual .retry() call, or a Celery beat re-schedule — produces the same idempotency key. Stripe sees the second and third requests, recognizes the key, and returns the original ch_A without creating a new charge. The vault_key is passed as a task argument rather than read from the module-level environment, which closes the second failure mode.
Failure mode 2: Module-level stripe.api_key bleeds across all workers in the pool
Celery uses a prefork execution model by default: each worker process is a separate OS process, and app.worker_main() forks a pool of concurrency worker processes (default: one per CPU core). When you set stripe.api_key = os.environ["STRIPE_SECRET_KEY"] at module level, every forked worker process inherits that assignment. This is not a bug — it is how Python forked processes work — but it has two billing consequences that compound each other.
First, there is no per-task or per-agent spend cap. All tasks running across all workers in the pool share one Stripe key. If a billing loop task enters an error state and starts retrying aggressively, it can exhaust the account's Stripe rate limit or, for metered billing accounts, the daily charge quota, for every other agent and task in the pool simultaneously. A single runaway charge_customer task consuming retries cannot be stopped at the key level without rotating the key for every other task in production.
# UNSAFE: module-level key shared across all worker processes
import stripe
import os
stripe.api_key = os.environ["STRIPE_SECRET_KEY"] # every worker fork inherits this
@app.task(autoretry_for=(stripe.error.StripeError,), max_retries=5)
def charge_one_customer(customer_id: str, amount_cents: int, billing_period: str):
# This key is also used by every other billing task running on every other worker.
# A runaway retry loop here hits Stripe rate limits for the entire pool.
# To stop it, you must rotate STRIPE_SECRET_KEY — which affects everything.
return stripe.charges.create(
amount=amount_cents, currency="usd", customer=customer_id
)["id"]
Second, Celery workers can be deployed on multiple machines. In a distributed setup, the shared module-level key flows to every machine's worker pool. Revoking the key requires a deployment to update every machine's environment variables, restart all worker processes, and wait for the queue to drain. There is no per-task or per-machine kill switch.
The fix is to pass the Stripe credentials as task arguments rather than reading them from module scope. Per-task vault keys — scoped to a specific customer, endpoint, and spend cap — are sized exactly for the task they serve and can be revoked instantly without touching any other task in the pool.
# SAFE: vault key passed as task argument — one key per billing run
@app.task(
autoretry_for=(Exception,),
max_retries=3,
retry_backoff=True,
)
def charge_one_customer(customer_id: str, amount_cents: int, billing_period: str,
vault_key: str) -> str:
# vault_key scoped to: POST /v1/charges only, $amount_cents+10% daily cap
# Revoke this key from the Keybrake dashboard — doesn't affect any other task
stripe_client = stripe.StripeClient(
vault_key,
base_url="https://proxy.keybrake.com/stripe",
)
idempotency_key = billing_idempotency_key(customer_id, amount_cents, billing_period)
charge = stripe_client.charges.create(
params={"amount": amount_cents, "currency": "usd", "customer": customer_id,
"metadata": {"billing_period": billing_period}},
options={"idempotency_key": idempotency_key},
)
update_billing_record(customer_id, charge.id, billing_period)
return charge.id
# Issue vault keys at dispatch time — before adding to the queue
def dispatch_billing_batch(customers: list[dict], billing_period: str):
for customer in customers:
vault_key = keybrake_client.issue_key(
vendor="stripe",
allowed_endpoints=["POST /v1/charges"],
daily_usd_cap=round(customer["amount_cents"] / 100 * 1.1, 2),
expires_in_hours=24,
metadata={"customer_id": customer["id"], "billing_period": billing_period},
)
charge_one_customer.delay(
customer_id=customer["id"],
amount_cents=customer["amount_cents"],
billing_period=billing_period,
vault_key=vault_key,
)
Vault keys issued at dispatch time carry the correct spend cap for each customer's charge amount. If a task's retry loop goes wrong — a logic bug that keeps retrying despite a 200 response, a test fixture that accidentally targets the production queue — the daily cap on the vault key is the hard ceiling on damage. The key can be revoked from the Keybrake dashboard without touching the worker pool or rotating the real Stripe key.
Failure mode 3: acks_late=True re-delivers tasks after a worker crash mid-execution
Celery's default acknowledgement mode is acks_late=False: the task is acknowledged (removed from the broker queue) as soon as a worker picks it up, before execution begins. This means a worker crash during execution loses the task — the charge may or may not have fired, and Celery has no record of it. Many teams switch to acks_late=True for billing pipelines precisely because a lost charge is considered worse than a duplicate: the task remains on the broker queue until the worker successfully acknowledges completion, so if the worker dies mid-execution, the broker re-delivers the task to another available worker.
# UNSAFE: acks_late=True without idempotency key — crash during billing = duplicate charge
@app.task(
acks_late=True, # task re-queued on worker crash — good for reliability
reject_on_worker_lost=True, # explicitly re-queue if worker process is killed
autoretry_for=(Exception,),
max_retries=3,
)
def charge_customer_reliable(customer_id: str, amount_cents: int, billing_period: str):
# Worker A picks up the task. stripe.charges.create() succeeds → ch_A
charge = stripe.charges.create(
amount=amount_cents, currency="usd", customer=customer_id
# No idempotency key
)
# Worker A crashes here (OOM kill, SIGKILL, host reboot).
# acks_late=True means the task was NOT acknowledged.
# Broker re-delivers the task to Worker B.
# Worker B runs charge_customer_reliable from line 1.
# stripe.charges.create() fires again → ch_B.
# Customer charged twice. No error in Celery logs.
update_billing_record(customer_id, charge["id"], billing_period)
ack() # never reached — Worker A is dead
The dangerous combination is acks_late=True plus reject_on_worker_lost=True (which explicitly re-queues tasks from killed workers rather than marking them as failed). This is a common production configuration for billing pipelines because it prevents lost charges from worker crashes — but without an idempotency key, it guarantees that any crash between a successful stripe.charges.create() and the task acknowledgement creates a duplicate charge on re-delivery.
The fix is the same content-hash idempotency key from failure mode 1, but it is especially critical here because the re-delivery window can extend for minutes — however long it takes the dead worker's tasks to be claimed by other workers in the pool. Stripe's 24-hour idempotency window comfortably covers this gap. A check_existing_charge pre-flight step using a read-only audit vault key provides a second layer for the rare case where the re-delivery happens after the idempotency window (e.g., a worker that was paused for 25+ hours by a host that did not fully die).
# SAFE: acks_late=True with idempotency key + optional pre-flight check
@app.task(
acks_late=True,
reject_on_worker_lost=True,
autoretry_for=(Exception,),
max_retries=3,
retry_backoff=True,
)
def charge_customer_reliable(customer_id: str, amount_cents: int, billing_period: str,
vault_key: str, audit_vault_key: str) -> str:
# Pre-flight: check if this billing period was already charged
# audit_vault_key: GET /v1/charges only, $0 daily cap — cannot create charges
audit_client = stripe.StripeClient(
audit_vault_key,
base_url="https://proxy.keybrake.com/stripe",
)
existing = audit_client.charges.list(params={"customer": customer_id, "limit": 10})
for ch in existing.data:
if (ch.metadata.get("billing_period") == billing_period
and ch.status == "succeeded"):
# Already charged — this is a re-delivery after worker crash
update_billing_record(customer_id, ch.id, billing_period)
return ch.id # idempotent: update_billing_record handles duplicate writes
stripe_client = stripe.StripeClient(
vault_key,
base_url="https://proxy.keybrake.com/stripe",
)
idempotency_key = billing_idempotency_key(customer_id, amount_cents, billing_period)
charge = stripe_client.charges.create(
params={
"amount": amount_cents,
"currency": "usd",
"customer": customer_id,
"description": f"Subscription {billing_period}",
"metadata": {"billing_period": billing_period},
},
options={"idempotency_key": idempotency_key},
)
update_billing_record(customer_id, charge.id, billing_period)
return charge.id
Both layers work together: the idempotency key catches re-deliveries within Stripe's 24-hour window (the common case — worker crashes are typically resolved within minutes), and the pre-flight query catches re-deliveries outside that window (extremely rare, but possible). The audit vault key has a zero-dollar spend cap, so the pre-flight itself cannot accidentally create a charge even if the audit logic has a bug.
Approach comparison
| Approach | Retry safe? | Worker crash safe? | Per-worker spend cap? | Key revoke? | Audit log? | Worker isolation? |
|---|---|---|---|---|---|---|
| Module-level stripe.api_key | No | No (with acks_late) | No | Rotate only (all workers) | No | No |
| UUID idempotency key | No — changes per retry | No | No | Rotate only | No | No |
| Content-hash idempotency key | Yes | Partial — only within 24h Stripe window | No | Rotate only | No | No |
| Stripe restricted key (direct) | No — still duplicates without idem key | No | No (endpoint only, not spend) | Rotate only | No | No |
| Pre-flight check + idem key | Yes | Yes — catches expired 24h window | No | Rotate only | No | No |
| Vault key via Keybrake proxy | Yes | Yes | Yes — per-task daily cap | Instant, no rotation needed | Yes | Yes — per-task key |
Gap analysis
Celery as Airflow's executor backend
Many data engineering teams run Apache Airflow with the CeleryExecutor — Airflow dispatches task instances to a Celery broker, and Celery workers execute them. In this setup, both Airflow's retry semantics and Celery's retry semantics apply simultaneously. An Airflow task with retries=3 creates three Airflow-level task instance retries; if the Celery task itself also has autoretry_for, each Airflow retry is itself retried at the Celery level. With the default configuration, a single billing pipeline failure can trigger up to (airflow_retries + 1) × (celery_max_retries + 1) calls to stripe.charges.create(). The content-hash idempotency key closes this at both layers — Stripe deduplicates regardless of which retry layer triggers the call — but it is important to understand that both layers are active when running Celery under Airflow.
task_always_eager=True in tests defeats idempotency key testing
A common Celery test configuration sets CELERY_TASK_ALWAYS_EAGER = True (or task_always_eager=True in the Celery 4.x config), which runs tasks synchronously in the calling process rather than sending them to the broker. This disables all of Celery's retry machinery: autoretry_for, max_retries, and acks_late are all bypassed. Tests that call charge_customer.delay() and check for idempotent behavior may pass because the task runs exactly once — the retry path is never exercised. Validate idempotency by calling charge_customer.apply(args=[...], retries=1) directly, or by testing the billing_idempotency_key() function in isolation with a unit test that asserts stability across multiple calls.
Chord callbacks fire even when header tasks partially fail
A common Celery billing pattern uses a chord: a group of charge_one_customer tasks (the header) feeding into a billing_summary callback. If one header task fails after its charge succeeded but before its callback can report success, the chord's error handler fires — but the charge is not rolled back. The chord callback may conclude the batch failed and re-dispatch the entire header, re-firing all charges including the one that already went through. If you use chords for billing fan-out, scope the vault key to the individual header task (not the chord), and have the callback's error handler call the audit vault key's read path before re-dispatching any tasks from the failed batch.
Celery beat cron intervals shorter than execution time create concurrent runs
When a celery beat periodic task has a cron interval shorter than the task's execution time — for example, a monthly billing run scheduled every 28 days but taking 29 days to complete — beat dispatches a second instance while the first is still running. Both instances execute concurrently against the same customer list. Without a content-hash idempotency key and a per-run spend cap, both workers simultaneously call stripe.charges.create() for the same customers. Add one_at_a_time=True via celery-redbeat or a distributed lock (Redis SET NX EX) to prevent overlapping periodic runs, and use the vault key's daily cap as the backstop: even if two runs overlap, the combined spend for a given customer cannot exceed the cap.
Pytest enforcement suite
import hashlib
import pytest
from unittest.mock import MagicMock, patch
def billing_idempotency_key(customer_id: str, amount_cents: int, billing_period: str) -> str:
raw = f"{customer_id}:{amount_cents}:{billing_period}:celery-billing"
return hashlib.sha256(raw.encode()).hexdigest()[:32]
def test_idempotency_key_stable_across_retries():
k1 = billing_idempotency_key("cus_abc", 2999, "2026-06")
k2 = billing_idempotency_key("cus_abc", 2999, "2026-06")
assert k1 == k2, "Key must be identical on every retry"
def test_idempotency_key_differs_by_billing_period():
k_june = billing_idempotency_key("cus_abc", 2999, "2026-06")
k_july = billing_idempotency_key("cus_abc", 2999, "2026-07")
assert k_june != k_july, "Different billing periods must produce different keys"
def test_idempotency_key_differs_by_customer():
k1 = billing_idempotency_key("cus_abc", 2999, "2026-06")
k2 = billing_idempotency_key("cus_def", 2999, "2026-06")
assert k1 != k2, "Different customers must produce different keys"
@patch("stripe.StripeClient")
def test_charge_uses_vault_key_not_env_var(mock_client_class):
mock_client = MagicMock()
mock_client_class.return_value = mock_client
mock_client.charges.create.return_value = MagicMock(id="ch_test")
vault_key = "vault_key_per_task_xyz"
from billing_tasks import charge_customer
charge_customer.apply(args=["cus_abc", 2999, "2026-06", vault_key])
mock_client_class.assert_called_once_with(
vault_key, base_url="https://proxy.keybrake.com/stripe"
)
@patch("stripe.StripeClient")
def test_pre_flight_returns_existing_charge_on_redelivery(mock_client_class):
existing_charge = MagicMock(id="ch_existing", status="succeeded")
existing_charge.metadata = {"billing_period": "2026-06"}
mock_client = MagicMock()
mock_client_class.return_value = mock_client
mock_client.charges.list.return_value = MagicMock(data=[existing_charge])
from billing_tasks import charge_customer_reliable
result = charge_customer_reliable.apply(
args=["cus_abc", 2999, "2026-06", "vault_key_abc", "audit_key_abc"]
).get()
assert result == "ch_existing"
mock_client.charges.create.assert_not_called()
FAQ
Can I use the Celery task ID as an idempotency key?
No. Celery assigns a unique UUID to each task when it is dispatched — self.request.id inside the task callable. On autoretry_for retry, the task ID is preserved: retries within the same task instance share the same ID. This makes self.request.id look like a safe idempotency key — but it is not, because a re-delivered task (from acks_late=True after a worker crash) may be assigned a new task ID by the broker. More critically, a manual .delay() re-dispatch — the pattern used when an engineer manually re-queues a failed billing batch — always creates a new task ID. Use a content hash of the billing parameters instead; it is stable across all re-dispatch patterns, not just autoretry_for retries.
Does Celery's retry method avoid the duplicate charge issue?
Celery's explicit self.retry(exc=e)) has the same behavior as autoretry_for for Stripe: it re-runs the task callable from the top. The only difference is that self.retry() gives you control over when and whether to retry, while autoretry_for retries on any matching exception type. In both cases, without an idempotency key, a retry after a successful charge and a failed downstream write fires a second charge. The fix is the same: content-hash idempotency key derived from billing parameters, applied before the stripe.charges.create() call.
What if I need to charge the same customer twice in one billing period?
Add a disambiguating suffix to the idempotency key: f"{customer_id}:{amount_cents}:{billing_period}:{charge_type}:celery-billing" where charge_type is "subscription", "overage", or "addon". Two charges for the same customer in the same period with different charge types produce different idempotency keys, so Stripe creates two distinct charge objects. Two identical calls with the same type — from a retry or re-delivery — still deduplicate correctly. The vault key's daily cap should be sized for the sum of all charge types a single customer can incur in one day.
How do vault keys interact with Celery's task serialization?
Celery serializes task arguments into the broker message — by default as JSON. A vault key passed as a string argument is serialized as a plain JSON string and stored in the broker (typically Redis or RabbitMQ) until a worker picks it up. If your broker connection is not encrypted or your broker ACLs are permissive, vault keys in task payloads are readable by anyone with broker access. Vault keys are short-lived and endpoint-scoped, so exposure is lower-risk than a bare Stripe secret key, but you should still encrypt your broker connection (rediss:// for TLS, or RabbitMQ TLS) and rotate vault keys on a per-billing-run basis rather than reusing them across runs.
Does the Keybrake proxy add meaningful latency to Stripe calls inside Celery tasks?
The proxy adds approximately 2-8ms of forwarding overhead. Celery billing tasks typically spend 50-300ms on the Stripe API call itself, plus 20-100ms on the downstream database write. Proxy overhead is not measurable in end-to-end task execution time. For high-throughput chord fan-outs — hundreds of concurrent billing tasks — the proxy's SQLite audit write is the only potential bottleneck; in practice, SQLite handles thousands of sequential writes per second, and Celery's concurrency is bounded by max_active_tasks and the worker pool size, not by the audit log throughput.
Can I use the same vault key across all tasks in a chord?
You can, but you lose per-customer spend isolation. If the chord header dispatches 1,000 tasks and all share one vault key with a $2,990 daily cap (1,000 × $2.99), a runaway loop in any one of the 1,000 tasks can exhaust the full $2,990 cap before the other tasks complete. Issue one vault key per customer at dispatch time, each capped at that customer's charge amount plus a small buffer (10%). This way a runaway loop on one customer's task is bounded to that customer's cap and cannot affect any other task in the chord.
Stop handing your Celery workers a bare Stripe key
Keybrake issues per-task vault keys with endpoint allowlists, daily spend caps, and a full audit log — so an autoretry_for storm, a worker crash re-delivery, or a chord fan-out duplicate can never charge a customer twice or exhaust your Stripe balance. One URL change in your StripeClient initializer.