Retries Are Easy; Idempotency Is the Whole Game
Distributed workers don't fail because they retry too much. They fail because their retries aren't safe to repeat.
Key Takeaways
- Every task that touches external state will run twice. Make the second run harmless.
- Idempotency keys are the default. Conditional writes are the fallback. Fencing tokens are the last resort.
- Don't paper over a duplicate-execution bug with more retries; fix the safety boundary.
Imagine your worker crashes 200 ms before committing a payment. The supervisor restarts it. The new attempt charges the customer again. Same task, same code, two outcomes — that isn't a flaky network, that's a missing idempotency contract.
The fix is not "retry smarter." The fix is making the side effect repeatable.
flowchart LR
A[Task arrives] --> B{Already done?}
B -- Yes --> C[Return cached result]
B -- No --> D[Run side effect with idempotency key]
D --> E[Persist: task_id to result]
E --> F[Return result]
Three patterns, in order of robustness:
Idempotency keys. Before the side effect, the worker stores task_id → outcome. On retry, the marker short-circuits the work. Stripe's idempotency layer (engineering blog, 2017) fingerprints each request and replays the prior response for 24 hours.
Conditional writes. When you don't own the downstream, send a unique key (Idempotency-Key header, version column, or ON CONFLICT DO NOTHING) and let the receiving system dedupe.
Fencing tokens. When partial failure spans services and stale workers can outlive their leases, attach a monotonic token. Old workers get rejected; the new attempt wins.
Pick the simplest pattern that covers your failure surface. Adding retries on top of an unsafe operation doesn't make it safer — it just makes the duplication faster.
---
References:
- Stripe: Designing robust and predictable APIs with idempotency — 2017 engineering write-up of the request-fingerprint approach.
- Martin Kleppmann: How to do distributed locking — why fencing tokens beat leases alone.