E00_Retries_Are_Easy_Idempotency_Is_the_Whole_Game | Worker Stability Test

Retries Are Easy; Idempotency Is the Whole Game

Distributed workers don't fail because they retry too much. They fail because their retries aren't safe to repeat.

Key Takeaways

Every task that touches external state will run twice. Make the second run harmless.
Idempotency keys are the default. Conditional writes are the fallback. Fencing tokens are the last resort.
Don't paper over a duplicate-execution bug with more retries; fix the safety boundary.

Imagine your worker crashes 200 ms before committing a payment. The supervisor restarts it. The new attempt charges the customer again. Same task, same code, two outcomes — that isn't a flaky network, that's a missing idempotency contract.

The fix is not "retry smarter." The fix is making the side effect repeatable.

flowchart LR
    A[Task arrives] --> B{Already done?}
    B -- Yes --> C[Return cached result]
    B -- No --> D[Run side effect with idempotency key]
    D --> E[Persist: task_id to result]
    E --> F[Return result]

Three patterns, in order of robustness:

Idempotency keys. Before the side effect, the worker stores task_id → outcome. On retry, the marker short-circuits the work. Stripe's idempotency layer (engineering blog, 2017) fingerprints each request and replays the prior response for 24 hours.

Conditional writes. When you don't own the downstream, send a unique key (Idempotency-Key header, version column, or ON CONFLICT DO NOTHING) and let the receiving system dedupe.

Fencing tokens. When partial failure spans services and stale workers can outlive their leases, attach a monotonic token. Old workers get rejected; the new attempt wins.

Pick the simplest pattern that covers your failure surface. Adding retries on top of an unsafe operation doesn't make it safer — it just makes the duplication faster.

---

References:

Stripe: Designing robust and predictable APIs with idempotency — 2017 engineering write-up of the request-fingerprint approach.
Martin Kleppmann: How to do distributed locking — why fencing tokens beat leases alone.