Worker Stability Test / Chapter 1

AI Tech /

E00_Retries_Are_Easy_Idempotency_Is_the_Whole_Game

# Retries Are Easy; Idempotency Is the Whole Game > Distributed workers don't fail because they retry too much. They fail because their retries aren't safe to repeat. ## Key Takeaways - Every task that touches external state will run twice. Make the second run harmless. - Idempotency keys are the default. Conditional writes are the fallback. Fencing tokens are the last resort. - Don't paper over a duplicate-execution bug with more retries; fix the safety boundary. Imagine your worker crashes 200 ms before committing a payment. The supervisor restarts it. The new attempt charges the customer again. Same task, same code, two outcomes — that isn't a flaky network, that's a missing idempotency contract. The fix is not "retry smarter." The fix is making the side effect repeatable. ```mermaid flowchart LR A[Task arrives] --> B{Already done?} B -- Yes --> C[Return

Chapter 1 of 1 1m Article Audio Learning path

Retries Are Easy; Idempotency Is the Whole Game

Distributed workers don't fail because they retry too much. They fail because their retries aren't safe to repeat.

Key Takeaways

  • Every task that touches external state will run twice. Make the second run harmless.
  • Idempotency keys are the default. Conditional writes are the fallback. Fencing tokens are the last resort.
  • Don't paper over a duplicate-execution bug with more retries; fix the safety boundary.

Imagine your worker crashes 200 ms before committing a payment. The supervisor restarts it. The new attempt charges the customer again. Same task, same code, two outcomes — that isn't a flaky network, that's a missing idempotency contract.

The fix is not "retry smarter." The fix is making the side effect repeatable.

flowchart LR
    A[Task arrives] --> B{Already done?}
    B -- Yes --> C[Return cached result]
    B -- No --> D[Run side effect with idempotency key]
    D --> E[Persist: task_id to result]
    E --> F[Return result]

Three patterns, in order of robustness:

Idempotency keys. Before the side effect, the worker stores task_id → outcome. On retry, the marker short-circuits the work. Stripe's idempotency layer (engineering blog, 2017) fingerprints each request and replays the prior response for 24 hours.

Conditional writes. When you don't own the downstream, send a unique key (Idempotency-Key header, version column, or ON CONFLICT DO NOTHING) and let the receiving system dedupe.

Fencing tokens. When partial failure spans services and stale workers can outlive their leases, attach a monotonic token. Old workers get rejected; the new attempt wins.

Pick the simplest pattern that covers your failure surface. Adding retries on top of an unsafe operation doesn't make it safer — it just makes the duplication faster.

---

References: