1. Checkpoint_Based_Recovery
Checkpoint-Based Recovery
A checkpoint is a frozen moment a system can roll back to—so recovery time stops depending on what went wrong, and starts depending on how often you saved.
Key Takeaways
- Checkpoints snapshot state at intervals, bounding recovery to "replay from last save" rather than "rebuild from scratch."
- Trade-off is storage and write overhead versus recovery latency.
- They shine for long-running, stateful, failure-prone workloads.
Body
A distributed job at minute 47 of a 90-minute run dies when a node loses power. Without a checkpoint, restart means redoing 47 minutes of work—or worse, replaying a corrupted state. With a
1m / Article + audio + video