Crash Test 2

1. Checkpoint_Based_Recovery

Checkpoint-Based Recovery

A checkpoint is a frozen moment a system can roll back to—so recovery time stops depending on what went wrong, and starts depending on how often you saved.

Key Takeaways

Checkpoints snapshot state at intervals, bounding recovery to "replay from last save" rather than "rebuild from scratch."
Trade-off is storage and write overhead versus recovery latency.
They shine for long-running, stateful, failure-prone workloads.

Body

A distributed job at minute 47 of a 90-minute run dies when a node loses power. Without a checkpoint, restart means redoing 47 minutes of work—or worse, replaying a corrupted state. With a

1m / Article + audio + video

Start here

1. Checkpoint_Based_Recovery

Checkpoint-Based Recovery

Key Takeaways

Body