Zombie_Recovery_and_the_Reaper | clamp test

Zombie Recovery and the Reaper

A "zombie" job is one that was claimed by a worker that no longer exists. The system must detect these jobs, reclaim them, and either retry the work or surface the problem—without any human intervention and without ever losing work.

Key Takeaways

Lease expiration is the canonical signal for zombie detection at the edge.
The reaper pattern is a periodic scan that reclaims jobs whose leases have expired.
Idempotency is the prerequisite for safe zombie recovery; without it, reaping creates duplicate work.
Zombie recovery is not a special case of error handling—it is a normal state transition in the system.

A zombie job is what happens when the runtime reclaims an isolate mid-job. The isolate was processing the work, holding a lease, doing whatever the job required. The runtime said "thirty seconds is up, I am reclaiming you." The isolate is gone. The job is still in the database, marked as in_progress, with a lease_until in the past. No completion event was written. No error was reported. The job is in limbo—it is neither done nor pending, and no worker knows it is stuck. This is the zombie state, and it is the most subtle failure mode in edge-resident systems.

In a traditional cloud, this state is hard to reach. Long-lived workers maintain heartbeats. Visibility timeouts are short. Operations teams monitor for stuck processes. At the edge, the zombie state is normal. It happens whenever a job takes longer than the runtime cap. It happens whenever an isolate is reclaimed for memory pressure. It happens whenever a network call hangs. The system must be designed to expect zombies, detect them, and recover from them. The recovery is not a recovery from an exceptional state. It is recovery from the normal state of edge compute.

The first time I traced through a zombie scenario in the deepmox-worker, I expected to find a separate "reaper" service that monitored for stuck jobs. What I found instead was a query. The reaper is not a service. The reaper is a periodic pull request, just like the regular pull cycle, except its query is different. The regular pull query selects status = 'pending' AND (lease_until IS NULL OR lease_until < now()). The reaper query selects status = 'in_progress' AND lease_until < now(). The reaper query is the same shape, with two changes: it looks at in-progress jobs, and it does not require the lease to be null. The reaper finds jobs that were claimed and never finished.

flowchart TD
    Start[Reaper Trigger] --> Q{Scan for zombies}
    Q -->|No zombies| Done1[Nothing to do]
    Q -->|Zombie found| R{Re-claimable?}
    R -->|Work idempotent| RC[Reset to pending, increment zombie_count]
    R -->|Work dangerous| AL[Alert, mark for review]
    RC --> SCH[Schedule retry with backoff]
    AL --> H[Hold for human]
    SCH --> Done2[Reaper cycle complete]
    H --> Done2

The diagram shows the reaper's logic. It is small, but it is the difference between a system that loses work and a system that never loses work. The reaper runs on a schedule—every minute, every five minutes, whatever frequency is appropriate for the workload. The reaper finds zombies, decides what to do with each, and commits the decision to the database. The decision is durable. The next pull cycle sees the reset jobs and processes them. The system has recovered, with no human intervention, with no special "recovery" code, with just a query that finds zombies and a few state transitions.

The most important property of the reaper is that it is not a separate service. The reaper is implemented in the same worker code, in the same isolates, using the same D1 database. There is no reaper daemon, no separate infrastructure, no deployment to manage. The reaper is just a request to the worker that says "act as the reaper" instead of "act as the regular puller." The worker handles the request, runs the reaper query, and returns. The same scaling properties that apply to the regular pull cycle apply to the reaper. The reaper scales with the worker fleet. The reaper cannot become a bottleneck because it is not a separate process.

Imagine you are running a large office building. Workers come and go. Some complete their tasks and leave. Some get reassigned mid-task. Some are called away and never come back. The building does not have a manager who tracks each worker. Instead, every task has a deadline. If the deadline passes and the task is not done, the next available worker picks it up. The task does not get lost. The building does not need a reaper service. The deadline is the reaper.

The reaper has to be careful about one thing: work that is dangerous to retry. Some operations are not idempotent. If a job charged a credit card, claimed a unique resource, sent a notification, or modified external state in a way that cannot be rolled back, re-claiming the job and re-running it would cause real harm. The deepmox-worker handles this through a reapable flag on each job. By default, jobs are reapable. Jobs that perform dangerous operations are marked non-reapable, and the reaper alerts when it encounters them. The alert goes to a human, who can decide whether to re-claim manually or mark the job as failed.

This is a critical pattern. The reaper is not omniscient. It cannot know, in general, whether re-running a job is safe. The job itself must declare whether it is reapable. The declaration is part of the job's metadata, set at submission time. If a job is reapable, the reaper can re-claim it. If a job is non-reapable, the reaper surfaces it for human review. The discipline of declaring reapability at submission is what makes the reaper safe to run automatically.

I started this analysis believing that the reaper was an emergency measure—a special-case handler for exceptional situations. After using the reaper in production, I now believe the opposite: the reaper is the normal mode of operation. The runtime cap is the most common reason jobs become zombies. The network is the second most common reason. Hardware failures are a distant third. If the reaper is treated as a normal part of the system, it is easy to design. If the reaper is treated as an emergency measure, it will be undertested and buggy, and it will fail when it is most needed.

The reaper frequency is a tunable parameter. If the reaper runs too often, it adds load to the database and finds few zombies. If the reaper runs too rarely, zombies sit in the queue for too long, increasing user-visible latency. The deepmox-worker runs the reaper every minute, with a query that scans for jobs whose leases expired more than sixty seconds ago. The sixty-second grace period gives legitimate long-running jobs (those that just barely exceeded the runtime cap) a chance to complete on their own before the reaper intervenes. The reaper is a patient mechanism.

The most counterintuitive property of the reaper is that it is observable. Every reaper action is a row in the errors table, with classification = 'zombie'. Every re-claim is an update to the jobs table, with zombie_count incremented. Every alert is a row that can be queried. This means the system can answer questions like: how many zombies did we have yesterday, what kinds of jobs went zombie, did the reaper recover them, are there any jobs that have gone zombie multiple times? These questions are impossible to answer in systems where reaper state is in-process. The deepmox-worker answers them with a single query.

The zombie_count column is particularly important. A job that has gone zombie once is normal—runtime cap exceeded, expected. A job that has gone zombie five times is suspicious. A job that has gone zombie ten times is broken. The zombie_count is the early warning system for systemic problems. If zombie_count is climbing across many jobs, there is likely a runtime issue, a network issue, or a workload that is too long for the runtime cap. The query that surfaces this is:

SELECT id, payload, zombie_count FROM jobs WHERE zombie_count > 3 ORDER BY zombie_count DESC

The query takes milliseconds to run. The answer is actionable. The system has detected a problem before any human would have noticed it.

The reaper is, ultimately, the most important pattern in the foundation layer, because it is the pattern that handles the failure mode the system is most likely to encounter. The runtime cap is hard. The network is unreliable. The reaper is the answer to both. By making the reaper a query, not a service, the system gains observability, testability, and simplicity. By making reapability a job-level declaration, the system is safe to run automatically. By tracking zombie_count, the system surfaces systemic problems early. The pattern generalizes: any system with leases can be reaped, and any reaper can be a query.

The next chapter will look at the runtime cap itself—the thirty-second wall—and how the system designs jobs to fit within that budget. The reaper is the safety net; the runtime cap is the constraint. Both are necessary, and both shape the architecture.