Error Classification as a First-Class Concern
Most retry logic in production systems is wrong—not because the math is wrong, but because every error is treated as the same kind of failure. Error classification is the discipline that makes retry logic correct.
Key Takeaways
- Errors fall into distinct categories: transient, permanent, rate-limited, and zombie. Each requires a different response.
- Treating all errors the same leads to retry storms that amplify rather than relieve load.
- Error classification must happen at the source, with the classified error persisted alongside the job.
- The deepmox-worker's error taxonomy is the foundation of its self-healing behavior.
A retry storm hit us three weeks ago. The system was processing 200 jobs per minute, all AI inference, all against an external API. The API started returning 503s. Our retry logic saw "request failed, retry," and dutifully retried. Each retry hit the same API, which was already overloaded, which returned more 503s, which triggered more retries. Within four minutes, the worker was retrying the same 200 jobs four times each, sending 800 requests per minute to an API that could handle 200. The API's rate limiter kicked in, returning 429s, which our retry logic treated as transient errors and retried again. By the eighth minute, we were sending 3,200 requests per minute, all of them failing, all of them being retried.
The bug was not in the retry logic. The retry logic was doing exactly what it was told: if the request fails, retry it. The bug was in the absence of error classification. Every failure was treated as the same kind of failure: "the request did not succeed, try again." But "the request did not succeed" can mean at least four different things, and each requires a different response. Some failures are transient and will succeed on retry. Some are permanent and will never succeed no matter how many times you retry. Some are rate-limited and need backoff. Some indicate the worker died mid-job and the job is in zombie state. Treating all of these the same is not just wrong—it is dangerous.
The deepmox-worker implements error classification as a first-class concern. Every error is classified at the moment it occurs, and the classification is persisted in the database alongside the job. The classification is not inferred from a generic "request failed" message. It is explicit: the code that catches the error is responsible for determining what kind of error it is and reporting that classification to the rest of the system. This is the discipline that prevents retry storms.
flowchart TD
E[Error Caught] --> Q{Classify}
Q -->|Network timeout, 5xx| T[Transient]
Q -->|4xx auth, schema| P[Permanent]
Q -->|429 rate limit| R[Rate Limited]
Q -->|Worker died| Z[Zombie]
Q -->|Unknown| U[Unknown]
T --> RT[Retry with backoff]
P --> FAIL[Mark failed, no retry]
R --> RATE[Exponential backoff]
Z --> RECLAIM[Re-claim with reset]
U --> INVESTIGATE[Alert and pause]
The taxonomy in the diagram is what every production system needs, even if the names differ. The point is that errors are not interchangeable. A retry storm is what happens when a system treats all errors the same. A well-classified system responds to each error type with the appropriate action: transient errors get immediate retry, permanent errors get marked failed, rate-limited errors get exponential backoff, zombie jobs get reclaimed, and unknown errors get investigated.
The most important classification is "permanent." A 401 from an upstream API is not going to become a 200 by retrying. A 400 because of a schema mismatch is not going to succeed by trying again. A 404 because the resource does not exist is not going to appear by waiting. These errors will fail forever, and every retry is wasted work that consumes resources and amplifies load. A system that retries permanent errors is not just inefficient—it is broken.
The second most important classification is "rate-limited." A 429 is a request to slow down, not a request to retry. Treating a 429 as a transient error and retrying immediately is the opposite of what the server is asking for. The correct response is exponential backoff with jitter, ideally respecting the Retry-After header if the server provides one. A system that retries 429s without backoff is actively hostile to the upstream service.
I initially thought error classification was a nice-to-have optimization. I changed my mind during the retry storm. The system was not just slow; it was actively harming the upstream service. Every retry made the situation worse. The fix was not to add more retry capacity. The fix was to classify errors correctly so that we only retried the ones that would succeed. Once we deployed the fix, the system recovered in minutes. The error classification was the difference between a self-healing system and a self-destructing one.
Imagine you are the operator of a system that calls an external API. The API starts returning errors. You have two choices: retry everything (and risk a storm) or classify first, then respond. The first option feels safer—it seems like more retries means more chances of success. The reality is the opposite. Unclassified retries are one of the most common causes of cascading failures in distributed systems. Classification is not a defensive measure; it is the only way to respond to errors correctly.
The deepmox-worker classifies errors in the worker code, not in the database. This is intentional. The worker has the context to determine what kind of error occurred (network, schema, rate-limit, etc.), and the database does not. The classification is then persisted as part of the job's error record:
errors (id, job_id, code, classification, message, retry_count, created_at)
The classification column is the key. It is not free-text. It is an enum: transient, permanent, rate_limited, zombie, unknown. The retry logic queries this column to decide what to do next. A job with classification='permanent' is not retried. A job with classification='rate_limited' is retried with backoff. A job with classification='zombie' is reclaimed with state reset. The classification is the contract between error handling and retry logic.
The discipline of persisting the classification alongside the job is what makes the system debuggable. When something goes wrong, you can query the errors table to see exactly what kinds of failures are occurring. If rate_limited errors are spiking, you know the upstream service is struggling. If permanent errors are spiking, you know there is a schema or auth issue. If zombie errors are spiking, you know the runtime is reclaiming workers mid-job. The classification is the diagnostic information that tells you what is actually happening.
The most counterintuitive lesson is about unknown errors. Most systems treat "we don't know what this is" as "transient, retry." The deepmox-worker treats it as "pause and investigate." The reasoning is that unknown errors are rare in production. If you see one, something has changed—an API version, a network condition, a deployment. Retrying unknown errors is dangerous because you might be amplifying a problem you do not understand. Pausing and investigating is safer, even if it means a temporary slowdown.
This is a hard discipline. There is real pressure to "just retry, we can investigate later." That pressure is what causes retry storms. The deepmox-worker resists it by making unknown errors explicit. Every unknown error is logged, alerted, and held for human review. The system does not silently absorb unknown errors and hope for the best. It surfaces them and waits for a human to classify them. This is slower in the short term and dramatically more reliable in the long term.
The shift in my thinking came when I realized that error classification is not about technical precision. It is about institutional trust. If your system can be trusted to classify errors correctly, you can trust it to retry correctly. If it cannot, you cannot trust it to do anything autonomously. The deepmox-worker's error classification is the foundation of its autonomy. It is what allows the pull cycle to run for hours without human intervention.
The deeper lesson is about separation of concerns. Error classification is a separate concern from error handling. Error handling is "what do we do when an error occurs." Error classification is "what kind of error is this." These are different problems, with different solutions. The deepmox-worker separates them by putting classification in the catch block and response in the retry logic. The catch block reports what kind of error it caught. The retry logic queries the database to decide what to do with jobs of each classification. The two pieces communicate through the persistent error record.
This separation is what makes the system extensible. Adding a new error type (say, quota_exceeded) does not require changing the retry logic. It requires updating the catch block to recognize the new error and adding a new branch in the retry decision. The retry logic stays generic: "look at the classification, respond appropriately." This is the kind of design that scales: small additions, no rewrites, behavior changes localized to one place.
Error classification is, ultimately, the discipline that makes self-healing possible. Without it, retry logic is a guess. With it, retry logic is a decision. The deepmox-worker's error classification is the difference between a system that recovers from failures and a system that makes failures worse. We will see this pattern recur in the next chapter, where retry logic uses the error classification to decide between immediate retry, exponential backoff, and abandonment.