Start here
The Polling Tax
A polling worker is a toll booth that opens once per second and never has a customer. A callback-driven worker is the same booth with the gate up. The visible saving is latency. The structural saving is something else: the booth is allowed to be silent for as long as it has nothing to report, and silence becomes a signal you can trust.
Key Takeaways
- A polling worker pays its tax in *empty* state — most polls return "no work." The system looks healthy while it pays.
- A callback-driven worker replaces the empty state with a kernel-level wait. The cost moves from "always polling" to "wake only when needed."
- The latency win is the headline. The real win is *observability*: a polling system that is silent is indistinguishable from a polling system that is broken. A callback system that is silent is unambiguously *waiting*, and the trace shows it.
- This chapter establishes the architectural case. The next chapter shows what the observability claim actually requires to be true.
03:14
The dashboard shows the worker's last heartbeat timestamp. It is 03:14 and the worker last spoke to the coordinator at 03:13:59. The system is healthy. The worker is, in the words of the runbook, "responsive."
The worker is, in fact, polling. Every second. It has been doing this for the last nine minutes. The job it picked up at 03:05:00 will take another six minutes to render. The worker, for those nine minutes, has asked the coordinator 540 times whether a new job is available and received 540 identical "no" responses. The system is healthy.
This is the polling architecture's signature property: it produces the appearance of responsiveness at a flat per-interval cost, and the cost is paid even when there is no work. The 540 empty polls are not a bug. They are the architecture.
stateDiagram-v2
[*] --> Polling
Polling --> Polling: poll → "no work" (every 1s, forever)
Polling --> Running: poll → "yes" (rare)
Running --> Polling: job complete → resume polling
note right of Polling
480 empty polls per 8-min job
System looks responsive
System is, in fact, idle
end note
What the polling worker is actually doing
A polling worker is a tight loop with a network call in it. Each iteration is the same: open a connection (or reuse one), send a request, wait for a response, decide, repeat. The architecture assumes that the decision step is "is there work?" and that asking the question frequently is the only way to keep the system responsive.
This is the assumption the callback architecture breaks.
A callback-driven worker, in the canonical Node.js formulation, *blocks only in the poll phase of the event loop, and only when the kernel has nothing queued.* When the kernel notifies the worker that a file descriptor is readable — because a render completed, a download finished, a remote service returned — the kernel places a callback in the poll queue, the event loop wakes, and the callback runs. The worker does not ask. The system tells.
The architectural shift looks like this in state-machine terms:
stateDiagram-v2
[*] --> KernelWait
KernelWait --> KernelWait: epoll_wait — no events (free)
KernelWait --> Running: kernel signals fd ready (rare)
Running --> KernelWait: callback completes → resume wait
note right of KernelWait
1 wake per 8-min job
System is asleep, not idle
Wake is observable as a span
end note
Two diagrams, side by side. The first one shows a worker that is *busy doing nothing* 480 times per job. The second one shows a worker that is *asleep* and is woken exactly once per job. The latency math is obvious. The observability math is the article.
The hidden cost is structural, not economic
When I started this analysis I assumed the polling-vs-callback question was a latency optimization. I was wrong. The economic case — fewer round-trips, lower CPU, lower coordinator load — is the visible part of the iceberg. The structural case is below the waterline.
A polling system is structurally hard to observe. Three reasons.
First, a polling worker emits a constant stream of "no work" responses. These look identical to "I checked and there was no work" and to "I checked and the network broke." A failed poll and a successful empty poll produce the same log line. The system cannot distinguish them without
8m / Article + audio + video
Full-Link Timing in Push Systems
A trace in an event-driven system is a relay-race baton. The producer hands it off at the wake-up signal. The worker carries it through the run. The completion posts it to the storage tier. Drop it at any one of those handoffs, and the full-link claim is false — not partial, not degraded, *false*. The latency number you report will silently exclude the dropped link. This chapter shows the two specific places the baton is most often dropped, and the backstop that catches the drop when it happens.
Key Takeaways
- Full-link timing in a callback-driven worker is a *trace-continuity* problem, not a logging problem. The
trace_id must cross the producer-consumer gap; spans without a shared trace_id are not a trace, they are unrelated logs. - OpenTelemetry's producer/consumer span kind is the model that makes the gap explicit. The producer span ends *before* the consumer span starts. The only thing connecting them is
trace_id and an optional link. - The two failure modes are: (1) the
trace_id is not propagated through the wake-up signal, and (2) the wake-up signal is lost in transit and the worker stays asleep. Both are silent. Both are detectable with the right spans. - The backstop is a deadline span — a timer-based span that fires if the worker has not produced a completion span by the expected time. The deadline span turns "silent sleep" into an observable event.
- The OpenTelemetry model is not additive instrumentation. It is the *structure* of the system. Without it, the architectural promise from the previous chapter does not hold.
What a real full-link span looks like
Here is the kind of span structure a callback-driven worker should be emitting, taken directly from the OpenTelemetry traces specification. Three spans, one shared trace_id, a clear parent chain.
{
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"spans": [
{
"name": "produce.render_job",
"span_id": "051581bf3cb55c13",
"parent_id": null,
"span_kind": "Producer",
"start_time": "2026-07-01T03:05:00.000Z",
"end_time": "2026-07-01T03:05:00.180Z"
},
{
"name": "worker.ack_wake",
"span_id": "5fb397be34d26b51",
"parent_id": "051581bf3cb55c13",
"span_kind": "Consumer",
"start_time": "2026-07-01T03:05:00.310Z",
"end_time": "2026-07-01T03:05:00.380Z"
},
{
"name": "worker.run",
"span_id": "93564f51e1abe1c2",
"parent_id": "051581bf3cb55c13",
"span_kind": "Internal",
"start_time": "2026-07-01T03:05:00.380Z",
"end_time": "2026-07-01T03:13:42.000Z"
}
]
}
Three spans. One trace_id. The produce.render_job span ends at 03:05:00.180. The worker.ack_wake span starts at 03:05:00.310. There is a 130-millisecond gap. The worker.run span starts when the wake-up is acknowledged, and it runs for eight minutes and forty-one seconds. The full-link timing is recoverable: from the producer's end (00:00:180) to the worker's end (08:42:000), with a 130ms wake-up latency in the middle.
This is what the OpenTelemetry model gives you. The producer is finished long before the consumer starts. The two spans are causally related (the consumer exists because the producer created a job) but temporally separated. The trace_id is what stitches them.
"A Consumer span represents the processing of a job created by a producer and may start long after the producer span has already ended." — OpenTelemetry traces spec
This is the architectural shape of an event-driven worker. Producer first, consumer later, same trace_id across both, parent chain intact.
Failure mode one: the dropped trace_id
I started this chapter assuming timing instrumentation was additive. Add a timer, log the latency. The OpenTelemetry model and a specific debugging session pushed me toward a different view: timing is *structural*. It either flows through the wake-up, or it doesn't exist.
Here is what the failure looks like in the wild.
The producer service creates a job, attaches a trace_id to the job payload, and pushes the wake-up signal. The signal carries the job ID. The signal does *not* carry the trace_id — because the signal is a thin notification ("wake up, there is work") and the assumption was that the worker would fetch the job from the queue and pick up the trace_id from the payload.
The worker wakes, fetches the job, starts a new span. The new span has its own trace_id, generated locally, with no relationship to the producer's trace_id. The worker's span tree is well-formed. The producer's span tree is well-formed. They are unrelated.
The result, on the dashboard, is a producer that finished in 180ms and a worker that started 130ms later. The two are not connected. The 8-minute-and-41-second run time of the worker is reported, but it is *not* attached to the producer that caused it. From the dashboard's point of view, the worker run is an orphan span.
This is the most common failure mode in callba
8m / Article + audio + video