The_Polling_Tax | E2E回调优化测试074659

The Polling Tax

A polling worker is a toll booth that opens once per second and never has a customer. A callback-driven worker is the same booth with the gate up. The visible saving is latency. The structural saving is something else: the booth is allowed to be silent for as long as it has nothing to report, and silence becomes a signal you can trust.

Key Takeaways

A polling worker pays its tax in *empty* state — most polls return "no work." The system looks healthy while it pays.
A callback-driven worker replaces the empty state with a kernel-level wait. The cost moves from "always polling" to "wake only when needed."
The latency win is the headline. The real win is *observability*: a polling system that is silent is indistinguishable from a polling system that is broken. A callback system that is silent is unambiguously *waiting*, and the trace shows it.
This chapter establishes the architectural case. The next chapter shows what the observability claim actually requires to be true.

03:14

The dashboard shows the worker's last heartbeat timestamp. It is 03:14 and the worker last spoke to the coordinator at 03:13:59. The system is healthy. The worker is, in the words of the runbook, "responsive."

The worker is, in fact, polling. Every second. It has been doing this for the last nine minutes. The job it picked up at 03:05:00 will take another six minutes to render. The worker, for those nine minutes, has asked the coordinator 540 times whether a new job is available and received 540 identical "no" responses. The system is healthy.

This is the polling architecture's signature property: it produces the appearance of responsiveness at a flat per-interval cost, and the cost is paid even when there is no work. The 540 empty polls are not a bug. They are the architecture.

stateDiagram-v2
    [*] --> Polling
    Polling --> Polling: poll → "no work" (every 1s, forever)
    Polling --> Running: poll → "yes" (rare)
    Running --> Polling: job complete → resume polling
    note right of Polling
        480 empty polls per 8-min job
        System looks responsive
        System is, in fact, idle
    end note

What the polling worker is actually doing

A polling worker is a tight loop with a network call in it. Each iteration is the same: open a connection (or reuse one), send a request, wait for a response, decide, repeat. The architecture assumes that the decision step is "is there work?" and that asking the question frequently is the only way to keep the system responsive.

This is the assumption the callback architecture breaks.

A callback-driven worker, in the canonical Node.js formulation, *blocks only in the poll phase of the event loop, and only when the kernel has nothing queued.* When the kernel notifies the worker that a file descriptor is readable — because a render completed, a download finished, a remote service returned — the kernel places a callback in the poll queue, the event loop wakes, and the callback runs. The worker does not ask. The system tells.

The architectural shift looks like this in state-machine terms:

stateDiagram-v2
    [*] --> KernelWait
    KernelWait --> KernelWait: epoll_wait — no events (free)
    KernelWait --> Running: kernel signals fd ready (rare)
    Running --> KernelWait: callback completes → resume wait
    note right of KernelWait
        1 wake per 8-min job
        System is asleep, not idle
        Wake is observable as a span
    end note

Two diagrams, side by side. The first one shows a worker that is *busy doing nothing* 480 times per job. The second one shows a worker that is *asleep* and is woken exactly once per job. The latency math is obvious. The observability math is the article.

The hidden cost is structural, not economic

When I started this analysis I assumed the polling-vs-callback question was a latency optimization. I was wrong. The economic case — fewer round-trips, lower CPU, lower coordinator load — is the visible part of the iceberg. The structural case is below the waterline.

A polling system is structurally hard to observe. Three reasons.

First, a polling worker emits a constant stream of "no work" responses. These look identical to "I checked and there was no work" and to "I checked and the network broke." A failed poll and a successful empty poll produce the same log line. The system cannot distinguish them without a sidecar that measures poll latency, and most do not.

Second, the polling cadence is the *floor* on detection latency. If the poll interval is 1 second, the system can notice a stuck worker at the *next* poll. If the interval is 60 seconds, it can take a minute to notice. There is no way to detect a failure faster than the next poll — and there is no way to detect it slower than the next poll, which means the failure-detection budget is *set by the polling cadence itself*, not by the system's actual response time.

Third, a polling worker that is "slow" is indistinguishable from a polling worker that is "in flight." Both will eventually respond to the next poll — the slow one with a partial answer, the in-flight one with a complete one. From outside the worker, the only signal you have is "did the next poll return in time?" If yes, the worker is fine. If no, the worker is *either* slow *or* dead, and the next poll will tell you which. The information density is one bit per poll: alive, or not alive.

A callback-driven worker inverts all three.

A callback worker emits *no signal* while waiting. That silence is not a bug — it is the architecture. The worker is asleep in the kernel, the event loop is in poll, and the only way to be awoken is for the kernel to deliver a file-descriptor event. If the worker is not awoken, *something is wrong*, and the wrongness is detectable by the absence of expected output at the deadline.

A callback worker that fails to be woken is *detectably* silent. The deadline is a timer, set by the producer, scoped to the maximum expected job time. If the timer fires and the worker has not produced a result, the system knows the wake-up was lost — and the trace will show the wake-up span was started and never closed. The information density is one bit per deadline: result, or no result.

This is what I mean by *observability-superior*. The polling system has to *invent* failure detection (a separate watchdog, a sidecar, a synthetic poll). The callback system has failure detection as a free consequence of its own design.

The hard rule the callback architecture imposes

There is a real cost to this. A callback-driven worker is only as good as its discipline. The Node.js documentation makes this point bluntly: a callback that never yields starves the event loop, the event loop never returns to the poll phase, and the kernel can no longer wake the process. The system is not asleep — it is *wedged*. The two states look identical from outside.

"This can create some bad situations because it allows you to 'starve' your I/O by making recursive process.nextTick() calls, which prevents the event loop from reaching the poll phase." — Node.js event loop docs

The polling architecture does not have this failure mode. A wedged polling worker is *also* silent at the deadline, but a polling worker that is functioning normally is also generating empty polls. The wedged-vs-idle distinction is *harder* in a polling system, not easier. So the callback architecture trades a new failure mode (wedging) for a stronger observability story (idle-vs-wedged is detectable) — and the new failure mode is *the same* failure mode the polling system already had, just better instrumented.

This is the central claim of the migration. A well-instrumented callback system is observability-superior to a well-instrumented polling system. The polling system's failure detection is *additive* — extra code on top of the polling loop. The callback system's failure detection is *structural* — it falls out of the architecture.

The Reactive Manifesto put this first, in 2014. "Responsive" is the first property. "Message driven" is the fourth. They are not separable. A system that is not message-driven cannot be responsive at scale, because the cost of "is there work?" *is* the latency floor, and the latency floor is not reducible below the polling interval. Push it down and the cost explodes. Push it up and the system slows. There is no setting that wins.

Imagine it is 03:14

Imagine it is 03:14, and you are the on-call engineer. The dashboard shows a worker last spoke to the coordinator at 03:13:59. In the polling architecture, the only question is: will the next poll come at 03:14:00? If yes, the worker is alive. If no, the worker is wedged, the network is broken, or the worker is in flight on a long job that happens to have paused its polls.

In the callback architecture, the dashboard would show you *what the worker is doing*. Not a heartbeat. A status. The worker is in KernelWait, blocked on a render completion event, with a deadline of 03:21:00. The deadline is a span, the span has a trace_id, and the trace_id is shared with the producer that created the job. If 03:21:00 passes and the span is still open, the trace will tell you so. The system does not need a watchdog. The trace is the watchdog.

This is what the migration buys. Not 1-second-vs-instant latency. *The trace is the watchdog.* That is the architectural claim. Everything that follows is the engineering bill.

What the next chapter owes you

The claim above is conditional. It is true *if* the trace crosses the producer-consumer boundary intact. It is true *if* the deadline is actually a span, not a comment in a config file. It is true *if* the wake-up signal carries the trace_id of the original request. The next chapter peels the claim open and shows the two specific failure modes that break it — the place where the trace_id is dropped, and the place where the wake-up is lost in transit. These are the points where the architectural promise meets the engineering reality, and they are exactly the points the userPrompt's "全链路计时" requirement is asking you to verify.

The polling worker makes 540 empty round-trips per 9-minute job. The callback worker makes 1 wake-up per job. The latency difference is the obvious story. The observability difference — the fact that the callback worker's silence is a *signal*, not a failure — is the real one.

---

References:

The Reactive Manifesto — foundational definition of message-driven, responsive systems
The Node.js Event Loop — canonical reference implementation of callback-first scheduling
OpenTelemetry — Traces — the observability model the architectural claim depends on
Envoy xDS API — production-scale example of push-based control-plane delivery