Full-Link Timing in Push Systems
A trace in an event-driven system is a relay-race baton. The producer hands it off at the wake-up signal. The worker carries it through the run. The completion posts it to the storage tier. Drop it at any one of those handoffs, and the full-link claim is false — not partial, not degraded, *false*. The latency number you report will silently exclude the dropped link. This chapter shows the two specific places the baton is most often dropped, and the backstop that catches the drop when it happens.
Key Takeaways
- Full-link timing in a callback-driven worker is a *trace-continuity* problem, not a logging problem. The
trace_idmust cross the producer-consumer gap; spans without a sharedtrace_idare not a trace, they are unrelated logs. - OpenTelemetry's producer/consumer span kind is the model that makes the gap explicit. The producer span ends *before* the consumer span starts. The only thing connecting them is
trace_idand an optionallink. - The two failure modes are: (1) the
trace_idis not propagated through the wake-up signal, and (2) the wake-up signal is lost in transit and the worker stays asleep. Both are silent. Both are detectable with the right spans. - The backstop is a deadline span — a timer-based span that fires if the worker has not produced a completion span by the expected time. The deadline span turns "silent sleep" into an observable event.
- The OpenTelemetry model is not additive instrumentation. It is the *structure* of the system. Without it, the architectural promise from the previous chapter does not hold.
What a real full-link span looks like
Here is the kind of span structure a callback-driven worker should be emitting, taken directly from the OpenTelemetry traces specification. Three spans, one shared trace_id, a clear parent chain.
{
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"spans": [
{
"name": "produce.render_job",
"span_id": "051581bf3cb55c13",
"parent_id": null,
"span_kind": "Producer",
"start_time": "2026-07-01T03:05:00.000Z",
"end_time": "2026-07-01T03:05:00.180Z"
},
{
"name": "worker.ack_wake",
"span_id": "5fb397be34d26b51",
"parent_id": "051581bf3cb55c13",
"span_kind": "Consumer",
"start_time": "2026-07-01T03:05:00.310Z",
"end_time": "2026-07-01T03:05:00.380Z"
},
{
"name": "worker.run",
"span_id": "93564f51e1abe1c2",
"parent_id": "051581bf3cb55c13",
"span_kind": "Internal",
"start_time": "2026-07-01T03:05:00.380Z",
"end_time": "2026-07-01T03:13:42.000Z"
}
]
}
Three spans. One trace_id. The produce.render_job span ends at 03:05:00.180. The worker.ack_wake span starts at 03:05:00.310. There is a 130-millisecond gap. The worker.run span starts when the wake-up is acknowledged, and it runs for eight minutes and forty-one seconds. The full-link timing is recoverable: from the producer's end (00:00:180) to the worker's end (08:42:000), with a 130ms wake-up latency in the middle.
This is what the OpenTelemetry model gives you. The producer is finished long before the consumer starts. The two spans are causally related (the consumer exists because the producer created a job) but temporally separated. The trace_id is what stitches them.
"A Consumer span represents the processing of a job created by a producer and may start long after the producer span has already ended." — OpenTelemetry traces spec
This is the architectural shape of an event-driven worker. Producer first, consumer later, same trace_id across both, parent chain intact.
Failure mode one: the dropped trace_id
I started this chapter assuming timing instrumentation was additive. Add a timer, log the latency. The OpenTelemetry model and a specific debugging session pushed me toward a different view: timing is *structural*. It either flows through the wake-up, or it doesn't exist.
Here is what the failure looks like in the wild.
The producer service creates a job, attaches a trace_id to the job payload, and pushes the wake-up signal. The signal carries the job ID. The signal does *not* carry the trace_id — because the signal is a thin notification ("wake up, there is work") and the assumption was that the worker would fetch the job from the queue and pick up the trace_id from the payload.
The worker wakes, fetches the job, starts a new span. The new span has its own trace_id, generated locally, with no relationship to the producer's trace_id. The worker's span tree is well-formed. The producer's span tree is well-formed. They are unrelated.
The result, on the dashboard, is a producer that finished in 180ms and a worker that started 130ms later. The two are not connected. The 8-minute-and-41-second run time of the worker is reported, but it is *not* attached to the producer that caused it. From the dashboard's point of view, the worker run is an orphan span.
This is the most common failure mode in callback-driven systems. The fix is mechanical: the wake-up signal must carry the trace_id of the originating request. The signal is a relay-race baton. The producer hands it to the consumer in the wake-up envelope. The consumer's first span inherits the trace_id from the envelope, not from the local context. The OpenTelemetry spec supports this via *context propagation* — a standardized way to serialize the trace context into a carrier that survives transport.
graph TD
P[produce.render_job<br/>trace_id=abc<br/>ends 03:05:00.180] -->|wake-up signal<br/>with trace_id=abc| W[worker.ack_wake<br/>trace_id=abc<br/>parent=P]
W --> R[worker.run<br/>trace_id=abc<br/>parent=P]
R -->|completion<br/>with trace_id=abc| C[worker.complete<br/>trace_id=abc<br/>parent=P]
style P fill:#cfc,stroke:#060
style W fill:#cfc,stroke:#060
style R fill:#cfc,stroke:#060
style C fill:#cfc,stroke:#060
The four spans form a single causal chain. The trace_id is the same on all four. The parent chain is intact. The full-link timing is recoverable.
Now here is the *same* architecture with the failure:
graph TD
P[produce.render_job<br/>trace_id=abc<br/>ends 03:05:00.180] -.->|signal lost<br/>trace_id| X[???]
P -.->|no shared id| W[worker.ack_wake<br/>trace_id=xyz<br/>orphan]
W --> R[worker.run<br/>trace_id=xyz<br/>orphan]
style P fill:#fcc,stroke:#600
style W fill:#fcc,stroke:#600
style R fill:#fcc,stroke:#600
The producer's span is closed. The worker's span is well-formed. There is no link between them. The 8-minute worker run is *unattributable* to the producer that caused it. The full-link claim is false.
Failure mode two: the silent sleep
The second failure is worse. The trace_id is propagated correctly. The wake-up signal is sent. The signal is *lost in transit* — the network drops a packet, the message broker has a partition, the destination queue is full and the producer's retry runs out. The worker is asleep in poll. The worker will be asleep for a long time.
In the polling architecture, this failure is detectable by the absence of the next poll. The polling worker *must* emit a heartbeat. The callback worker emits nothing.
This is the architectural failure mode the previous chapter flagged: the new architecture's failure is *silence*. The fix is the deadline span.
A deadline span is a timer-based span that opens when the producer publishes the job and closes when the worker emits a completion span. If the worker has not emitted a completion span by the deadline, the deadline span *itself* closes with an error status. The deadline span turns "the worker has been asleep for too long" into an observable, reportable event.
The implementation is one line of OpenTelemetry:
deadline = tracer.start_span(
"worker.deadline",
kind=SpanKind.INTERNAL,
start_time=now,
attributes={"deadline_at": now + MAX_JOB_DURATION}
)
# ...wait for completion span or timeout...
deadline.set_status(StatusCode.ERROR, "worker did not complete in time")
deadline.end()
This is not optional instrumentation. It is the *backstop* that converts the architectural failure mode (silence) into an observable failure mode (an error span). Without the deadline span, the system is observability-inferior to the polling system it replaced. With the deadline span, it is observability-superior.
Imagine your worker has been asleep for 22 minutes
Imagine your callback-driven worker has been asleep for 22 minutes. The producer pushed the wake-up signal at 03:05:00. The signal was lost. The worker is in KernelWait, blocked on epoll_wait, with no events queued. It is not broken. It is not slow. It is not running. It is *waiting for an event that will never come*.
In the polling architecture, you would know this. The polling worker is *also* silent, but the polling worker is silent because the polling cycle is set to a 60-second interval. The silence is bounded. The system has a clock on it.
In the callback architecture, you need to *install* that clock yourself. The deadline span is that clock. Set it to 1.5× the maximum expected job time. When it fires, it is the system telling you, with full trace context, that the wake-up did not happen.
The system has, in effect, become its own watchdog. The trace is the watchdog. The deadline is the bark.
The architectural promise, with the bill
The previous chapter made the claim: a callback-driven worker is observability-superior to a polling worker, because silence is a signal you can trust. That claim is true *if* the two conditions above are met.
If the trace_id is propagated through the wake-up signal, the producer and consumer are stitched into one trace. The full-link timing is recoverable.
If a deadline span guards the worker, silence is converted to an observable event. The full-link *failure* is recoverable.
Miss either, and the new architecture is observability-neutral or observability-inferior. The latency win remains, but the observability win evaporates. You have replaced a polling worker that emits constant noise with a callback worker that emits constant *silence* — and the new silence is harder to debug, because the absence of a signal is harder to investigate than the presence of a wrong signal.
This is the engineering bill the migration imposes. The architectural case for callback-first is strong. The implementation case requires the OpenTelemetry model to be applied at the *wake-up boundary*, not just inside the worker. If the wake-up is a thin notification with no trace context, the migration is a latency win and a debugging loss.
What the migration will surface next
The deadline span solves the second failure mode. It does not solve a third failure mode that the migration will eventually surface: *backpressure*. When a callback worker is fast and the queue behind it is deep, the worker will be woken at the rate of the queue's depth. The system is no longer polling, but it is *thrashing* — the worker is running, completing, being woken, running, completing. The OpenTelemetry model will show this as a forest of consumer spans with no idle time. The fix is backpressure — a slow-down signal from the consumer to the producer, expressed as a span attribute or a queue depth. That migration is the next one. It is not the one to do first.
The first migration is the one the userPrompt asks to validate: replace poll with callback, and make sure the trace follows the baton across every handoff. The first migration is observability. The second will be throughput.
The latency number is the headline of the migration. The trace continuity is the part of the migration you will be debugging at 03:14 if you get it wrong. The deadline span is the difference between a 03:14 page that ends in five minutes and a 03:14 page that ends in five hours.
---
References:
- OpenTelemetry — Traces — the trace, span, and span-kind model this chapter depends on
- OpenTelemetry — Context Propagation — the mechanism that makes
trace_idsurvive transport - The Node.js Event Loop — the kernel-level wait that makes the callback architecture work
- Envoy xDS API — production-scale example of a push-based, trace-stitched control plane