1. Tracing_Is_Air_Traffic_Control_For_Your_Services
Tracing Is Air Traffic Control for Your Services
Distributed tracing is not "better logging." It is the difference between knowing that a plane landed and being able to replay its exact path across every airspace it crossed.
Key Takeaways
- Monitoring tells you *that* something broke; tracing tells you *where in the journey* it broke.
- A single user request in a modern system crosses 5–15 services; without tracing, the boundary between services is the boundary of your knowledge.
- The shift from metrics-first to traces-first is the same shift air traffic control made from "is the runway clear?" to "where is *this flight* right now?"
- Traces are only useful when three properties hold: context propagation across every hop, sampling that preserves the failures, and a backend that lets you slice by any field.
Imagine a flight from San Francisco to Tokyo. It files a flight plan, takes off, crosses Oakland Center, hands off to Seattle Center, then Anchorage, then Tokyo Control. About nine hours into the flight, the pilot reports a hydraulic warning. The passengers are fine. The plane will land safely. But someone, somewhere, will eventually need to answer a question: *which handoff was responsible?* Was it a maintenance issue that should have been caught in San Francisco? A sensor that failed mid-Pacific? A coordination error between Anchorage and Tokyo? You cannot answer this from a single radar ping at the destination. You need the entire flight's path, with timestamps at each handoff, replayable on demand.
That is what a distributed trace is. It is the flight data recorder stitched together with the air traffic control logs, so that any request — any "flight" through your system — can be replayed from takeoff to landing with the timing of every handoff intact.
The Blind Spot Monitoring Cannot Reach
Most teams today run their services the way airports ran in the 1950s: a clipboard at each tower, a stack of paper at the gate, a telephone call when something looked wrong. You had runway status ("is the runway clear?"), weather ("is it raining?"), and departure counts ("how many planes took off?"). If a passenger missed a connection, you could sometimes piece it together from the logs of two adjacent airports. If three airlines were involved, you usually could not.
That is what dashboards built from metrics and logs give you. They are excellent at *aggregate* health. "Error rate is up 0.4%." "p99 latency is 1.8 seconds." "Service B restarted twice in the last hour." Each of these is a runway-status reading. None of them tells you about *this user's request* — the one that failed at 14:32:07 because Service D timed out calling Service G, which was itself slow because Service K had a noisy neighbor on a shared node.
I started my career believing metrics would be enough. They are not. The reason is structural, not stylistic: in a system with N services, the number of possible code paths a single request can take is exponential in N. Aggregate metrics collapse those paths into a single number. The interesting information — which path, in what order, with what latency contribution — is destroyed by the act of aggregation itself. You cannot recover what you did not record.
This is the cognitive shift tracing forces on you. Stop asking *is the system healthy?* and start asking *what did this particular request actually do?* The first question is a property of the fleet. The second is a property of a single flight.
The Trace as a Flight Recorder
A trace is a tree. The root is the entry point — the edge, the API gateway, the user click. Each child is a span: a unit of work with a name, a start time, an end time, and a set of attributes (the "what was this doing" metadata). When Service A calls Service B, Service B's span becomes a child of Service A's. When Service B calls Services C and D in parallel, those spans become siblings. The result, when visualized, looks less like a flow chart and more like a horizontal timeline — Gantt-chart spaghetti, often called a "flame graph."
The power of this structure is not the visualization. It is the *context propagation*. When Ser
8m / Article + audio + video