A Stethoscope for Your Services: A Beginner's Guide to Distributed Tracing
Distributed tracing is the practice of following a single request as it bounces across the services that handle it — the most reliable way to turn a vague complaint like "the app is slow" into a specific sentence like "the recommendation service added 1.4 seconds."
Key Takeaways
- A trace is the full story of one request; a span is one chapter inside that story.
- Distributed tracing exists because modern applications are not one program — they are dozens of small programs talking to each other, and traditional logs cannot reconstruct what happened between them.
- The single most useful real-world analogy is a package with a tracking number: every scan writes a timestamp, every handler leaves a record, and you can replay the entire journey after the fact.
- Three practical habits matter more than any tool choice: name your spans consistently, capture the minimum useful context, and use sampling deliberately instead of capturing everything.
- You do not need to instrument every service on day one — start with the request path users actually feel, then expand outward.
Start with the symptom
You get paged at 9:14 PM. A user reports that checkout is taking eight seconds. Your dashboard says the API is healthy. The database CPU is calm. The cache hit rate is fine. Nothing on any single graph looks wrong. And yet the user is right — something, somewhere, is slow.
If you have ever stared at a wall of dashboards during an incident like this and felt that you were looking at the wrong layer of reality, you already understand why distributed tracing was invented. Dashboards summarize what is true on average across thousands of requests. When one specific request goes wrong, averages lie.
What you actually want, in that moment, is the diary of one request. You want to know which service it entered first, what it called next, how long each step took, and where the time went. That diary is a trace. The skill of reading and producing those diaries is distributed tracing.
Here is the blunt version of the conclusion before we get into the mechanism: distributed tracing is not a luxury observability tool for sophisticated teams. It is the only observability tool that survives the move from monolithic applications to systems built from dozens of small services. Logs are still essential. Metrics are still essential. But neither of them answers "what happened to this one user's request?" Tracing is the tool that fills that gap.
The one analogy you actually need
Imagine you ship a physical package across the country. The sender puts a tracking number on it. From that moment, every person who touches the package scans it: the warehouse that receives it, the truck that loads it, the sorting facility that routes it, the driver who delivers it. Each scan records three things: where, when, and (sometimes) what condition the package was in.
If the recipient calls and says "my package never arrived," you do not panic. You pull up the tracking number and read the chain of scans. You see that it arrived in the regional hub at 3:00 PM, then sat on a conveyor belt for 14 hours, then got rerouted, then was finally delivered two days late. You did not need to be there for any of it. The scans, taken together, are a faithful record of the journey.
That is exactly what a distributed trace is. The package is one user request — say, "load the homepage." The tracking number is the trace ID, a unique identifier generated the moment the request enters your system. Every scan is a span: a record of one unit of work with a start time, an end time, a name, and an ID. And every span carries its parent's ID, which is how you reconstruct the journey.
When the tracking number shows up at every handoff, the chain is complete. When one handler forgets to scan, you have a gap in the story. The discipline of distributed tracing is, fundamentally, the discipline of scanning the package at every handoff.
flowchart TD
U[User clicks "Checkout"] --> A[API Gateway<br/>span: 12ms]
A --> B[Auth Service<br/>span: 34ms]
A --> C[Cart Service<br/>span: 89ms]
C --> D[Inventory Service<br/>span: 140ms]
C --> E[Pricing Service<br/>span: 22ms]
A --> F[Payment Service<br/>span: 1,420ms]
F --> G[Bank API<br/>span: 1,310ms]
Each box is one span. The arrows are the parent–child relationships. The number in milliseconds is how long that piece of work actually took. With a trace like this in hand, you do not need a war room to find the cause of the eight-second checkout. You point at the payment service and you are done.
Spans, traces, and context: the vocabulary
The terms sound technical, but they map onto ordinary experience.
A trace is the full story of one request. It has a unique trace ID and contains every span associated with that request.
A span is one unit of work inside the trace. Every span has a name, a start time, an end time, and an ID. Spans nest: the parent span "checkout" might contain child spans for "validate cart," "charge card," and "send confirmation email."
Context is the small bag of facts that travels with the request across service boundaries. At minimum, the trace ID. Often, also things like the user ID, the request source, or feature flags. Context propagation is the technical term for "passing the tracking number to the next handler."
You do not need to memorize any of this to use tracing. You need to remember one rule: when a service receives a request, it should know which trace it belongs to, and when it makes a downstream call, it should pass that information along.
Why logs and metrics alone are not enough
A common beginner question is "we already have logs and metrics — why do we need a third thing?" The honest answer is that logs, metrics, and traces answer different questions.
Metrics tell you what is true on average across many requests. They are excellent for dashboards and alerts. They are terrible for understanding one specific request, because the signal is averaged away.
Logs tell you what individual programs said at individual moments. They are excellent for debugging a single service in isolation. They are terrible for understanding a request that hops across services, because the log lines from different services have no inherent connection — they only share a timestamp and a user ID, which is not always enough to reconstruct the journey.
Traces are the only observability signal that explicitly stitches the journey together. They exist precisely because the others could not. If you have ever tried to debug a slow request by joining logs across services by timestamp and given up after twenty minutes, you already understand the gap.
The three habits that matter most
Tools like Jaeger, Zipkin, and the OpenTelemetry collectors are all roughly equivalent in capability for a beginner. What separates a team that gets value from tracing from a team that collects dust is a small number of disciplined habits. Three matter most.
1. Name your spans consistently
A span named "handler" tells you almost nothing. A span named "GET /api/users/:id" tells you exactly what the code was doing. Adopt a convention early — usually the route, the function, or the external system being called — and apply it everywhere. The cost of bad names compounds: six months in, when you are looking at 50,000 spans, vague names become impossible to search.
2. Capture the minimum useful context
There is a real temptation to dump everything you know into every span: full request bodies, full response payloads, every environment variable. Do not. A span with a hundred fields of context is slow to produce, expensive to store, and impossible to scan. Capture what you would need to debug: the user ID, the order ID, the cache key, the error message. Leave the rest to logs.
3. Sample deliberately
Capturing every trace from every request is the default assumption, and it is wrong. In a busy system, that is too much data and too much cost. Most teams sample — keep 100% of error traces and 1% of successful ones, or use rate-limiting at the edge, or sample by user. The point is not to choose a clever sampling strategy; the point is to choose one, on purpose, and document it.
A practical starting path
If you are new to distributed tracing, here is the path I would take, in order.
Start with one service that talks to others. A monolithic app with no outbound calls has nothing to trace. Pick a service that fans out — the one that calls the database, the cache, and one external API — and instrument it first. You will get useful traces within an hour.
Use OpenTelemetry, not a vendor SDK. OpenTelemetry is the open standard that every tracing backend supports. If you start with a vendor's proprietary library, you will rewrite everything the first time you change vendors. If you start with OpenTelemetry, the SDK and the backend are independent choices.
Instrument the framework, not the code. Most modern web frameworks — Express, Spring, Django, FastAPI, Rails — have OpenTelemetry plugins that automatically create spans for every request and every outbound HTTP or database call. Turn the plugin on before you start adding custom spans. You will get 80% of the value with 5% of the effort.
Add custom spans only at meaningful boundaries. A custom span should mark a unit of work that a human would want to look at: "build the recommendation list," "enrich with user history," "apply business rules." It should not mark "increment i by one." When in doubt, leave it to the framework.
Make the trace ID visible. Once you have traces flowing, make sure every log line in every service includes the trace ID. The day you connect "the slow request" to "the error log line" through that shared ID, you will never go back.
What tracing will not do for you
A few honest limitations are worth stating up front, because they prevent disappointment later.
Tracing will not tell you whether your system is healthy in aggregate. That is metrics' job. Tracing will not replace logs for deep debugging inside a single function. That is logs' job. And tracing will not save you if you do not look at the traces — a tracing backend nobody queries is just an expensive log pipeline.
Tracing also has a sample-size problem. If you only sample 1% of successful requests, you will not see rare edge cases that happen to 0.1% of users. The answer is not to sample more — the answer is to sample errors at 100% and to add targeted instrumentation when you know what edge case you are hunting for.
The mental shift
Here is the change in thinking that tracing produces, and it is the reason teams that adopt it rarely go back.
Before tracing, you debug by service: "the auth service is slow." After tracing, you debug by request: "this user's request was slow, and here is exactly where the time went." The first framing forces you to investigate services that may not even be involved. The second framing points you straight at the cause.
That is the entire pitch for distributed tracing. It does not give you magic visibility into your system. It gives you the right unit of observation — one request — and the discipline to record its journey honestly. Everything else is tooling.
---
References: