March 14 2025

Distributed Tracing: What It Is and How to Use It to Debug Systems

How to use traces to reconstruct a flow across services without getting lost in tools, jargon, or pretty visuals with no value.

Andrews Ribeiro

Founder & Engineer

6 min Intermediate Systems

#debugging-production#debugging#production#observability#tracing#distributed-systems

The problem

When the system is small, investigation is usually more direct.

There is one application, some logs, a few dependencies, and a relatively short path between request and response.

But when the flow goes through:

a gateway
an authentication service
a checkout service
a queue
a worker
an external provider

debugging changes its nature.

You stop having “one error in one place.”

You start having a story spread across many places.

And then harder questions appear:

where did this flow start degrading?
which stage consumed most of the time?
did this error start here or was it propagated from another service?
is the problem general or only in one branch of the flow?

With only loose logs and aggregated metrics, sometimes that becomes too slow to answer.

Mental model

Think about it like this:

distributed tracing is a way to follow the journey of the same operation across several parts of the system

That is the core idea.

You are not looking at one isolated server.

You are looking at a trajectory.

Usually that appears as:

one trace representing the whole operation
spans representing internal stages
duration and status for each stage
parent and child relationships between calls

Nothing too mysterious.

Tracing exists to reconstruct the path of a distributed execution.

Breaking it down

Logs, metrics, and traces do not do the same thing

This is a common mistake.

Many people learn observability as if it were one bundled thing.

It is not.

A simple way to think about it:

metrics show aggregate patterns
logs show events with context
traces show the path of one specific execution

Example:

a metric may show latency increasing
a log may record a timeout in one dependency
a trace may show where in the flow that time was spent and how that delay affected the rest

Once you understand that role, you stop expecting tracing to do what it never promised.

Tracing shines when the real question is “where in the flow?”

That is the classic use.

If the main problem is:

which stage is slow
which service introduced the error
where the chain started failing
why one specific request got much worse than normal

tracing helps a lot.

If the question is totally local and simple, logs or metrics may already be enough.

So:

it is not a tool for everything.

It is especially good for trajectory questions.

A span is just one stage with a start and an end

Many people get stuck on that word.

They do not need to.

Think about a span as:

one piece of work
with a start
with an end
with a duration
and with some context attached

Examples:

an HTTP call to another service
one database query
a publish to a queue
one important internal processing step

When you see many linked spans together, the whole operation becomes much clearer.

Tracing helps separate the real bottleneck from intuitive suspicion

That gain is huge.

In distributed systems, teams often have instincts like:

“I think it is the database”
“it looks like the queue”
“it must be the external gateway”

Sometimes they are right.

Sometimes they are not.

Tracing reduces that lottery because it shows:

the time spent at each hop
which dependency failed
whether there were retries
whether the problem appeared early or late in the flow

It does not solve everything alone, but it shortens the search a lot.

Bad tracing also exists

This is worth saying.

Not every pretty trace helps.

Common problems:

too many spans with no value
confusing span names
no useful context
important boundaries without instrumentation
difficulty connecting the trace with logs and errors

Useful tracing is not the one that generates the prettiest graph.

It is the one that helps someone answer a real operational question.

One trace rarely closes the diagnosis by itself

This point matters too.

A trace can show:

where latency happened
where the operation failed
which path it followed

But many times you still need:

logs for more detailed context
metrics to understand the scale of the problem
code reading to validate the hypothesis

Tracing does not replace the rest.

It connects the rest.

In interviews, the best answer is about when and why to use it

Instead of dumping the definition, it is usually better to answer like this:

when I would use tracing
which question I would try to answer
how it complements logs and metrics
which decision it unlocks

That sounds much more mature than repeating terminology.

Simple example

Imagine checkout got slower after a recent change.

You already know from metrics that latency increased.

But you still do not know where.

Weak answer:

“I would open tracing to see what happened.”

That is still too generic.

Better answer:

“Because the flow crosses several services and one external dependency, I would use tracing to locate where latency grew inside the request journey. The question is not only ‘is it slow?’ because the metric already answered that. The question is ‘in which stage is the time being consumed, and did that start in our service, in the database, or in one external call?’ From there I would cross the trace with the logs and the error in the suspicious stage.”

That answer works better because it shows:

why tracing enters the picture
which question it answers
how it connects to the rest of the investigation

Common mistakes

treating tracing as a replacement for logs and metrics
talking about spans and traces without saying which question they answer
instrumenting everything without criteria and creating too much noise
looking at one isolated trace and treating that as proof of root cause
turning an interview answer into a vendor catalog

How a senior thinks

Engineers who are more mature with distributed systems often think like this:

“When the problem crosses several boundaries, I need a way to see the whole trajectory, not only isolated pieces of it.”

That is a very good lens.

Because it explains why tracing exists without mysticism.

Seniority here is not using pretty observability words.

It is knowing which tool reduces the most uncertainty for the question you have right now.

What the interviewer wants to see

When this topic appears in an interview, the evaluator is usually trying to understand whether you:

can differentiate the roles of logs, metrics, and traces
know when tracing actually helps
can use tracing to locate a bottleneck or failure in a distributed flow
avoid overly abstract talk
think of investigation as a composition of signals, not as one magic tool

A strong answer usually shows:

what kind of system or problem calls for tracing
which question tracing helps answer
how it works with logs and metrics
which practical decision it accelerates in debugging

If that appears, the answer already gets much stronger.

Tracing does not exist to make observability look more sophisticated. It exists to make a distributed flow readable.

When the problem lives between services, looking only at each service in isolation almost always delays understanding.

Quick summary

What to keep in your head

Tracing connects stages of the same flow across services and helps answer where time and errors actually appeared.
A useful trace does not replace logs and metrics. It connects the pieces when the problem crosses several boundaries.
In distributed debugging, tracing reduces guessing about which service started degrading the flow.
In interviews, a strong answer shows when to use tracing and what kind of question it helps answer.

Practice checklist

Use this when you answer

Can I explain the difference between logs, metrics, and traces without mixing their roles?
Do I know when tracing helps more than looking only at an aggregated dashboard?
Can I use the idea of a trace to locate a bottleneck, error, or slow dependency in a distributed flow?
Can I answer without turning it into an observability tool catalog?

You finished this article

Next step

Logs and Observability Without Noise Next step →