Skip to main content

Distributed Tracing: What It Is and How to Use It to Debug Systems

How to use traces to reconstruct a flow across services without getting lost in tools, jargon, or pretty visuals with no value.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

The problem

When the system is small, investigation is usually more direct.

There is one application, some logs, a few dependencies, and a relatively short path between request and response.

But when the flow goes through:

  • a gateway
  • an authentication service
  • a checkout service
  • a queue
  • a worker
  • an external provider

debugging changes its nature.

You stop having “one error in one place.”

You start having a story spread across many places.

And then harder questions appear:

  • where did this flow start degrading?
  • which stage consumed most of the time?
  • did this error start here or was it propagated from another service?
  • is the problem general or only in one branch of the flow?

With only loose logs and aggregated metrics, sometimes that becomes too slow to answer.

Mental model

Think about it like this:

distributed tracing is a way to follow the journey of the same operation across several parts of the system

That is the core idea.

You are not looking at one isolated server.

You are looking at a trajectory.

Usually that appears as:

  • one trace representing the whole operation
  • spans representing internal stages
  • duration and status for each stage
  • parent and child relationships between calls

Nothing too mysterious.

Tracing exists to reconstruct the path of a distributed execution.

Breaking it down

Logs, metrics, and traces do not do the same thing

This is a common mistake.

Many people learn observability as if it were one bundled thing.

It is not.

A simple way to think about it:

  • metrics show aggregate patterns
  • logs show events with context
  • traces show the path of one specific execution

Example:

  • a metric may show latency increasing
  • a log may record a timeout in one dependency
  • a trace may show where in the flow that time was spent and how that delay affected the rest

Once you understand that role, you stop expecting tracing to do what it never promised.

Tracing shines when the real question is “where in the flow?”

That is the classic use.

If the main problem is:

  • which stage is slow
  • which service introduced the error
  • where the chain started failing
  • why one specific request got much worse than normal

tracing helps a lot.

If the question is totally local and simple, logs or metrics may already be enough.

So:

it is not a tool for everything.

It is especially good for trajectory questions.

A span is just one stage with a start and an end

Many people get stuck on that word.

They do not need to.

Think about a span as:

  • one piece of work
  • with a start
  • with an end
  • with a duration
  • and with some context attached

Examples:

  • an HTTP call to another service
  • one database query
  • a publish to a queue
  • one important internal processing step

When you see many linked spans together, the whole operation becomes much clearer.

Tracing helps separate the real bottleneck from intuitive suspicion

That gain is huge.

In distributed systems, teams often have instincts like:

  • “I think it is the database”
  • “it looks like the queue”
  • “it must be the external gateway”

Sometimes they are right.

Sometimes they are not.

Tracing reduces that lottery because it shows:

  • the time spent at each hop
  • which dependency failed
  • whether there were retries
  • whether the problem appeared early or late in the flow

It does not solve everything alone, but it shortens the search a lot.

Bad tracing also exists

This is worth saying.

Not every pretty trace helps.

Common problems:

  • too many spans with no value
  • confusing span names
  • no useful context
  • important boundaries without instrumentation
  • difficulty connecting the trace with logs and errors

Useful tracing is not the one that generates the prettiest graph.

It is the one that helps someone answer a real operational question.

One trace rarely closes the diagnosis by itself

This point matters too.

A trace can show:

  • where latency happened
  • where the operation failed
  • which path it followed

But many times you still need:

  • logs for more detailed context
  • metrics to understand the scale of the problem
  • code reading to validate the hypothesis

Tracing does not replace the rest.

It connects the rest.

In interviews, the best answer is about when and why to use it

Instead of dumping the definition, it is usually better to answer like this:

  • when I would use tracing
  • which question I would try to answer
  • how it complements logs and metrics
  • which decision it unlocks

That sounds much more mature than repeating terminology.

Simple example

Imagine checkout got slower after a recent change.

You already know from metrics that latency increased.

But you still do not know where.

Weak answer:

“I would open tracing to see what happened.”

That is still too generic.

Better answer:

“Because the flow crosses several services and one external dependency, I would use tracing to locate where latency grew inside the request journey. The question is not only ‘is it slow?’ because the metric already answered that. The question is ‘in which stage is the time being consumed, and did that start in our service, in the database, or in one external call?’ From there I would cross the trace with the logs and the error in the suspicious stage.”

That answer works better because it shows:

  • why tracing enters the picture
  • which question it answers
  • how it connects to the rest of the investigation

Common mistakes

  • treating tracing as a replacement for logs and metrics
  • talking about spans and traces without saying which question they answer
  • instrumenting everything without criteria and creating too much noise
  • looking at one isolated trace and treating that as proof of root cause
  • turning an interview answer into a vendor catalog

How a senior thinks

Engineers who are more mature with distributed systems often think like this:

“When the problem crosses several boundaries, I need a way to see the whole trajectory, not only isolated pieces of it.”

That is a very good lens.

Because it explains why tracing exists without mysticism.

Seniority here is not using pretty observability words.

It is knowing which tool reduces the most uncertainty for the question you have right now.

What the interviewer wants to see

When this topic appears in an interview, the evaluator is usually trying to understand whether you:

  • can differentiate the roles of logs, metrics, and traces
  • know when tracing actually helps
  • can use tracing to locate a bottleneck or failure in a distributed flow
  • avoid overly abstract talk
  • think of investigation as a composition of signals, not as one magic tool

A strong answer usually shows:

  1. what kind of system or problem calls for tracing
  2. which question tracing helps answer
  3. how it works with logs and metrics
  4. which practical decision it accelerates in debugging

If that appears, the answer already gets much stronger.

Tracing does not exist to make observability look more sophisticated. It exists to make a distributed flow readable.

When the problem lives between services, looking only at each service in isolation almost always delays understanding.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Next article Error Handling Without Empty Try/Catch Theater Previous article How to Debug Without Changing Code in the Dark

Keep exploring

Related articles