March 21 2025

Intermittent Bugs: Where to Start

How to investigate a failure that appears and disappears without treating irregular behavior like luck or superstition.

Andrews Ribeiro

Founder & Engineer

4 min Intermediate Systems

#debugging-production#debugging#troubleshooting#intermittent-bugs#production#investigation

The problem

Intermittent bugs mess with the team’s confidence.

Today it fails. Tomorrow it passes. It works on your machine. In production it comes back.

Because it does not follow a neat script, a lot of people switch into the wrong mode:

they call it flaky too early
they try to fix it without a clear slice
they change config in the dark
they hope it disappears

But an intermittent bug is almost never pure luck.

There is usually a condition that still has not been isolated.

Mental model

Think about it like this:

an intermittent bug is a bug whose pattern has not been discovered yet

That sentence helps because it removes the mysticism.

If there is behavior, there is some combination of factors behind it:

one specific instance
load
timing
data
browser
tenant
external dependency

Your job is to figure out which variable is participating.

Breaking it down

Start by looking for the difference between good and bad cases

One isolated failure does not say much by itself.

What really helps is comparing:

when it fails
when it passes
what changes between those two scenarios

Sometimes the clue is not inside the error itself. It is inside the contrast.

Focus on hidden variables

When a bug is not constant, suspect things like:

one instance in the pool
one small subset of data
a timeout under higher latency
a different event order
stale cache
an oscillating external dependency

Intermittency is often the signature of a condition that is not present all the time.

Try to increase the chance of repetition

You will not always reproduce it perfectly.

But you can increase the probability:

run the same flow many times
pin the same input
test at a similar time of day
route to the same instance
simulate latency
reduce noise and vary one condition at a time

Partial reproduction already helps a lot.

Treat each attempt like an experiment

If you change five things at once, the bug stays intermittent and your investigation becomes intermittent too.

Change one condition, observe, write it down.

That rhythm looks less heroic. But it usually works much better.

Simple example

Imagine a profile-save endpoint that fails in about 1 out of every 20 requests.

Looking only at the latest failure may lead the team to:

change the ORM
increase the timeout
restart everything

But after comparing successful requests with failed ones, the team notices one important difference:

every failure was handled by the same pod

Now the bug stops looking random.

It becomes an objective question:

what exists in that pod that does not exist in the others?

Maybe one missing environment variable. Maybe corrupted cache. Maybe an old version still running.

That is the point: the comparison removed a large part of the fog.

Common mistakes

calling it unstable without trying to find a pattern
investigating only the failed request and ignoring the successful ones
trying to “solve” it with retries before understanding the cause
accepting “it disappeared after restart” as an explanation
confusing low frequency with lack of causality

How a senior thinks

More experienced engineers usually slow the team’s anxiety down.

The reasoning often sounds like this:

“If it appears and disappears, some variable is still escaping our observation. Before fixing it, I want to compare good and bad cases and discover what changes between them.”

That kind of thinking does not romanticize the bug.

It just treats intermittency as lack of enough visibility.

What the interviewer wants to see

In interviews, this topic measures investigation maturity.

The evaluator wants to see whether you:

look for pattern instead of complaining about randomness
compare cases that pass and fail
raise hypotheses about hidden variables
try to increase reproducibility with method

A strong answer often sounds like this:

“I would start by comparing successful and failed requests to discover what really changes. An intermittent bug usually points to a condition that only appears part of the time, so I would try to isolate environment, data, timing, or instance before proposing a fix.”

An intermittent bug is not a bug without cause. It is a bug whose cause is still badly observed.

When the pattern appears, intermittency stops looking like mystery and becomes engineering again.

Quick summary

What to keep in your head

An intermittent bug usually has a hidden pattern, not magic.
Comparing failing cases with successful ones often teaches more than staring only at the latest exception.
Intermittency usually points to an environment variable, concurrency, specific data, or a timing window.
The initial goal is not to prove a pretty theory. It is to increase your chance of repeating and explaining the behavior.

Practice checklist

Use this when you answer

Can I list which hidden variables might explain a bug that appears and disappears?
Do I know how to compare good and bad cases to find meaningful differences?
Can I propose ways to increase reproducibility without changing everything?
Can I explain in an interview why intermittent bugs require scoping, not superstition?

You finished this article

Next step

How to Debug Without Changing Code in the Dark Next step →

You finished this article

Next step

How to Debug Without Changing Code in the Dark Next step →

Next article Hypothesis, Isolation, and Confirmation Previous article Error Handling Without Empty Try/Catch Theater

Intermittent Bugs: Where to Start

The problem

Mental model

Breaking it down

Start by looking for the difference between good and bad cases

Focus on hidden variables

Try to increase the chance of repetition

Treat each attempt like an experiment

Simple example

Common mistakes

How a senior thinks

What the interviewer wants to see

What to keep in your head

Use this when you answer

Keep exploring

Articles

Debugging & Production

Related articles

Hypothesis, Isolation, and Confirmation

How to Distinguish Symptom from Root Cause

Logs and Observability Without Noise

Related articles

Hypothesis, Isolation, and Confirmation

How to Distinguish Symptom from Root Cause

Logs and Observability Without Noise