March 21 2025
Intermittent Bugs: Where to Start
How to investigate a failure that appears and disappears without treating irregular behavior like luck or superstition.
Andrews Ribeiro
Founder & Engineer
4 min Intermediate Systems
The problem
Intermittent bugs mess with the team’s confidence.
Today it fails. Tomorrow it passes. It works on your machine. In production it comes back.
Because it does not follow a neat script, a lot of people switch into the wrong mode:
- they call it flaky too early
- they try to fix it without a clear slice
- they change config in the dark
- they hope it disappears
But an intermittent bug is almost never pure luck.
There is usually a condition that still has not been isolated.
Mental model
Think about it like this:
an intermittent bug is a bug whose pattern has not been discovered yet
That sentence helps because it removes the mysticism.
If there is behavior, there is some combination of factors behind it:
- one specific instance
- load
- timing
- data
- browser
- tenant
- external dependency
Your job is to figure out which variable is participating.
Breaking it down
Start by looking for the difference between good and bad cases
One isolated failure does not say much by itself.
What really helps is comparing:
- when it fails
- when it passes
- what changes between those two scenarios
Sometimes the clue is not inside the error itself. It is inside the contrast.
Focus on hidden variables
When a bug is not constant, suspect things like:
- one instance in the pool
- one small subset of data
- a timeout under higher latency
- a different event order
- stale cache
- an oscillating external dependency
Intermittency is often the signature of a condition that is not present all the time.
Try to increase the chance of repetition
You will not always reproduce it perfectly.
But you can increase the probability:
- run the same flow many times
- pin the same input
- test at a similar time of day
- route to the same instance
- simulate latency
- reduce noise and vary one condition at a time
Partial reproduction already helps a lot.
Treat each attempt like an experiment
If you change five things at once, the bug stays intermittent and your investigation becomes intermittent too.
Change one condition, observe, write it down.
That rhythm looks less heroic. But it usually works much better.
Simple example
Imagine a profile-save endpoint that fails in about 1 out of every 20 requests.
Looking only at the latest failure may lead the team to:
- change the ORM
- increase the timeout
- restart everything
But after comparing successful requests with failed ones, the team notices one important difference:
- every failure was handled by the same pod
Now the bug stops looking random.
It becomes an objective question:
what exists in that pod that does not exist in the others?
Maybe one missing environment variable. Maybe corrupted cache. Maybe an old version still running.
That is the point: the comparison removed a large part of the fog.
Common mistakes
- calling it unstable without trying to find a pattern
- investigating only the failed request and ignoring the successful ones
- trying to “solve” it with retries before understanding the cause
- accepting “it disappeared after restart” as an explanation
- confusing low frequency with lack of causality
How a senior thinks
More experienced engineers usually slow the team’s anxiety down.
The reasoning often sounds like this:
“If it appears and disappears, some variable is still escaping our observation. Before fixing it, I want to compare good and bad cases and discover what changes between them.”
That kind of thinking does not romanticize the bug.
It just treats intermittency as lack of enough visibility.
What the interviewer wants to see
In interviews, this topic measures investigation maturity.
The evaluator wants to see whether you:
- look for pattern instead of complaining about randomness
- compare cases that pass and fail
- raise hypotheses about hidden variables
- try to increase reproducibility with method
A strong answer often sounds like this:
“I would start by comparing successful and failed requests to discover what really changes. An intermittent bug usually points to a condition that only appears part of the time, so I would try to isolate environment, data, timing, or instance before proposing a fix.”
An intermittent bug is not a bug without cause. It is a bug whose cause is still badly observed.
When the pattern appears, intermittency stops looking like mystery and becomes engineering again.
Quick summary
What to keep in your head
- An intermittent bug usually has a hidden pattern, not magic.
- Comparing failing cases with successful ones often teaches more than staring only at the latest exception.
- Intermittency usually points to an environment variable, concurrency, specific data, or a timing window.
- The initial goal is not to prove a pretty theory. It is to increase your chance of repeating and explaining the behavior.
Practice checklist
Use this when you answer
- Can I list which hidden variables might explain a bug that appears and disappears?
- Do I know how to compare good and bad cases to find meaningful differences?
- Can I propose ways to increase reproducibility without changing everything?
- Can I explain in an interview why intermittent bugs require scoping, not superstition?
You finished this article
Share this page
Copy the link manually from the field below.