Skip to main content

Intermittent Bugs: Where to Start

How to investigate a failure that appears and disappears without treating irregular behavior like luck or superstition.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

The problem

Intermittent bugs mess with the team’s confidence.

Today it fails. Tomorrow it passes. It works on your machine. In production it comes back.

Because it does not follow a neat script, a lot of people switch into the wrong mode:

  • they call it flaky too early
  • they try to fix it without a clear slice
  • they change config in the dark
  • they hope it disappears

But an intermittent bug is almost never pure luck.

There is usually a condition that still has not been isolated.

Mental model

Think about it like this:

an intermittent bug is a bug whose pattern has not been discovered yet

That sentence helps because it removes the mysticism.

If there is behavior, there is some combination of factors behind it:

  • one specific instance
  • load
  • timing
  • data
  • browser
  • tenant
  • external dependency

Your job is to figure out which variable is participating.

Breaking it down

Start by looking for the difference between good and bad cases

One isolated failure does not say much by itself.

What really helps is comparing:

  • when it fails
  • when it passes
  • what changes between those two scenarios

Sometimes the clue is not inside the error itself. It is inside the contrast.

Focus on hidden variables

When a bug is not constant, suspect things like:

  • one instance in the pool
  • one small subset of data
  • a timeout under higher latency
  • a different event order
  • stale cache
  • an oscillating external dependency

Intermittency is often the signature of a condition that is not present all the time.

Try to increase the chance of repetition

You will not always reproduce it perfectly.

But you can increase the probability:

  • run the same flow many times
  • pin the same input
  • test at a similar time of day
  • route to the same instance
  • simulate latency
  • reduce noise and vary one condition at a time

Partial reproduction already helps a lot.

Treat each attempt like an experiment

If you change five things at once, the bug stays intermittent and your investigation becomes intermittent too.

Change one condition, observe, write it down.

That rhythm looks less heroic. But it usually works much better.

Simple example

Imagine a profile-save endpoint that fails in about 1 out of every 20 requests.

Looking only at the latest failure may lead the team to:

  • change the ORM
  • increase the timeout
  • restart everything

But after comparing successful requests with failed ones, the team notices one important difference:

  • every failure was handled by the same pod

Now the bug stops looking random.

It becomes an objective question:

what exists in that pod that does not exist in the others?

Maybe one missing environment variable. Maybe corrupted cache. Maybe an old version still running.

That is the point: the comparison removed a large part of the fog.

Common mistakes

  • calling it unstable without trying to find a pattern
  • investigating only the failed request and ignoring the successful ones
  • trying to “solve” it with retries before understanding the cause
  • accepting “it disappeared after restart” as an explanation
  • confusing low frequency with lack of causality

How a senior thinks

More experienced engineers usually slow the team’s anxiety down.

The reasoning often sounds like this:

“If it appears and disappears, some variable is still escaping our observation. Before fixing it, I want to compare good and bad cases and discover what changes between them.”

That kind of thinking does not romanticize the bug.

It just treats intermittency as lack of enough visibility.

What the interviewer wants to see

In interviews, this topic measures investigation maturity.

The evaluator wants to see whether you:

  • look for pattern instead of complaining about randomness
  • compare cases that pass and fail
  • raise hypotheses about hidden variables
  • try to increase reproducibility with method

A strong answer often sounds like this:

“I would start by comparing successful and failed requests to discover what really changes. An intermittent bug usually points to a condition that only appears part of the time, so I would try to isolate environment, data, timing, or instance before proposing a fix.”

An intermittent bug is not a bug without cause. It is a bug whose cause is still badly observed.

When the pattern appears, intermittency stops looking like mystery and becomes engineering again.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Next article Hypothesis, Isolation, and Confirmation Previous article Error Handling Without Empty Try/Catch Theater

Keep exploring

Related articles