April 3 2025

Async Bugs and Race Conditions

How to make timing failures easier to understand by making ordering, concurrency, and shared state visible.

Andrews Ribeiro

Founder & Engineer

3 min Intermediate Systems

#debugging-production#debugging#async#race-condition

The problem

Race conditions feel scary because they rarely fail the exact same way twice.

It works locally, fails in production, disappears when you add a console.log, and only shows up when two responses land in one specific order.

That makes a lot of people treat async bugs like bad luck, when the real problem is usually much simpler: the system has no solid rule for which result is still allowed to update shared state.

Mental model

When you are hunting an async bug, looking only at “what the code does” is not enough.

You also need to look at:

the timeline of events
which operation finished before the other
whether the state was still valid when the result arrived

Once the investigation shifts from “reading lines of code” to “drawing the sequence of events,” the bug usually stops feeling like a ghost.

It also helps to replace a bad sentence with a better one:

bad: “the app went weird”
better: “two operations finished in an order the UI was not prepared to handle”

Breaking it down

A practical way to investigate this kind of bug looks like this:

list the concurrent events involved
draw the order in which they can finish
find the point where two operations compete over the same state
identify the missing guarantee: cancellation, locking, request versioning, or final validation

That turns a “random bug” into a predictable collision.

This matters because concurrency does not mean total chaos. It means there are multiple valid timelines, and your code still has to stay correct in more than one of them.

Simple example

Imagine an autocomplete input:

the user types re
request A is sent
the user keeps typing and reaches react
request B is sent
request B returns first and shows the correct results
request A returns later and overwrites the UI with stale data

The problem is not fetch.

The problem is that the frontend accepted an old response as if it were still the current truth.

Good fixes here are straightforward:

cancel the earlier request with AbortController
ignore responses with an outdated request ID
only update the UI if the response still matches the current input

None of these fixes exist to make the request “faster.” They exist to stop old state from winning after the world has already changed.

Common mistakes

trying to reproduce the bug by random clicking without mapping the timeline first
putting a setTimeout on top of the problem and hoping it disappears
assuming “async” means random and impossible to fix
forgetting that two perfectly valid responses can still break the UI if they arrive in the wrong order

How a senior thinks

More experienced engineers do not call an async bug flaky by reflex.

They draw the timeline and ask:

What sequence of events puts this system into an invalid state?

That question pulls the discussion out of superstition and back into causality.

Another useful question usually follows:

What guarantee is missing that should stop old state from becoming valid again?

Sometimes the answer is cancellation. Sometimes it is idempotency. Sometimes it is a lock. Sometimes it is just checking whether the state is still current before applying the result.

What the interviewer wants to see

In frontend or systems interviews, concurrency reveals depth very quickly.

You understand that concurrency makes execution order less predictable.
You look for collision points over shared mutable state.
You talk about architectural guarantees, not just adding more await.

A strong answer often sounds like this:

I would draw the timeline and figure out which response or operation arrived too late but still managed to write into shared state. From there I would choose the right guarantee: cancellation, locking, versioning, request IDs, or a final validation check.

A race condition is not bad luck. It is a collision the architecture still does not know how to survive.

Quick summary

What to keep in your head

Async bugs usually get clearer when you draw the event order instead of staring at code in isolation.
A race condition happens when two valid operations compete over the same state without enough protection.
Cancellation, request IDs, final validation, and locks solve different versions of the same timing problem.
The faster you can name the collision, the less time you waste calling the bug random.

Practice checklist

Use this when you answer

Can I draw the timeline of two concurrent requests or events?
Can I tell whether the fix needs cancellation, a lock, a request ID, or a final validation check?
Can I explain why an old response should not overwrite newer state?
Can I talk about async bugs in interviews as a causality problem, not a luck problem?

You finished this article

Next step

Logs and Observability Without Noise Next step →

You finished this article

Next step

Logs and Observability Without Noise Next step →

Next article Debugging Rounds: How to Investigate Broken Code Like a Real Engineer Previous article Logs and Observability Without Noise

Async Bugs and Race Conditions

The problem

Mental model

Breaking it down

Simple example

Common mistakes

How a senior thinks

What the interviewer wants to see

What to keep in your head

Use this when you answer

Keep exploring

Articles

Debugging & Production

Related articles

Investigating Production Failures

Debugging Rounds: How to Investigate Broken Code Like a Real Engineer

Distributed Tracing: What It Is and How to Use It to Debug Systems

Related articles

Investigating Production Failures

Debugging Rounds: How to Investigate Broken Code Like a Real Engineer

Distributed Tracing: What It Is and How to Use It to Debug Systems