May 1 2025

Investigating Production Failures

How to investigate a real production problem with a clear process for evidence, containment, and communication.

Andrews Ribeiro

Founder & Engineer

5 min Intermediate Systems

#debugging-production#debugging#production#incidents

Track

Senior Full Stack Interview Trail

Step 7 / 14

Back to track Previous article Next article

The problem

Production failures create pressure very quickly.

That pressure often turns into movement before it turns into understanding.

The team restarts pods, increases timeouts, adds logs everywhere, rolls something back, and ships a speculative fix before anyone has clearly described what actually broke.

Sometimes the symptom goes away. The problem is that nobody learns why.

Mental model

During a production incident, your first job is not to look fast.

Your first job is to reduce damage and uncertainty in the right order.

A useful question is:

What do we know for sure, what is only a guess, what changed recently, and what action lowers risk without making diagnosis messier?

That question is valuable because it keeps the room from turning into an operational lottery.

Breaking it down

Containment and investigation are not the same thing

Sometimes the first move is mitigation:

rollback
turning off a feature flag
routing traffic away
shrinking the blast radius

Sometimes the first move is to observe a little better before changing anything.

The deciding questions are usually:

is the damage active and meaningful?
is the mitigation safe and reversible?
does waiting increase real cost?

If the answer is yes, containment usually comes first.

But good containment is not random destruction. It reduces damage without wiping out your ability to understand what happened.

Define the symptom before chasing the cause

A lot of bad incident response starts by trying to explain something that has not even been scoped yet.

Before theories, lock down:

what error is happening
since when
in which flow
for whom
with what severity

That alone removes a lot of noise.

Recent change is still a strong clue

One of the first useful questions in a real incident is still:

what changed near the start of the problem?

That change might be:

a deploy
a config change
a feature flag
a third-party dependency
a traffic pattern shift

Not because recent change proves guilt.

Because it narrows the search space.

Look at reliable signals before changing too much

This is where you check:

metrics
logs
traces
error rate by flow
correlation with a deploy or a component

The logic matters more than the tool name:

confirm impact
narrow scope
compare to normal
identify the highest-probability suspect

If you jump straight into changing infra, config, and code before doing this, the response stops looking like investigation and starts looking like technical anxiety.

Changing many things at once ruins the experiment

This is where weak answers fall apart.

If the response is:

restart the pods
raise the timeout
add more logs
roll back

all in one breath, it sounds active but not disciplined.

If the system improves, nobody knows which move helped.

If it gets worse, the room just got noisier.

Strong answers show sequence, not just actions

A much better structure is:

what I would confirm first
what I would mitigate if impact is high
what I would investigate next
how I would communicate status and next decision

That sounds like real operational reasoning.

Simple example

Imagine this interview prompt:

After a deploy, checkout starts returning 500 for part of the traffic. What do you do first?

A weak answer sounds like:

I would open the logs, restart the pods, and roll back if needed.

That has actions, but not much order.

A stronger answer sounds more like this:

First I would confirm the real impact: whether the issue affects every checkout or only one payment path, when it started, and whether it lines up with the deploy. If the impact is high and the correlation is strong, I would consider rollback or disabling the change to stop the bleeding. In parallel, I would check the most reliable signals to narrow the failure: which endpoint is failing, which dependency is involved, whether latency changed, and what is different between affected and unaffected requests. I would avoid changing multiple things at once because that makes diagnosis worse.

That answer shows:

containment with judgment
investigation with sequence
respect for evidence
implicit communication of next steps

Common mistakes

changing multiple things at once and destroying the diagnostic trail
treating rollback as if it also explains the cause
jumping into code before defining impact and scope
naming tools instead of showing reasoning
mixing temporary mitigation with real correction

How a senior thinks

Experienced engineers often think about incidents like this:

If the room is moving too fast, my job is to put structure back. If the room is too slow, my job is to pull the safest first action.

That is a strong lens because it shows two things at the same time:

emotional control
technical prioritization

Production failures rarely need genius. They need order under pressure.

What the interviewer wants to see

When this shows up in an interview, the interviewer is usually checking whether you:

contain without destroying evidence
investigate with method instead of guesswork
separate symptom, hypothesis, and likely cause
check signals before committing to a story
communicate sequence even with incomplete context

A strong answer usually makes this clear:

how you would scope the problem
when you would mitigate
which signals you would check first
how you would avoid making the investigation dirtier

Production incidents do not need heroic guessing. They need operational clarity when the context is still incomplete.

Quick summary

What to keep in your head

Production incidents get better when you separate mitigation, investigation, and explanation instead of doing everything at once.
A strong response starts with scope, impact, and recent change, not with random infrastructure actions.
Mitigation is not the same thing as diagnosis. Stopping the bleeding does not automatically explain the cause.
The team that changes fewer things with more evidence usually learns faster and recovers more safely.

Practice checklist

Use this when you answer

Can I explain the order between containing damage, narrowing scope, and debugging the cause?
Do I know which signals I would check first during a live production issue?
Can I separate symptom, hypothesis, mitigation, and likely root cause in my own explanation?
Can I answer an interview scenario without hiding behind tool names?

You finished this article

Part of the track: Senior Full Stack Interview Trail (7/14)

Next step

How to Structure a System Design Answer Next step →