May 1 2025
Investigating Production Failures
How to investigate a real production problem with a clear process for evidence, containment, and communication.
Andrews Ribeiro
Founder & Engineer
5 min Intermediate Systems
Track
Senior Full Stack Interview Trail
Step 7 / 14
The problem
Production failures create pressure very quickly.
That pressure often turns into movement before it turns into understanding.
The team restarts pods, increases timeouts, adds logs everywhere, rolls something back, and ships a speculative fix before anyone has clearly described what actually broke.
Sometimes the symptom goes away. The problem is that nobody learns why.
Mental model
During a production incident, your first job is not to look fast.
Your first job is to reduce damage and uncertainty in the right order.
A useful question is:
What do we know for sure, what is only a guess, what changed recently, and what action lowers risk without making diagnosis messier?
That question is valuable because it keeps the room from turning into an operational lottery.
Breaking it down
Containment and investigation are not the same thing
Sometimes the first move is mitigation:
- rollback
- turning off a feature flag
- routing traffic away
- shrinking the blast radius
Sometimes the first move is to observe a little better before changing anything.
The deciding questions are usually:
- is the damage active and meaningful?
- is the mitigation safe and reversible?
- does waiting increase real cost?
If the answer is yes, containment usually comes first.
But good containment is not random destruction. It reduces damage without wiping out your ability to understand what happened.
Define the symptom before chasing the cause
A lot of bad incident response starts by trying to explain something that has not even been scoped yet.
Before theories, lock down:
- what error is happening
- since when
- in which flow
- for whom
- with what severity
That alone removes a lot of noise.
Recent change is still a strong clue
One of the first useful questions in a real incident is still:
- what changed near the start of the problem?
That change might be:
- a deploy
- a config change
- a feature flag
- a third-party dependency
- a traffic pattern shift
Not because recent change proves guilt.
Because it narrows the search space.
Look at reliable signals before changing too much
This is where you check:
- metrics
- logs
- traces
- error rate by flow
- correlation with a deploy or a component
The logic matters more than the tool name:
- confirm impact
- narrow scope
- compare to normal
- identify the highest-probability suspect
If you jump straight into changing infra, config, and code before doing this, the response stops looking like investigation and starts looking like technical anxiety.
Changing many things at once ruins the experiment
This is where weak answers fall apart.
If the response is:
- restart the pods
- raise the timeout
- add more logs
- roll back
all in one breath, it sounds active but not disciplined.
If the system improves, nobody knows which move helped.
If it gets worse, the room just got noisier.
Strong answers show sequence, not just actions
A much better structure is:
- what I would confirm first
- what I would mitigate if impact is high
- what I would investigate next
- how I would communicate status and next decision
That sounds like real operational reasoning.
Simple example
Imagine this interview prompt:
After a deploy, checkout starts returning
500for part of the traffic. What do you do first?
A weak answer sounds like:
I would open the logs, restart the pods, and roll back if needed.
That has actions, but not much order.
A stronger answer sounds more like this:
First I would confirm the real impact: whether the issue affects every checkout or only one payment path, when it started, and whether it lines up with the deploy. If the impact is high and the correlation is strong, I would consider rollback or disabling the change to stop the bleeding. In parallel, I would check the most reliable signals to narrow the failure: which endpoint is failing, which dependency is involved, whether latency changed, and what is different between affected and unaffected requests. I would avoid changing multiple things at once because that makes diagnosis worse.
That answer shows:
- containment with judgment
- investigation with sequence
- respect for evidence
- implicit communication of next steps
Common mistakes
- changing multiple things at once and destroying the diagnostic trail
- treating rollback as if it also explains the cause
- jumping into code before defining impact and scope
- naming tools instead of showing reasoning
- mixing temporary mitigation with real correction
How a senior thinks
Experienced engineers often think about incidents like this:
If the room is moving too fast, my job is to put structure back. If the room is too slow, my job is to pull the safest first action.
That is a strong lens because it shows two things at the same time:
- emotional control
- technical prioritization
Production failures rarely need genius. They need order under pressure.
What the interviewer wants to see
When this shows up in an interview, the interviewer is usually checking whether you:
- contain without destroying evidence
- investigate with method instead of guesswork
- separate symptom, hypothesis, and likely cause
- check signals before committing to a story
- communicate sequence even with incomplete context
A strong answer usually makes this clear:
- how you would scope the problem
- when you would mitigate
- which signals you would check first
- how you would avoid making the investigation dirtier
Production incidents do not need heroic guessing. They need operational clarity when the context is still incomplete.
Quick summary
What to keep in your head
- Production incidents get better when you separate mitigation, investigation, and explanation instead of doing everything at once.
- A strong response starts with scope, impact, and recent change, not with random infrastructure actions.
- Mitigation is not the same thing as diagnosis. Stopping the bleeding does not automatically explain the cause.
- The team that changes fewer things with more evidence usually learns faster and recovers more safely.
Practice checklist
Use this when you answer
- Can I explain the order between containing damage, narrowing scope, and debugging the cause?
- Do I know which signals I would check first during a live production issue?
- Can I separate symptom, hypothesis, mitigation, and likely root cause in my own explanation?
- Can I answer an interview scenario without hiding behind tool names?
You finished this article
Part of the track: Senior Full Stack Interview Trail (7/14)
Share this page
Copy the link manually from the field below.