Skip to main content

Investigating Production Failures

How to investigate a real production problem with a clear process for evidence, containment, and communication.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

Senior Full Stack Interview Trail

Step 7 / 14

The problem

Production failures create pressure very quickly.

That pressure often turns into movement before it turns into understanding.

The team restarts pods, increases timeouts, adds logs everywhere, rolls something back, and ships a speculative fix before anyone has clearly described what actually broke.

Sometimes the symptom goes away. The problem is that nobody learns why.

Mental model

During a production incident, your first job is not to look fast.

Your first job is to reduce damage and uncertainty in the right order.

A useful question is:

What do we know for sure, what is only a guess, what changed recently, and what action lowers risk without making diagnosis messier?

That question is valuable because it keeps the room from turning into an operational lottery.

Breaking it down

Containment and investigation are not the same thing

Sometimes the first move is mitigation:

  • rollback
  • turning off a feature flag
  • routing traffic away
  • shrinking the blast radius

Sometimes the first move is to observe a little better before changing anything.

The deciding questions are usually:

  • is the damage active and meaningful?
  • is the mitigation safe and reversible?
  • does waiting increase real cost?

If the answer is yes, containment usually comes first.

But good containment is not random destruction. It reduces damage without wiping out your ability to understand what happened.

Define the symptom before chasing the cause

A lot of bad incident response starts by trying to explain something that has not even been scoped yet.

Before theories, lock down:

  • what error is happening
  • since when
  • in which flow
  • for whom
  • with what severity

That alone removes a lot of noise.

Recent change is still a strong clue

One of the first useful questions in a real incident is still:

  • what changed near the start of the problem?

That change might be:

  • a deploy
  • a config change
  • a feature flag
  • a third-party dependency
  • a traffic pattern shift

Not because recent change proves guilt.

Because it narrows the search space.

Look at reliable signals before changing too much

This is where you check:

  • metrics
  • logs
  • traces
  • error rate by flow
  • correlation with a deploy or a component

The logic matters more than the tool name:

  • confirm impact
  • narrow scope
  • compare to normal
  • identify the highest-probability suspect

If you jump straight into changing infra, config, and code before doing this, the response stops looking like investigation and starts looking like technical anxiety.

Changing many things at once ruins the experiment

This is where weak answers fall apart.

If the response is:

  • restart the pods
  • raise the timeout
  • add more logs
  • roll back

all in one breath, it sounds active but not disciplined.

If the system improves, nobody knows which move helped.

If it gets worse, the room just got noisier.

Strong answers show sequence, not just actions

A much better structure is:

  1. what I would confirm first
  2. what I would mitigate if impact is high
  3. what I would investigate next
  4. how I would communicate status and next decision

That sounds like real operational reasoning.

Simple example

Imagine this interview prompt:

After a deploy, checkout starts returning 500 for part of the traffic. What do you do first?

A weak answer sounds like:

I would open the logs, restart the pods, and roll back if needed.

That has actions, but not much order.

A stronger answer sounds more like this:

First I would confirm the real impact: whether the issue affects every checkout or only one payment path, when it started, and whether it lines up with the deploy. If the impact is high and the correlation is strong, I would consider rollback or disabling the change to stop the bleeding. In parallel, I would check the most reliable signals to narrow the failure: which endpoint is failing, which dependency is involved, whether latency changed, and what is different between affected and unaffected requests. I would avoid changing multiple things at once because that makes diagnosis worse.

That answer shows:

  • containment with judgment
  • investigation with sequence
  • respect for evidence
  • implicit communication of next steps

Common mistakes

  • changing multiple things at once and destroying the diagnostic trail
  • treating rollback as if it also explains the cause
  • jumping into code before defining impact and scope
  • naming tools instead of showing reasoning
  • mixing temporary mitigation with real correction

How a senior thinks

Experienced engineers often think about incidents like this:

If the room is moving too fast, my job is to put structure back. If the room is too slow, my job is to pull the safest first action.

That is a strong lens because it shows two things at the same time:

  • emotional control
  • technical prioritization

Production failures rarely need genius. They need order under pressure.

What the interviewer wants to see

When this shows up in an interview, the interviewer is usually checking whether you:

  • contain without destroying evidence
  • investigate with method instead of guesswork
  • separate symptom, hypothesis, and likely cause
  • check signals before committing to a story
  • communicate sequence even with incomplete context

A strong answer usually makes this clear:

  1. how you would scope the problem
  2. when you would mitigate
  3. which signals you would check first
  4. how you would avoid making the investigation dirtier

Production incidents do not need heroic guessing. They need operational clarity when the context is still incomplete.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: Senior Full Stack Interview Trail (7/14)

Next article Logs and Observability Without Noise Previous article Consistent Hashing in Practice

Keep exploring

Related articles