Skip to main content

Rollback and Incident Mitigation

How to decide between rolling back, turning things off, degrading, or containing a bad release with operational clarity.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

Startup Engineer Interview Trail

Step 6 / 10

The problem

Sometimes a release goes wrong and the team reacts with one word:

rollback

Sometimes that works.

Sometimes it does not.

The mistake is to think rollback is the universal answer for any deploy-related incident.

In practice, there is a question that comes first:

what action contains damage the fastest right now?

Rolling back version may be one of them.

But it is not the only one.

Mental model

Rollback is an attempt to return to a previous version or behavior.

Mitigation is any action that reduces the impact of the incident quickly.

That can include:

  • version rollback
  • turning off a flag
  • cutting traffic
  • disabling a secondary integration
  • serving a degraded response
  • opening a circuit breaker

In other words:

rollback is one mitigation tool, but good mitigation does not depend on rollback alone.

Breaking the problem down

When rollback helps a lot

Rollback tends to work well when:

  • the problem clearly came from the release
  • the previous version is still compatible
  • no irreversible state change happened
  • the deploy strategy makes going back easy

In those cases, rolling back reduces damage fast.

When rollback is not enough

There are situations where going back to the previous version does not solve the problem by itself:

  • a migration already changed schema or data
  • a bad event was already published
  • a queue already accumulated inconsistent work
  • an external provider changed behavior

Here, rolling code back may help a little, but the damage already spread to other parts.

That is why mitigation needs to be broader than rollback.

Strong mitigation thinks in terms of containment

In a real incident, good mitigation usually asks:

  • how do we shrink the blast radius now?
  • what can we disable without taking everything down?
  • what needs to degrade so the main flow stays alive?

Examples:

  • turn off non-essential functionality with a flag
  • reduce the canary percentage
  • route traffic back to the previous environment
  • block risky writes temporarily
  • respond with a simpler fallback

Rollback without preparation is theater

If the team says it can roll back, but in practice:

  • it does not know which version was live
  • it does not know how to go back
  • it does not trust the previous version
  • it depends on opaque manual steps

then rollback is more wish than capability.

Simple example

Imagine a new checkout release.

After deploy:

  • errors go up
  • latency gets worse
  • conversion drops

Possible responses:

  1. disable the new flow with a flag
  2. reduce traffic to the canary version
  3. move back to the previous blue-green environment
  4. if needed, block a secondary feature to protect the main purchase flow

Notice that the immediate goal is not to “make the architecture pretty.”

It is to lower damage now.

The calmer analysis comes later.

Common mistakes

  • Treating rollback as the only answer to everything.
  • Ignoring that data and schema are part of the equation too.
  • Not preparing a clear way to go back or turn things off.
  • Waiting too long to mitigate while searching for the perfect root cause.
  • Confusing version restoration with state restoration.

How a senior thinks

People with more experience usually organize the response like this:

  1. contain impact
  2. stabilize service
  3. understand the cause
  4. fix and prevent repeat incidents

That reasoning avoids a common crisis mistake:

trying to solve things elegantly before reducing damage.

What the interviewer wants to see

In an interview, the evaluator wants to see whether you think operationally under pressure.

You move up a level when you:

  • separate rollback from mitigation
  • mention flags, traffic, and degradation as options
  • recognize rollback limits when state has changed
  • show that damage containment comes first

A strong answer usually sounds like this:

“If a release made the system worse, my first question is which action reduces impact right now. It might be rollback, it might be turning off a flag, it might be moving traffic, or it might be degrading a feature. After that, I investigate the cause with more calm.”

In an incident, elegance comes later. First comes reducing the size of the damage.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: Startup Engineer Interview Trail (6/10)

Next article Talking About Conflict and Hard Decisions Previous article Gradual Rollouts With Control

Keep exploring

Related articles