May 9 2025

Rollback and Incident Mitigation

How to decide between rolling back, turning things off, degrading, or containing a bad release with operational clarity.

Andrews Ribeiro

Founder & Engineer

4 min Intermediate Systems

#debugging-production#deploy#rollback#incidents#mitigation#infra

Track

Startup Engineer Interview Trail

Step 6 / 10

Back to track Previous article Next article

The problem

Sometimes a release goes wrong and the team reacts with one word:

rollback

Sometimes that works.

Sometimes it does not.

The mistake is to think rollback is the universal answer for any deploy-related incident.

In practice, there is a question that comes first:

what action contains damage the fastest right now?

Rolling back version may be one of them.

But it is not the only one.

Mental model

Rollback is an attempt to return to a previous version or behavior.

Mitigation is any action that reduces the impact of the incident quickly.

That can include:

version rollback
turning off a flag
cutting traffic
disabling a secondary integration
serving a degraded response
opening a circuit breaker

In other words:

rollback is one mitigation tool, but good mitigation does not depend on rollback alone.

Breaking the problem down

When rollback helps a lot

Rollback tends to work well when:

the problem clearly came from the release
the previous version is still compatible
no irreversible state change happened
the deploy strategy makes going back easy

In those cases, rolling back reduces damage fast.

When rollback is not enough

There are situations where going back to the previous version does not solve the problem by itself:

a migration already changed schema or data
a bad event was already published
a queue already accumulated inconsistent work
an external provider changed behavior

Here, rolling code back may help a little, but the damage already spread to other parts.

That is why mitigation needs to be broader than rollback.

Strong mitigation thinks in terms of containment

In a real incident, good mitigation usually asks:

how do we shrink the blast radius now?
what can we disable without taking everything down?
what needs to degrade so the main flow stays alive?

Examples:

turn off non-essential functionality with a flag
reduce the canary percentage
route traffic back to the previous environment
block risky writes temporarily
respond with a simpler fallback

Rollback without preparation is theater

If the team says it can roll back, but in practice:

it does not know which version was live
it does not know how to go back
it does not trust the previous version
it depends on opaque manual steps

then rollback is more wish than capability.

Simple example

Imagine a new checkout release.

After deploy:

errors go up
latency gets worse
conversion drops

Possible responses:

disable the new flow with a flag
reduce traffic to the canary version
move back to the previous blue-green environment
if needed, block a secondary feature to protect the main purchase flow

Notice that the immediate goal is not to “make the architecture pretty.”

It is to lower damage now.

The calmer analysis comes later.

Common mistakes

Treating rollback as the only answer to everything.
Ignoring that data and schema are part of the equation too.
Not preparing a clear way to go back or turn things off.
Waiting too long to mitigate while searching for the perfect root cause.
Confusing version restoration with state restoration.

How a senior thinks

People with more experience usually organize the response like this:

contain impact
stabilize service
understand the cause
fix and prevent repeat incidents

That reasoning avoids a common crisis mistake:

trying to solve things elegantly before reducing damage.

What the interviewer wants to see

In an interview, the evaluator wants to see whether you think operationally under pressure.

You move up a level when you:

separate rollback from mitigation
mention flags, traffic, and degradation as options
recognize rollback limits when state has changed
show that damage containment comes first

A strong answer usually sounds like this:

“If a release made the system worse, my first question is which action reduces impact right now. It might be rollback, it might be turning off a flag, it might be moving traffic, or it might be degrading a feature. After that, I investigate the cause with more calm.”

In an incident, elegance comes later. First comes reducing the size of the damage.

Quick summary

What to keep in your head

Rollback and mitigation solve different moments of an incident, even if they often happen together.
Not every failure allows a clean return to the previous version, especially when state and migrations are involved.
Fast mitigation can mean turning off a flag, reducing traffic, degrading a feature, or opening a circuit breaker.
A strong response plan thinks first about how to contain damage, not how to look heroic during the crisis.

Practice checklist

Use this when you answer

Can I explain when rollback helps and when it does not solve the problem by itself?
Do I know examples of mitigation that do not depend on a new release?
Can I talk about state and migration as limits on simple rollback?
Can I answer an incident by thinking first about damage containment?