Skip to main content

Gradual Rollouts With Control

How to release changes to production in real steps, with a clear stop condition and expansion rule, instead of pushing to 100 percent and hoping.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

Startup Engineer Interview Trail

Step 8 / 10

The problem

Some teams call this a gradual rollout:

  1. ship the version
  2. put it at 10 percent
  3. wait a little
  4. go to 100 percent because “it looks fine”

That is not control.

That is just urgency split into stages.

Gradual rollout only helps when it reduces the blast radius of failure and increases the chance of catching problems early.

Mental model

Think about it this way:

gradual rollout is a loop of exposure, observation, and decision

First you choose who gets the change.

Then you measure what matters.

Then you decide:

  • expand
  • hold
  • roll back

If that decision is not clear before you start, the rollout is still badly designed.

Breaking the problem down

Choose which part of the world gets the change first

Rollout can be based on:

  • instance
  • user
  • tenant
  • region
  • traffic type
  • internal team before customer

The useful question is:

which slice reduces damage without hiding the behavior I need to observe?

Sometimes 1 percent of random traffic helps.

Sometimes a controlled internal tenant helps more.

Define what success and failure mean

Before you release, you need to know what you are watching.

Examples:

  • error rate
  • latency
  • timeout
  • queue growth
  • result mismatch
  • conversion or business impact

You also need a stop trigger.

Without that, the team always finds a way to say it is “still acceptable.”

Have a clear way to stop or go back

Gradual rollout without a kill switch and without fast rollback is only half the story.

You need to know:

  • how to stop new exposure
  • how to pull traffic back
  • what happens to already written state
  • whether anything irreversible is in the path

That last point matters a lot.

If the system already wrote breaking data, turning traffic off may not be enough.

Expand in stages with enough time to observe

Not every failure shows up in 30 seconds.

Some depend on:

  • volume
  • time of day
  • rare request shapes
  • a signed job running later
  • old data hitting the new path

So a good stage is not only about percentage.

It is percentage plus time plus correct signal reading.

Separate deploy, activation, and exposure

This point avoids a lot of confusion.

You can:

  • deploy the code
  • keep the feature off
  • activate it for a few
  • expand gradually

When the team mixes those layers, it loses the ability to operate a release calmly.

Simple example

Imagine a new checkout flow.

Weak plan:

  • deploy on Friday
  • release 5 percent
  • if nobody complains in 10 minutes, go to 100 percent

Better plan:

  1. deploy the new code without activating it
  2. turn it on first for the internal team
  3. release it to a small tenant
  4. measure error, latency, abandonment, and payment integration
  5. expand to more tenants or more percentage
  6. keep a clear pause rule if regression appears

Notice that the point is not the number itself.

The point is that each step buys learning before it buys more risk.

Common mistakes

  • Confusing gradual rollout with gradual deploy.
  • Choosing a random cohort when the issue depends on a specific tenant.
  • Rolling out without a baseline to compare against.
  • Expanding based on calendar and anxiety instead of signal.
  • Ignoring that schema or contract can still break when versions coexist.

How a senior thinks

People with more experience do not only ask:

“What percentage should I start with?”

They ask more like this:

“Which failure do I want to catch early, which group will show me that, and how do I stop without making the damage worse?”

That way of thinking is less about the tool and more about operational control.

What the interviewer wants to see

In an interview, the evaluator wants to see whether you can turn a release into an observable process, not a polished guess.

You move up a level when you:

  • separate deploy from activation
  • choose the cohort with intent
  • talk about metrics and stop conditions
  • consider rollback and state compatibility

A strong answer usually sounds like this:

“I would do a gradual rollout by defining the exposure unit first, then the success and failure metrics, and a clear way to pause or roll back. I would only expand when the signal was stable, not just because some time had passed.”

A good gradual rollout is not the one that moves slowly. It is the one that learns fast before exposing the rest.

If you do not know when to stop, you are not controlling anything.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: Startup Engineer Interview Trail (8/10)

Next article Rollback and Incident Mitigation Previous article Feature Flags vs Deploy: When to Use Each One

Keep exploring

Related articles