September 30 2025

Failure and recovery scenarios

How to think about a real system when some part breaks, without treating resilience like a slogan.

Andrews Ribeiro

Founder & Engineer

4 min Intermediate Systems

#system-design#debugging-production#systems#incidents#resilience#recovery

Track

System Design Interviews - From Basics to Advanced

Step 15 / 19

Back to track Previous article Next article

The problem

Many architecture discussions talk about availability as if the system were too elegant to fail.

When a dependency really does go down, the reaction becomes improvisation:

infinite retries
giant timeouts
restart and hope

That is not resilience.

That is panic with technical vocabulary.

Mental model

Failure is part of the system.

Thinking clearly about recovery means answering four questions:

what broke
who depends on it
what can degrade and what must stop
how the system returns to a coherent state

If you answer those early, the design stops sounding like a slogan and starts sounding like real operations.

Breaking it down

Name the failure concretely

“The system is down” rarely helps.

It is much more useful to say something like:

the database is unavailable
the payment gateway is slow
the broker is delaying delivery
storage is rejecting writes

Once the failure is concrete, the rest of the analysis gets better.

Separate what stops from what degrades

Not every failure needs to become a full error for the user.

Sometimes a degraded mode is acceptable.

But it has to be explicit.

Examples:

reads continue, writes stop
the order is accepted as pending
the report takes longer, but does not disappear

Treat retry as a dangerous tool

Retry helps with transient failure.

But retry without limits can:

grow the queue
saturate an already unhealthy dependency
duplicate side effects

So good retry usually comes with:

a limit
backoff
idempotency
a queue or quarantine when needed

It also needs a clear point where the system stops trying.

If it never knows when to give up, it turns a local failure into a wider incident.

Recovery means returning to a trustworthy state

This is where shallow answers often separate from mature ones.

Recovery is not only “the service responds again.”

It is “the service responds again without leaving behind:

duplicate payments
orphaned orders
misleading statuses
confusing reprocessing”

Simple example

Imagine an orders API that depends on a payment gateway.

If the gateway fails, you might choose to:

block new purchases
accept the order and mark payment as PENDING
accept within limits and retry payment later

Each choice has a cost.

Blocking everything protects consistency, but hurts conversion immediately.

Accepting pending orders keeps the flow alive, but creates operational debt and user expectation.

A mature answer could sound like this:

If the gateway fails, I do not want the system to pretend nothing happened. I might accept the order in a pending state, with a clear time limit, and retry payment through a queue. I also need to prevent unbounded retry and have a reconciliation path to clean up old pending states.

Now the answer is not only talking about failure.

It is talking about:

behavior during failure
user experience
return to consistency
the cost of the recovery choice itself

Common mistakes

Talking about retry without limits.
Calling every fallback resilience.
Ignoring the user-facing state during failure.
Recovering availability and forgetting consistency.
Treating failure as an operational detail separate from the product.

How a senior thinks

Someone with more experience often replaces the question “how do we avoid failure?” with a more useful one:

When this breaks, what do I want the system to do explicitly?

That question is powerful because it forces you to define:

degraded mode
user visibility
retry limits
state cleanup

What the interviewer wants to see

In interviews, failure and recovery measure operational maturity.

The interviewer wants to see whether you:

treat failure as part of the flow
define acceptable degradation
think about retry with limits
care about returning to a consistent state

A mature architecture does not promise that nothing breaks. It decides in advance what to do when it does.

Real recovery is not restart. It is a controlled return to a state the system can trust again.

Quick summary

What to keep in your head

Failure is part of the system. The architecture needs to decide in advance what stops and what degrades.
Retry without limits is not resilience. It is a fast way to make a bad situation worse.
Recovery is not only about serving traffic again. It is about serving traffic again without corrupting state.
In interviews, a strong answer talks about behavior during failure and after failure.

Practice checklist

Use this when you answer

Can I name which component failed and which flows were affected?
Can I say what degrades and what must stop?
Can I explain how to avoid duplication, inconsistency, or retry storms?
Can I describe how the system returns to a coherent state?

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (15/19)

Next step

Social Media Feed System Design Next step →

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (15/19)

Next step

Social Media Feed System Design Next step →

Next article AI scenarios in production Previous article API scenarios at scale

Share this page

Failure and recovery scenarios

System Design Interviews - From Basics to Advanced

The problem

Mental model

Breaking it down

Name the failure concretely

Separate what stops from what degrades

Treat retry as a dangerous tool

Recovery means returning to a trustworthy state

Simple example

Common mistakes

How a senior thinks

What the interviewer wants to see

What to keep in your head

Use this when you answer

Keep exploring

Articles

System Design

Debugging & Production

Related articles

Notification System Design

File Upload and Processing System Design

Search System Design Without a Canned Answer

Related articles

Social Media Feed System Design Next step →
You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (15/19)

Next article AI scenarios in production

Previous article API scenarios at scale

File Upload and Processing System Design

Search System Design Without a Canned Answer