September 30 2025
Failure and recovery scenarios
How to think about a real system when some part breaks, without treating resilience like a slogan.
Andrews Ribeiro
Founder & Engineer
4 min Intermediate Systems
Track
System Design Interviews - From Basics to Advanced
Step 15 / 19
The problem
Many architecture discussions talk about availability as if the system were too elegant to fail.
When a dependency really does go down, the reaction becomes improvisation:
- infinite retries
- giant timeouts
- restart and hope
That is not resilience.
That is panic with technical vocabulary.
Mental model
Failure is part of the system.
Thinking clearly about recovery means answering four questions:
- what broke
- who depends on it
- what can degrade and what must stop
- how the system returns to a coherent state
If you answer those early, the design stops sounding like a slogan and starts sounding like real operations.
Breaking it down
Name the failure concretely
“The system is down” rarely helps.
It is much more useful to say something like:
- the database is unavailable
- the payment gateway is slow
- the broker is delaying delivery
- storage is rejecting writes
Once the failure is concrete, the rest of the analysis gets better.
Separate what stops from what degrades
Not every failure needs to become a full error for the user.
Sometimes a degraded mode is acceptable.
But it has to be explicit.
Examples:
- reads continue, writes stop
- the order is accepted as pending
- the report takes longer, but does not disappear
Treat retry as a dangerous tool
Retry helps with transient failure.
But retry without limits can:
- grow the queue
- saturate an already unhealthy dependency
- duplicate side effects
So good retry usually comes with:
- a limit
- backoff
- idempotency
- a queue or quarantine when needed
It also needs a clear point where the system stops trying.
If it never knows when to give up, it turns a local failure into a wider incident.
Recovery means returning to a trustworthy state
This is where shallow answers often separate from mature ones.
Recovery is not only “the service responds again.”
It is “the service responds again without leaving behind:
- duplicate payments
- orphaned orders
- misleading statuses
- confusing reprocessing”
Simple example
Imagine an orders API that depends on a payment gateway.
If the gateway fails, you might choose to:
- block new purchases
- accept the order and mark payment as
PENDING - accept within limits and retry payment later
Each choice has a cost.
Blocking everything protects consistency, but hurts conversion immediately.
Accepting pending orders keeps the flow alive, but creates operational debt and user expectation.
A mature answer could sound like this:
If the gateway fails, I do not want the system to pretend nothing happened. I might accept the order in a pending state, with a clear time limit, and retry payment through a queue. I also need to prevent unbounded retry and have a reconciliation path to clean up old pending states.
Now the answer is not only talking about failure.
It is talking about:
- behavior during failure
- user experience
- return to consistency
- the cost of the recovery choice itself
Common mistakes
- Talking about retry without limits.
- Calling every fallback resilience.
- Ignoring the user-facing state during failure.
- Recovering availability and forgetting consistency.
- Treating failure as an operational detail separate from the product.
How a senior thinks
Someone with more experience often replaces the question “how do we avoid failure?” with a more useful one:
When this breaks, what do I want the system to do explicitly?
That question is powerful because it forces you to define:
- degraded mode
- user visibility
- retry limits
- state cleanup
What the interviewer wants to see
In interviews, failure and recovery measure operational maturity.
The interviewer wants to see whether you:
- treat failure as part of the flow
- define acceptable degradation
- think about retry with limits
- care about returning to a consistent state
A mature architecture does not promise that nothing breaks. It decides in advance what to do when it does.
Real recovery is not restart. It is a controlled return to a state the system can trust again.
Quick summary
What to keep in your head
- Failure is part of the system. The architecture needs to decide in advance what stops and what degrades.
- Retry without limits is not resilience. It is a fast way to make a bad situation worse.
- Recovery is not only about serving traffic again. It is about serving traffic again without corrupting state.
- In interviews, a strong answer talks about behavior during failure and after failure.
Practice checklist
Use this when you answer
- Can I name which component failed and which flows were affected?
- Can I say what degrades and what must stop?
- Can I explain how to avoid duplication, inconsistency, or retry storms?
- Can I describe how the system returns to a coherent state?
You finished this article
Part of the track: System Design Interviews - From Basics to Advanced (15/19)
Share this page
Copy the link manually from the field below.