Skip to main content

Failure and recovery scenarios

How to think about a real system when some part breaks, without treating resilience like a slogan.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

System Design Interviews - From Basics to Advanced

Step 15 / 19

The problem

Many architecture discussions talk about availability as if the system were too elegant to fail.

When a dependency really does go down, the reaction becomes improvisation:

  • infinite retries
  • giant timeouts
  • restart and hope

That is not resilience.

That is panic with technical vocabulary.

Mental model

Failure is part of the system.

Thinking clearly about recovery means answering four questions:

  1. what broke
  2. who depends on it
  3. what can degrade and what must stop
  4. how the system returns to a coherent state

If you answer those early, the design stops sounding like a slogan and starts sounding like real operations.

Breaking it down

Name the failure concretely

“The system is down” rarely helps.

It is much more useful to say something like:

  • the database is unavailable
  • the payment gateway is slow
  • the broker is delaying delivery
  • storage is rejecting writes

Once the failure is concrete, the rest of the analysis gets better.

Separate what stops from what degrades

Not every failure needs to become a full error for the user.

Sometimes a degraded mode is acceptable.

But it has to be explicit.

Examples:

  • reads continue, writes stop
  • the order is accepted as pending
  • the report takes longer, but does not disappear

Treat retry as a dangerous tool

Retry helps with transient failure.

But retry without limits can:

  • grow the queue
  • saturate an already unhealthy dependency
  • duplicate side effects

So good retry usually comes with:

  • a limit
  • backoff
  • idempotency
  • a queue or quarantine when needed

It also needs a clear point where the system stops trying.

If it never knows when to give up, it turns a local failure into a wider incident.

Recovery means returning to a trustworthy state

This is where shallow answers often separate from mature ones.

Recovery is not only “the service responds again.”

It is “the service responds again without leaving behind:

  • duplicate payments
  • orphaned orders
  • misleading statuses
  • confusing reprocessing”

Simple example

Imagine an orders API that depends on a payment gateway.

If the gateway fails, you might choose to:

  • block new purchases
  • accept the order and mark payment as PENDING
  • accept within limits and retry payment later

Each choice has a cost.

Blocking everything protects consistency, but hurts conversion immediately.

Accepting pending orders keeps the flow alive, but creates operational debt and user expectation.

A mature answer could sound like this:

If the gateway fails, I do not want the system to pretend nothing happened. I might accept the order in a pending state, with a clear time limit, and retry payment through a queue. I also need to prevent unbounded retry and have a reconciliation path to clean up old pending states.

Now the answer is not only talking about failure.

It is talking about:

  • behavior during failure
  • user experience
  • return to consistency
  • the cost of the recovery choice itself

Common mistakes

  • Talking about retry without limits.
  • Calling every fallback resilience.
  • Ignoring the user-facing state during failure.
  • Recovering availability and forgetting consistency.
  • Treating failure as an operational detail separate from the product.

How a senior thinks

Someone with more experience often replaces the question “how do we avoid failure?” with a more useful one:

When this breaks, what do I want the system to do explicitly?

That question is powerful because it forces you to define:

  • degraded mode
  • user visibility
  • retry limits
  • state cleanup

What the interviewer wants to see

In interviews, failure and recovery measure operational maturity.

The interviewer wants to see whether you:

  • treat failure as part of the flow
  • define acceptable degradation
  • think about retry with limits
  • care about returning to a consistent state

A mature architecture does not promise that nothing breaks. It decides in advance what to do when it does.

Real recovery is not restart. It is a controlled return to a state the system can trust again.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (15/19)

Next article AI scenarios in production Previous article API scenarios at scale

Keep exploring

Related articles