Skip to main content

API scenarios at scale

How to think about an API under load without falling into generic distributed systems answers.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

System Design Interviews - From Basics to Advanced

Step 14 / 19

The problem

Many system design answers for APIs at scale turn into a list of famous technology names.

Redis, Kafka, load balancer, microservice, sharding.

Everything shows up before anyone answers:

  • which route actually matters
  • which dependency limits the flow
  • what the business is willing to give up under pressure

The result looks like architecture, but it is missing diagnosis.

Mental model

API at scale does not start with the number of components.

It starts with four questions:

  1. which operation matters most
  2. which operation suffers first when load rises
  3. which resource saturates first
  4. how the system degrades when it cannot serve everything

If you can answer those four, much of the architecture starts to reveal itself.

Breaking it down

Pick the critical flow

Not everything has the same weight.

In a real API, there is usually one path worth protecting first.

Examples:

  • checkout
  • login
  • redirect
  • report generation

If you do not choose that flow early, you end up designing everything with the same priority.

Do a quick read/write estimate

It does not need to become a thesis.

But it does need to answer whether the problem is dominated by:

  • reads
  • writes
  • heavy processing
  • an external dependency

It also helps to say whether the real pain is throughput, tail latency, or cost blow-up.

Pretty averages hide APIs that are bad at p95.

Without that, it is easy to build an elegant answer for a bottleneck that was never the main one.

Name the first resource that saturates

This is where the answer starts becoming serious.

Because the bottleneck is rarely “scale” in the abstract.

It is usually something concrete, like:

  • CPU holding the request open
  • a database running out of connections
  • slow storage
  • a flaky third-party dependency
  • expensive fanout or aggregation

Make the smallest change that solves the right problem

Not every API under load needs microservices, queues, and several cache layers.

Sometimes the right move is much smaller:

  • remove heavy work from the request path
  • return 202 Accepted
  • add retry and rate limiting
  • add cache only on the hot path

The more proportional the change, the stronger the answer usually sounds.

Explain how the system degrades

This step is often skipped, and that is a mistake.

A system at scale is not only one that works when everything is fine.

It is one that behaves predictably when it can no longer keep up.

That includes deciding what happens first:

  • reject early
  • return partial results
  • move work to async
  • or protect one critical path while another gets worse

If you do not decide that, the system decides for you in the worst possible way.

Simple example

Imagine an API that generates financial reports at the end of the month.

The main flow is:

  1. a user requests a report
  2. the API queries many tables
  3. it generates a heavy file
  4. it returns the result

If many users do this at the same time, a likely bottleneck is heavy computation inside the request.

A mature answer could sound like this:

The critical flow is asking for a report and getting status back quickly. I do not need to return the file in the same request. So I remove report generation from the synchronous path, return 202 Accepted, put the job in a queue, and let the client poll for status or receive a notification when the file is ready.

And then add:

I also need to limit how many heavy jobs each account can trigger at once, so one customer does not degrade everyone else.

Now the answer has:

  • a main flow
  • a named bottleneck
  • a proportional change
  • controlled degradation

Common mistakes

  • Starting with the tool instead of the flow.
  • Talking about scale without talking about a physical or operational resource.
  • Ignoring acceptable degradation.
  • Assuming every high-load API needs the same architecture.
  • Forgetting the operational cost of the component you just added.

How a senior thinks

Someone with more experience usually pulls the conversation toward real impact.

The thinking sounds like this:

What must keep working when demand rises? What can move to async? What must stay under a specific latency? What do I reject first when capacity runs out?

That is the difference between a pretty diagram and a defensible system.

What the interviewer wants to see

In interviews, this scenario measures whether you:

  • choose an important flow
  • locate the main bottleneck
  • change the architecture because of need, not fashion
  • define how the system degrades

API at scale is not about how many boxes you know. It is about knowing which flow deserves protection and which sacrifice the system can afford.

Once degradation is clear, your architecture starts sounding real instead of just popular.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (14/19)

Next article Failure and recovery scenarios Previous article Asynchronous Interviews in English: Loom, Video, and Written Answers

Keep exploring

Related articles