October 6 2025

API scenarios at scale

How to think about an API under load without falling into generic distributed systems answers.

Andrews Ribeiro

Founder & Engineer

4 min Intermediate Systems

#system-design#systems#api#scalability#architecture

Track

System Design Interviews - From Basics to Advanced

Step 14 / 19

Back to track Previous article Next article

The problem

Many system design answers for APIs at scale turn into a list of famous technology names.

Redis, Kafka, load balancer, microservice, sharding.

Everything shows up before anyone answers:

which route actually matters
which dependency limits the flow
what the business is willing to give up under pressure

The result looks like architecture, but it is missing diagnosis.

Mental model

API at scale does not start with the number of components.

It starts with four questions:

which operation matters most
which operation suffers first when load rises
which resource saturates first
how the system degrades when it cannot serve everything

If you can answer those four, much of the architecture starts to reveal itself.

Breaking it down

Pick the critical flow

Not everything has the same weight.

In a real API, there is usually one path worth protecting first.

Examples:

checkout
login
redirect
report generation

If you do not choose that flow early, you end up designing everything with the same priority.

Do a quick read/write estimate

It does not need to become a thesis.

But it does need to answer whether the problem is dominated by:

reads
writes
heavy processing
an external dependency

It also helps to say whether the real pain is throughput, tail latency, or cost blow-up.

Pretty averages hide APIs that are bad at p95.

Without that, it is easy to build an elegant answer for a bottleneck that was never the main one.

Name the first resource that saturates

This is where the answer starts becoming serious.

Because the bottleneck is rarely “scale” in the abstract.

It is usually something concrete, like:

CPU holding the request open
a database running out of connections
slow storage
a flaky third-party dependency
expensive fanout or aggregation

Make the smallest change that solves the right problem

Not every API under load needs microservices, queues, and several cache layers.

Sometimes the right move is much smaller:

remove heavy work from the request path
return 202 Accepted
add retry and rate limiting
add cache only on the hot path

The more proportional the change, the stronger the answer usually sounds.

Explain how the system degrades

This step is often skipped, and that is a mistake.

A system at scale is not only one that works when everything is fine.

It is one that behaves predictably when it can no longer keep up.

That includes deciding what happens first:

reject early
return partial results
move work to async
or protect one critical path while another gets worse

If you do not decide that, the system decides for you in the worst possible way.

Simple example

Imagine an API that generates financial reports at the end of the month.

The main flow is:

a user requests a report
the API queries many tables
it generates a heavy file
it returns the result

If many users do this at the same time, a likely bottleneck is heavy computation inside the request.

A mature answer could sound like this:

The critical flow is asking for a report and getting status back quickly. I do not need to return the file in the same request. So I remove report generation from the synchronous path, return 202 Accepted, put the job in a queue, and let the client poll for status or receive a notification when the file is ready.

And then add:

I also need to limit how many heavy jobs each account can trigger at once, so one customer does not degrade everyone else.

Now the answer has:

a main flow
a named bottleneck
a proportional change
controlled degradation

Common mistakes

Starting with the tool instead of the flow.
Talking about scale without talking about a physical or operational resource.
Ignoring acceptable degradation.
Assuming every high-load API needs the same architecture.
Forgetting the operational cost of the component you just added.

How a senior thinks

Someone with more experience usually pulls the conversation toward real impact.

The thinking sounds like this:

What must keep working when demand rises? What can move to async? What must stay under a specific latency? What do I reject first when capacity runs out?

That is the difference between a pretty diagram and a defensible system.

What the interviewer wants to see

In interviews, this scenario measures whether you:

choose an important flow
locate the main bottleneck
change the architecture because of need, not fashion
define how the system degrades

API at scale is not about how many boxes you know. It is about knowing which flow deserves protection and which sacrifice the system can afford.

Once degradation is clear, your architecture starts sounding real instead of just popular.

Quick summary

What to keep in your head

API at scale does not start with more components. It starts with the critical flow.
The first useful bottleneck is usually a concrete resource: CPU, database, connections, disk, or an external dependency.
A mature system is not only the one that works when traffic is healthy. It is the one that degrades predictably when capacity runs out.
In interviews, a strong answer adds a component only after naming the problem that component solves.

Practice checklist

Use this when you answer

Can I choose the most important flow before opening the diagram?
Can I say which resource saturates first and why?
Can I propose the smallest change that reduces that bottleneck?
Can I explain how the system fails or degrades when capacity is gone?

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (14/19)

Next step

Failure and recovery scenarios Next step →