October 6 2025
API scenarios at scale
How to think about an API under load without falling into generic distributed systems answers.
Andrews Ribeiro
Founder & Engineer
4 min Intermediate Systems
Track
System Design Interviews - From Basics to Advanced
Step 14 / 19
The problem
Many system design answers for APIs at scale turn into a list of famous technology names.
Redis, Kafka, load balancer, microservice, sharding.
Everything shows up before anyone answers:
- which route actually matters
- which dependency limits the flow
- what the business is willing to give up under pressure
The result looks like architecture, but it is missing diagnosis.
Mental model
API at scale does not start with the number of components.
It starts with four questions:
- which operation matters most
- which operation suffers first when load rises
- which resource saturates first
- how the system degrades when it cannot serve everything
If you can answer those four, much of the architecture starts to reveal itself.
Breaking it down
Pick the critical flow
Not everything has the same weight.
In a real API, there is usually one path worth protecting first.
Examples:
- checkout
- login
- redirect
- report generation
If you do not choose that flow early, you end up designing everything with the same priority.
Do a quick read/write estimate
It does not need to become a thesis.
But it does need to answer whether the problem is dominated by:
- reads
- writes
- heavy processing
- an external dependency
It also helps to say whether the real pain is throughput, tail latency, or cost blow-up.
Pretty averages hide APIs that are bad at p95.
Without that, it is easy to build an elegant answer for a bottleneck that was never the main one.
Name the first resource that saturates
This is where the answer starts becoming serious.
Because the bottleneck is rarely “scale” in the abstract.
It is usually something concrete, like:
- CPU holding the request open
- a database running out of connections
- slow storage
- a flaky third-party dependency
- expensive fanout or aggregation
Make the smallest change that solves the right problem
Not every API under load needs microservices, queues, and several cache layers.
Sometimes the right move is much smaller:
- remove heavy work from the request path
- return
202 Accepted - add retry and rate limiting
- add cache only on the hot path
The more proportional the change, the stronger the answer usually sounds.
Explain how the system degrades
This step is often skipped, and that is a mistake.
A system at scale is not only one that works when everything is fine.
It is one that behaves predictably when it can no longer keep up.
That includes deciding what happens first:
- reject early
- return partial results
- move work to async
- or protect one critical path while another gets worse
If you do not decide that, the system decides for you in the worst possible way.
Simple example
Imagine an API that generates financial reports at the end of the month.
The main flow is:
- a user requests a report
- the API queries many tables
- it generates a heavy file
- it returns the result
If many users do this at the same time, a likely bottleneck is heavy computation inside the request.
A mature answer could sound like this:
The critical flow is asking for a report and getting status back quickly. I do not need to return the file in the same request. So I remove report generation from the synchronous path, return
202 Accepted, put the job in a queue, and let the client poll for status or receive a notification when the file is ready.
And then add:
I also need to limit how many heavy jobs each account can trigger at once, so one customer does not degrade everyone else.
Now the answer has:
- a main flow
- a named bottleneck
- a proportional change
- controlled degradation
Common mistakes
- Starting with the tool instead of the flow.
- Talking about scale without talking about a physical or operational resource.
- Ignoring acceptable degradation.
- Assuming every high-load API needs the same architecture.
- Forgetting the operational cost of the component you just added.
How a senior thinks
Someone with more experience usually pulls the conversation toward real impact.
The thinking sounds like this:
What must keep working when demand rises? What can move to async? What must stay under a specific latency? What do I reject first when capacity runs out?
That is the difference between a pretty diagram and a defensible system.
What the interviewer wants to see
In interviews, this scenario measures whether you:
- choose an important flow
- locate the main bottleneck
- change the architecture because of need, not fashion
- define how the system degrades
API at scale is not about how many boxes you know. It is about knowing which flow deserves protection and which sacrifice the system can afford.
Once degradation is clear, your architecture starts sounding real instead of just popular.
Quick summary
What to keep in your head
- API at scale does not start with more components. It starts with the critical flow.
- The first useful bottleneck is usually a concrete resource: CPU, database, connections, disk, or an external dependency.
- A mature system is not only the one that works when traffic is healthy. It is the one that degrades predictably when capacity runs out.
- In interviews, a strong answer adds a component only after naming the problem that component solves.
Practice checklist
Use this when you answer
- Can I choose the most important flow before opening the diagram?
- Can I say which resource saturates first and why?
- Can I propose the smallest change that reduces that bottleneck?
- Can I explain how the system fails or degrades when capacity is gone?
You finished this article
Part of the track: System Design Interviews - From Basics to Advanced (14/19)
Share this page
Copy the link manually from the field below.