June 24 2025

Rate Limiting: When, How, and Why

How to think about rate limiting as shared capacity protection, which strategies exist, and what actually matters in practice.

Andrews Ribeiro

Founder & Engineer

5 min Intermediate Systems

#system-design#systems#api#rate-limiting#reliability#architecture

Track

System Design Interviews - From Basics to Advanced

Step 10 / 19

Back to track Previous article Next article

The problem

Rate limiting often shows up in conversations as a quick detail.

Someone says:

“put a limiter at the edge”

and it sounds like the topic is done.

But the important part is not naming the limiter.

It is explaining:

what it is protecting
who the limit applies to
what behavior it creates when traffic gets tight

Without that, the system usually falls into two common failures:

one client consumes too much capacity and makes life worse for everyone else
the system degrades chaotically instead of predictably

Mental model

Think of it this way:

rate limiting is a capacity contract.

In plain English, you are saying:

above a certain pace, this client will have to wait, fail, or slow down

That can serve different goals:

protect a public API
reduce abuse
distribute a shared resource
absorb bursts
limit expensive actions like login, SMS sending, or report generation

So the useful question is not:

“do we need rate limiting?”

It is this one:

which capacity am I protecting, for whom, and what should happen when the limit is hit?

Breaking the problem down

Where rate limiting usually lives

The most common place is near the system entry point:

API gateway
load balancer with rules
application HTTP layer

The earlier you block, the less useless work you spend.

But that does not mean every limit belongs only at the edge.

Some limits make more sense closer to the rule itself:

per user
per specific action
per expensive resource
per external integration

Example:

limiting requests by API key at the edge makes sense
limiting “at most 3 SMS messages per hour for the same user” is more product logic

The algorithm changes behavior

You do not need to memorize formulas.

You need to understand how each option behaves.

Fixed window:

simple
easy to explain
but creates odd behavior at the window boundary

The client can send a lot at the end of one minute and a lot again at the beginning of the next.

Sliding window:

smooths that edge
tends to be fairer
but is usually a bit more expensive to maintain

Token bucket:

fills a bucket with tokens over time
each request spends one token
allows controlled bursts

In interviews, token bucket is often a strong answer because it balances clarity with real behavior.

Distributed rate limiting usually needs shared state

If you have multiple instances and each one counts locally, the client can dodge the limit by landing on different instances.

That is why, in distributed systems, the counter usually lives in shared state.

Redis shows up here often because:

it is fast
it handles counters and expiration well
it supports useful atomic operations for this case

It is not mandatory in every scenario.

But it is a common design that is easy to defend.

The limit key changes the effect

You can rate limit by:

IP
user
API key
tenant
endpoint
action

The key choice changes who pays the price.

If you limit by IP only, you may punish many users behind the same NAT.

If you limit by user only, anonymous abuse becomes harder to control.

Good answers usually show that the key is part of the design, not a default afterthought.

What happens when the limit is exceeded

The limit is not finished when you block the request.

You still need to define system behavior:

return 429 Too Many Requests
include retry hints
slow down instead of hard blocking
queue some requests
prioritize paid or internal traffic

This matters because the behavior becomes part of the product experience.

Simple example

A good interview answer could sound like this:

“I would treat rate limiting as protection for shared capacity. First I would define what I am protecting and who the limit applies to. At the edge, I would likely use rate limiting by API key or user to avoid wasting work early. For distributed counting, I would use shared state, often Redis, because local counters break across multiple instances. For the algorithm, token bucket is a good default when I want controlled bursts without unlimited spikes. And I would be explicit about the response, usually 429 with retry guidance, so the client sees a predictable contract instead of random failure.”

That works because it:

explains the goal
picks a place for the limiter
shows awareness of distribution
treats limit behavior as part of the system

Common mistakes

Treating rate limiting as a generic abuse checkbox.
Naming an algorithm without explaining its behavior.
Counting locally in a distributed system and assuming it still works.
Ignoring what the client sees when the limit is hit.
Mixing product limits and infrastructure limits without saying so.

How a senior thinks about it

People with real production experience usually simplify the conversation into two questions:

Which capacity am I protecting?

What should this feel like for the client when traffic is too high?

That framing clears a lot of noise.

Instead of sounding theoretical, the answer starts sounding operational.

What the interviewer wants to see

In this scenario, the interviewer wants to see whether you:

explain what is being protected
understand why the algorithm changes behavior
recognize the distributed counting problem
make the client-visible behavior explicit
keep the answer grounded in trade-offs instead of buzzwords

Good rate limiting is not just about stopping traffic. It is about turning overload into something predictable.

Quick summary

What to keep in your head

Rate limiting protects shared capacity and helps keep fairness across clients.
The algorithm matters because it changes how the limit behaves around bursts, windows, and multiple instances.
In distributed systems, counting locally on each instance is almost never enough.
The client response is part of the design too: a good limit should not feel like random failure.

Practice checklist

Use this when you answer

Can I explain fixed window, sliding window, and token bucket without hiding behind formulas?
Do I know where a rate limiter usually lives and when it also belongs near business rules?
Can I explain why Redis shows up so often in this problem?
Do I know what to return to the client when the limit is exceeded?

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (10/19)

Next step

Load Balancing Without a Black Box Next step →