Skip to main content

Rate Limiting: When, How, and Why

How to think about rate limiting as shared capacity protection, which strategies exist, and what actually matters in practice.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

System Design Interviews - From Basics to Advanced

Step 10 / 19

The problem

Rate limiting often shows up in conversations as a quick detail.

Someone says:

  • “put a limiter at the edge”

and it sounds like the topic is done.

But the important part is not naming the limiter.

It is explaining:

  • what it is protecting
  • who the limit applies to
  • what behavior it creates when traffic gets tight

Without that, the system usually falls into two common failures:

  • one client consumes too much capacity and makes life worse for everyone else
  • the system degrades chaotically instead of predictably

Mental model

Think of it this way:

rate limiting is a capacity contract.

In plain English, you are saying:

  • above a certain pace, this client will have to wait, fail, or slow down

That can serve different goals:

  • protect a public API
  • reduce abuse
  • distribute a shared resource
  • absorb bursts
  • limit expensive actions like login, SMS sending, or report generation

So the useful question is not:

  • “do we need rate limiting?”

It is this one:

which capacity am I protecting, for whom, and what should happen when the limit is hit?

Breaking the problem down

Where rate limiting usually lives

The most common place is near the system entry point:

  • API gateway
  • load balancer with rules
  • application HTTP layer

The earlier you block, the less useless work you spend.

But that does not mean every limit belongs only at the edge.

Some limits make more sense closer to the rule itself:

  • per user
  • per specific action
  • per expensive resource
  • per external integration

Example:

  • limiting requests by API key at the edge makes sense
  • limiting “at most 3 SMS messages per hour for the same user” is more product logic

The algorithm changes behavior

You do not need to memorize formulas.

You need to understand how each option behaves.

Fixed window:

  • simple
  • easy to explain
  • but creates odd behavior at the window boundary

The client can send a lot at the end of one minute and a lot again at the beginning of the next.

Sliding window:

  • smooths that edge
  • tends to be fairer
  • but is usually a bit more expensive to maintain

Token bucket:

  • fills a bucket with tokens over time
  • each request spends one token
  • allows controlled bursts

In interviews, token bucket is often a strong answer because it balances clarity with real behavior.

Distributed rate limiting usually needs shared state

If you have multiple instances and each one counts locally, the client can dodge the limit by landing on different instances.

That is why, in distributed systems, the counter usually lives in shared state.

Redis shows up here often because:

  • it is fast
  • it handles counters and expiration well
  • it supports useful atomic operations for this case

It is not mandatory in every scenario.

But it is a common design that is easy to defend.

The limit key changes the effect

You can rate limit by:

  • IP
  • user
  • API key
  • tenant
  • endpoint
  • action

The key choice changes who pays the price.

If you limit by IP only, you may punish many users behind the same NAT.

If you limit by user only, anonymous abuse becomes harder to control.

Good answers usually show that the key is part of the design, not a default afterthought.

What happens when the limit is exceeded

The limit is not finished when you block the request.

You still need to define system behavior:

  • return 429 Too Many Requests
  • include retry hints
  • slow down instead of hard blocking
  • queue some requests
  • prioritize paid or internal traffic

This matters because the behavior becomes part of the product experience.

Simple example

A good interview answer could sound like this:

“I would treat rate limiting as protection for shared capacity. First I would define what I am protecting and who the limit applies to. At the edge, I would likely use rate limiting by API key or user to avoid wasting work early. For distributed counting, I would use shared state, often Redis, because local counters break across multiple instances. For the algorithm, token bucket is a good default when I want controlled bursts without unlimited spikes. And I would be explicit about the response, usually 429 with retry guidance, so the client sees a predictable contract instead of random failure.”

That works because it:

  • explains the goal
  • picks a place for the limiter
  • shows awareness of distribution
  • treats limit behavior as part of the system

Common mistakes

  • Treating rate limiting as a generic abuse checkbox.
  • Naming an algorithm without explaining its behavior.
  • Counting locally in a distributed system and assuming it still works.
  • Ignoring what the client sees when the limit is hit.
  • Mixing product limits and infrastructure limits without saying so.

How a senior thinks about it

People with real production experience usually simplify the conversation into two questions:

Which capacity am I protecting?

What should this feel like for the client when traffic is too high?

That framing clears a lot of noise.

Instead of sounding theoretical, the answer starts sounding operational.

What the interviewer wants to see

In this scenario, the interviewer wants to see whether you:

  • explain what is being protected
  • understand why the algorithm changes behavior
  • recognize the distributed counting problem
  • make the client-visible behavior explicit
  • keep the answer grounded in trade-offs instead of buzzwords

Good rate limiting is not just about stopping traffic. It is about turning overload into something predictable.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (10/19)

Next article Load Balancing Without a Black Box Previous article Replication and Sharding Without Mystery

Keep exploring

Related articles