Design an API Rate Limiter

Problem Context

🚦 A Rate Limiter controls how many requests a client can make to an API within a given time window. It's the traffic cop for your API.

Every major API you've used has rate limiting: Twitter limits tweets per hour, Stripe limits API calls per second, GitHub throttles unauthenticated requests. Without it, a single bad actor (or a bug in someone's script) could bring down your entire service.

Functional Requirements

Core Functional Requirements

FR1: The system should limit requests based on a configurable rate (e.g., 100 requests per minute per user).
FR2: The system should return appropriate responses when rate limits are exceeded (HTTP 429).
FR3: The system should support rate limiting by different identifiers (user ID, API key, IP address).

Out of Scope:

Authentication and authorization (already established, like using JWT)
Billing integration
Request queuing
Per-endpoint rate limits

Mentioning out-of-scope items shows the interviewer you understand the broader picture.

Non-Functional Requirements

Core Non-Functional Requirements

NFR1: Rate limit checks should add < 5ms latency to requests.
NFR2: The system should be highly available (99.99%+).
NFR3: Rate limits should be accurate across distributed servers.
NFR4: System should scale to handle 1M+ requests per second globally.

Here's the current whiteboard:

The Set Up

Defining the Core Entities

Client: The entity making requests
- clientId: Unique identifier (user ID, API key, or IP address)
- tier: Service tier (free, pro, enterprise)
Rate Limit Rule: Configuration defining the limit
- ruleId: Unique identifier for the rule
- clientId: Which client this applies to (or * for default)
- limit: Max requests allowed (e.g., 100)
- windowSeconds: Time window duration (e.g., 60)
- burstLimit: Optional max burst (e.g., 150)
Request Counter: Tracks requests in the current window
- clientId: Who made the requests
- windowKey: Current time window identifier (e.g., 14:05)
- count: Number of requests made
- expiresAt: When this counter resets
Time Window: The period over which we count
- start: Window start timestamp
- durationSeconds: Window length (60, 3600, etc.)

The counter is the heart of rate limiting. Everything else is about making that counter accurate and fast.

API Interface

The rate limiter doesn't expose a public API to end users. Instead, it's an internal service that other components query. We need to define:

How servers check the limit
What responses come from the rate limiter
How limits can be set in the first place

You don't need to memorize the exact JSON format. Focus on understanding the inputs, outputs, and why each field exists.

Rate Limit Check (Internal): FR1, FR3

Called by API servers before processing each request.

1. Check Rate Limit

POST /ratelimit/check

Request:
{
  "clientId": "user_abc123",
  "resource": "api"
}

Response (allowed):
{
  "allowed": true,
  "remaining": 87,
  "limit": 100,
  "resetAt": 1704355200
}

Response (denied):
{
  "allowed": false,
  "remaining": 0,
  "limit": 100,
  "resetAt": 1704355200,
  "retryAfter": 23
}

Why POST and not GET? POST is used because this endpoint represents an explicit action (checking and consuming the quota)

Response Headers (User-Facing): FR2

Users never call the rate limiter directly, but they see standard headers in every API response:

Normal Response:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1704355200

Rate Limited Response:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704355200
Retry-After: 23

{
  "error": "Rate limit exceeded",
  "message": "Too many requests. Please retry after 23 seconds."
}

The Retry-After header tells well-behaved clients when to try again. Bots will ignore it and keep hammering (they'll stay blocked).

Admin Configuration: FR1

Admins configure limits for different clients or tiers.

Set Rate Limit Rule

PUT /ratelimit/rules/{clientId}

Request:
{
  "requestsPerMinute": 100,
  "burstLimit": 150
}

Response:
{
  "clientId": "user_abc123",
  "requestsPerMinute": 100,
  "burstLimit": 150,
  "updatedAt": "2024-01-04T12:00:00Z"
}

What's burst limit? The sustained rate is 100/min, but we allow brief spikes to 150. This handles legitimate traffic bursts without being overly restrictive.

High-Level Design

Let's tackle our functional requirements:

FR1: Limit requests by configurable rate
FR2: Return 429 when limit exceeded
FR3: Support different identifiers

We'll build up from the most simple implementation and fix red flags as we go. In a real interview, you can start at Diagram 2.

1) The Simplest System: FR1, FR2

The simplest rate limiter is just a counter in memory.

Logic:

Request comes in with client identifier
Increment counter for that client
If counter is less than limit, allow request
If counter is greater than or equal to limit, reject with 429

What breaks?

Counter never resets (everyone eventually gets blocked forever!)
If server restarts, all counts are lost
A single server can't scale horizontally

First, we need time windows to reset counters.

2) Add Time Windows: FR1, FR2

The counter needs to reset periodically. Let's add time-based windows:

Now limits reset properly! A user blocked at 14:05 gets a fresh quota at 14:06.

What breaks?

Boundary problem: A user sends 99 requests at 14:00:59, then 99 more at 14:01:01. That's 198 requests in 2 seconds! (our limit is 100/minute). The window boundary creates a loophole.
In-memory storage doesn't survive restarts.
A Single server still can't scale.

3) Sliding Window: FR1, FR2

The fixed window boundary can be exploited. Sliding window looks at a rolling time period instead:

Why not store exact timestamps? For perfect accuracy, you'd store when each request happened: [2:00:01 PM, 2:00:03 PM, ...]. That's O(n) memory: 10,000 requests = 10,000 timestamps. The approximation uses O(1) memory (just 2 numbers no matter the volume).

What breaks?

Still in-memory on one server
Multiple API servers means multiple isolated counters (a user could hit Server A 100 times AND Server B 100 times!)

The counter needs to be centralized.

4) Centralized Counter (Redis): FR1, FR2, NFR3

Multiple servers need to share the same counter. Enter Redis:

Why Lua script? Without it, two servers could READ count=99, both calculate 99 < 100, then both INCR (101 requests allowed when limit is 100). Lua runs atomically inside Redis so the check and increment happen as one indivisible operation.

Why Redis specifically?

Sub-millisecond reads/writes
Lua scripts run atomically (no race conditions)
Built-in TTL (keys auto-expire)

Now both servers see the same counter. A user can't bypass limits by hitting different servers.

This completes our functional requirements:

FR1 ✅ Configurable rate limits (via Rules Config)
FR2 ✅ Proper 429 responses with headers
FR3 ✅ Client identification (clientId in Redis key)

But we still have problems to solve:

Redis becomes a single point of failure (NFR2)
If Redis is in us-east-1 and your servers are in Asia, that's 200ms latency per request (NFR1)
How do we handle 1M+ requests per second? (NFR4)

We address non-functional requirements and more in the deep dives:

NFR1 (Latency): How do we reduce Redis latency?
NFR2 (Availability): What happens when Redis fails?
NFR3 (Consistency): How do we handle distributed counting across regions?
NFR4 (Scale): How do we handle 1M+ RPS?

Potential Deep Dives

1) What algorithm should we use?

We glossed over algorithms earlier. There are four main approaches, each with trade-offs:

Token Bucket is the industry standard (used by AWS, Stripe, etc.), but understanding all four helps you discuss trade-offs intelligently.

1a) Fixed Window Counter

How it works: Divide time into fixed 1-minute windows. Count requests in the current window. When the clock hits the next minute, reset the counter to zero.

The problem: As previously demonstrated in the HLD, we can have a boundary exploit.

1b) Sliding Window Log

How it works: Store the exact timestamp of every request. To check the limit, count how many timestamps fall within the last 60 seconds from right now. (no boundary exploit)

1c) Sliding Window Counter (Approximation)

The main idea: We want to know "how many requests in the last 60 seconds?" but we don't want to store every timestamp. So we use two fixed-window counters and combine them with a weighted average for prediction.

How it works:

Windows are fixed 60-second periods based on the clock (2:04, 2:05, 2:06...)
We store a count for the whole previous window (complete) and current window (still in progress)
When checking the limit, we combine the current window count and a weighted average of the previous window (based on what percent of that window is included) to see if the limit is exceeded.

Example: At 2:05:36 PM (36 seconds into the current window, starting at 2:05:00 PM)

Previous window (2:04 PM): 80 requests
Current window (2:05 PM): 30 requests SO FAR
We're 60% through the current window, so 40% of the previous window overlaps with our 60-second lookback (look at diagram above)

  estimated = (prev × overlap%) + curr
            = (80 × 0.4) + 30
            = 32 + 30
            = 62 requests

The trade-off: We assume requests were evenly spread in the previous window. This isn't perfect, but it's good enough for rate limiting and we only store 2 numbers instead of thousands of timestamps.

1d) Token Bucket (Recommended)

How it works: Imagine a bucket that holds tokens. Each request must "spend" one token. The bucket refills at a steady rate (like 100 tokens per minute). If the bucket is empty, the request is rejected.

Step by step:

Refill: Calculate how many tokens to add based on time since last request
- tokens_to_add = time_elapsed × refill_rate
- Cap at bucket capacity (can't exceed 100)
Check: If there's at least 1 token, consume it and allow the request
Deny: If empty, reject with 429

Why it handles bursts: Unlike fixed windows, a user can "save up" tokens. If they haven't made requests for 30 seconds, they've accumulated ~50 tokens and can burst 50 requests instantly. This matches real-world usage patterns.

Example timeline:

2:00:00 PM: Bucket starts full: 100 tokens
2:00:01 PM: User sends 100 requests, 0 tokens left
2:00:02 PM: User tries 1 request, allowed (1.67 tokens refilled, use 1, 0.67 left)
2:00:02 PM: User tries another, denied (only 0.67 tokens, need 1)

Why it's the standard: AWS, Stripe, GitHub, and most major APIs use Token Bucket because:

It allows legitimate traffic bursts
It has O(1) memory (just 2 values: tokens and last_refill_time)
It provides smooth, predictable rate limiting
It's intuitive to explain to users ("you have X requests remaining")

Bottom line: Use Token Bucket for production. It handles bursts gracefully and is the industry standard.

2) How do we handle distributed rate limiting?: NFR1, NFR3 (Consistency)

With Redis deployed per region, a user could hit US Redis (100 requests), then EU Redis (100 more) with a VPN (getting 200 requests when their limit is 100).

Strict per-user accuracy requires either cross-region coordination (adds latency) or routing each user to one region.

Solution A: Sticky Sessions (Recommended for per-user limits)

Route each user to the same region always. Their count lives in one Redis.

How: Your load balancer (AWS ALB, Cloudflare) hashes the user ID and routes consistently.

Trade-off: A user in Asia routed to US Redis gets higher latency.

Solution B: Accept Some Inaccuracy

Give each region the full limit. Most users don't VPN between regions mid-session.
Trade-off: The rare user gaming the system gets 2-3x their limit. For fair usage limits, this is often acceptable because it would cost you much more in the long run to implement cross-region tracking.

Solution C: Cross-Region Sync (Expensive)

Sync each user's count across regions using a global Redis.
Trade-off: This adds latency to every request and is complex to implement. It's only worth it for very strict security limits.

For per-user limits: use sticky sessions.

3) What happens when Redis fails?: NFR2 (99.99% Availability)

Redis is our critical dependency. When it's unreachable, we have three choices:

Option	Behavior	Trade-off
Fail Open	Allow all requests	No rate limiting (risk abuse)
Fail Closed	Deny all requests	Blocks legitimate users
Fail to Local Cache	Use in-memory counters	Only works with sticky sessions

Local Cache Fallback

Each server falls back to its own in-memory counter. This only works if a user's requests consistently hit the same server (via load balancer hashing). Otherwise they could hit different servers and exceed their limit.

Without consistent routing, your options are fail-open or fail-closed.

Circuit Breaker Pattern

Why? If Redis is slow (not fully down), hammering it with retries makes things worse. Circuit breaker stops trying temporarily, then probes to recover.

4) How do we handle 1M+ requests per second?: NFR4 (Scale)

The simple answer: Shard Redis by user ID.

Each Redis handles ~100K requests/second. For 1M RPS, use 10+ Redis instances. This is the same pattern as database sharding.

Optional Optimizations (if sharding isn't enough)

Request batching: Instead of 1 Redis call per request, batch multiple requests and flush every 10ms. Reduces Redis calls by 5-10x.
Cache rejections locally: Once Redis says a user is over their limit, cache that locally in the server's memory. New requests from that user get denials (rate limited until 2:01) and don't hit Redis.

5) How do we support different rate limit tiers?: FR1, FR3

We store the tier and limit mappings in a configuration database:

Tier	Requests/min	Burst
Free	60	10
Pro	1,000	100
Enterprise	10,000	1,000

Flow:

Request arrives with API key
We look up the tier for that key
Apply that tier's limit

Caching tier lookups: Hitting the database on every request is too slow. Cache the mapping locally with a TTL. When a user upgrades, publish an event to invalidate caches.

What to Expect?

That covered a lot! Here's what matters at each level.

Mid-level

Breadth over Depth (80/20): Build the core system with centralized Redis (Diagrams 1-4). It's okay to treat scaling as a black box.
Expect Guided Questions: The interviewer won't expect you to know everything. "Why Redis?" or "What happens at window boundaries?" are fair game.
Collaborative Design: You lead the initial design, but expect the interviewer to guide you toward problems you missed.
The Bar: Complete the core flow with Redis as centralized storage. Know why in-memory alone fails. Understand the algorithms (Token Bucket is recommended).

Senior

Balanced Breadth & Depth (60/40): Go deeper on areas where you have hands-on experience. Don't just say "token bucket" but explain how it handles bursts and why it's the industry standard.
Proactive Problems: Identify failure modes before being asked. "What if Redis dies?" and "How do we handle users across regions?" should come from you.
Trade-off Discussions: Explain why you chose one approach over another. "We use sticky sessions to avoid distributed consistency issues" is higher level thinking.
The Bar: Complete Diagram 4 and proactively dive into 2-3 deep dives: algorithms, distributed rate limiting (sticky sessions), or Redis failure handling (circuit breaker).

Staff

Depth over Breadth (40/60): The interviewer assumes you know the basics. Get to Diagram 4 in ~15 minutes, then spend most time on deep dives.
Experience-Driven: Draw from real projects. "We hit this exact problem when..." is more valuable than textbook answers.
Drive the Conversation: You lead the discussion, identifying and solving problems before the interviewer prompts. They're observing your thought process, not teaching.
The Bar: Cover all deep dives unprompted: algorithm trade-offs (why Token Bucket), distributed consistency (sticky sessions vs cross-region sync), failure handling (circuit breaker), scaling (shard by user ID), and tiered limits.

Mock interview this question with AI & pass your real interview. Good luck! 🚦

Design API Rate Limiter

Design an API Rate Limiter

Problem Context

Functional Requirements

Core Functional Requirements

Out of Scope:

Non-Functional Requirements

Core Non-Functional Requirements

The Set Up

Defining the Core Entities

API Interface

Rate Limit Check (Internal): FR1, FR3

Response Headers (User-Facing): FR2

Admin Configuration: FR1

High-Level Design

1) The Simplest System: FR1, FR2

2) Add Time Windows: FR1, FR2

3) Sliding Window: FR1, FR2

4) Centralized Counter (Redis): FR1, FR2, NFR3

Potential Deep Dives

1) What algorithm should we use?

1a) Fixed Window Counter

1b) Sliding Window Log

1c) Sliding Window Counter (Approximation)

1d) Token Bucket (Recommended)

2) How do we handle distributed rate limiting?: NFR1, NFR3 (Consistency)

Solution A: Sticky Sessions (Recommended for per-user limits)

Solution B: Accept Some Inaccuracy

Solution C: Cross-Region Sync (Expensive)

3) What happens when Redis fails?: NFR2 (99.99% Availability)

Local Cache Fallback

Circuit Breaker Pattern

4) How do we handle 1M+ requests per second?: NFR4 (Scale)

Optional Optimizations (if sharding isn't enough)

5) How do we support different rate limit tiers?: FR1, FR3

What to Expect?

Mid-level

Senior

Staff