How to Design a Rate Limiter API | Learn System Design (original) (raw)

Last Updated : 28 Mar, 2026

A Rate Limiter API controls how many requests a user or system can make within a specific time period. It helps protect servers from overload by restricting excessive traffic and ensuring fair usage. It is commonly used in APIs, login systems, and payment services to maintain stability and performance.

**Example: A login API allows only 5 requests per minute per user. If a user exceeds this limit, further requests are blocked temporarily.

1. Use of rate limiting

2. System Requirements

The requirements of a rate limiter API can be classified into two categories: functional and non-functional.

Functional requirements

This section defines the core capabilities of the Rate Limiter API.

Non-functional requirements

This section defines system qualities like performance, scalability, and security.

3. High Level Design (HLD)

Placement of Rate Limiter in System Design

A rate limiter should generally be implemented on the server side rather than on the client side. This is because of the following points:

HLD of Rate Limiter API - rate limiter placed at server side

The overall basic structure of a rate limiter seems relatively simpler. We just need a counter associated with each user to track how many requests are being same submitted in a particular timeframe. The request is rejected if the counter value hits the limit.

4. Memory Structure/Approximation

Thus, now let's think of the data structure which might help us. Since we need fast retrieval of the counter values associated with each user, we can use a hash-table. Considering we have a key-value pair. The key would contain hash value of each User Id, and the corresponding value would be the pair or structure of counter and the startTime, e.g.,
UserId -> {counter, startTime}

Now, each UserId let's say takes 8 bytes(long long) and the counter takes 2 bytes(int), which for now can count to 50k(limit). Now for the time if we store only the minute and seconds, it will also take 2 bytes. So in total, we would need 12 bytes to store each user's data.

Now considering the overhead of 10 bytes for each record in our hash-table, we would be needing to track at least 5 million users at any time(traffic), so the total memory in need would be:
(12+10)bytes*5 million = 110 MB

5. Components

Best Place to Store Counters in Rate

Due to the slowness of Database operations, it is not a smart option for us. This problem can be handled by an in-memory cache such as Redis. It is quick and supports the already implemented time-based expiration technique.

We can rely on two commands being used with in-memory storage,

In this design, client requests pass through a rate limiter middleware, which checks against the configured rate limits. The rate limiter module stores and retrieves rate limit data from a backend storage system. If a client exceeds a rate limit, the rate limiter module returns an appropriate response to the client.

6. Algorithms to Design a Rate Limiter API

Several algorithms are used for rate limiting, including

Let's discuss each algorithm in detail:

1. Token Bucket

The token bucket algorithm is a simple algorithm that uses a fixed-size token bucket to limit the rate of incoming requests. The token bucket is filled with tokens at a fixed rate, and each request requires a token to be processed. If the bucket is empty, the request is rejected.

The token bucket algorithm can be implemented using the following steps:

Thus, by allocating a bucket with a predetermined number of tokens for each user, we are successfully limiting the number of requests per user per time unit. When the counter of tokens comes down to 0 for a certain user, we know that he or she has reached the maximum amount of requests in a particular timeframe. The bucket will be auto-refilled whenever the new timeframe starts.

Token bucket example with initial bucket token count of 3 for each user in one minute

2. Leaky Bucket

It is based on the idea that if the average rate at which water is poured exceeds the rate at which the bucket leaks, the bucket will overflow.

The leaky bucket algorithm is similar to the token bucket algorithm, but instead of using a fixed-size token bucket, it uses a leaky bucket that empties at a fixed rate. Each incoming request adds to the bucket's depth, and if the bucket overflows, the request is rejected.

One way to implement this is using a queue, which corresponds to the bucket that will contain the incoming requests. Whenever a new request is made, it is added to the queue's end. If the queue is full at any time, then the additional requests are discarded.

The leaky bucket algorithm can be separated into the following concepts:

Leaky bucket example with token count per user per minute is 3, which is the queue size.

3. Sliding Window Logs

Another approach to rate limiting is to use sliding window logs. This data structure involves a "window" of fixed size that slides along a timeline of events, storing information about the events that fall within the window at any given time.

The window can be thought of as a buffer of limited size that holds the most recent events or changes that have occurred. As new events or changes occur, they are added to the buffer, and old events that fall outside of the window are removed. This ensures that the buffer stays within its fixed size, and only contains the most recent events.

This rate limitation keeps track of each client's request in a time-stamped log. These logs are normally stored in a time-sorted hash set or table.

The sliding window logs algorithm can be implemented using the following steps:

Sliding window logs in a timeframe of 1 minute

4. Sliding Window Counters

The sliding window counter algorithm is an optimization over sliding window logs. As we can see in the previous approach, memory usage is high. For example, to manage numerous users or huge window timeframes, all the request timestamps must be kept for a window time, which eventually uses a huge amount of memory. Also, removing numerous timestamps older than a particular timeframe means high complexity of time as well.

To reduce surges of traffic, this algorithm accounts for a weighted value of the previous window's request based on timeframe. If we have a one-minute rate limit, we can record the counter for each second and calculate the sum of all counters in the previous minute whenever we get a new request to determine the throttling limit.

The sliding window counters can be separated into the following concepts:

sliding window counters with a timeframe of 20 seconds

7. Examples of Rate Limiting APIs used worldwide

This section highlights commonly used platforms and tools that provide built-in rate limiting features.