Build a Distributed Rate Limiter Across 100+ API Gateway Nodes


distributed-systems api-design scalability

System Design Deep Dive

Distributed Rate Limiter

Enforcing global per-user limits across 100+ gateway nodes without a single coordinator

14 min readAdvancedRate-Limiting

Think of a highway toll system. A single toll booth at the entrance of a major city works perfectly when traffic is light, but the moment you scale to millions of vehicles per day, that single booth becomes the chokepoint that causes the very congestion it was meant to regulate. You build more lanes, distribute the toll collection, and now each booth processes its slice of traffic independently. But here is the twist: you still need to enforce a per-vehicle daily limit across all booths, not just at one. That is the core problem of distributed rate limiting at scale.

The naive implementation of a rate limiter places a single Redis instance or a database table as the sole arbiter of whether a request is allowed. Every API gateway node, before forwarding a request, asks the central coordinator: “has this user exceeded their limit?” At modest scale, this works. At 100+ gateway nodes each fielding thousands of requests per second, the coordinator becomes a serialization bottleneck, a single point of failure, and an added 5-15ms of latency on every single request path. A 10ms overhead multiplied across millions of requests per day is not an engineering inconvenience, it is a product-level SLA violation that triggers escalations and on-call pages.

The deeper challenge is the local vs global limit tradeoff. Each gateway node can maintain an in-process counter that is blindingly fast - a simple atomic increment in memory costs nanoseconds. But if node A has seen 900 requests from user X while node B has seen 900 requests from the same user, and the global limit is 1000 requests per minute, you are already at 1800 - double the limit - with no node aware that a violation is occurring. The opposite extreme, serializing every check through a single global store, sacrifices the latency guarantees that make an API gateway competitive. The winning design lives in the middle: local approximation with periodic global synchronization and a shared authoritative store for the final guard.

We need to solve for sub-millisecond enforcement latency under normal conditions, eventual global consistency within a bounded error window, graceful degradation when the shared store is unreachable, and operational simplicity that allows limit overrides without a service restart simultaneously.

Requirements and Constraints

Functional Requirements

  • Enforce per-user request rate limits (e.g., 1000 requests per minute per user)
  • Enforce per-tenant (organization) rate limits independently of per-user limits
  • Support multiple limit tiers (free, pro, enterprise) with different quotas
  • Provide a limit override API to temporarily raise or lower limits for specific users or tenants
  • Return 429 Too Many Requests with a Retry-After header when limits are exceeded
  • Track and expose current usage metrics per user and tenant

Non-Functional Requirements

  • Enforcement decision latency: under 1ms at the 99th percentile (p99)
  • Global consistency lag: at most 500ms between a limit being hit on one node and being enforced on all nodes
  • Throughput: support 500,000 requests per second across the gateway fleet
  • Availability: rate limiter must degrade gracefully (fail open) if the Redis cluster is unreachable
  • Accuracy: allow at most 5% overage above configured limits during normal operation
  • Durability: limit configurations survive gateway node restarts and Redis failovers

Constraints

  • No single point of failure in the enforcement path
  • Redis Cluster with 6 nodes (3 primary, 3 replica) - no standalone Redis
  • Gateway fleet may scale from 20 to 200+ nodes dynamically via autoscaling
  • Each gateway node has 512MB memory budget for rate limiting state
  • Limit configuration changes must propagate within 30 seconds

High-Level Architecture

The system is composed of four major layers working in concert. The API Gateway Fleet runs 100+ stateless nodes, each with a Local Counter Layer holding an in-process approximate count. The Redis Rate Store acts as the globally authoritative counter backend, sharded across a Redis Cluster. The Config Service stores rate limit configurations and serves them to gateway nodes via a push/pull mechanism. Finally, the Gossip Synchronizer runs as a lightweight sidecar on each gateway node, periodically broadcasting local counts to peers and reconciling the global view.

Distributed rate limiter architecture across 100+ gateway nodes

When a request arrives at any gateway node, the decision sequence is: check the local counter first (nanoseconds), then if the local count is within the pre-allocated local budget, allow immediately. If the local budget is exhausted, or if a random sampling roll triggers a sync, the node performs an atomic Redis operation to check and increment the global counter. Redis responds with the current global count, and the node makes the final allow-or-deny decision. The gossip layer runs asynchronously, sharing aggregate counts between nodes every 100ms to rebalance local budgets without touching Redis.

The key insight is that the local counter layer and the Redis layer serve different purposes. Local counters exist for speed - they answer the common case (user is well under their limit) in nanoseconds with zero network I/O. Redis exists for correctness - it serializes the critical section when a user approaches their limit. Gossip exists for efficiency - it prevents every node from hammering Redis for users who are nowhere near their limit.

Component Deep Dives

The Local Counter Layer

The Local Counter Layer’s job is to answer the majority of rate limit checks in under 100 nanoseconds without touching the network.

Each gateway node maintains an in-process hash map keyed by (user_id, window_start_epoch_second). The value is a 64-bit atomic counter. When a request arrives, the node increments the counter and compares it against the node’s local budget, which is a fraction of the global limit allocated to this specific node. The node budget is calculated as global_limit / num_active_nodes * local_budget_fraction, where local_budget_fraction defaults to 0.7 - meaning each node can pre-authorize 70% of its fair share before requiring a Redis round-trip.

The tricky design decision is what happens when the local budget is exhausted. A hard stop here would cause false negatives: a user who has consumed their node’s local budget but has ample global quota remaining would be incorrectly rejected. Instead, exhausting the local budget triggers a Redis sync, which returns the current global count and optionally re-allocates a fresh local budget from the remaining global headroom.

Burst allowance is handled at the local layer using a token bucket model layered on top of the sliding window counter. Each user gets a small burst reserve of burst_multiplier * base_rate / num_nodes tokens that refill at the base rate. The burst reserve allows a user who has been quiet for a few seconds to send a quick batch of requests without immediately hitting the Redis layer.

import threading
import time
from collections import defaultdict
from dataclasses import dataclass, field

@dataclass
class LocalBucket:
    count: int = 0
    window_start: float = 0.0
    burst_tokens: float = 0.0
    last_refill: float = field(default_factory=time.monotonic)
    lock: threading.Lock = field(default_factory=threading.Lock)

class LocalCounterLayer:
    def __init__(self, global_limit: int, num_nodes: int, window_seconds: int = 60):
        self.node_budget = int(global_limit / num_nodes * 0.7)
        self.burst_reserve = int(global_limit / num_nodes * 0.3)
        self.window_seconds = window_seconds
        self.refill_rate = global_limit / num_nodes / window_seconds
        self._buckets: dict[str, LocalBucket] = defaultdict(LocalBucket)

    def check_and_increment(self, user_id: str) -> tuple[bool, int]:
        bucket = self._buckets[user_id]
        now = time.monotonic()
        window_start = now - (now % self.window_seconds)

        with bucket.lock:
            # Reset counter on new window
            if window_start > bucket.window_start:
                bucket.count = 0
                bucket.window_start = window_start

            # Refill burst tokens
            elapsed = now - bucket.last_refill
            bucket.burst_tokens = min(
                self.burst_reserve,
                bucket.burst_tokens + elapsed * self.refill_rate
            )
            bucket.last_refill = now

            # Fast path: within node budget
            if bucket.count < self.node_budget:
                bucket.count += 1
                return True, bucket.count

            # Check burst reserve
            if bucket.burst_tokens >= 1.0:
                bucket.burst_tokens -= 1.0
                bucket.count += 1
                return True, bucket.count

            # Budget exhausted - caller must check Redis
            return False, bucket.count

The analogy here is a local cash register with a float. The cashier (gateway node) keeps a limited amount of cash on hand (local budget) and handles most transactions without going to the vault (Redis). Only when the float runs low does the cashier make a trip to the vault to replenish. This minimizes vault trips while ensuring the total cash in circulation never exceeds the allowed amount.

Failure mode: if a gateway node crashes, its in-flight local counter state is lost. Any counts not yet flushed to Redis are gone. This means a node crash can allow a burst of requests through during the next window that technically exceeds the global limit. The 5% overage tolerance in the SLA accounts for this. To bound the loss, local counters are flushed to Redis every 10 seconds.

The Redis Rate Store

The Redis Rate Store’s job is to act as the globally authoritative counter that serializes limit enforcement when users approach their quotas.

The sliding window counter algorithm is implemented as a sorted set in Redis, where each element is a request identifier and the score is the request timestamp. This allows the Redis Lua script to atomically count requests within the current window, remove expired entries, increment for the new request, and return the current count - all in a single round-trip that cannot be interleaved with other operations.

Using a sorted set instead of a simple string counter solves the boundary problem of fixed windows. With fixed windows, a user can send 1000 requests at 11:59pm and another 1000 at 12:00am, effectively doubling the limit at the boundary. The sliding window maintains an accurate count of requests in the last N seconds regardless of window alignment.

The alternative algorithm is a token bucket stored as a hash with tokens and last_refill fields. Token buckets are better for burst-tolerant use cases where you care more about sustained rate than exact per-minute counts. The Redis Lua script for a token bucket atomically computes how many tokens to add since the last refill, caps at the bucket maximum, consumes one token, and stores the new state.

-- Sliding window counter using sorted set
-- KEYS[1]: rate limit key (e.g., "rl:user:42:requests")
-- ARGV[1]: current timestamp (milliseconds)
-- ARGV[2]: window duration (milliseconds)
-- ARGV[3]: limit
-- ARGV[4]: request ID (unique per request)
-- Returns: {current_count, allowed (1/0)}

local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local req_id = ARGV[4]
local window_start = now - window

-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
  -- Add current request
  redis.call('ZADD', key, now, req_id)
  -- Set TTL slightly longer than window to handle clock skew
  redis.call('PEXPIRE', key, window + 1000)
  return {count + 1, 1}
else
  return {count, 0}
end
-- Token bucket in Redis
-- KEYS[1]: rate limit key
-- ARGV[1]: current timestamp (seconds, float)
-- ARGV[2]: fill rate (tokens per second)
-- ARGV[3]: bucket capacity
-- Returns: {tokens_remaining, allowed (1/0)}

local key = KEYS[1]
local now = tonumber(ARGV[1])
local fill_rate = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * fill_rate)

if new_tokens >= 1.0 then
  new_tokens = new_tokens - 1.0
  redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
  redis.call('EXPIRE', key, math.ceil(capacity / fill_rate) + 10)
  return {math.floor(new_tokens), 1}
else
  redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
  return {0, 0}
end

Redis Lua atomicity is the cornerstone of correctness here. The EVAL command in Redis executes a Lua script atomically - no other Redis command can execute between the ZREMRANGEBYSCORE and ZADD calls. This prevents the race condition where two gateway nodes simultaneously check the count, both see it as below the limit, both add their requests, and the limit is exceeded by two.

Redis internals showing Lua script execution path and sliding window data structure

Failure mode: Redis Cluster failover takes 5-15 seconds during which the primary shard is unreachable. The gateway should fail open (allow requests) during this window rather than dropping legitimate traffic. A circuit breaker on the Redis client switches to local-only enforcement when error rates exceed a threshold, and the circuit half-opens every 5 seconds to test recovery.

Gossip-Based Rate Synchronization

The Gossip Synchronizer’s job is to share aggregate rate limit counts between gateway nodes so that each node has a reasonable approximation of global usage without hammering Redis on every request.

Each gateway node runs a background goroutine that every 100ms selects 3 random peers from the gateway fleet membership list and sends them a compact summary of its local counters. The receiver merges the incoming counts using a max-merge CRDT strategy: for each user, take the maximum count seen across all received gossip messages and the local state. This is not perfectly accurate but it is eventually consistent and monotonically increasing, which is the property we need for rate limiting.

The membership list is maintained using a simple heartbeat protocol: each node broadcasts a heartbeat every 5 seconds. Nodes that have not sent a heartbeat in 15 seconds are considered dead and removed from the gossip target pool. When new nodes join, they announce themselves and begin receiving gossip within one cycle.

package gossip

import (
    "encoding/json"
    "math/rand"
    "net/http"
    "sync"
    "time"
)

type CounterSnapshot struct {
    NodeID    string             `json:"node_id"`
    Timestamp int64              `json:"timestamp"`
    Counts    map[string]int64   `json:"counts"` // key: "user:ID:window", value: count
}

type GossipSynchronizer struct {
    nodeID      string
    peers       []string
    localCounts sync.Map
    client      *http.Client
    interval    time.Duration
}

func NewGossipSynchronizer(nodeID string, interval time.Duration) *GossipSynchronizer {
    return &GossipSynchronizer{
        nodeID:   nodeID,
        interval: interval,
        client:   &http.Client{Timeout: 50 * time.Millisecond},
    }
}

func (g *GossipSynchronizer) Run() {
    ticker := time.NewTicker(g.interval)
    defer ticker.Stop()
    for range ticker.C {
        g.gossipRound()
    }
}

func (g *GossipSynchronizer) gossipRound() {
    snapshot := g.buildSnapshot()
    targets := g.selectPeers(3)
    for _, peer := range targets {
        go g.sendSnapshot(peer, snapshot)
    }
}

func (g *GossipSynchronizer) MergeSnapshot(incoming CounterSnapshot) {
    for key, incomingCount := range incoming.Counts {
        // Max-merge: always keep the higher count
        for {
            existing, loaded := g.localCounts.Load(key)
            if !loaded {
                if _, swapped := g.localCounts.LoadOrStore(key, incomingCount); swapped {
                    break
                }
                continue
            }
            existingCount := existing.(int64)
            if incomingCount <= existingCount {
                break
            }
            if g.localCounts.CompareAndSwap(key, existingCount, incomingCount) {
                break
            }
        }
    }
}

func (g *GossipSynchronizer) selectPeers(n int) []string {
    if len(g.peers) <= n {
        return g.peers
    }
    indices := rand.Perm(len(g.peers))[:n]
    result := make([]string, n)
    for i, idx := range indices {
        result[i] = g.peers[idx]
    }
    return result
}

func (g *GossipSynchronizer) buildSnapshot() CounterSnapshot {
    counts := make(map[string]int64)
    g.localCounts.Range(func(k, v any) bool {
        counts[k.(string)] = v.(int64)
        return true
    })
    return CounterSnapshot{
        NodeID:    g.nodeID,
        Timestamp: time.Now().UnixMilli(),
        Counts:    counts,
    }
}

func (g *GossipSynchronizer) sendSnapshot(peer string, snap CounterSnapshot) {
    body, _ := json.Marshal(snap)
    // POST to peer's gossip endpoint - errors are silently ignored
    // gossip is best-effort; correctness comes from Redis
    _ = body
}

The analogy is a game of telephone played optimistically. Nodes whisper their current counts to a few neighbors, who pass them along. Within a few rounds (typically 3-5 hops for a 100-node cluster using the gossip fan-out of 3), every node has a rough picture of the global state. This picture is always a lower bound, never an upper bound, because nodes can only see counts that have been shared with them. The Redis layer is the ground truth that catches any final overage.

Failure mode: if gossip messages are delayed due to network congestion, nodes can diverge significantly in their view of global counts. The mitigation is that each node resyncs with Redis every 10 seconds as a floor, regardless of gossip health. Additionally, gossip messages older than 5 seconds are discarded on receipt rather than merged, preventing stale state from corrupting the current window.

The Limit Override API

The Limit Override API’s job is to allow operators to temporarily modify rate limits for specific users or tenants without restarting any service.

Overrides are stored in a PostgreSQL table with a TTL column. The Config Service watches this table using Postgres LISTEN/NOTIFY and pushes override events to all gateway nodes via a Redis pub/sub channel. Gateway nodes subscribe to this channel and update their local config cache within milliseconds of an override being created. This means an operator can raise a customer’s limit from 1000 to 5000 requests per minute in under 5 seconds end-to-end.

The override API supports both absolute overrides (set limit to exactly N) and multiplicative overrides (multiply current limit by factor F). Multiplicative overrides are safer for operational use because they maintain the correct ratio across tier changes.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from datetime import datetime, timedelta
import asyncpg

app = FastAPI()

class LimitOverride(BaseModel):
    user_id: str | None = None
    tenant_id: str | None = None
    override_type: str  # "absolute" or "multiplicative"
    value: float        # limit value or multiplier
    duration_minutes: int
    reason: str

@app.post("/api/v1/rate-limits/overrides")
async def create_override(override: LimitOverride, db: asyncpg.Connection):
    if not override.user_id and not override.tenant_id:
        raise HTTPException(400, "Must specify user_id or tenant_id")

    expires_at = datetime.utcnow() + timedelta(minutes=override.duration_minutes)

    row = await db.fetchrow("""
        INSERT INTO rate_limit_overrides
            (user_id, tenant_id, override_type, value, expires_at, reason, created_at)
        VALUES ($1, $2, $3, $4, $5, $6, NOW())
        RETURNING id
    """, override.user_id, override.tenant_id,
        override.override_type, override.value,
        expires_at, override.reason)

    # Publish to Redis pub/sub for instant propagation
    await publish_override_event({
        "id": str(row["id"]),
        "user_id": override.user_id,
        "tenant_id": override.tenant_id,
        "override_type": override.override_type,
        "value": override.value,
        "expires_at": expires_at.isoformat(),
    })

    return {"id": str(row["id"]), "expires_at": expires_at.isoformat()}

@app.delete("/api/v1/rate-limits/overrides/{override_id}")
async def delete_override(override_id: str, db: asyncpg.Connection):
    result = await db.execute(
        "DELETE FROM rate_limit_overrides WHERE id = $1",
        override_id
    )
    if result == "DELETE 0":
        raise HTTPException(404, "Override not found")
    await publish_override_event({"id": override_id, "deleted": True})
    return {"status": "deleted"}

Failure mode: if the Redis pub/sub channel is unavailable when an override is created, gateway nodes will not receive the update in real time. The fallback is that each gateway node polls the Config Service every 30 seconds for config changes. This means the worst-case propagation delay for an override is 30 seconds even without pub/sub.

Data Model

The data model separates rate limit configuration from runtime state. Configuration lives in PostgreSQL for durability and auditability. Runtime state lives in Redis for speed. The two are joined by the gateway node’s in-memory cache.

-- Rate limit configurations per tier
CREATE TABLE rate_limit_configs (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tier          VARCHAR(50) NOT NULL,  -- 'free', 'pro', 'enterprise'
    resource      VARCHAR(100) NOT NULL, -- 'api_requests', 'data_transfer', etc.
    limit_value   INTEGER NOT NULL,
    window_seconds INTEGER NOT NULL,
    burst_multiplier FLOAT NOT NULL DEFAULT 1.2,
    algorithm     VARCHAR(20) NOT NULL DEFAULT 'sliding_window', -- or 'token_bucket'
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (tier, resource)
);

-- Per-user tier assignments
CREATE TABLE user_rate_tiers (
    user_id       UUID NOT NULL,
    tier          VARCHAR(50) NOT NULL DEFAULT 'free',
    assigned_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    assigned_by   UUID,
    PRIMARY KEY (user_id)
);

-- Temporary overrides
CREATE TABLE rate_limit_overrides (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id       UUID,
    tenant_id     UUID,
    override_type VARCHAR(20) NOT NULL, -- 'absolute' or 'multiplicative'
    value         FLOAT NOT NULL,
    reason        TEXT NOT NULL,
    expires_at    TIMESTAMPTZ NOT NULL,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CHECK (user_id IS NOT NULL OR tenant_id IS NOT NULL)
);

-- Index for active overrides lookup (hot path during config refresh)
CREATE INDEX idx_overrides_user_active
    ON rate_limit_overrides (user_id, expires_at)
    WHERE expires_at > NOW() AND user_id IS NOT NULL;

CREATE INDEX idx_overrides_tenant_active
    ON rate_limit_overrides (tenant_id, expires_at)
    WHERE expires_at > NOW() AND tenant_id IS NOT NULL;

-- Audit log for limit decisions (sampled, not every request)
CREATE TABLE rate_limit_events (
    id            BIGSERIAL PRIMARY KEY,
    user_id       UUID NOT NULL,
    tenant_id     UUID,
    node_id       VARCHAR(100) NOT NULL,
    decision      VARCHAR(10) NOT NULL, -- 'allow' or 'deny'
    count_at_decision INTEGER NOT NULL,
    limit_applied INTEGER NOT NULL,
    recorded_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Partition by month for retention management
CREATE INDEX idx_events_user_time
    ON rate_limit_events (user_id, recorded_at DESC);

The rate_limit_configs table uses a composite unique index on (tier, resource) so lookups are O(1). The overrides table uses partial indexes filtered on expires_at > NOW() - in PostgreSQL this creates an index that only includes non-expired rows, keeping the index small and fast. As overrides expire, they fall out of the partial index automatically.

Redis key structure follows the pattern rl:{user_id}:{resource}:{window_epoch} for sliding window counters and rl:tb:{user_id}:{resource} for token buckets. Keys are prefixed with the Redis Cluster hash slot tag {user_id} to ensure all keys for a given user land on the same shard, allowing cross-key Lua scripts to work correctly.

Data flow showing full request lifecycle from incoming request through local check to Redis and final allow/deny decision

Key Algorithms and Protocols

Sliding Window Counter

The sliding window counter tracks requests within a rolling time window rather than fixed intervals. The implementation using a Redis sorted set provides exact counts at the cost of memory proportional to the request rate. For a user making 1000 requests per minute, the sorted set holds at most 1000 members at any time.

import redis
import time
import uuid

def sliding_window_check(
    r: redis.Redis,
    user_id: str,
    limit: int,
    window_seconds: int
) -> tuple[bool, int]:
    key = f"rl:{user_id}:requests"
    now_ms = int(time.time() * 1000)
    window_ms = window_seconds * 1000
    window_start_ms = now_ms - window_ms

    lua_script = """
    local key = KEYS[1]
    local now = tonumber(ARGV[1])
    local window_start = tonumber(ARGV[2])
    local limit = tonumber(ARGV[3])
    local req_id = ARGV[4]

    redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
    local count = redis.call('ZCARD', key)

    if count < limit then
        redis.call('ZADD', key, now, req_id)
        redis.call('PEXPIRE', key, tonumber(ARGV[5]))
        return {count + 1, 1}
    end
    return {count, 0}
    """

    result = r.eval(
        lua_script, 1, key,
        now_ms, window_start_ms, limit,
        str(uuid.uuid4()), window_ms + 5000
    )
    count, allowed = int(result[0]), bool(result[1])
    return allowed, count

The sliding window sorted set approach uses approximately 80 bytes per stored request entry in Redis (UUID + score + hash table overhead). For a limit of 1000 requests per minute, this is 80KB per user. With 10,000 active users, that is 800MB - which may be too large. For memory-constrained deployments, use the fixed window counter with two overlapping windows (current and previous) to approximate sliding behavior at 1/1000th the memory cost.

Token Bucket Algorithm

The token bucket maintains a pool of tokens that refills at a constant rate. Each request consumes one token. If the bucket is empty, the request is denied. This naturally handles burst traffic: a user who has been idle accumulates tokens up to the bucket capacity, then can spend them quickly.

package ratelimit

import (
    "context"
    "fmt"
    "time"
    "github.com/redis/go-redis/v9"
)

type TokenBucketConfig struct {
    Capacity  float64 // max tokens
    FillRate  float64 // tokens per second
}

func TokenBucketCheck(
    ctx context.Context,
    rdb *redis.Client,
    userID string,
    cfg TokenBucketConfig,
) (allowed bool, remaining float64, err error) {
    key := fmt.Sprintf("rl:tb:%s", userID)
    now := float64(time.Now().UnixNano()) / 1e9

    script := redis.NewScript(`
        local key = KEYS[1]
        local now = tonumber(ARGV[1])
        local fill_rate = tonumber(ARGV[2])
        local capacity = tonumber(ARGV[3])

        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now

        local elapsed = math.max(0, now - last_refill)
        local new_tokens = math.min(capacity, tokens + elapsed * fill_rate)

        if new_tokens >= 1.0 then
            new_tokens = new_tokens - 1.0
            local ttl = math.ceil(capacity / fill_rate) + 10
            redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
            redis.call('EXPIRE', key, ttl)
            return {new_tokens, 1}
        end

        redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
        return {0, 0}
    `)

    result, err := script.Run(ctx, rdb, []string{key},
        now, cfg.FillRate, cfg.Capacity).Slice()
    if err != nil {
        return false, 0, fmt.Errorf("token bucket check: %w", err)
    }

    remaining = result[0].(float64)
    allowed = result[1].(int64) == 1
    return allowed, remaining, nil
}

Gossip Synchronization Protocol

The gossip protocol runs every 100ms and follows a fixed sequence: build a snapshot of all local counters modified since the last gossip round, select 3 random peers, serialize the snapshot as a compact binary message (not JSON in production), and send it asynchronously. Peers that receive the snapshot apply a max-merge against their own state.

The convergence time for gossip in a network of N nodes with fan-out F is approximately log(N) / log(F) rounds. For 100 nodes with fan-out 3, this is about 4 rounds, or 400ms. This aligns with our 500ms global consistency requirement.

import asyncio
import time
import random
from dataclasses import dataclass

@dataclass
class GossipMessage:
    node_id: str
    timestamp: float
    counts: dict[str, int]  # key: "user_id:window_epoch", value: count

async def gossip_round(
    local_counts: dict[str, int],
    peer_urls: list[str],
    node_id: str,
    session
) -> None:
    if not peer_urls:
        return

    msg = GossipMessage(
        node_id=node_id,
        timestamp=time.time(),
        counts=dict(local_counts)  # snapshot
    )

    targets = random.sample(peer_urls, min(3, len(peer_urls)))

    tasks = [
        send_gossip(session, peer, msg)
        for peer in targets
    ]
    # Fire and forget - gossip failures are non-fatal
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

def merge_gossip(
    local_counts: dict[str, int],
    incoming: GossipMessage,
    max_age_seconds: float = 5.0
) -> dict[str, int]:
    if time.time() - incoming.timestamp > max_age_seconds:
        return local_counts  # Discard stale gossip

    merged = dict(local_counts)
    for key, incoming_count in incoming.counts.items():
        existing = merged.get(key, 0)
        merged[key] = max(existing, incoming_count)  # Max-merge CRDT
    return merged

The gossip protocol makes rate limiting eventually consistent rather than strongly consistent. This means a user could briefly exceed their limit during the convergence window. This is acceptable for most APIs - a 5% overage for 400ms is far better than adding 10ms of synchronous overhead to every single request. If you need strict enforcement (financial APIs, abuse prevention), route 100% of checks through Redis and accept the latency.

Scaling and Performance

The system scales horizontally in two dimensions: gateway nodes and Redis shards. Adding gateway nodes increases the fleet’s aggregate request processing capacity. Adding Redis shards increases the rate store’s throughput for limit checks. The two scale independently, which is the key architectural property that avoids coordination bottlenecks.

Each gateway node processes approximately 5,000 requests per second with p99 enforcement latency under 1ms. Of these, roughly 85% are resolved by the local counter layer in under 200 nanoseconds. The remaining 15% require a Redis round-trip, which adds 1-3ms of latency. The gossip layer uses about 50KB/s of bandwidth per node for a fleet of 100 nodes with 10,000 active users.

Scaling diagram showing how gateway nodes and Redis shards scale independently
Capacity Estimation
--------------------
Fleet size:              100 gateway nodes
Requests per node:       5,000 req/s
Total fleet throughput:  500,000 req/s

Local check hit rate:    85%  -> 425,000 req/s resolved locally
Redis check rate:        15%  -> 75,000 req/s hitting Redis

Redis Cluster:           3 primary shards
Requests per shard:      25,000 req/s
Redis p99 latency:       1.5ms (within datacenter)

Memory per gateway node:
  Local counters (10k active users): ~40MB
  Gossip state:                       ~8MB
  Total:                             ~48MB (well under 512MB budget)

Redis memory per shard:
  Sorted set (sliding window):
    1,000 req/min limit * 80 bytes * 10,000 users / 3 shards = ~267MB
  Token bucket (hash):
    10,000 users * 64 bytes / 3 shards = ~0.2MB

At Stripe, their rate limiter architecture uses a similar two-tier approach: in-process counters for speed and Redis for global consistency. They published that their p99 rate limit check latency is under 500 microseconds, achieved by keeping the Redis check on the hot path only for users near their limit. The local counter layer handles the vast majority of traffic without any network I/O.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Redis primary shard failureRedis Sentinel / health checks within 5sUsers on that shard temporarily use local-only enforcement; up to 5% overageRedis Cluster auto-failover promotes replica in 5-15s; gateway nodes reconnect automatically
Gateway node crashLoad balancer health check within 10sIn-flight local counter state lost; brief overage for affected usersTraffic rerouted to remaining nodes; no state recovery needed (Redis is authoritative)
Network partition (gateway to Redis)Redis client timeout after 100msAll affected nodes switch to fail-open mode; limits not enforcedCircuit breaker half-opens every 5s; full recovery when partition heals
Gossip storm (all nodes gossip simultaneously)CPU spike on gossip endpointGossip processing consumes excess CPU, slowing rate checksJitter added to gossip interval (100ms +/- 30ms); gossip message rate limiting per peer
Config Service unavailableHTTP health check within 15sNo new limit overrides propagated; existing config cachedGateway uses stale config (up to 30s old); Config Service restarts independently
Redis key expiry misconfigurationMonitoring alert on missing keysAll users appear to have zero count; all requests allowedAdd alerting on Redis keyspace event for unexpected key deletion; verify TTL on every write

The most dangerous failure mode is the “split brain” scenario: two Redis shards become unreachable simultaneously while gateway nodes are near the limit for a high-value user. In this case, both the Redis check and the gossip layer fail to communicate, and local-only enforcement allows up to num_nodes * local_budget requests through - potentially 100x the limit. Mitigate this by implementing a hard ceiling in the local counter: never allow more than global_limit * 2 / num_nodes requests locally before forcing a Redis check, even if Redis is unavailable.

Comparison of Approaches

ApproachConsistencyLatency OverheadMemory CostComplexity
Centralized Redis (no local cache)StrongHigh (5-15ms per request)Low (O(users))Low
Local-only countersWeak (per-node)None (in-process)Medium (per-node)Low
Local + Redis hybrid (this design)Eventual (within 500ms)Low (1ms for Redis path)MediumMedium
Gossip-only (no Redis)Eventual (within 400ms)Very lowLowHigh
Leaky bucket (single Redis list)StrongHigh (RPUSH/LTRIM)High (O(requests))Low
Fixed window counterApproximateLow (INCR + EXPIRE)Very lowVery low

The hybrid local + Redis approach is the right choice for this system because it meets the sub-millisecond p99 requirement for the common case (user well under limit) while still providing global enforcement that matters for users near their limit. Pure local counters fail the global consistency requirement. Centralized Redis fails the latency requirement at this scale. Gossip-only fails in adversarial conditions where a user specifically targets one node.

For smaller deployments (under 10 nodes), centralized Redis with a fixed window counter is dramatically simpler and perfectly adequate. Only reach for the hybrid approach when you have measured that Redis latency is appearing in your p99 request latency budget.

Key Takeaways

  • Local counter layer: Resolve 85%+ of rate limit checks in under 200 nanoseconds using in-process atomic counters with a pre-allocated budget per node.
  • Redis Lua atomicity: The Lua EVAL command executes check-and-increment atomically, eliminating the race condition that would allow limit overages under concurrent load.
  • Sliding window vs token bucket: Sliding window sorted sets give exact per-minute counts but cost 80 bytes per request in Redis memory; token buckets give burst-tolerant enforcement at constant memory (one hash per user).
  • Gossip-based sync: Gossip propagates approximate global counts across 100 nodes in under 400ms without any central coordinator, achieving eventual consistency at the cost of up to 5% overage during the convergence window.
  • Local vs global limit tradeoff: Allocate 70% of fair-share quota to local budgets (speed path) and reserve 30% for Redis enforcement (accuracy path); adjust this ratio based on your latency vs accuracy requirements.
  • Burst allowance: Layer a token bucket on top of the sliding window counter at the local level to allow short bursts without immediately escalating to Redis.
  • Limit override API: Store overrides in PostgreSQL, propagate via Redis pub/sub for fast delivery, and fall back to polling for reliability.
  • Fail open, not closed: When Redis is unreachable, allow requests through with local-only enforcement. A brief overage is almost always better than dropping legitimate traffic for your customers.

Frequently Asked Questions

Q: Why not use a single Redis instance instead of a cluster?

A: A single Redis instance becomes a bottleneck at 75,000 requests per second hitting it for rate checks (our 15% Redis check rate from 500K req/s total). A well-tuned Redis single instance can handle around 100K operations per second, so a single instance would be at 75% capacity with no headroom for spikes. Redis Cluster with 3 shards gives us 3x the headroom and eliminates the single point of failure. For smaller deployments under 50K req/s total, a single Redis instance with a replica for failover is sufficient.

Q: Why not use Nginx or an API gateway product that has built-in rate limiting?

A: Most built-in rate limiters (Nginx limit_req, Kong rate-limiting) use node-local counting with no cross-node coordination. They enforce per-node limits, not global limits. If you have 10 gateway nodes and a 1000 req/min limit, a user can make 10,000 requests per minute (1000 to each node) without triggering any limit. That is only acceptable if your use case genuinely needs per-node limiting rather than global limiting.

Q: What is the worst-case overage allowed by this design?

A: In normal operation, the overage is bounded by the local budget size: num_nodes * local_budget_per_node = 100 * (limit * 0.7 / 100) = 0.7 * limit. So a user with a 1000 req/min limit could theoretically consume up to 700 extra requests during the gossip convergence window if they perfectly distribute their requests across all 100 nodes. In practice, users do not distribute requests uniformly across nodes, and gossip converges quickly enough to limit actual overage to under 5%.

Q: Why use gossip instead of a centralized pub/sub like Redis pub/sub for count synchronization?

A: Redis pub/sub would require every gateway node to subscribe to every other gateway node’s count updates, creating O(N^2) message volume as the fleet grows. Gossip is O(N log N) because each node only talks to a fixed fan-out of peers. At 100 nodes, pub/sub would generate 100 * 100 = 10,000 messages per gossip round while gossip generates 100 * 3 = 300. Gossip also degrades gracefully when nodes leave or join, whereas pub/sub channel membership needs explicit management.

Q: How do you handle the case where a user’s limit needs to be enforced to the exact request, with zero tolerance for overage?

A: Drop the local counter layer entirely for that user and route all their requests directly through Redis. You can flag specific users or tiers as “strict enforcement” in the config, and the gateway checks that flag before deciding whether to use local caching. The tradeoff is that strict-enforcement users add a 1-3ms latency overhead per request. This is appropriate for financial API calls, abuse prevention, and high-value accounts where the business impact of overage exceeds the latency cost.

Q: How do you prevent the Redis rate store from being a write hotspot for viral users?

A: Redis Cluster distributes users across shards by the hash of their user ID. A viral user who generates 10x normal traffic will be hot on exactly one shard. Mitigate this by increasing the local budget fraction for users detected as high-traffic (dynamically, based on gossip signal), so fewer of their requests hit Redis. In extreme cases (>10x normal traffic), promote a dedicated Redis replica to serve reads for that user’s key using read-from-replica with a controlled staleness tolerance.

Interview Questions

Q: How would you modify this design to support per-IP rate limiting in addition to per-user limiting?

Expected depth: Discuss key namespace separation (rl:ip:{ip} vs rl:user:{uid}), the challenge that IP space is unbounded (IPv6), memory management using LRU eviction for IP counters, and why IP-based limiting is a weaker signal than user-based limiting (NAT, proxies). Mention that IP limiting is typically a first-pass coarse filter while user limiting is the fine-grained enforcement layer.

Q: How would you design the rollout of a new rate limiting algorithm (e.g., switching from sliding window to token bucket) without downtime?

Expected depth: Discuss feature flags in the config service, shadow mode (run new algorithm in parallel and log differences without enforcing), gradual rollout by user ID hash range, and the data migration challenge (existing sorted set keys need to be converted to hash keys or run in parallel). Mention that algorithm changes affecting limits should be communicated to API consumers via changelog.

Q: A customer reports they are hitting rate limits despite being well under their contracted limit. How would you debug this?

Expected depth: Walk through the debugging layers: check if the request is hitting multiple gateway nodes (inspect X-Gateway-Node header), check if there is an active override that is reducing their limit, query the rate_limit_events audit table for recent decisions, check if there is a tenant-level limit below the user-level limit, verify the window alignment (sliding window vs fixed window can cause confusion at minute boundaries). Mention the importance of the Retry-After header for client-side debugging.

Q: How would you implement rate limiting that is aware of request cost, not just request count (e.g., expensive queries consume more quota)?

Expected depth: Extend the Lua script to accept a cost parameter and consume cost tokens/quota units instead of 1. Discuss how to determine request cost (request size, response time, CPU cost) and whether to determine it pre-request (estimated) or post-request (actual). Pre-request is simpler but requires cost estimation; post-request is accurate but means you can only enforce limits with one request of delay. Most practical implementations use pre-request cost with a fixed cost table per endpoint.

Q: What happens to in-flight requests when a gateway node receives a config update that lowers a user’s rate limit?

Expected depth: Discuss the atomicity challenge: the config cache update and the local counter reset need to happen together. If the counter is not reset when the limit is lowered, users can continue using their old (higher) budget until the window expires. The safe approach is to always enforce min(current_count, new_limit) during a transition period, and to drain the local budget back to Redis before applying the new limit. Also discuss the window TTL: a lower limit takes effect at the next window boundary for simplicity, or immediately for strict enforcement.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article