Build a WebSocket-Based Presence System


distributed-systems performance reliability

System Design Deep Dive

WebSocket Presence System

10 million concurrent users, sub-200ms propagation, zero ghost presences - pick all three

⏱ 14 min read📐 Advanced🏗️ Distributed Systems

Imagine a city-wide bulletin board where anyone can post a note reading “I’m here.” The challenge isn’t writing the note - it’s ensuring that when 10 million people simultaneously post, update, and tear down their notes, every person who cares about a specific note sees the change within 200 milliseconds, and crucially, that every note gets automatically removed the moment its author walks away. That’s the essence of a WebSocket-based presence system: deceptively simple on the surface, genuinely hard at scale.

The naïve approach - store {user_id: status} in a single Redis hash and broadcast every change to all subscribers - collapses almost immediately. At 10 million concurrent users with even modest churn (users coming online, going offline, changing rooms), you’re looking at hundreds of thousands of status changes per second. Broadcasting each change to all subscribers turns into an O(N*M) fan-out problem where N is the event rate and M is the subscriber count. At Slack’s scale, a single channel can have 10,000 members. A single keystroke from any one of them becomes 10,000 delivery events. Multiply that by thousands of active channels and you’ve built a machine that melts under normal operation.

The real tension is between three competing forces. First, liveness accuracy: the system must know, within seconds, that a user has gone offline - even if their TCP connection didn’t close cleanly. Second, fan-out efficiency: status changes must propagate only to users who actually care about that status, not everyone connected. Third, connection stickiness: WebSocket connections are long-lived and stateful, meaning any server that holds a connection owns that user’s presence state - and server restarts become presence eviction events.

We need to solve for heartbeat-driven liveness detection, scoped pub-sub fan-out, and graceful connection draining simultaneously. Each of these is a solved problem in isolation; combining them at 10 million concurrent connections is where the real engineering lives.

Requirements and Constraints

Functional Requirements

  • Track online/offline/away status for registered users in real time
  • Propagate status changes to all relevant subscribers (users sharing a conversation, channel, or room) within 200ms
  • Support typing indicator events: user starts typing, user stops typing
  • Handle connection drops gracefully - detect offline state within 30 seconds of a silent disconnect
  • Allow clients to subscribe to presence updates for a specific set of user IDs
  • Support multi-device presence: a user is online if any of their devices is connected
  • Expose a REST API for initial presence state fetch on page load

Non-Functional Requirements

  • Scale: 10 million concurrent WebSocket connections
  • Latency: Status change propagation P99 under 200ms end-to-end
  • Availability: 99.99% uptime (under 1 hour downtime per year)
  • Throughput: Handle up to 500,000 presence events per second at peak
  • Detection latency: Declare a user offline within 30 seconds of connection loss
  • Recovery: Presence state fully consistent within 60 seconds of a server restart
  • Storage: Presence state is ephemeral - no long-term persistence required

Constraints and Assumptions

  • Presence data is soft-state: brief inconsistency during failover is acceptable
  • Typing indicators have a 3-second TTL; they do not need persistence across reconnects
  • We assume users are authenticated; auth is handled by a separate service
  • Each user can have at most 5 simultaneous device connections
  • A “relevant subscriber” means: users who share at least one active conversation or channel with the target user
  • The system does not need to store presence history - only current state matters

High-Level Architecture

The presence system decomposes into five major components. Clients connect via a WebSocket Gateway tier that handles connection management and heartbeats. The gateway writes presence events to a Presence Event Bus (a Kafka cluster) which decouples ingest from fan-out. A Presence Registry (Redis Cluster) stores the authoritative current state for all users. A Fan-out Service reads events from Kafka, looks up subscriber lists, and pushes updates back through the gateway. Finally, a Presence GC (garbage collector) runs periodic sweeps to evict stale presence records whose heartbeats have expired.

WebSocket presence system high-level architecture showing all major components and data flow

Data flows in two directions simultaneously. The write path starts when a client connects or sends a heartbeat: the gateway writes {user_id, status, server_id, ts} into Redis and emits an event to Kafka. The read path starts when the fan-out service consumes that Kafka event, looks up the user’s subscribers in Redis, and pushes a notification to each subscriber’s WebSocket gateway node. The gateway then pushes to the subscriber’s open connection.

The event bus is the critical decoupling point. Without it, the gateway would need to perform the fan-out inline - turning every presence change into a synchronous multi-step operation that blocks the connection handling loop. With Kafka in the middle, the gateway does a fast local Redis write and a non-blocking Kafka produce, then returns to handling the next event. Fan-out happens asynchronously in a separate process tier with its own scaling dimension.

Key Insight

The presence system is actually two separate problems: liveness detection (heartbeats, TTLs, GC) and change propagation (pub-sub, fan-out). Conflating them into a single code path is what causes most presence systems to either go stale or become bottlenecked.

The WebSocket Connection Registry

The WebSocket Connection Registry is the gateway’s in-memory and Redis-backed index of which user is connected to which server process, on which connection ID.

Think of it like a hotel front desk. When a guest checks in, the desk records their room number. When you want to send a message to that guest, you first call the front desk to find out which room they’re in - you don’t broadcast to every room on every floor. The connection registry is that front desk.

Each WebSocket gateway node maintains a local in-process map: connection_id -> {user_id, device_id, subscriptions}. This local map is O(1) for connection lookups and lives entirely in memory. It’s intentionally ephemeral - if the process dies, the map dies with it.

The durable layer is in Redis. When a user connects to gateway node gw-42, we write:

-- Register connection in Redis (pipelined)
HSET user:presence:{user_id} status online last_seen 1718000000 server gw-42 device_id mobile-abc
EXPIRE user:presence:{user_id} 45
SADD server:connections:gw-42 {user_id}

The EXPIRE 45 is load-bearing. It means that even if gw-42 crashes without sending an offline event, Redis will evict the key after 45 seconds - a built-in presence TTL. Every heartbeat from the client resets this TTL via EXPIRE user:presence:{user_id} 45. This is the heartbeat-based liveness mechanism in its simplest form.

For multi-device support, we track per-device presence separately:

-- Multi-device: each device gets its own key
HSET user:device:presence:{user_id}:{device_id} status online last_seen 1718000000 server gw-42
EXPIRE user:device:presence:{user_id}:{device_id} 45

-- User-level presence is derived: online if any device is online
-- This derivation happens in the GC sweep, not on every event
Watch Out

Storing per-user presence as a single Redis key breaks multi-device semantics. If a user has two devices and one disconnects, a naive HSET overwrites the online state with offline - making the user appear offline while their laptop is still connected. Always track presence at device granularity and derive user-level status by aggregating device states.

The fan-out service needs to know which gateway node to push updates to. This is the server routing problem: given a subscriber’s user_id, which gateway node holds their connection? We store this in the same Redis hash:

-- Lookup: which server holds user_id X's connection?
HGET user:presence:{user_id} server
-- Returns "gw-42"
-- Fan-out service then sends delivery request to gw-42 via internal gRPC
WebSocket Connection Registry internals showing local map, Redis backing store, and multi-device tracking
Real World

Discord’s presence system uses a similar layered approach - in-process connection maps for hot path lookups, backed by a distributed registry for cross-node routing. They’ve written about their “Elixir/Erlang” gateway tier that leverages per-process lightweight processes as the connection registry, with a separate Cassandra-backed store for persistent friend presence state.

Heartbeat-Based Liveness

Heartbeat-based liveness is the mechanism by which the system distinguishes a user who is quietly idle from a user whose connection has silently died.

TCP connections lie. A TCP connection can appear open on the server side for minutes after the client process has crashed, the phone has lost signal, or the NAT device has timed out the session. We cannot trust connection state alone. Instead, we require the client to periodically send a heartbeat ping, and we declare the connection dead if we don’t receive one within a deadline.

The heartbeat protocol is a simple ping-pong:

Client -> Server: {"type": "ping", "ts": 1718000000000}
Server -> Client: {"type": "pong", "ts": 1718000000000, "server_ts": 1718000000050}

The server-side heartbeat handler is deliberately minimal:

# Heartbeat handler - runs in the WebSocket message loop
async def handle_ping(ws_connection, message):
    user_id = ws_connection.user_id
    device_id = ws_connection.device_id

    # Reset Redis TTL - this is the liveness signal
    pipe = redis.pipeline(transaction=False)
    pipe.hset(
        f"user:device:presence:{user_id}:{device_id}",
        mapping={"last_seen": int(time.time()), "status": "online"}
    )
    pipe.expire(f"user:device:presence:{user_id}:{device_id}", 45)
    await pipe.execute()

    # Echo pong - required for client-side dead-connection detection
    await ws_connection.send_json({
        "type": "pong",
        "ts": message["ts"],
        "server_ts": int(time.time() * 1000)
    })

Client heartbeat interval is 15 seconds. Redis TTL is 45 seconds. This gives us a 3-miss grace period before a user is automatically evicted from Redis. The stale presence cleanup job (described later) runs every 10 seconds and marks users as offline when their Redis key has expired.

The math behind the timing constants matters. With 10 million concurrent connections sending heartbeats every 15 seconds, we have approximately 667,000 heartbeat operations per second. Each heartbeat is two Redis commands (HSET + EXPIRE) pipelined. At typical Redis throughput of 500,000-1,000,000 ops/second per node, we need roughly 2-3 Redis nodes dedicated to heartbeat processing alone.

Key Insight

The heartbeat interval and Redis TTL create a tunable tradeoff between ghost-presence duration and heartbeat load. A 30-second TTL with 10-second heartbeats detects offline users faster but triples the heartbeat load. A 120-second TTL with 60-second heartbeats halves load but leaves ghost presences visible for 2 minutes after disconnect.

The client-side heartbeat loop must also handle the gateway going silent - if the client sends a ping and receives no pong within 10 seconds, it treats the connection as dead and reconnects:

// Client-side heartbeat manager
class HeartbeatManager {
  constructor(socket) {
    this.socket = socket;
    this.interval = null;
    this.pendingPong = null;
    this.PING_INTERVAL_MS = 15000;
    this.PONG_TIMEOUT_MS = 10000;
  }

  start() {
    this.interval = setInterval(() => this.sendPing(), this.PING_INTERVAL_MS);
  }

  sendPing() {
    if (this.pendingPong) {
      // Previous ping never got a pong - connection is dead
      this.socket.reconnect();
      return;
    }
    const ts = Date.now();
    this.socket.send(JSON.stringify({ type: "ping", ts }));
    this.pendingPong = setTimeout(() => {
      // No pong received in time - treat as dead
      this.socket.reconnect();
    }, this.PONG_TIMEOUT_MS);
  }

  onPong() {
    if (this.pendingPong) {
      clearTimeout(this.pendingPong);
      this.pendingPong = null;
    }
  }

  stop() {
    clearInterval(this.interval);
    if (this.pendingPong) clearTimeout(this.pendingPong);
  }
}
Watch Out

Mobile clients behind aggressive NAT or on cellular often have their TCP connections silently killed by the network within 90 seconds of inactivity. A 60-second heartbeat interval will miss these drops. Always use heartbeat intervals under 30 seconds for mobile-targeted presence systems.

Presence Pub-Sub and Fan-Out

Presence pub-sub is the mechanism by which a status change for user A propagates to every user who is currently subscribed to A’s status.

The subscriber model is not “subscribe to all presence events” - that’s the naive approach that causes the fan-out explosion. The correct model is scoped subscriptions: a user subscribes to the presence of specific other users, typically the members of their current active conversations.

When user Alice opens a conversation with Bob and Carol, Alice’s client sends:

{"type": "subscribe_presence", "user_ids": ["bob-uuid", "carol-uuid"]}

The gateway records this subscription in Redis:

-- Track that Alice wants updates for Bob and Carol
SADD presence:subscribers:{bob_user_id} {alice_user_id}
SADD presence:subscribers:{carol_user_id} {alice_user_id}
EXPIRE presence:subscribers:{bob_user_id} 3600
EXPIRE presence:subscribers:{carol_user_id} 3600

When Bob’s status changes, the fan-out service:

  1. Reads SMEMBERS presence:subscribers:{bob_user_id} - returns [alice_user_id]
  2. For each subscriber, looks up their gateway server from user:presence:{alice_user_id}
  3. Sends a delivery instruction to the appropriate gateway via internal gRPC
Presence pub-sub fan-out flow showing Kafka consumption, subscriber lookup, and gateway routing

The fan-out service is a Kafka consumer group. Each partition is consumed by one fan-out worker, providing natural parallelism without double-delivery:

# Fan-out worker - one per Kafka partition
async def process_presence_event(event):
    user_id = event["user_id"]
    new_status = event["status"]
    event_ts = event["ts"]

    # Fetch all subscribers for this user
    subscriber_ids = await redis.smembers(f"presence:subscribers:{user_id}")
    if not subscriber_ids:
        return  # No one is watching - common case, fast exit

    # Group subscribers by their gateway server
    # Batching by server reduces gRPC calls from N to K (where K << N)
    server_to_subscribers = {}
    pipe = redis.pipeline(transaction=False)
    for sub_id in subscriber_ids:
        pipe.hgetall(f"user:presence:{sub_id}")
    presence_records = await pipe.execute()

    for sub_id, record in zip(subscriber_ids, presence_records):
        if not record or not record.get("server"):
            continue  # Subscriber is offline or stale entry
        server = record["server"]
        server_to_subscribers.setdefault(server, []).append(sub_id)

    # Deliver to each gateway server in parallel
    delivery_payload = {
        "event": "presence_update",
        "user_id": user_id,
        "status": new_status,
        "ts": event_ts
    }
    tasks = []
    for server, subscriber_batch in server_to_subscribers.items():
        tasks.append(
            gateway_grpc_client.deliver_to_users(server, subscriber_batch, delivery_payload)
        )
    await asyncio.gather(*tasks, return_exceptions=True)

The key optimization here is batching by gateway server. Instead of one gRPC call per subscriber, we group all subscribers on the same gateway and send a single batch delivery. If 1,000 subscribers of a popular user all happen to be connected to gw-07, that’s one gRPC call instead of 1,000.

Key Insight

Fan-out scope limiting is not just about performance - it’s about correctness. Without scoped subscriptions, users receive presence updates for people they don’t share context with, wasting bandwidth and leaking social graph information across organizational boundaries.

Real World

Slack’s presence system uses a similar scoped fan-out model. Rather than broadcasting to all workspace members (which could be tens of thousands), presence updates propagate only to users who have a shared DM or channel with the user in question. This is called “presence rings” internally - the innermost ring (DM partners) gets immediate updates, while outer rings (channel co-members) get batched updates on a longer cadence.

Typing Indicator Throttling

Typing indicator throttling is the mechanism that prevents a user’s keystrokes from flooding the presence system with hundreds of events per minute.

Without throttling, a fast typist generates a keydown event every 50-100ms. Over a 60-second typing session that’s 600-1200 raw events. Each event would trigger a fan-out to every subscriber. Multiply by millions of concurrent users and you have a system driven into the ground by users doing exactly what they’re supposed to do.

The solution is leading-edge throttling with trailing stop: when the user starts typing, emit one typing_start event immediately. While they continue typing, suppress further events. Emit typing_stop exactly 3 seconds after the last keystroke.

// Client-side typing throttle
class TypingThrottle {
  constructor(socket, conversationId) {
    this.socket = socket;
    this.conversationId = conversationId;
    this.isTyping = false;
    this.stopTimer = null;
    this.STOP_DELAY_MS = 3000;
  }

  onKeystroke() {
    if (!this.isTyping) {
      // Leading edge: emit start immediately
      this.isTyping = true;
      this.socket.send(JSON.stringify({
        type: "typing_start",
        conversation_id: this.conversationId,
        ts: Date.now()
      }));
    }

    // Reset the trailing stop timer on every keystroke
    if (this.stopTimer) clearTimeout(this.stopTimer);
    this.stopTimer = setTimeout(() => {
      this.isTyping = false;
      this.socket.send(JSON.stringify({
        type: "typing_stop",
        conversation_id: this.conversationId,
        ts: Date.now()
      }));
    }, this.STOP_DELAY_MS);
  }
}

Server-side, typing indicators have a short TTL in Redis and are stored separately from persistent presence:

-- Typing state: short TTL, scoped to conversation
SETEX typing:{conversation_id}:{user_id} 5 1

-- Check who is typing in a conversation
KEYS typing:{conversation_id}:*
-- Better approach: use a sorted set with score = expiry timestamp
ZADD typing:{conversation_id} {expiry_unix_ts} {user_id}
ZREMRANGEBYSCORE typing:{conversation_id} 0 {now}
ZRANGE typing:{conversation_id} 0 -1

Using a sorted set with score equal to expiry timestamp lets us atomically fetch all active typists while pruning expired ones. The ZREMRANGEBYSCORE + ZRANGE pair is a read-and-cleanup in two commands.

Watch Out

Server-side typing throttle is still necessary even with client-side throttle, because mobile apps get backgrounded, JavaScript timers get de-prioritized, and networks introduce reordering. Always apply a server-side rate limiter of at most 2 typing events per user per conversation per second as a safety valve.

Connection Draining on Failover

Connection draining is the process of gracefully migrating active WebSocket connections away from a gateway node before it shuts down, ensuring clients reconnect without experiencing a presence gap.

Without draining, a gateway restart creates a thundering herd: all N connections on that node simultaneously disconnect and attempt to reconnect to the remaining nodes. At 1 million connections per gateway node, this is a reconnection storm that can cascade into a cluster-wide outage.

Draining works in four phases. The load balancer is told to stop routing new connections to the draining node. The node sends a reconnect_hint message to all connected clients with a small randomized delay:

{"type": "reconnect_hint", "delay_ms": 1500, "reason": "server_drain"}

Clients receive this and schedule a reconnect after their assigned delay. The jitter spreads reconnections over several seconds, converting a spike into a manageable ramp. The draining node waits for its connection count to reach zero or a hard timeout (30 seconds), then completes shutdown.

# Gateway drain handler (called on SIGTERM)
async def drain_connections():
    logger.info(f"Starting drain: {len(active_connections)} connections")

    # Stop accepting new connections
    await lb_client.deregister(server_id=THIS_SERVER_ID)

    # Send reconnect hints with jitter
    tasks = []
    for i, conn in enumerate(active_connections.values()):
        jitter_ms = (i % 100) * 50  # Spread over 5 seconds, 100 slots
        tasks.append(send_reconnect_hint(conn, delay_ms=jitter_ms))
    await asyncio.gather(*tasks, return_exceptions=True)

    # Wait for graceful disconnect (30s hard deadline)
    deadline = time.time() + 30
    while active_connections and time.time() < deadline:
        await asyncio.sleep(0.5)

    # Force close remaining connections
    for conn in list(active_connections.values()):
        await conn.close(code=1001, reason="server_shutting_down")

    logger.info("Drain complete")

During drain, the node continues to process heartbeats and forward events for connections still open. The presence registry entry (server gw-42) is not deleted until the connection actually closes - at which point the close handler writes the offline event to Kafka, triggering proper presence cleanup for each user.

Key Insight

The reconnect hint delay distribution is as important as the drain mechanism itself. Uniform jitter over 5 seconds reduces reconnection load by 5x compared to simultaneous reconnection. Weighted jitter (lighter users get shorter delays, heavy-subscription users get longer delays) can reduce peak load by another 2-3x.

Stale Presence Cleanup

Stale presence cleanup is the background process that reaps presence records for users whose connections died without sending a proper offline event.

This is the “death by a thousand silent disconnects” problem. Mobile clients, browser tabs closed without beforeunload events, process kills, and network timeouts all produce connections that die without a clean close handshake. Without an active cleanup process, the presence registry accumulates ghost entries that show users as online when they’ve been gone for hours.

The GC runs as a separate service (not embedded in the gateway) on a 10-second cycle:

# Presence GC - runs every 10 seconds
async def presence_gc_sweep():
    now = int(time.time())

    # Scan all gateway server registration sets
    server_ids = await redis.smembers("known:servers")
    for server_id in server_ids:
        # Get users registered to this server
        user_ids = await redis.smembers(f"server:connections:{server_id}")

        stale_users = []
        pipe = redis.pipeline(transaction=False)
        for user_id in user_ids:
            pipe.hget(f"user:presence:{user_id}", "last_seen")
        last_seen_times = await pipe.execute()

        for user_id, last_seen in zip(user_ids, last_seen_times):
            if last_seen is None:
                # Key expired from Redis - user is definitely offline
                stale_users.append(user_id)
            elif now - int(last_seen) > 40:
                # Last heartbeat over 40s ago - treat as stale
                stale_users.append(user_id)

        if stale_users:
            await mark_users_offline(stale_users, server_id, reason="gc_sweep")

async def mark_users_offline(user_ids, server_id, reason):
    # Emit offline events to Kafka - fan-out handles subscriber notification
    events = [
        {"user_id": uid, "status": "offline", "ts": int(time.time()), "reason": reason}
        for uid in user_ids
    ]
    await kafka_producer.send_batch("presence-events", events)

    # Clean up Redis
    pipe = redis.pipeline(transaction=False)
    for uid in user_ids:
        pipe.hdel(f"user:presence:{uid}", "status", "server", "last_seen")
        pipe.hset(f"user:presence:{uid}", "status", "offline")
        pipe.srem(f"server:connections:{server_id}", uid)
    await pipe.execute()

The GC avoids scanning the entire keyspace by maintaining per-server user sets. When a user connects to gw-42, we add them to SADD server:connections:gw-42 {user_id}. This turns the GC scan from a global KEYS operation (O(N) across all Redis keys) into a targeted set membership check.

Real World

WhatsApp’s presence system uses a similar TTL-plus-sweep approach. Their “last seen” timestamp is stored as a Unix timestamp rather than a derived boolean, allowing the GC to compute “was online N seconds ago” as a simple arithmetic comparison rather than requiring a separate offline event. This also enables the “last seen 2 hours ago” display without storing any additional state.

Data Model

The presence system uses Redis as its primary store (ephemeral, fast) with PostgreSQL for audit logging and subscription persistence.

-- Subscription registry: persisted so clients don't re-send on reconnect
CREATE TABLE presence_subscriptions (
    subscriber_user_id  UUID        NOT NULL,
    target_user_id      UUID        NOT NULL,
    context_type        VARCHAR(32) NOT NULL,  -- 'conversation', 'channel', 'dm'
    context_id          UUID        NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    expires_at          TIMESTAMPTZ,           -- NULL = persistent
    PRIMARY KEY (subscriber_user_id, target_user_id, context_id)
);

CREATE INDEX idx_ps_target ON presence_subscriptions(target_user_id)
    WHERE expires_at IS NULL OR expires_at > NOW();

CREATE INDEX idx_ps_context ON presence_subscriptions(context_type, context_id);

-- Presence audit log: for debugging and analytics only
CREATE TABLE presence_events_log (
    id              BIGSERIAL   PRIMARY KEY,
    user_id         UUID        NOT NULL,
    device_id       VARCHAR(64),
    event_type      VARCHAR(32) NOT NULL,  -- 'online', 'offline', 'away', 'typing_start'
    server_id       VARCHAR(64),
    client_ip       INET,
    occurred_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    reason          VARCHAR(64)            -- 'heartbeat_miss', 'clean_close', 'gc_sweep'
);

CREATE INDEX idx_pel_user_ts ON presence_events_log(user_id, occurred_at DESC);
-- Partition by month to keep the table manageable
-- ALTER TABLE presence_events_log PARTITION BY RANGE (occurred_at);

Redis key schema:

user:presence:{user_id}              HASH  {status, last_seen, server, device_id}  TTL=45s
user:device:presence:{user_id}:{device_id}  HASH  {status, last_seen, server}      TTL=45s
presence:subscribers:{user_id}       SET   {subscriber_user_id, ...}                TTL=3600s
server:connections:{server_id}       SET   {user_id, ...}                           no TTL
typing:{conversation_id}             ZSET  {user_id -> expiry_ts}                   TTL=30s
known:servers                        SET   {server_id, ...}                         no TTL

The user:presence hash uses a 45-second TTL that gets reset on every heartbeat. If the TTL expires naturally, the GC sweep picks it up in the next 10-second cycle. The combination of TTL-based expiry and active GC creates defense in depth against ghost presences.

Data model diagram showing Redis key schema, relationships, and TTL flows

Key Algorithms and Protocols

Exponential Backoff Reconnection

When a client disconnects - whether from a network blip, server drain, or crash - it must reconnect. The reconnection algorithm determines how much load the system absorbs during partial outages.

# Client reconnection with exponential backoff and jitter
import random
import asyncio

class ReconnectStrategy:
    def __init__(self):
        self.base_delay = 1.0    # seconds
        self.max_delay = 60.0    # seconds
        self.multiplier = 2.0
        self.jitter_factor = 0.3
        self.attempt = 0

    def next_delay(self):
        # Exponential backoff with full jitter
        capped = min(self.base_delay * (self.multiplier ** self.attempt), self.max_delay)
        jitter = capped * self.jitter_factor * random.random()
        delay = capped + jitter
        self.attempt += 1
        return delay

    def reset(self):
        self.attempt = 0

async def reconnect_loop(ws_client):
    strategy = ReconnectStrategy()
    while True:
        try:
            await ws_client.connect()
            strategy.reset()
            await ws_client.run()  # blocks until disconnect
        except Exception as e:
            delay = strategy.next_delay()
            await asyncio.sleep(delay)

The “full jitter” variant - where the actual delay is uniformly sampled between 0 and the capped value - is provably better than capped exponential backoff at reducing aggregate load during correlated reconnection events. AWS’s architecture blog (2015) showed that full jitter reduces request load by 3x compared to pure exponential during outage recovery.

Key Insight

Full jitter exponential backoff is the only reconnection algorithm that remains safe when many clients disconnect simultaneously. Pure exponential backoff with the same base-delay creates synchronized retry waves at 1s, 2s, 4s intervals - effectively a coordinated DDoS against your own gateway.

Fan-Out Scope Limiting via Subscription Intersection

The naive fan-out approach - “everyone in the same workspace gets the event” - is O(workspace_size). For a 10,000-person company, one presence change means 10,000 deliveries. The correct approach limits fan-out to the intersection of subscribers who are actually watching the changed user.

# Efficient subscriber lookup using Redis pipeline
async def get_active_subscribers(user_id: str, max_subscribers: int = 500) -> list[str]:
    # Get all subscribers watching this user
    # Use SRANDMEMBER with limit if set is massive to avoid full scan
    subscriber_count = await redis.scard(f"presence:subscribers:{user_id}")

    if subscriber_count <= max_subscribers:
        subscribers = await redis.smembers(f"presence:subscribers:{user_id}")
    else:
        # For celebrity users: sample max_subscribers from the set
        # This intentionally drops some subscribers - acceptable for massive fan-out
        subscribers = await redis.srandmember(
            f"presence:subscribers:{user_id}", max_subscribers
        )

    # Filter to only online subscribers (no point delivering to offline users)
    pipe = redis.pipeline(transaction=False)
    subscriber_list = list(subscribers)
    for sub_id in subscriber_list:
        pipe.hget(f"user:presence:{sub_id}", "status")
    statuses = await pipe.execute()

    return [
        sub_id for sub_id, status in zip(subscriber_list, statuses)
        if status == b"online"
    ]

The SRANDMEMBER with a limit handles the celebrity problem: if a public figure has 50,000 users subscribed to their presence (e.g., a support bot that’s in every channel), we cap deliveries at 500 per event rather than fanning out to 50,000. This is a deliberate tradeoff - some subscribers miss an update cycle - but the next subscription refresh or manual status fetch will self-correct.

Consistent Hashing for Gateway Routing

When a fan-out service decides which gateway node to contact for delivery, it needs stable routing. Consistent hashing maps user_id to a gateway node in a way that minimizes re-routing when nodes are added or removed:

import hashlib
from bisect import bisect_left, insort

class ConsistentHashRing:
    def __init__(self, virtual_nodes: int = 150):
        self.ring = []        # sorted list of (hash_value, server_id)
        self.virtual_nodes = virtual_nodes

    def add_server(self, server_id: str):
        for i in range(self.virtual_nodes):
            key = f"{server_id}:{i}"
            h = int(hashlib.sha256(key.encode()).hexdigest(), 16)
            insort(self.ring, (h, server_id))

    def remove_server(self, server_id: str):
        self.ring = [(h, s) for h, s in self.ring if s != server_id]

    def get_server(self, user_id: str) -> str:
        if not self.ring:
            raise ValueError("No servers in ring")
        h = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
        idx = bisect_left(self.ring, (h, ""))
        if idx >= len(self.ring):
            idx = 0
        return self.ring[idx][1]

However, since WebSocket connections are stateful and “sticky,” the fan-out service does not actually use consistent hashing for connection routing - it reads the authoritative server field from Redis. Consistent hashing is used for partitioning the presence GC work across multiple GC worker instances, so each worker owns a predictable slice of user IDs to scan.

Scaling and Performance

At 10 million concurrent connections with a 15-second heartbeat interval, the system must process approximately 667,000 heartbeat operations per second. At peak connectivity (morning commute, post-lunch for different timezones), this can spike to 1 million heartbeats per second.

Capacity Estimation:
  Concurrent connections: 10,000,000
  Heartbeat interval: 15 seconds
  Heartbeat ops/second: 10,000,000 / 15 = 667,000 ops/sec
  Redis ops per heartbeat: 2 (HSET + EXPIRE, pipelined)
  Total Redis heartbeat load: 1,333,000 ops/sec

  Presence changes (status events):
    Assume 1% of users change status per minute = 100,000 events/min = 1,667 events/sec
    Average subscribers per user: 20
    Fan-out events/sec: 1,667 * 20 = 33,340 deliveries/sec
    Peak fan-out (10x): 333,400 deliveries/sec

  Typing indicators:
    Assume 5% of users typing at any moment = 500,000 typists
    Each typist: 2 events (start, stop) per conversation per minute = max 16,667 events/sec
    Fan-out per typing event: 5 subscribers average
    Typing fan-out: 83,335 deliveries/sec

  Memory per connection (gateway):
    Connection state: ~2 KB (subscription list, metadata)
    At 1M connections per gateway: 2 GB RAM
    At 10 gateways: 20 GB total for connection state

  Redis memory:
    Per-user presence key: ~200 bytes
    10M users: 2 GB for presence hashes
    Subscriber sets: ~50 bytes per subscription, avg 20 subscriptions = 10 GB
    Total Redis: ~12 GB active data, use 3x replication = 36 GB provisioned

  Gateway nodes: 10 (1M connections each)
  Fan-out workers: 20 (16 Kafka partitions + overhead)
  Redis nodes: 6 (3 primaries + 3 replicas for HA)
  Kafka: 3 brokers, 16 partitions for presence-events topic
Scaling architecture showing gateway sharding, Kafka partitions, Redis cluster, and fan-out worker pool

The gateway tier scales horizontally. Connections are distributed across gateway nodes by the load balancer using consistent routing by user ID - this ensures that a user’s multi-device connections land on the same node, simplifying multi-device presence merging. Each gateway node handles up to 1 million concurrent connections, bounded by file descriptor limits and memory.

The Kafka presence-events topic uses 16 partitions keyed by user_id. This ensures all events for a given user are processed by the same fan-out worker instance, preventing out-of-order delivery to subscribers. At 1,667 presence events per second, 16 partitions is massively over-provisioned but provides headroom for 100x traffic growth.

Real World

Slack handles similar presence scale with a tiered architecture: gateway nodes (called “Switchboard”) manage WebSocket connections, while presence propagation uses a separate pub-sub mesh (built on their internal “Flannel” service). They published that their presence system handles peak loads of over 1 billion presence events per day across hundreds of thousands of workspaces.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Gateway node crashLoad balancer health check fails within 5sAll connections on that node disconnectClients reconnect with exponential backoff; stale presence cleaned by GC within 45s
Redis node failureRedis Sentinel or Cluster failoverHeartbeat writes fail; presence updates stallRedis Cluster promotes replica; GC grace period extends presence TTL for active connections
Kafka broker failureKafka consumer lag alert + broker health checkFan-out stalls; presence events queue upKafka rebalances partitions to surviving brokers within 30s; fan-out workers resume from committed offset
Network partition (gateway from Redis)Redis command timeoutsGateway cannot record heartbeats; all connections appear stale to GCGateway queues heartbeat writes in-memory (ring buffer, 60s capacity); GC uses extended TTL during partition
Fan-out worker crashConsumer group rebalanceKafka partition goes unconsumed; presence updates delayedKafka reassigns partition to another worker within session.timeout.ms (default 45s)
Thundering herd on reconnectGateway connection rate spike alertCPU and connection table exhaustion on surviving gatewaysClient-side exponential backoff + gateway-side connection rate limiting per IP block
Watch Out

The most common operational mistake is setting Redis maxmemory-policy to allkeys-lru on the presence cluster. Under memory pressure, Redis will evict presence keys - silently marking active users as offline. Always use a dedicated Redis instance for presence with maxmemory-policy noeviction and trigger an alert when memory usage exceeds 80%.

Comparison of Approaches

ApproachLatencyComplexityFailure ModeBest Fit
Long polling1-5 secondsLowPolling storms during reconnect; high server CPUPrototype or low-traffic system under 10K users
Server-Sent Events (SSE)200-500msMediumOne-directional; client cannot send heartbeatsRead-heavy presence (dashboards) where client doesn’t need to send state
WebSocket with Redis pub-sub50-200msMediumRedis PUBSUB doesn’t scale past ~100K channels; single-broker fan-outSmall to medium scale under 1M concurrent users
WebSocket + Kafka fan-out (this design)50-200msHighKafka consumer lag during spike; more operational complexity1M+ concurrent users; need durable event log for debugging
XMPP/MQTT presence100-300msVery HighProtocol complexity; limited fan-out controlIoT device presence; existing XMPP infrastructure
gRPC streaming30-100msHighConnection management is application-levelInternal service-to-service presence; not client-facing

The WebSocket + Kafka fan-out design is the right choice for this scale. Redis pub-sub’s fan-out model is a single-node operation - the subscriber count is bounded by what one Redis node can push per second. Kafka’s consumer group model distributes fan-out processing across many workers and provides an audit log of every presence event for free. The operational complexity is real but manageable with mature Kafka tooling.

The one scenario where you’d downgrade to WebSocket + Redis pub-sub is during initial product development where speed to production matters more than theoretical ceiling. Redis pub-sub can easily handle 100,000 concurrent connections, which gives plenty of runway before the Kafka migration becomes necessary.

Key Takeaways

  • WebSocket connection registry: Store per-connection metadata in both local process memory (O(1) hot path) and Redis (cross-node routing), with the Redis entry serving as the source of truth for fan-out routing.
  • Heartbeat-based liveness: Redis key TTL is your primary liveness signal - a key that expires means the user is offline, regardless of whether a clean close event was sent.
  • Stale presence cleanup: A GC sweep running every 10 seconds handles the long tail of silent disconnects; defense in depth with both TTL expiry and active GC prevents ghost presences from accumulating.
  • Fan-out scope limiting: Scoped subscriptions (subscribe to specific users, not all users) are the mechanism that keeps fan-out O(subscribers) rather than O(total users).
  • Typing indicator throttling: Leading-edge send plus trailing-edge stop, with a 3-second debounce, reduces typing events by 200x compared to per-keystroke delivery; always apply client-side AND server-side rate limiting.
  • Connection draining: Randomized reconnect hints during server shutdown spread reconnection load across 5 seconds, converting a reconnection spike into a manageable ramp.
  • Presence pub-sub via Kafka: Decoupling presence event ingest from fan-out delivery allows each tier to scale independently and provides an audit log for debugging presence state bugs.
  • Multi-device presence: Track presence at device granularity, derive user-level status by aggregation; never overwrite device state with user state or vice versa.

The counter-intuitive lesson from building presence systems is that accuracy and freshness are not the same property. You can have highly accurate presence (no ghost presences, no false offline states) with relatively coarse freshness (30-second detection latency) by leaning on TTLs and GC sweeps rather than real-time event propagation. The 200ms propagation SLA applies to reacting to known events - it says nothing about how quickly we detect unknown events (silent disconnects). Separating these two latency SLAs lets you build a simpler, more reliable system than trying to make everything sub-second.

Frequently Asked Questions

Q: Why not use Redis pub-sub instead of Kafka for the event bus? A: Redis pub-sub is fire-and-forget with no persistence, no consumer groups, and no backpressure. If a fan-out worker is slow or crashes, events are lost. Kafka gives us durable event storage, consumer group rebalancing (workers can crash and recover without losing events), and an audit log. At 10 million users, the fan-out tier will need scaling and rolling restarts; losing events during maintenance is unacceptable.

Q: Why not use a dedicated presence service like PubNub or Ably instead of building from scratch? A: Managed services are an excellent choice for products under 500K concurrent users. Above that, the per-connection pricing ($0.0001-0.001 per connection-hour) becomes significant - 10M connections at $0.0005/hour is $5,000/hour or $3.6M/month. At that scale, building in-house pays for itself within months. The operational complexity is also more predictable than vendor API limits and rate throttling.

Q: How do you handle presence for a user who has multiple tabs open in the same browser? A: Each browser tab is a separate WebSocket connection with its own device_id (typically derived from a tab session ID). Multi-device presence aggregation treats each tab as a separate “device.” The user is online if any tab is connected. When all tabs close, the last device’s heartbeat expires and the GC marks the user offline. Tab-level presence also means that closing one tab doesn’t flash the user as offline before the next tab’s heartbeat fires.

Q: What happens to subscriber sets when a user goes offline and their subscriptions become irrelevant? A: Subscriber sets have a 1-hour TTL that gets refreshed whenever the subscriber is active. When a user goes offline and stops refreshing their subscriptions, the sets expire naturally within an hour. We also run a separate cleanup job that removes offline users from subscriber sets when we emit their offline event - this is a best-effort cleanup, not a correctness requirement, since fan-out delivery to offline subscribers is a no-op (we check subscriber status before delivering).

Q: Why use a sorted set for typing indicators rather than simple keys with TTL? A: Individual keys with TTL require one SETEX per active typist and one implicit expiry operation. Reading “who is currently typing” requires either a KEYS typing:{conv}:* scan (terrible at scale) or maintaining a separate index. The sorted set approach stores all typists in one key, lets us atomically expire stale entries with ZREMRANGEBYSCORE, and reads the full list with ZRANGE - three operations instead of N. The sorted set is also more cache-friendly since it’s a single key access pattern.

Q: How do you prevent presence state from becoming inconsistent during a Redis cluster failover? A: During Redis Sentinel or Cluster failover, there is a brief window (typically 15-30 seconds) where writes fail. Gateway nodes buffer heartbeat writes in a bounded in-memory ring buffer during this window. When Redis recovers, buffered writes are replayed. The GC extends its staleness threshold from 40 seconds to 90 seconds during detected Redis unavailability, preventing mass offline events during the failover window. Users may see presence temporarily stale but will not see false offline events.

Interview Questions

Q: How would you design the presence system to scale to 100 million concurrent users instead of 10 million? Expected depth: Discuss horizontal gateway scaling (100 nodes at 1M connections each), Kafka partition scaling (160 partitions), Redis Cluster sharding strategy for presence keys (hash slot distribution), fan-out tier scaling, and the celebrity problem - users with 100,000+ subscribers needing special-cased bounded fan-out. Address the regime change where Redis Cluster may need to be replaced with a purpose-built presence store (Apache Cassandra with TTL-based row expiry).

Q: How would you add “last seen X minutes ago” timestamps without storing full event history? Expected depth: Store last_seen Unix timestamp in the presence hash alongside status. Update it on every heartbeat and on explicit online/offline events. The display string is computed client-side from the timestamp. Discuss privacy implications (some users want to hide exact last-seen), how to implement “last seen hidden” as a null value in the hash, and the staleness window during the Redis TTL expiry period where last_seen may be slightly stale.

Q: A celebrity user has 2 million followers all subscribed to their presence. How do you handle fan-out without melting your delivery tier? Expected depth: Discuss fan-out scope limiting - not all followers are subscribed at any given moment; only users who have recently opened a conversation view involving the celebrity are active subscribers. For truly massive fan-out (verified celebrities), implement a tiered delivery model: immediate delivery to active subscribers (last active within 5 minutes), batched delivery at 30-second intervals to recently-active subscribers (last active within 1 hour), and no delivery to long-idle subscribers who will fetch on next session load.

Q: How do you ensure that the typing indicator “User is typing…” disappears if the typing_stop event is lost? Expected depth: Client-side: the trailing stop timer fires 3 seconds after last keystroke regardless of delivery - if the network drops between start and stop, the stop fires when the client reconnects. Server-side: typing indicator keys (sorted set entries) have a 5-second TTL. Subscribers re-fetch typing state when they receive a heartbeat acknowledgement after reconnecting. The combination of client-side debounce, server-side TTL, and reconnect-triggered re-fetch provides three independent mechanisms to clear a stuck “is typing” indicator.

Q: Walk me through what happens, step by step, when a gateway node crashes ungracefully. Expected depth: Load balancer health check detects the crash within 5 seconds and removes the node from rotation. All connections on that node are immediately disconnected - clients see a connection close event. Redis presence keys for users on that node continue to exist but will expire within 45 seconds (TTL). The GC sweep at T+10s sees expired keys and emits offline events to Kafka. Fan-out workers consume the offline events and deliver to subscribers. Meanwhile, all affected clients begin reconnecting to surviving gateway nodes with exponential backoff. New connections write fresh presence entries to Redis. The end state is correct within approximately 45 seconds of the crash.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article