Build a Push Notification Routing System

reliability distributed-systems performance

System Design Deep Dive

Push Notification Routing System

Route 100M messages/day across APNs, FCM, and Web Push with guaranteed delivery and zero duplicates

⏱ 14 min read📐 Advanced🏗️ Push-Notifications

Imagine a post office that must deliver every letter exactly once, to the right mailbox, across three different postal services that each use incompatible address formats - and the recipient may have moved since you last checked. That is the core engineering tension of a push notification routing system. At 1,000 notifications per second - which scales to roughly 100 million per day for a mid-sized consumer app - the naive approach of a single service calling APNs and FCM directly collapses under the weight of connection management, error handling, and fan-out complexity.

The problem is deceptively multi-dimensional. First, you have heterogeneous delivery channels: Apple Push Notification service (APNs) speaks HTTP/2 with p8 JWT credentials; Firebase Cloud Messaging (FCM) offers a gRPC batch API plus a legacy REST endpoint; Web Push uses RFC 8030 with VAPID authentication. A single user might have three active devices across all three channels simultaneously. Second, you have delivery guarantees in tension with cost: APNs and FCM are themselves unreliable - they queue messages for offline devices, silently drop them after TTL expiry, and return opaque errors for invalid tokens. Getting at-least-once delivery requires your system to track state and retry, but naive retries cause duplicates. Third, you have scale forcing architectural decisions that seem premature at small scale: at 10 req/s a single Python service calling both APNs and FCM works fine. At 10,000 req/s you need connection pools, batching, fan-out parallelism, and a separate retry pipeline.

The third dimension is token hygiene. Device tokens are not permanent. iOS generates a new APNs token when the user reinstalls the app or restores from backup. Android tokens refresh periodically with FCM. A stale token registry wastes network calls and, worse, you might deliver notifications to the wrong device if a token has been reassigned. This is the kind of silent bug that causes GDPR violations - sending a notification to the prior owner of a recycled phone number or device token.

We need to solve for channel routing correctness, delivery durability with deduplication, and graceful handling of partial failures simultaneously.

Requirements and Constraints

Functional Requirements

Accept notification payloads via REST or gRPC and route them to the correct channel (APNs, FCM, or Web Push) based on stored device tokens.
Support fan-out: one notification to all devices owned by a user (multi-device, multi-channel).
Guarantee at-least-once delivery: if the upstream gateway (APNs, FCM) returns a transient error, retry with exponential backoff.
Deduplicate: if the same notification is submitted twice within a configurable window (default 24 hours), drop the second submission without dispatching.
Track delivery receipts: record SENT, DELIVERED, FAILED, and EXPIRED states per notification per device.
Support both visible notifications (shown to user, appear in notification tray) and silent notifications (background data payloads that wake the app without UI).
Provide a token registration API: apps call it on launch to upsert their current device token, platform, and app version.

Non-Functional Requirements

Throughput: 100M notifications/day average, peak 10x (1.16M/hour burst).
Latency: 95th-percentile end-to-end dispatch latency under 500ms from API ingestion to gateway call.
Availability: 99.9% for the ingestion API; 99.5% for the dispatch layer (tolerate gateway outages via retry queue).
Durability: zero notification loss for retryable failures; dead-letter queue for terminal failures.
Deduplication window: 24-hour sliding window; configurable to 5 minutes for high-frequency alerts.
Storage: receipt logs retained 90 days; token registry indefinitely (cleaned by stale token eviction).

Constraints

We do not build our own device-level delivery guarantee beyond what APNs/FCM provide - we rely on them to handle offline queuing and re-delivery to devices.
Out of scope: in-app notification inbox (a separate read model problem), email and SMS channels, notification scheduling (run a cron on top of this system for that).
Assumption: one user_id maps to up to 10 concurrent active devices.

High-Level Architecture

The system has five major tiers. An API Gateway handles authentication, rate limiting, and routes inbound calls to the Deduplication Service, which checks a Redis sliding window before allowing a message to proceed. Approved messages flow to the Token Registry (Cassandra) to resolve user_id into device tokens, then to the Router Engine which selects the right channel per device and hands off to channel-specific senders. The Receipt Tracker writes delivery outcomes to PostgreSQL and feeds failures back into a Kafka retry queue.

Data flows like this: a service calls POST /v1/notify with a user_id, a title, a body, and an optional data payload. The API Gateway validates the JWT and applies per-caller rate limits. The Deduplication Service computes an idempotency key (sha256(caller_id + user_id + content_hash)) and checks Redis with SET key 1 NX EX 86400. If the key already exists, we return 200 OK immediately with a deduplicated: true flag - success from the caller’s perspective, no dispatch. If new, the message enters the routing pipeline. The Router looks up all active device tokens for the user, groups them by channel, and dispatches to each channel’s sender concurrently. Each sender maintains a connection pool to the upstream gateway, reports ACK or FAIL back to the Receipt Tracker, and enqueues failures into the Kafka retry topic.

The separation between the Router Engine and the per-channel Senders is the key structural decision. The Router handles business logic - which devices to target, priority rules, silent vs. visible type selection, and notification batching for bulk sends. The Senders handle the mechanical details of each protocol - HTTP/2 multiplexing for APNs, batch sizing for FCM’s v1 API, VAPID key signing for Web Push. This boundary means you can swap or upgrade an individual sender without touching routing logic.

Key Insight

The deduplication check must happen before the token registry lookup - not after. Checking after means you’ve already paid the Cassandra read cost and begun fan-out work before discovering the message is a duplicate. Redis SETNX is O(1) and sub-millisecond; make it the first gate.

The Device Token Registry

The Token Registry’s job is to map user_id to a list of (device_id, platform, token, token_updated_at) rows, where a single user may have up to 10 active rows.

The naive approach - a single Postgres table with a (user_id, device_id) primary key - works until you hit high fan-out read volume. Every notification triggers at least one token lookup, which at 10,000 notifications/second means 10,000 reads/second on the tokens table. Postgres handles this fine with good indexing, but this table becomes a hot spot when you add token refreshes, stale evictions, and multi-region replication. We choose Cassandra because its partition model turns user_id-keyed reads into single-node operations, RF=3 gives us fault tolerance without sacrificing read performance, and write throughput for token upserts is excellent.

Device token registry internals: registration flow, lookup with Redis cache, Cassandra partition layout, and stale token eviction

The lookup path adds Redis as a read-through cache with a 5-minute TTL. On cache miss, we read from Cassandra and populate the cache. The cache hit rate approaches 95% in practice because notifications for the same user often cluster in time (a broadcast event hitting many users simultaneously means the first few lookups populate the cache, and subsequent lookups for the same user hit Redis). The 5-minute TTL means a newly registered token is visible within 5 minutes of registration, which is acceptable for all use cases except very low-latency post-install campaigns.

Stale token eviction is the subtle piece. When APNs returns a 410 Unregistered response, it is telling you definitively that the token is dead. FCM returns UNREGISTERED in its error payload. The Sender does not delete the token synchronously - that would add latency to the critical path. Instead, it publishes a token_eviction_event to a dedicated Kafka topic. A TokenJanitor consumer processes these asynchronously, deletes the row from Cassandra, and invalidates the Redis cache entry.

Watch Out

Never delete a token synchronously in the Sender’s critical path. Token eviction is a low-urgency write to Cassandra - doing it in-line adds latency variance to every failed dispatch. Use an async eviction queue, and deduplicate eviction events so a burst of 410s for the same token doesn’t cause a thundering-herd of DELETE operations.

Channel Routing Logic

The Router Engine’s job is to take a (user_id, notification_payload) pair and produce a list of (device_token, platform, channel_sender) dispatch tasks.

The routing logic looks simpler than it is. Most engineers assume routing is just “look up the tokens, pick the channel by platform.” The reality has three complications: priority overrides, silent vs. visible type forcing, and channel fallback.

Priority: APNs and FCM both support notification priority (high vs. normal). high priority wakes the device immediately and consumes battery. normal priority is batched by the OS and delivered when convenient. The Router maps a caller-supplied urgency field (critical, high, normal, low) to platform-specific priority headers. A critical notification on iOS also requires an apns-push-type: alert header and a provisioned entitlement from Apple.

Silent vs. visible type: Silent notifications (content-available: 1 on APNs, data-only on FCM) wake the app in the background without showing UI. They are used for sync triggers, badge count updates, and pre-fetching. Visible notifications show a title/body in the notification tray. The Router must set the correct APNs apns-push-type header (background for silent, alert for visible) and the correct FCM message structure (a notification object for visible, data-only for silent). Mixing these up causes either silent messages to display as alerts, or alerts to fail validation with APNs.

Channel fallback: if a user has both an iOS device and a web push subscription, and the APNs send fails with a 5xx error, the system does not automatically fall back to Web Push. They are separate devices. Each device gets its own delivery attempt with its own retry queue. “Channel fallback” in this system means retrying on the same channel after a delay.

# Channel routing: maps devices to sender tasks with correct platform config
from dataclasses import dataclass
from enum import Enum
from typing import List

class Channel(Enum):
    APNS = "apns"
    FCM = "fcm"
    WEBPUSH = "webpush"

class NotifType(Enum):
    VISIBLE = "visible"
    SILENT = "silent"

class Urgency(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    NORMAL = "normal"
    LOW = "low"

@dataclass
class DeviceToken:
    device_id: str
    user_id: str
    platform: str  # "ios", "android", "web"
    token: str
    app_bundle: str

@dataclass
class DispatchTask:
    device_token: DeviceToken
    channel: Channel
    apns_priority: int          # 10 = immediate, 5 = normal
    apns_push_type: str         # "alert" | "background"
    fcm_priority: str           # "high" | "normal"
    include_notification: bool  # False = silent

PLATFORM_TO_CHANNEL = {
    "ios": Channel.APNS,
    "android": Channel.FCM,
    "web": Channel.WEBPUSH,
}

URGENCY_TO_APNS_PRIORITY = {
    Urgency.CRITICAL: 10,
    Urgency.HIGH: 10,
    Urgency.NORMAL: 5,
    Urgency.LOW: 5,
}

URGENCY_TO_FCM_PRIORITY = {
    Urgency.CRITICAL: "high",
    Urgency.HIGH: "high",
    Urgency.NORMAL: "normal",
    Urgency.LOW: "normal",
}

def build_dispatch_tasks(
    devices: List[DeviceToken],
    notif_type: NotifType,
    urgency: Urgency,
) -> List[DispatchTask]:
    tasks = []
    for device in devices:
        channel = PLATFORM_TO_CHANNEL.get(device.platform)
        if channel is None:
            continue  # unknown platform, skip
        is_visible = (notif_type == NotifType.VISIBLE)
        tasks.append(DispatchTask(
            device_token=device,
            channel=channel,
            apns_priority=URGENCY_TO_APNS_PRIORITY[urgency],
            apns_push_type="alert" if is_visible else "background",
            fcm_priority=URGENCY_TO_FCM_PRIORITY[urgency],
            include_notification=is_visible,
        ))
    return tasks

Real World

Meta’s notification platform handles both “visible” and “data” message types through their Iris push infrastructure. Silent data messages are used to trigger client-side ranking updates without showing a notification, while visible messages go through separate priority queues. The distinction between the two paths in their system is enforced at schema validation time, not as a runtime check.

Delivery Receipt Tracking

The Receipt Tracker’s job is to record the lifecycle of every notification from SENT to DELIVERED, FAILED, or EXPIRED, and to feed unresolved failures into the retry pipeline.

Think of receipts like certified mail tracking. You know when you handed it to the post office (SENT). The post office tells you when they delivered it to the mailbox (DELIVERED). If delivery fails, they tell you why - bad address (FAILED_TERMINAL), recipient unreachable today (FAILED_RETRYABLE). Critically, not all channels provide delivery confirmation: APNs and FCM confirm that they received your API call and will attempt delivery. They do not (by default) confirm when the device woke up and displayed the notification. That would require in-app SDK callbacks feeding a separate click_receipt event.

The receipt record captures this nuance with a delivery_status enum and a gateway_status field:

-- Delivery receipts table: tracks every notification's lifecycle per device
CREATE TABLE delivery_receipts (
    receipt_id      UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    notification_id UUID        NOT NULL,
    device_id       VARCHAR(64) NOT NULL,
    user_id         UUID        NOT NULL,
    channel         VARCHAR(16) NOT NULL CHECK (channel IN ('apns','fcm','webpush')),
    delivery_status VARCHAR(16) NOT NULL DEFAULT 'pending'
                    CHECK (delivery_status IN ('pending','sent','delivered','failed','expired','deduplicated')),
    gateway_status  VARCHAR(64),           -- raw status code from APNs/FCM/WebPush
    attempt_count   SMALLINT    NOT NULL DEFAULT 0,
    next_retry_at   TIMESTAMPTZ,
    last_attempted_at TIMESTAMPTZ,
    sent_at         TIMESTAMPTZ,
    resolved_at     TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    payload_hash    CHAR(64),              -- sha256 of notification payload
    CONSTRAINT fk_notification
        FOREIGN KEY (notification_id) REFERENCES notifications(notification_id)
);

CREATE INDEX idx_receipts_notification ON delivery_receipts(notification_id);
CREATE INDEX idx_receipts_device_status ON delivery_receipts(device_id, delivery_status);
CREATE INDEX idx_receipts_retry ON delivery_receipts(next_retry_at)
    WHERE delivery_status = 'pending' AND next_retry_at IS NOT NULL;
CREATE INDEX idx_receipts_user_created ON delivery_receipts(user_id, created_at DESC);

The next_retry_at partial index is the key performance optimization - the retry scanner queries WHERE delivery_status = 'pending' AND next_retry_at <= NOW(), and that index covers only rows actively awaiting retry, not the entire table. At 100M notifications/day with a 90-day retention window, that table grows to 9B rows. Partitioning by created_at (monthly range partitions) is mandatory at that scale.

Key Insight

Write receipts asynchronously, not inline with the dispatch call. The Sender fires off the gateway request, gets back a response code, and publishes a receipt event to a local Kafka topic. A separate Receipt Consumer writes to PostgreSQL. This keeps Sender latency independent of PostgreSQL write latency and allows batch-inserting receipts at high throughput.

Retry with Backoff

The retry subsystem is the most subtle part of the design. It needs to handle four distinct error classes differently: transient errors that warrant immediate retry, rate-limit errors that warrant polite back-off, gateway unavailability that warrants longer back-off plus circuit breaking, and terminal errors that should never be retried.

Like a package delivery driver who tries the same address at progressively longer intervals - morning, then afternoon, then next day - the retry system increases the delay between attempts exponentially. The difference from naive exponential backoff is jitter: without it, a burst of 10,000 failed sends at the same moment will all retry at exactly the same time, causing a thundering herd that amplifies the original failure.

# Exponential backoff with full jitter for retry delay calculation
import random
import math

def compute_retry_delay_seconds(
    attempt_count: int,
    base_delay: float = 1.0,
    max_delay: float = 3600.0,
    jitter_factor: float = 0.3,
) -> float:
    """
    Returns delay in seconds for the Nth retry attempt.
    Formula: min(base * 2^N, max) with up to 30% jitter.
    attempt_count=0 means first retry after initial failure.
    """
    if attempt_count < 0:
        raise ValueError("attempt_count must be >= 0")
    exponential = base_delay * math.pow(2, attempt_count)
    capped = min(exponential, max_delay)
    # Full jitter: random in [0, cap] - prevents thundering herd
    jitter = random.uniform(0, capped * jitter_factor)
    return capped + jitter


TERMINAL_ERRORS = {
    # APNs terminal errors
    "BadDeviceToken", "Unregistered", "MissingDeviceToken",
    "BadMessageId", "DuplicateHeaders",
    # FCM terminal errors
    "UNREGISTERED", "INVALID_ARGUMENT", "SENDER_ID_MISMATCH",
    # Web Push
    "404",  # subscription gone
    "410",  # subscription permanently invalid
}

RETRYABLE_ERRORS = {
    # APNs transient
    "InternalServerError", "ServiceUnavailable", "Shutdown",
    # FCM transient
    "INTERNAL", "UNAVAILABLE", "QUOTA_EXCEEDED",
    # Web Push
    "429",  # rate limited
    "500", "503",
}

MAX_RETRY_ATTEMPTS = 5

def classify_gateway_error(error_code: str, channel: str) -> str:
    """Returns 'terminal', 'retryable', or 'unknown'."""
    if error_code in TERMINAL_ERRORS:
        return "terminal"
    if error_code in RETRYABLE_ERRORS:
        return "retryable"
    return "unknown"  # treat unknown as retryable up to MAX_RETRY_ATTEMPTS

The retry queue uses Kafka topics named notify.retry.0, notify.retry.1, through notify.retry.4. Each Sender publishes to the appropriate retry topic after computing next_retry_at. A separate RetryScheduler service runs as a Kafka consumer that wakes up every 30 seconds, scans for retry events whose next_retry_at has passed, and re-injects them into the main notify.dispatch topic. This approach avoids polling PostgreSQL for due retries - the retry schedule lives in Kafka message timestamps.

Watch Out

FCM’s QUOTA_EXCEEDED error is a rate-limit signal, not a terminal failure. A common mistake is treating it as terminal and discarding the message. The correct behavior is to back off with a longer initial delay (60 seconds minimum for FCM rate limits) and retry. Apple’s Shutdown error means the APNs server is restarting - always retry after 5-10 seconds.

Deduplication Window

The deduplication window prevents a notification from being sent twice if the caller retries their API call due to a network timeout. This is the same problem as payment idempotency - the caller cannot know if their first request succeeded, so they retry, and without dedup the user gets two notifications.

A bloom filter is like a bouncer who remembers every face he has seen - he might very rarely think he’s seen someone he hasn’t, but he will never let in someone he definitely has. For notification dedup, a small false-positive rate (0.1%) is acceptable: we’d rather occasionally drop a valid message than ever send a duplicate. But for most production systems, a Redis SET with a key expiry is more operationally predictable than a probabilistic structure.

# Deduplication service: Redis-backed idempotency window
import hashlib
import redis

class DeduplicationService:
    def __init__(self, redis_client: redis.Redis, window_seconds: int = 86400):
        self.redis = redis_client
        self.window = window_seconds

    def _build_key(self, caller_id: str, user_id: str, content_hash: str) -> str:
        raw = f"dedup:{caller_id}:{user_id}:{content_hash}"
        # SHA-256 the key to keep it fixed-length in Redis
        return "dedup:" + hashlib.sha256(raw.encode()).hexdigest()

    def compute_content_hash(self, title: str, body: str, data: dict) -> str:
        # Stable hash of notification content - order-independent for data dict
        normalized = f"{title}|{body}|{sorted(data.items())}"
        return hashlib.sha256(normalized.encode()).hexdigest()[:16]

    def check_and_set(
        self,
        caller_id: str,
        user_id: str,
        title: str,
        body: str,
        data: dict,
    ) -> bool:
        """
        Returns True if this notification is a duplicate (should be dropped).
        Returns False if it is new (should be dispatched).
        Uses SET NX EX for atomic check-and-set.
        """
        content_hash = self.compute_content_hash(title, body, data)
        key = self._build_key(caller_id, user_id, content_hash)
        # SET key 1 NX EX window_seconds
        result = self.redis.set(key, "1", nx=True, ex=self.window)
        # set() returns None if key already existed (NX condition failed)
        return result is None  # True = duplicate, False = new

    def force_expire(self, caller_id: str, user_id: str, title: str,
                     body: str, data: dict) -> None:
        """Manually expire a dedup entry (e.g., after confirmed delivery failure)."""
        content_hash = self.compute_content_hash(title, body, data)
        key = self._build_key(caller_id, user_id, content_hash)
        self.redis.delete(key)

The dedup key design matters more than it looks. Including caller_id prevents cross-caller collisions (two different services sending the same content to the same user are independent notifications). Including only content fields - not notification_id - means even if the caller generates a new UUID on retry, we still catch the duplicate.

Real World

Airbnb’s notification platform uses a two-layer dedup system: a fast Redis check with a 5-minute window catches retries from transient network failures, while a slower Postgres-backed check with a 24-hour window catches duplicates from upstream service restarts that replay event queues. The two windows serve different failure modes - the short window handles TCP timeouts, the long window handles Kafka consumer group rebalances.

Notification Batching

The Batch Aggregator’s job is to coalesce multiple individual notifications into a single upstream gateway call, dramatically improving throughput per TCP connection.

Think of batching as filling a bus before it departs - you collect passengers (messages) for 500 milliseconds, then send the bus (batch API call) regardless of how many seats are filled. A half-full bus is still more efficient than 50 individual cars.

FCM’s v1 API accepts up to 500 messages per batch call. APNs does not have a batch API - each notification is a separate HTTP/2 request, but HTTP/2 multiplexing means you can have hundreds of concurrent in-flight requests over a single TCP connection, which achieves the same effect. Web Push has no batching support at all.

// FCM batch sender: aggregates messages with a 500ms window, max 500 per batch
package sender

import (
    "context"
    "sync"
    "time"
)

const (
    FCMMaxBatchSize   = 500
    BatchWindowMs     = 500
)

type FCMMessage struct {
    Token   string
    Title   string
    Body    string
    Data    map[string]string
    TaskID  string
    IsHigh  bool
}

type FCMBatchSender struct {
    mu       sync.Mutex
    pending  []FCMMessage
    timer    *time.Timer
    flushFn  func([]FCMMessage) error
}

func NewFCMBatchSender(flushFn func([]FCMMessage) error) *FCMBatchSender {
    s := &FCMBatchSender{flushFn: flushFn}
    s.timer = time.AfterFunc(BatchWindowMs*time.Millisecond, s.flush)
    return s
}

func (s *FCMBatchSender) Add(msg FCMMessage) {
    s.mu.Lock()
    s.pending = append(s.pending, msg)
    shouldFlush := len(s.pending) >= FCMMaxBatchSize
    s.mu.Unlock()

    if shouldFlush {
        // Batch is full - flush immediately without waiting for timer
        s.flush()
    }
}

func (s *FCMBatchSender) flush() {
    s.mu.Lock()
    if len(s.pending) == 0 {
        s.mu.Unlock()
        // Reset timer for next window
        s.timer.Reset(BatchWindowMs * time.Millisecond)
        return
    }
    batch := s.pending
    s.pending = make([]FCMMessage, 0, FCMMaxBatchSize)
    s.mu.Unlock()

    // Send batch to FCM - errors are logged per-message in response
    if err := s.flushFn(batch); err != nil {
        // Re-enqueue failed batch messages individually for retry
        for _, msg := range batch {
            enqueueRetry(msg)
        }
    }
    s.timer.Reset(BatchWindowMs * time.Millisecond)
}

func enqueueRetry(msg FCMMessage) {
    // Publish to Kafka retry topic - implementation in retry module
}

Silent notifications cannot be batched with visible notifications in the same APNs connection. APNs requires apns-push-type: background for silent and apns-push-type: alert for visible. Some implementations use separate connection pools for each type to avoid the per-request overhead of checking the push type.

Key Insight

APNs’s HTTP/2 multiplexing gives you batching for free - 25 persistent connections each supporting 1000 concurrent streams means 25,000 in-flight APNs requests simultaneously. The correct mental model for APNs is not “batch API” but “saturate the connection pool concurrently.” FCM’s actual batch API reduces per-message overhead differently - it combines multiple messages in one HTTP body, reducing TLS handshake and TCP ack overhead.

Data Model

Core Schema

-- Notifications: top-level notification request, one per API call
CREATE TABLE notifications (
    notification_id UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    caller_id       VARCHAR(64) NOT NULL,
    user_id         UUID        NOT NULL,
    notif_type      VARCHAR(16) NOT NULL DEFAULT 'visible'
                    CHECK (notif_type IN ('visible', 'silent')),
    urgency         VARCHAR(16) NOT NULL DEFAULT 'normal'
                    CHECK (urgency IN ('critical','high','normal','low')),
    title           VARCHAR(256),
    body            TEXT,
    data            JSONB,
    ttl_seconds     INT         NOT NULL DEFAULT 86400,
    idempotency_key CHAR(64)    UNIQUE,  -- sha256 of dedup fields
    status          VARCHAR(16) NOT NULL DEFAULT 'queued'
                    CHECK (status IN ('queued','dispatched','completed','failed','deduplicated')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    expires_at      TIMESTAMPTZ NOT NULL GENERATED ALWAYS AS
                    (created_at + (ttl_seconds * INTERVAL '1 second')) STORED
);

CREATE INDEX idx_notifications_caller ON notifications(caller_id, created_at DESC);
CREATE INDEX idx_notifications_user ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notifications_idempotency ON notifications(idempotency_key);

-- Device tokens: Cassandra schema expressed as SQL for documentation
-- Actual storage: Cassandra table with user_id as partition key
CREATE TABLE device_tokens (
    token_id        UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID        NOT NULL,
    device_id       VARCHAR(128) NOT NULL,
    platform        VARCHAR(16) NOT NULL CHECK (platform IN ('ios','android','web')),
    push_token      TEXT        NOT NULL,
    app_bundle      VARCHAR(256),
    app_version     VARCHAR(32),
    os_version      VARCHAR(32),
    is_active       BOOLEAN     NOT NULL DEFAULT true,
    registered_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_seen_at    TIMESTAMPTZ,
    token_updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(user_id, device_id)
);

CREATE INDEX idx_tokens_user_active ON device_tokens(user_id) WHERE is_active = true;

Indexing and Partitioning Strategy

The delivery_receipts table is the write-heavy hot table. It receives one insert per device per notification - at 100M notifications/day with an average 2.5 devices per user, that’s 250M inserts per day. Monthly range partitioning by created_at keeps each partition manageable (7.5B rows over 30 days). The retry index (WHERE delivery_status = 'pending') stays small because most notifications resolve quickly.

For Cassandra token storage, the partition key is user_id (hashed). Clustering columns are (platform, device_id) so that reads for a single user return all devices sorted by platform, and per-device updates are single-row writes.

Watch Out

The Cassandra token table should use ALLOW FILTERING only in the admin tooling, never in production lookups. All production reads should be partition-key-first: SELECT * FROM device_tokens WHERE user_id = ?. An accidental full-table scan on a 1B-row Cassandra table will saturate the cluster’s CPU for minutes.

Key Algorithms and Protocols

Consistent Token Refresh Detection

When a device refreshes its token, the app SDK calls the registration endpoint with the new token. A naive upsert by device_id handles this correctly. The edge case is when the old token was already in-flight for a notification - the APNs 410 response with the old token is still valid and should still trigger eviction of the old token, even though a new token has been registered for the same device.

# Token upsert with race-condition safe conflict handling
import psycopg2
from datetime import datetime, timezone

def upsert_device_token(
    conn,
    user_id: str,
    device_id: str,
    platform: str,
    push_token: str,
    app_bundle: str,
    app_version: str,
) -> dict:
    """
    Upserts a device token. Returns the stored record.
    ON CONFLICT updates only if the new token differs or last_seen_at is stale.
    """
    now = datetime.now(timezone.utc)
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO device_tokens
                (user_id, device_id, platform, push_token, app_bundle,
                 app_version, is_active, registered_at, last_seen_at, token_updated_at)
            VALUES (%s, %s, %s, %s, %s, %s, true, %s, %s, %s)
            ON CONFLICT (user_id, device_id) DO UPDATE SET
                push_token       = EXCLUDED.push_token,
                app_version      = EXCLUDED.app_version,
                is_active        = true,
                last_seen_at     = EXCLUDED.last_seen_at,
                token_updated_at = CASE
                    WHEN device_tokens.push_token != EXCLUDED.push_token
                    THEN EXCLUDED.token_updated_at
                    ELSE device_tokens.token_updated_at
                END
            RETURNING *
        """, (user_id, device_id, platform, push_token, app_bundle,
              app_version, now, now, now))
        row = cur.fetchone()
        conn.commit()
        return row

APNs HTTP/2 Connection Management

APNs requires persistent HTTP/2 connections with JWT authentication. The JWT token expires every 60 minutes and must be rotated. A naive implementation rotates the connection on JWT expiry, causing a thundering herd of TLS handshakes. The correct approach rotates only the JWT bearer token in the Authorization header, not the TCP connection.

// APNs JWT rotation without connection churn
package apns

import (
    "crypto/ecdsa"
    "crypto/x509"
    "encoding/pem"
    "fmt"
    "sync"
    "time"

    "github.com/golang-jwt/jwt/v4"
)

type APNsAuth struct {
    mu         sync.RWMutex
    privateKey *ecdsa.PrivateKey
    keyID      string
    teamID     string
    token      string
    issuedAt   time.Time
    // Rotate JWT every 55 minutes (Apple requires < 60 min)
    rotateEvery time.Duration
}

func NewAPNsAuth(pemKey []byte, keyID, teamID string) (*APNsAuth, error) {
    block, _ := pem.Decode(pemKey)
    if block == nil {
        return nil, fmt.Errorf("failed to decode PEM block")
    }
    key, err := x509.ParseECPrivateKey(block.Bytes)
    if err != nil {
        return nil, fmt.Errorf("failed to parse EC key: %w", err)
    }
    a := &APNsAuth{
        privateKey:  key,
        keyID:       keyID,
        teamID:      teamID,
        rotateEvery: 55 * time.Minute,
    }
    if err := a.rotate(); err != nil {
        return nil, err
    }
    return a, nil
}

func (a *APNsAuth) rotate() error {
    now := time.Now()
    claims := jwt.MapClaims{
        "iss": a.teamID,
        "iat": now.Unix(),
    }
    t := jwt.NewWithClaims(jwt.SigningMethodES256, claims)
    t.Header["kid"] = a.keyID
    signed, err := t.SignedString(a.privateKey)
    if err != nil {
        return err
    }
    a.mu.Lock()
    a.token = signed
    a.issuedAt = now
    a.mu.Unlock()
    return nil
}

func (a *APNsAuth) BearerToken() (string, error) {
    a.mu.RLock()
    stale := time.Since(a.issuedAt) >= a.rotateEvery
    a.mu.RUnlock()
    if stale {
        if err := a.rotate(); err != nil {
            return "", err
        }
    }
    a.mu.RLock()
    defer a.mu.RUnlock()
    return a.token, nil
}

Key Insight

APNs HTTP/2 multiplexing means you should never open more than 25-50 connections to APNs per sender instance - each connection supports 1000 concurrent streams. Opening 500 connections “for throughput” actually degrades performance because TLS handshake time dominates. Maximize concurrent streams per connection, not connection count.

Scaling and Performance

Horizontal Scaling Model

The Dispatcher layer scales horizontally with Kafka partition assignment ensuring each dispatcher replica owns a subset of partitions - critical for ordered retry delivery. The Token Registry scales by adding Cassandra nodes to the ring; the consistent hash partitioner redistributes data automatically. The API Gateway scales behind a load balancer with stateless request handling.

Retry state machine and horizontal scaling: exponential backoff with jitter, connection pool sizing, Kafka partition assignment

Capacity Estimation

Given:
  100M notifications/day
  Average 2.5 devices per user = 250M dispatch tasks/day
  Peak = 10x average = 25M dispatch tasks/hour = 6,944/sec
  Average payload = 512 bytes notification + 200 bytes metadata = ~712 bytes

Throughput:
  Average: 250M / 86400 = 2,893 dispatch tasks/sec
  Peak (10x): 28,930 dispatch tasks/sec

Storage (receipts):
  250M receipts/day * 300 bytes/row = 75 GB/day
  90-day retention = 6.75 TB total
  Monthly partitions: ~2.25 TB/month

Token Registry (Cassandra):
  1B active devices * 200 bytes/row = 200 GB
  RF=3: 600 GB total storage across cluster

Redis dedup window:
  100M keys/day * 32 bytes/key = 3.2 GB
  24-hour sliding window = ~3.2 GB steady state
  5-min window for high-frequency: ~11 MB (100M * 5/1440 * 32)

APNs connection math:
  28,930 peak dispatch/sec / 4 dispatcher replicas = 7,232/sec/replica
  25 HTTP/2 connections * 1000 streams = 25,000 concurrent capacity/replica
  Peak load = 7,232/25,000 = 29% connection utilization (healthy headroom)

FCM batch math:
  FCM sends 500 msgs/batch, so at 28,930/sec:
  28,930 / 500 = 58 batch API calls/sec to FCM
  Each batch takes ~100ms RTT to FCM datacenter
  10 gRPC channels * 10 concurrent batches = 100 in-flight batches = safe

Bottlenecks and Hot Spots

The Token Registry becomes a hot spot during broadcast campaigns (e.g., send to all 100M users simultaneously). A cache pre-warming step before campaign launch - proactively loading top-1M users’ token lists into Redis - reduces Cassandra read load by an order of magnitude during the spike.

The Receipt Writer is the second bottleneck. Batch-inserting receipts using PostgreSQL COPY or multi-row inserts instead of single-row INSERT statements improves throughput from ~5K/sec to ~50K/sec on the same hardware.

Real World

Pinterest’s notification infrastructure handles broadcast sends to 400M users using a fan-out service that pre-shards the recipient list into chunks of 10,000 and processes each chunk on a separate worker. This bounded-fan-out pattern prevents any single worker from becoming a bottleneck and allows the broadcast to scale linearly with worker count. The technique is described in their 2018 engineering blog post on notification infrastructure.

Failure Modes and Recovery

Full notification delivery data flow from API call to gateway ACK, including dedup, routing, retry, and DLQ paths

Failure	Detection	Impact	Recovery
APNs gateway timeout	HTTP/2 stream timeout after 10s	Notifications not dispatched to iOS devices	Retry with `InternalServerError` classification; exponential backoff starting at 5s
Redis dedup node failure	Redis RESP3 connection error	Dedup check fails; risk of duplicate sends	Fail open (allow dispatch) with flag logged; Redis Cluster failover within 15s
Cassandra token registry node down	CQL timeout or `NoHostAvailableException`	Cannot look up tokens for affected partitions	Read from replica (RF=3, CL=ONE); alert if two replicas down simultaneously
FCM `QUOTA_EXCEEDED`	FCM HTTP 429 with `Retry-After` header	FCM sends blocked for affected project	Honor `Retry-After` value; separate backoff per FCM sender credential
PostgreSQL receipt writer down	Connection refused	Receipts not recorded; delivery status unknown	Buffer receipts in Kafka; replay against PostgreSQL when recovered
Dispatcher replica crash	Kafka consumer group rebalance (heartbeat timeout 30s)	In-flight messages in that partition delayed 30s	Kafka rebalance reassigns partitions; at-least-once Kafka semantics ensure no message loss

Watch Out

The most common operational mistake is not monitoring APNs certificate or JWT key expiry. APNs p8 JWT keys do not expire, but p12 certificates do - every year. Missing certificate renewal causes a silent cliff-edge failure where iOS notifications stop at midnight on expiry day with zero code changes. Automate certificate expiry alerting with a 30-day warning threshold and test the renewal process in staging at least annually.

Comparison of Approaches

Approach	Latency	Throughput	Failure Handling	Best Fit
Direct synchronous dispatch (single service, no queue)	Low (sub-100ms P50)	Limited by single-node throughput (~5K/sec)	No retry; any gateway failure drops message	Prototypes, small apps under 10K users
Queue-backed async dispatch (Kafka + dispatcher workers)	Medium (100-500ms P50)	High (scale workers horizontally to any throughput)	Full retry with backoff; DLQ for terminal failures	Production systems, all scales above 10K users
Third-party managed service (OneSignal, Braze, Leanplum)	Low to medium	Vendor-dependent; usually very high	Vendor handles retry and dedup	Teams without infra bandwidth; cost-effective under 100M/month
Hybrid: vendor for small scale, own infra above threshold	Low for both tiers	Combined; in-house handles overflow	Vendor dedup may conflict with your own	Large orgs migrating from vendor to in-house gradually
Push-pull model (devices poll for notifications)	High (poll interval, typically 30s-5min)	Low server push load; high poll load	No server-side retry needed; device polls again	IoT or enterprise contexts where push gateway is unavailable

The queue-backed async approach is the right architecture for any production system above 10K active devices. The synchronous approach’s simplicity is a trap - the first time your APNs connection pool saturates under load, you start dropping notifications with no recovery path. The Kafka-backed model costs more to operate but gives you a durable message log, horizontal scalability, and a natural retry substrate. Use a managed service only if you genuinely cannot staff the infrastructure.

Key Takeaways

Device Token Registry: Store tokens in Cassandra partitioned by user_id for efficient per-user lookups; layer a Redis cache with a 5-minute TTL to absorb read amplification during broadcast sends.
Channel Routing Logic: The routing engine must correctly distinguish silent (background) from visible (alert) notification types - the wrong apns-push-type header causes Apple to reject or silently drop messages.
Delivery Receipt Tracking: Write receipts asynchronously via Kafka to decouple dispatch latency from PostgreSQL write latency; use a partial index on next_retry_at to keep retry scans fast.
Retry with Backoff: Always apply full jitter to exponential backoff delays - without jitter, a burst failure causes a synchronized retry storm that amplifies the original problem.
Deduplication Window: Compute the idempotency key from content fields, not from a caller-supplied UUID, so retries with new IDs are still caught; use Redis SET NX EX for atomic check-and-set in a single RTT.
Notification Batching: For FCM, batch up to 500 messages per API call with a 500ms coalesce window; for APNs, maximize concurrent HTTP/2 streams on a small number of persistent connections rather than opening many connections.
Silent vs. Visible Notifications: Track the notification type through the entire dispatch path - different priority rules, connection pools, and APNs headers apply to each type, and conflating them causes subtle delivery failures.

The counter-intuitive lesson from building this system is that the hardest part is not delivery - it is knowing when not to deliver. Deduplication, stale token eviction, and TTL expiry are all mechanisms for not sending. The system that sends reliably is also the system that knows when to stop, which turns out to require more design work than the delivery path itself.

Frequently Asked Questions

Q: Why not use a single Postgres table for the token registry instead of Cassandra?

A: Postgres handles this fine up to a few million devices, but above 100M active device tokens the write amplification of token refreshes (apps upsert their token on every launch) creates hot spots on the tokens table. Cassandra’s write path is append-only (no read-before-write on upsert with IF NOT EXISTS or LWT avoidance) and shards naturally by user_id. If you are already heavily invested in Postgres and under 50M devices, a partitioned Postgres table with a Redis cache is a valid simpler choice.

Q: Why not use APNs2’s native delivery feedback service to learn about failed tokens instead of parsing 410 responses inline?

A: The APNs feedback service was deprecated in the provider API v2. In the modern HTTP/2 provider API, stale token information is returned directly in the response to each send attempt as a 410 Unregistered status. The legacy feedback service polled a separate endpoint periodically. Modern systems should parse the per-request response status, not poll the legacy feedback endpoint.

Q: Why use Redis for deduplication instead of Postgres with a UNIQUE constraint on the idempotency key?

A: Postgres unique constraint checks require a read-before-write that takes exclusive locks on the index. Under high write concurrency, this creates lock contention. Redis SET NX EX is a single atomic operation with O(1) complexity and sub-millisecond latency, and the in-memory nature means the 24-hour sliding window stays warm. The trade-off is Redis is not durable by default - a Redis crash with no persistence could lose the dedup window, causing duplicate sends for the window’s worth of messages. Use Redis Cluster with AOF persistence to mitigate this.

Q: Should silent notifications go through the same dispatch pipeline as visible notifications?

A: They can share the same ingestion and routing path, but they should use separate Kafka topics and separate connection pools at the sender level. Silent notifications have lower urgency but higher frequency (sync triggers fire many times per hour per active user). Mixing them with visible notifications in the same priority queue risks low-urgency silent messages delaying high-priority alerts during bursts. A separate topic with lower partition count and different consumer group settings isolates the two traffic classes.

Q: What is the right TTL for the deduplication window?

A: It depends on the notification type. For marketing or informational notifications, 24 hours covers all reasonable retry scenarios. For transactional notifications (OTP codes, payment confirmations), use a 5-minute window - the caller needs to be able to resend if the first code expires. Never use a longer TTL than the content’s validity period. An OTP deduped for 24 hours means a user who requests a new code within 24 hours might not receive it.

Q: How do you handle a user who has 50 devices (enterprise MDM scenario)?

A: Set a hard cap per user (default 10, configurable up to 50 per tenant) at the token registration API. Beyond the cap, evict the least-recently-seen device. This bounds fan-out cost and prevents token registry abuse. For MDM scenarios, consider a separate broadcast_group_id field that lets the enterprise push to a device group via a group token rather than enumerating all individual device tokens.

Interview Questions

Q: Walk me through how you guarantee at-least-once delivery for a notification to an iOS device when APNs returns a 503 error.

Expected depth: Explain the Kafka-backed retry queue, the error classification between terminal (e.g., BadDeviceToken) and retryable (e.g., ServiceUnavailable) APNs errors, exponential backoff with jitter, the MAX_RETRY_ATTEMPTS limit (typically 5), and the dead-letter queue for terminal failures. Also discuss the ttl_seconds field - a notification that has been retrying for longer than its TTL should be marked EXPIRED rather than retried.

Q: How would you design the deduplication system to work across multiple API Gateway instances without a shared cache?

Expected depth: The candidate should recognize that a shared Redis cluster is the standard solution, not in-process caches. Discuss Redis Cluster sharding by the dedup key (consistent hashing), the trade-off between strong consistency (synchronous Redis write before any dispatch) vs. best-effort (async write with rare duplicate risk), and the failure mode where Redis is unavailable (fail open vs. fail closed). Strong candidates discuss Lua scripts for atomic check-and-set to avoid TOCTOU race conditions under high concurrency.

Q: How would you handle a burst of 10 million notifications to send in under 5 minutes (a major app release announcement)?

Expected depth: Discuss horizontal scaling of dispatchers with Kafka partition pre-assignment, token registry cache pre-warming (loading top-N users into Redis before send), FCM batch API utilization to maximize throughput per HTTP connection, APNs HTTP/2 connection pool sizing math, and the need to spread fan-out over time (rate-limit to smooth the burst rather than spike APNs/FCM). Strong candidates mention token registry read amplification and the Redis cache warming pattern.

Q: What changes to this design would you make to support 10 billion notifications per day (100x current scale)?

Expected depth: Sharding the entire dispatch pipeline by geographic region (APNs US datacenter vs. APNs EU datacenter), moving receipt storage from PostgreSQL to a distributed time-series store (e.g., Cassandra again, or ClickHouse for analytics), multi-region Kafka replication, and separating the notification pipeline per channel rather than a shared Kafka topic. Discuss the trade-off between operational complexity and scalability - at 10B/day, the single-region Kafka model with one consumer group per dispatcher becomes a bottleneck.

Q: How would you extend this system to support in-app notification badges (the unread count on the app icon)?

Expected depth: Badge counts are a separate read model from delivery receipts - they represent how many unread notifications exist, not how many were sent. The candidate should distinguish between: (1) setting the badge count as part of the APNs/FCM payload (server-controlled), (2) maintaining a server-side unread count per user in Redis (incr/decr on send/open), and (3) having the app SDK report its own count. Discuss the consistency challenges when a user reads notifications on one device - does the badge clear on other devices? This requires a separate notification_reads event stream and per-device badge state reconciliation.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article