Build YouTube's View Count System

System Design Deep Dive

YouTube’s View Count System

Counting billions of views accurately in near real-time without slowing video delivery or letting bots inflate the numbers

⏱ 14 min read📐 Advanced🏗️ View-Count

Imagine running a gas station with a broken pump display. You need to show customers exactly how much gas is flowing - but the sensor inside the pump takes a reading only once a minute and the display lags by another 30 seconds. Meanwhile, there’s a guy with a garden hose pretending to fill up without paying. That is precisely the problem with view counts at YouTube scale: you want an accurate number fast, and you need to filter out fraud, but the mechanism that serves the number must never slow down the thing people actually care about - watching the video.

YouTube ingests roughly 5 billion views per day, or about 58,000 view events per second at peak. A naive counter - increment a single row in Postgres every time someone watches - would melt under this load. At 58K writes per second a single row would be the hottest spot in any database ever deployed. Even Redis, which is fast, would saturate a single key at somewhere around 100K-200K writes per second before CPU becomes the bottleneck - and that is with no concurrency protection. More critically, a single counter has no isolation: if the counting service restarts, you lose uncounted views. If the bot detection step is synchronous, it blocks video delivery. If you query the raw counter directly from video pages, count reads add latency to the most traffic-heavy path in the system.

The real challenge is that four concerns pull in opposite directions. Accuracy requires counting every real view exactly once. Latency requires never touching the counting path from the video serve path. Freshness requires counts that update in seconds for trending content. Fraud resistance requires discarding bot traffic without allowing bots to cause legitimate views to be lost. Getting all four right simultaneously is the engineering problem.

We need to solve for write throughput through counter sharding, approximate deduplication via HyperLogLog, asynchronous bot filtering off the critical path, and eventual consistency between in-memory counters and the durable source of truth. Let’s build it.

Requirements and Constraints

Functional Requirements

Track a view when a user watches a video for at least 30 seconds (YouTube’s documented threshold)
Display the current view count on the video watch page, updated within 30 seconds for normal videos
For trending videos (top 0.1% by current velocity), update the displayed count within 5 seconds
Resist inflation from bots, page refreshes within short windows, and automated crawlers
Provide an admin API to query raw, unfiltered event counts for abuse investigation
Support a “unique viewers” metric distinct from total play count (one viewer watching 10 times = 10 plays, 1 unique)

Non-Functional Requirements

Write throughput: 58,000 view events per second sustained, 200,000 at peak (viral video moments)
Read latency: View count reads on video pages must complete in under 10ms (P99)
Count freshness: Less than 30 seconds lag for standard videos; less than 5 seconds for trending
Durability: No view event lost once acknowledged by the edge layer, even if internal services crash
Availability: 99.99% - view counting outages cause visible stale numbers, which erode creator trust
Storage: ~800 million active videos; storing sharded counters + reconciliation records = roughly 200GB in Redis, 50GB in Postgres

Constraints and Assumptions

The video serve path (CDN delivery) must be completely decoupled from counting; a counting outage must not affect playback
View events are submitted via fire-and-forget beacons from the CDN edge, not from the browser directly
We accept eventual consistency: the displayed count can be up to 30 seconds stale in the normal case
Bot filtering runs asynchronously - some bot views may be counted briefly before the filter catches up, but reconciliation corrects this within hours
“View” definition is a product decision fixed upstream; we design for that threshold, not for detecting it

High-Level Architecture

The system splits cleanly into a write path and a read path, with an async reconciliation job connecting them to the durable store.

YouTube view count system architecture overview

The write path starts at the CDN edge. When the player’s 30-second watch threshold triggers, the client fires a beacon to the nearest CDN edge node. The edge node enqueues the event to Kafka (the view_events topic, 30 partitions) and immediately returns a 200 - it never touches a database. Kafka gives us durability: the event is on disk before we ACK.

The Bot Filter Service consumes from the view_events topic, scores each event using IP velocity, user-agent signals, watch duration, and replay rate, and emits valid events onto a filtered clean_view_events topic. The View Aggregation Service consumes from clean_view_events, batches events in 500ms windows, and writes increments to Redis Counter Shards and HyperLogLog registers.

The read path is completely separate. The Count Read API serves GET /count/{video_id} requests from video pages. It reads from a Redis read cache with a 30-second TTL (5 seconds for trending videos). It never reads from the write path’s shards directly.

An hourly Reconciliation Job sums all shard counters for each video and writes a canonical count to Postgres, which becomes the source of truth for historical reporting, analytics, and the fallback when Redis is unavailable.

Key Insight

The video serve path and the counting path share zero infrastructure. A complete failure of the counting system shows stale numbers - it does not drop a single video frame. Decoupling them with Kafka as the handoff point is the architectural decision everything else depends on.

The Ingestion Pipeline

The ingestion pipeline’s job is to durably capture every view event off the critical path, at any volume, without acknowledgment delays.

The CDN edge node does exactly two things when a view event arrives: enqueue to Kafka and return 200. The Kafka producer is configured with acks=1 (leader acknowledgment only) and linger.ms=5 to batch events from concurrent watchers into fewer produce requests. The tradeoff: acks=1 risks losing an event if the Kafka leader crashes before replication. At YouTube scale this is acceptable - losing 0.001% of events in a node crash scenario is better than adding latency to 100% of video deliveries.

# View event beacon handler on the CDN edge (Python/FastAPI sketch)
# Demonstrates: fire-and-forget event submission, no database touch
from aiokafka import AIOKafkaProducer
import msgpack, time

producer = AIOKafkaProducer(
    bootstrap_servers="kafka-cluster:9092",
    acks=1,
    linger_ms=5,
    compression_type="lz4",
)

async def handle_view_event(video_id: str, viewer_id: str, ip: str, ua: str):
    event = {
        "video_id": video_id,
        "viewer_id": viewer_id,
        "ip": ip,
        "ua": ua,
        "ts": int(time.time() * 1000),
        "session_id": generate_session_id(viewer_id, video_id),
    }
    # Partition by video_id so events for the same video land on the same partition
    key = video_id.encode()
    await producer.send("view_events", value=msgpack.packb(event), key=key)
    return {"status": "ok"}  # never block on Kafka confirmation

Partitioning by video_id means all events for a given video land on the same Kafka partition, which preserves ordering and makes downstream deduplication simpler. With 30 partitions and 800 million videos, roughly 27 million videos per partition - all well within Kafka’s limits.

Real World

YouTube’s actual view counting uses a similar fire-and-forget beacon approach. Early versions counted views synchronously during video serve, which caused counting to slow video delivery under load. The decoupling to async beacons was one of the earliest architectural changes at scale.

The Bot Filter Service

The bot filter’s job is to discard fraudulent view events before they reach the aggregation layer, using a scoring pipeline that runs in microseconds per event.

Think of the bot filter as a customs agent at an airport. Most travelers wave through in seconds because their documents are obviously valid. A small fraction get pulled aside for a longer check. A tiny fraction get rejected entirely. The scoring is the same: most events score low and pass immediately; suspicious ones get inspected; clear bots get dropped.

Each event is evaluated on five signal categories:

IP velocity: how many view events have come from this IP in the last 60 seconds, tracked in Redis as a sliding window counter per (ip, minute_bucket). Datacenter CIDRs and known VPN ASNs get a flat penalty.
Watch duration ratio: the ratio of watch time to video length. A 10-minute video watched for exactly 31 seconds on 200 different requests from the same IP is a strong bot signal.
User-agent quality: presence of known headless browser fingerprints (HeadlessChrome, python-requests), missing Accept-Language headers, or UA strings that rotate suspiciously.
Replay rate: has this (viewer_id, video_id) pair been seen in the last 30 minutes? Legitimate users rarely watch the same video 10 times in 30 minutes.
TLS fingerprint: the JA3 hash of the TLS handshake identifies the client library. Automated tools have distinctive JA3 fingerprints even when they spoof the User-Agent.

# Bot scoring: weighted feature evaluation
# Demonstrates: signal extraction and score composition

import redis
import hashlib
import time

r = redis.Redis(host="redis-bot-signals", decode_responses=True)

WEIGHTS = {
    "ip_velocity": 0.35,
    "watch_ratio": 0.25,
    "ua_quality": 0.20,
    "replay_rate": 0.15,
    "tls_fp": 0.05,
}

def score_event(event: dict) -> float:
    scores = {}

    # IP velocity signal
    ip_key = f"ipvel:{event['ip']}:{int(time.time()) // 60}"
    count = r.incr(ip_key)
    r.expire(ip_key, 120)
    scores["ip_velocity"] = min(1.0, count / 200.0)  # 200 req/min = max score

    # Watch duration ratio
    watch_secs = event.get("watch_secs", 0)
    video_secs = event.get("video_duration_secs", 1)
    ratio = watch_secs / max(video_secs, 1)
    # Suspicious: always exactly at the threshold (31s) or less than 5%
    scores["watch_ratio"] = 1.0 if ratio < 0.05 else (0.6 if abs(ratio - 31 / video_secs) < 0.01 else 0.0)

    # UA quality
    ua = event.get("ua", "")
    headless_signals = ["HeadlessChrome", "python-requests", "curl/", "Wget/", "Go-http-client"]
    scores["ua_quality"] = 1.0 if any(h in ua for h in headless_signals) else 0.0

    # Replay rate
    replay_key = f"replay:{event['viewer_id']}:{event['video_id']}"
    seen = r.get(replay_key)
    r.setex(replay_key, 1800, "1")  # 30-minute window
    scores["replay_rate"] = 0.8 if seen else 0.0

    # TLS fingerprint (passed in by edge)
    known_bot_ja3 = {"a0e9f5d64349fb13191bc781f81f42e1", "b32309a26951912be7dba376398d89"}
    scores["tls_fp"] = 1.0 if event.get("ja3") in known_bot_ja3 else 0.0

    total = sum(WEIGHTS[k] * scores[k] for k in WEIGHTS)
    return total

def filter_event(event: dict) -> bool:
    score = score_event(event)
    if score > 0.7:
        log_rejected(event, score)
        return False
    return True

Events with a score above 0.7 are rejected and written to an audit log for retrospective analysis. Events between 0.4 and 0.7 are soft-flagged and sampled for ML model retraining. Everything below 0.4 passes without further inspection.

Watch Out

A synchronous bot filter on the write path is a correctness trap: if the filter is slow, you back-pressure Kafka consumers and views pile up unprocessed. The filter must run as an async consumer group, not inline with event ingestion. If it falls behind, you accept temporarily counting some bot views - reconciliation will correct them.

The Aggregation Service and Counter Sharding

The aggregation service’s job is to translate a stream of individual view events into atomic counter increments with minimal write amplification.

Counter sharding is the key mechanism here. A single Redis key for a viral video would receive tens of thousands of INCR operations per second during peak views. Redis is single-threaded per key - this bottlenecks on CPU, not memory bandwidth. The solution is to shard the counter: instead of one key view_count:{video_id}, we maintain 128 keys view_count:{video_id}:{shard}. Each write goes to exactly one shard, selected by hash(event_id) % 128. The true count is the sum of all 128 shards.

Counter sharding internals - 128 shards per video

The aggregation service runs 50 parallel workers, each consuming a slice of Kafka partitions. Each worker maintains a local 500ms batch buffer. At the end of each batch window, it emits a INCRBY per shard - not one INCR per event. This batching is critical: at 58K events/sec, batching 500ms turns 29,000 individual INCR calls into roughly 128 INCRBY calls (one per active shard), a 225x reduction in Redis round trips.

# Aggregation worker: batching + sharded INCR
# Demonstrates: batch window, shard selection, pipeline write
import asyncio
import hashlib
import redis.asyncio as aioredis

SHARD_COUNT = 128
BATCH_WINDOW_MS = 500

class AggregationWorker:
    def __init__(self, redis_client):
        self.r = redis_client
        self.buffer: dict[str, int] = {}  # key -> increment count

    def get_shard_key(self, video_id: str, event_id: str) -> str:
        shard = int(hashlib.md5(event_id.encode()).hexdigest(), 16) % SHARD_COUNT
        return f"vc:{video_id}:{shard}"

    def add_event(self, video_id: str, event_id: str):
        key = self.get_shard_key(video_id, event_id)
        self.buffer[key] = self.buffer.get(key, 0) + 1

    async def flush(self):
        if not self.buffer:
            return
        pipe = self.r.pipeline(transaction=False)
        for key, count in self.buffer.items():
            pipe.incrby(key, count)
            pipe.expire(key, 86400 * 7)  # 7-day TTL; reconciliation persists before expiry
        await pipe.execute()
        self.buffer.clear()

    async def run(self, kafka_consumer):
        while True:
            flush_task = asyncio.create_task(asyncio.sleep(BATCH_WINDOW_MS / 1000))
            events_in_window = []
            async for msg in kafka_consumer:
                events_in_window.append(msg)
                if flush_task.done():
                    break
            for msg in events_in_window:
                self.add_event(msg.value["video_id"], msg.value["session_id"])
            await self.flush()

Reading the total count requires summing all 128 shards. This is done via a Redis pipeline (MGET of all 128 keys) and takes roughly 2-3ms. The count is then cached in a separate read-layer key with a 30-second TTL, so the 128-key aggregation runs at most once every 30 seconds per video, not on every page load.

Key Insight

Sharding the counter trades read complexity (sum 128 keys instead of GET 1) for write scalability (128 independent hot keys instead of 1 bottleneck). The read cost is paid once every TTL window and cached - the write benefit is continuous and proportional to the number of shards.

Approximate Counting with HyperLogLog

Alongside the raw play count, the system tracks unique viewers - a fundamentally different problem. Counting unique viewers requires deduplication: if viewer A watches video X ten times, she contributes 1 to the unique count, not 10.

Exact deduplication at scale requires storing a set of all viewer IDs per video - at hundreds of millions of viewers per viral video, this is gigabytes of state per video, just for deduplication. HyperLogLog (HLL) solves this with a probabilistic data structure that estimates set cardinality using 12KB of memory regardless of the set size, with a standard error of 0.81%.

Redis implements HLL natively. Adding a viewer is PFADD unique_viewers:{video_id} {viewer_id} (O(1)). Querying the unique count is PFCOUNT unique_viewers:{video_id} (O(1)). The memory overhead is constant and tiny.

# HyperLogLog unique viewer tracking
# Demonstrates: PFADD for dedup, PFCOUNT for cardinality estimate

import redis

r = redis.Redis()

def record_view(video_id: str, viewer_id: str) -> bool:
    """Returns True if this is a new unique viewer for this video."""
    hll_key = f"unique_viewers:{video_id}"
    # PFADD returns 1 if the internal HLL changed (likely a new unique element)
    # Returns 0 if the element was probably already present
    is_new = r.pfadd(hll_key, viewer_id)
    r.expire(hll_key, 86400 * 30)  # 30-day rolling window
    return bool(is_new)

def get_unique_viewers(video_id: str) -> int:
    """Returns approximate unique viewer count with ~0.81% standard error."""
    return r.pfcount(f"unique_viewers:{video_id}")

def merge_unique_viewers(video_ids: list[str]) -> int:
    """Get unique viewers across a set of videos (e.g., for a channel total)."""
    keys = [f"unique_viewers:{vid}" for vid in video_ids]
    return r.pfcount(*keys)  # Redis merges HLLs on the fly for cross-key PFCOUNT

The 0.81% error means a video with 10 million unique viewers might be displayed as 10,081,000 or 9,919,000. This is perfectly acceptable for display purposes - YouTube already rounds large counts to two significant figures (“10M views” rather than “10,081,337 views”).

Real World

Redis’s PFADD and PFCOUNT are used by Twitter for tracking unique tweet impressions, by Cloudflare for unique IP counting per zone, and by many ad-tech systems for reach frequency capping. The 12KB-per-counter memory cost makes HLL viable even when tracking billions of unique elements.

The Read Path and Eventual Consistency Display

The count read path’s job is to serve sub-10ms count lookups to video pages without touching the write path’s counter shards on every request.

Async data flow from view event to displayed count

The read architecture uses a two-tier cache. The shard aggregation cache is a single Redis key cached_count:{video_id} that stores the sum of all 128 shards. A background process (the “count refresher”) updates this key on a TTL schedule: every 30 seconds for standard videos, every 5 seconds for trending videos.

Video pages read from cached_count:{video_id}. If the key is missing (cold start, or TTL expired and the refresher hasn’t run yet), they fall back to summing the shards directly. If the shards are also empty (very cold video or Redis cold restart), they fall back to Postgres.

# Count read API with tiered fallback
# Demonstrates: read path decoupling from write path, tiered fallback

import redis
import psycopg2
from functools import lru_cache

r = redis.Redis()
TRENDING_TTL = 5
STANDARD_TTL = 30
SHARD_COUNT = 128

def get_view_count(video_id: str) -> dict:
    # Tier 1: read-layer cache (fastest path)
    cached = r.get(f"cached_count:{video_id}")
    if cached:
        return {"count": int(cached), "source": "cache", "lag_seconds": STANDARD_TTL}

    # Tier 2: sum shards directly
    shard_keys = [f"vc:{video_id}:{s}" for s in range(SHARD_COUNT)]
    values = r.mget(shard_keys)
    shard_sum = sum(int(v) for v in values if v)
    if shard_sum > 0:
        # Backfill the read cache for next request
        is_trending = is_trending_video(video_id)
        ttl = TRENDING_TTL if is_trending else STANDARD_TTL
        r.setex(f"cached_count:{video_id}", ttl, shard_sum)
        return {"count": shard_sum, "source": "shards", "lag_seconds": 0}

    # Tier 3: Postgres (source of truth, for cold/old videos)
    count = fetch_count_from_postgres(video_id)
    return {"count": count, "source": "postgres", "lag_seconds": 3600}

def fetch_count_from_postgres(video_id: str) -> int:
    with psycopg2.connect(dsn=PG_DSN) as conn:
        with conn.cursor() as cur:
            cur.execute(
                "SELECT view_count FROM video_view_counts WHERE video_id = %s",
                (video_id,),
            )
            row = cur.fetchone()
            return row[0] if row else 0

The eventual consistency display means users will sometimes see a count that is up to 30 seconds behind. This is a deliberate product decision. YouTube has trained users to understand that view counts are approximate and lag slightly. The alternative - a strongly consistent count that requires a distributed transaction on every view - would add hundreds of milliseconds to every video load.

Watch Out

If the read cache TTL and the shard update frequency are both 30 seconds but are not synchronized, a viewer could see a count that lags up to 60 seconds (cache expires at second 29, shards were last updated at second 1). Stagger the TTL slightly below the update interval to avoid compounding lag.

Periodic Reconciliation

The reconciliation job’s job is to periodically drain the in-memory shard counters into Postgres, resetting the source of truth and enabling shard key cleanup.

Every hour, the reconciliation job runs as a scheduled task. It reads all 128 shard counters for each video using a Redis pipeline, sums them, and adds the total to the Postgres video_view_counts.view_count column using an upsert. After a successful write, it resets the shard counters to zero (atomically, using a Lua script to avoid a TOCTOU race).

-- Schema: view counting source of truth
-- Demonstrates: upsert pattern, audit trail, partitioning strategy

CREATE TABLE video_view_counts (
    video_id        TEXT        NOT NULL,
    view_count      BIGINT      NOT NULL DEFAULT 0,
    unique_viewers  BIGINT      NOT NULL DEFAULT 0,
    last_reconciled TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    shard_epoch     BIGINT      NOT NULL DEFAULT 0,
    PRIMARY KEY (video_id)
);

-- Separate reconciliation audit table for debugging
CREATE TABLE reconciliation_log (
    id              BIGSERIAL   PRIMARY KEY,
    video_id        TEXT        NOT NULL,
    delta           BIGINT      NOT NULL,
    shard_snapshot  JSONB,      -- stores {shard_id: count} for debugging
    reconciled_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON reconciliation_log (video_id, reconciled_at DESC);
CREATE INDEX ON video_view_counts (view_count DESC);  -- for "most viewed" queries

The shard reset uses a Lua script to ensure atomicity - we do not want to decrement the shard by an amount different from what we read, which would happen if new views arrive between the MGET and the DECRBY:

-- Lua script: atomic read-and-reset of all shards for a video
-- Demonstrates: Lua atomicity in Redis, TOCTOU prevention
local prefix = ARGV[1]
local shard_count = tonumber(ARGV[2])
local total = 0
for i = 0, shard_count - 1 do
    local key = prefix .. ":" .. i
    local val = redis.call("GETSET", key, "0")
    if val then
        total = total + tonumber(val)
    end
end
return total

The Lua script uses GETSET to atomically swap each shard to zero and return the old value. Any views that arrive between shard reads within the Lua script are counted in the next reconciliation cycle - they are not lost.

Key Insight

Reconciliation does not need to be transactionally consistent across all videos simultaneously - it just needs to eventually produce a correct total for each video independently. Running one video’s reconciliation while another video’s shards keep accumulating is perfectly safe because each video’s shards are independent.

Data Model

The data model separates concerns into three tiers: real-time write state (Redis shards), durable aggregate state (Postgres), and transient read state (Redis read cache).

-- Redis key patterns (documented as DDL equivalent)
-- vc:{video_id}:{shard}          String  INCR target, 7-day TTL
-- unique_viewers:{video_id}      HyperLogLog, 30-day TTL
-- cached_count:{video_id}        String  read cache, 5s or 30s TTL
-- ipvel:{ip}:{minute_bucket}     String  bot detection velocity, 2-min TTL
-- replay:{viewer_id}:{video_id}  String  replay detection, 30-min TTL

-- Postgres: durable view count store
CREATE TABLE video_view_counts (
    video_id        TEXT        PRIMARY KEY,
    view_count      BIGINT      NOT NULL DEFAULT 0,
    unique_viewers  BIGINT      NOT NULL DEFAULT 0,
    last_reconciled TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    shard_epoch     BIGINT      NOT NULL DEFAULT 0
);

-- Postgres: per-video metadata (written by content pipeline, read-only here)
CREATE TABLE video_metadata (
    video_id        TEXT        PRIMARY KEY,
    title           TEXT,
    channel_id      TEXT        NOT NULL,
    published_at    TIMESTAMPTZ,
    duration_secs   INT,
    is_trending     BOOLEAN     NOT NULL DEFAULT FALSE
);

-- Postgres: bot rejection audit
CREATE TABLE rejected_view_events (
    id              BIGSERIAL   PRIMARY KEY,
    video_id        TEXT        NOT NULL,
    ip              INET,
    viewer_id       TEXT,
    bot_score       FLOAT       NOT NULL,
    reject_reason   TEXT,
    event_ts        TIMESTAMPTZ NOT NULL,
    logged_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON rejected_view_events (video_id, event_ts DESC);
CREATE INDEX ON rejected_view_events (ip, event_ts DESC);

Partitioning strategy: video_view_counts does not need partitioning - 800 million rows with a single integer per row is about 30GB, well within Postgres’s comfort zone. rejected_view_events is time-partitioned by logged_at monthly, since it grows without bound and old rejections are rarely queried.

Sharding key choice: Redis shard keys use hash(session_id) % 128 rather than hash(viewer_id) % 128. Using session_id means the same viewer watching from two devices in the same minute gets balanced across shards, rather than always hitting the same shard. This distributes write load more evenly.

Key Algorithms and Protocols

HyperLogLog Cardinality Estimation

HyperLogLog works by hashing each element to a binary string and observing the position of the leftmost 1-bit. Elements that hash to strings starting with many leading zeros are rare - they indicate a large set. By tracking the maximum leading-zero position across all elements seen, and averaging across multiple hash functions (called registers), HLL estimates cardinality with remarkable accuracy.

# HyperLogLog register update (conceptual implementation)
# Demonstrates: the hash-and-observe mechanism behind PFADD
import hashlib
import math

def hll_add(registers: list[int], element: str, num_registers: int = 16) -> bool:
    # Hash the element to a 64-bit integer
    h = int(hashlib.sha256(element.encode()).hexdigest(), 16) & 0xFFFFFFFFFFFFFFFF

    # Use first log2(num_registers) bits to select the register
    register_bits = int(math.log2(num_registers))
    register_idx = h >> (64 - register_bits)

    # Count leading zeros in the remaining bits
    remaining = h & ((1 << (64 - register_bits)) - 1)
    leading_zeros = 0
    for i in range(64 - register_bits - 1, -1, -1):
        if remaining & (1 << i):
            break
        leading_zeros += 1

    # Update register if we've seen more leading zeros
    if leading_zeros + 1 > registers[register_idx]:
        registers[register_idx] = leading_zeros + 1
        return True  # register changed = likely new unique element
    return False

def hll_count(registers: list[int]) -> int:
    m = len(registers)
    # Harmonic mean of 2^(-M[j]) across all registers
    z = sum(2.0 ** (-r) for r in registers)
    alpha = 0.7213 / (1 + 1.079 / m)
    estimate = alpha * m * m / z
    return int(estimate)

Time complexity: O(1) per add, O(m) per count where m is number of registers (16 for Redis’s implementation in sparse mode). Space: O(m) = 12KB for m=16384 registers.

Key Insight

The property that makes HLL correct at scale is that the maximum leading-zero count across millions of elements converges to a stable estimate of log2(n). Adding elements that are already in the set does not change the maximum - they are idempotent, which gives HLL its deduplication property for free.

Sliding Window Velocity Counter

Bot detection uses a sliding window counter per IP address to detect velocity spikes. Fixed-window counters have an edge case: a bot sending 999 requests at 11:59:59 PM and 999 requests at 12:00:01 AM evades a “1000/minute” limit because each falls in a separate window. A true sliding window avoids this.

# Sliding window velocity counter using sorted sets
# Demonstrates: correct sliding window without fixed-window boundary bypass
import time
import redis

r = redis.Redis()

def check_and_record_velocity(ip: str, video_id: str, window_secs: int = 60, limit: int = 100) -> bool:
    now_ms = int(time.time() * 1000)
    window_ms = window_secs * 1000
    key = f"vel:{ip}:{video_id}"

    pipe = r.pipeline()
    # Remove events outside the window
    pipe.zremrangebyscore(key, 0, now_ms - window_ms)
    # Count events in the window
    pipe.zcard(key)
    # Add this event
    pipe.zadd(key, {str(now_ms): now_ms})
    # Set TTL
    pipe.expire(key, window_secs * 2)
    results = pipe.execute()

    current_count = results[1]
    return current_count < limit  # True = not exceeding limit, False = rate limited

Space complexity: O(k) per key where k is the count of events in the window. Time: O(log k) for the sorted set operations. For most users this is negligible; for bots generating thousands of events we cap stored entries by the zremrangebyscore call.

Idempotent Session Deduplication

A user who refreshes a video page should not generate two view events. The client assigns a session_id that combines viewer_id, video_id, and a session start timestamp floored to the nearest 30 minutes. The aggregation service tracks seen session IDs in a Redis set with a 30-minute TTL, discarding duplicate events within the same session window.

# Session-based deduplication on the aggregation side
# Demonstrates: idempotency key pattern for view dedup
import redis, hashlib, time

r = redis.Redis()

def is_duplicate_view(viewer_id: str, video_id: str) -> bool:
    window = int(time.time()) // 1800  # 30-minute window
    session_id = hashlib.sha256(f"{viewer_id}:{video_id}:{window}".encode()).hexdigest()[:16]
    dedup_key = f"dedup:{session_id}"
    # SET NX returns True only the first time
    is_new = r.set(dedup_key, 1, nx=True, ex=3600)
    return not bool(is_new)  # True if duplicate (key already existed)

Scaling and Performance

The system’s scaling story is primarily about write throughput. Reads scale trivially with read replicas and CDN caching. Writes scale through sharding and batching.

Capacity Estimation (peak load):

Input:
  5B views/day = 58K views/sec average
  Viral spike: 3.4x = 200K views/sec peak
  Average event size: 250 bytes (video_id, viewer_id, ip, ua, ts)

Kafka:
  200K events/sec * 250 bytes = 50 MB/sec ingress
  30 partitions, lz4 compressed = ~10 MB/sec per partition
  7-day retention: 50 MB/sec * 86400 * 7 = ~30 TB

Redis (counter shards):
  800M videos * 128 shards/video * 8 bytes = 820 GB (if all shards populated)
  In practice: ~2% of videos have active views = ~16 GB live shard data
  Peak INCR ops: 200K events/sec / 128 shards = 1,562 ops/shard/sec
  Redis can sustain 200K+ ops/sec per node; shards need 1 Redis node

Bot filter:
  200K events/sec, ~1ms scoring time per event
  Parallelism needed: 200 concurrent goroutines or 20 workers with batching

Postgres:
  800M rows * ~200 bytes = ~160 GB (fits on single node with SSDs)
  Reconciliation: 1 write/video/hour = ~800M upserts/hour = ~220 upserts/sec
  Well within Postgres limits

Bottlenecks and mitigations:

The dominant write bottleneck is Redis. With 128 shards and 50 parallel aggregation workers, peak writes per shard are around 1,600 per second - well within a single Redis node’s capacity. If load doubles, doubling shard count doubles capacity linearly.

The Kafka consumer throughput becomes the bottleneck if bot filtering falls behind. The bot filter consumer group scales horizontally - add more consumer instances up to the partition count (30 partitions = 30 max parallel consumers).

Real World

Twitter’s “like count” system faced the same bottleneck. They moved from a single Redis key per tweet to a sharded counter design identical to this one, with the additional optimization of using Redis Cluster with keyslot tags to ensure all shards for a given tweet land on the same cluster node, allowing multi-key pipelines without cross-node hops.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
Kafka broker crash	Kafka JMX `UnderReplicatedPartitions > 0`	Events queue on producers; no data loss if `acks=1`	Kafka controller elects new leader; producers retry automatically
Bot filter consumer lag	Consumer group lag > 100K events	Bot views counted temporarily; corrected by reconciliation	Scale up consumer instances; backpressure to Kafka auto-manages
Redis shard node crash	Redis Sentinel failover alert	In-flight shard increments lost; ~30 seconds of uncounted views	Redis Sentinel promotes replica; reconciliation fills Postgres gap
Reconciliation job failure	Job runs as Kubernetes CronJob; alerts on missing `reconciled_at` update	Redis shards grow indefinitely; 7-day TTL clears them without Postgres update	Re-run job from last checkpoint; shard counters still valid if TTL not expired
Count Read API node crash	Load balancer health check	Count reads fall back to Postgres (slower but correct)	Pod restart via Kubernetes; stateless so instant recovery
Postgres primary failure	PgBouncer connection errors; monitoring on replication lag	Reconciliation writes pause; read fallback serves stale Postgres	Promote read replica; replay recent reconciliation logs

Watch Out

The most operationally dangerous failure is Redis key expiry during a reconciliation outage. If the reconciliation job fails for longer than the shard TTL (7 days), the shard data evaporates and the view count delta is lost permanently. Always alert when reconciliation job last-run time exceeds 2 hours.

Comparison of Approaches

Approach	Write Latency	Read Latency	Consistency	Fraud Resistance	Best Fit
Single counter row (Postgres)	5-50ms (locking)	1-5ms	Strong	Easy (single row to audit)	Very low traffic only
Redis single key	0.1-0.5ms	0.1ms	Eventual	Moderate	Medium scale, no viral spikes
Redis sharded counter (this design)	0.1-0.3ms	1-5ms (128 MGET)	Eventual (30s)	Strong (async filter)	YouTube-scale write throughput
Flink streaming aggregation	100-500ms (window)	5-20ms	Near-real-time	Strong (in-stream)	When analytics depth needed alongside count
Lambda architecture (batch + speed)	0ms (fire-forget)	10-50ms	Eventually consistent	Strong	When historical recomputation needed
CRDT counter (distributed)	1-5ms	5-10ms (merge)	Eventual (convergent)	Moderate	Multi-datacenter active-active write

The sharded Redis approach wins for this use case because the read side is heavily cached and the write side is the bottleneck. The engineering investment in summing 128 keys on reads is trivial compared to the operational simplicity of avoiding a streaming framework like Flink for what is fundamentally just a counter increment.

Flink would be the right choice if we needed to compute more complex aggregations alongside the count - retention curves, watch time histograms, or funnel analysis. For pure view counting, it is overengineered.

Key Takeaways

Counter sharding distributes write load by splitting a single counter into N independent keys, eliminating the hot-key bottleneck that would appear on any viral video.
Approximate counting with HyperLogLog enables unique viewer tracking with constant 12KB memory per video, accepting 0.81% error - a reasonable tradeoff for a display metric.
Eventual consistency display is a deliberate product choice, not a limitation: 30 seconds of lag is invisible to users and enables a fully decoupled write path.
Async count pipeline ensures that bot filtering, aggregation, and deduplication never block video delivery - Kafka is the durable handoff buffer between them.
Bot filtering signals layer multiple weak signals (IP velocity, watch duration, UA quality, replay rate, TLS fingerprint) into a single score, making it expensive for bots to evade all signals simultaneously.
Periodic reconciliation bridges the gap between ephemeral Redis counters and durable Postgres state, providing a source of truth that survives full Redis failures.
Read path decoupling keeps count reads on a separate cache tier so popular video pages do not fan out to 128 Redis keys per request - the refresh is done in the background, not on the read path.
Idempotency via session IDs prevents double-counting from refreshes without requiring a distributed lock per view event.

The counter-intuitive lesson in this system is that the displayed view count is not the system’s most important data. The reconciled Postgres count is. The displayed count is a best-effort approximation that serves UX needs. The Postgres count serves creator payments, content ranking, and abuse investigations. Designing for that distinction - rather than trying to make the displayed count exactly equal to the true count in real time - is what makes the architecture tractable at this scale.

Frequently Asked Questions

Q: Why not just use a database sequence or atomic counter directly?

A: At 58K writes per second, even Redis’s single-key INCR approaches its per-key CPU limit for viral videos. A database row with a locking UPDATE view_count = view_count + 1 saturates a Postgres connection pool within seconds of a viral spike. Sharding to 128 Redis keys distributes the hot-write load and brings each shard well below saturation limits. The read cost of summing 128 keys is paid once per TTL interval, not per page view.

Q: Why Kafka instead of writing directly to Redis from the CDN edge?

A: Kafka provides durability that Redis does not. A Redis write that lands on a node before it crashes is lost. A Kafka write is replicated before ACK. Since we fire-and-forget from the CDN edge, we need the intermediary to be durable. Kafka also decouples the CDN edge from the processing pipeline - the CDN does not need to know about bot filtering, shard counts, or reconciliation.

Q: Why not run bot detection before Kafka, at the CDN edge?

A: CDN edge nodes are designed for minimal-latency request handling - they should not run ML-weight scoring inline. Synchronous bot detection at the edge adds 5-20ms to every video view acknowledgment. Async detection via Kafka adds zero latency to the user experience. The tradeoff is that some bot views land in the counter briefly before the filter catches up, but reconciliation corrects this.

Q: How does the system handle a video going viral and spiking to 10x normal traffic in 60 seconds?

A: The architecture handles this gracefully because every component scales independently. Kafka absorbs the write burst (it is designed for exactly this). The bot filter consumer group can be scaled horizontally within minutes by adding pods. The Redis shards handle 200K writes/sec total (1,562/shard) comfortably. The only thing that needs adjustment is the read cache TTL - trending detection sets the TTL to 5 seconds when velocity exceeds a threshold, so the displayed count updates faster during the spike.

Q: Why not use a CRDT counter for multi-datacenter active-active writes?

A: CRDTs (like a G-Counter) are the right tool when you need active-active writes across geographic regions without a single coordination point. For YouTube’s view counting, the write path funnels through Kafka, which already provides ordering within a partition. CRDTs add merge complexity and require more network coordination on reads. The sharded Redis approach with regional Kafka clusters and cross-regional reconciliation is simpler to operate. If active-active multi-region writes with zero coordination were a hard requirement, CRDTs would be the right choice.

Q: What happens to the view count if the reconciliation job runs late and Redis TTLs expire?

A: This is the most dangerous failure mode. If Redis shard keys expire before reconciliation writes to Postgres, those views are permanently lost. The mitigation is defense in depth: (1) set shard TTLs to 7 days, much longer than the 1-hour reconciliation interval; (2) alert loudly when reconciliation last-run time exceeds 2 hours; (3) keep the reconciliation_log table which stores delta per reconciliation run, allowing reconstruction of the approximate total even after a partial data loss.

Interview Questions

Q: Design the counter architecture for a video that suddenly goes viral and receives 500,000 views per minute. How does your system behave?

Expected depth: Discuss how shard count determines write capacity (each shard handles ~1500 ops/sec at saturation, so 128 shards = ~192K/sec). Explain that 500K/min = 8,333/sec exceeds current capacity. Solutions: pre-shard to 512 or 1024, or dynamically add shards (requires key migration). Discuss Kafka partition saturation separately. Note that read cache TTL should drop to 5s for trending videos.

Q: How would you modify this system to guarantee exactly-once view counting - no view ever counted twice or missed?

Expected depth: Explain the tension between exactly-once and performance. Discuss idempotency keys per session as the primary dedup mechanism. Explain that the Kafka at-least-once delivery model means aggregation must handle duplicate events from consumer retries. Propose transactional outbox or Kafka exactly-once semantics (EOS) with enable.idempotence=true on the producer. Note that HyperLogLog for unique viewers is inherently approximate - exact unique counting requires a distributed set which is expensive.

Q: A creator claims their video has 1 million views but their analytics shows only 600,000 monetizable views. How would you debug this discrepancy?

Expected depth: Walk through the audit trail: compare video_view_counts.view_count (total including bots post-reconciliation) with rejected_view_events count for the same video and time range. Explain that ~40% rejection rate for a new creator is suspicious and might indicate a bot attack or an overly aggressive filter. Check the bot filter’s signal breakdown for that video’s rejected events. Discuss the shard_epoch field which allows correlating Postgres writes to specific reconciliation runs.

Q: How would you implement a “view milestone” notification - alerting a creator when their video crosses 1M, 10M, 100M views?

Expected depth: The naive approach (check count on every reconciliation) is quadratic in videos. Better: publish count updates to a separate Kafka topic after reconciliation. A milestone checker service consumes this topic and does threshold comparisons. Use Redis sorted sets to track which milestones have been triggered per video to avoid duplicate notifications. Discuss edge cases: count decreasing (bot cleanup), milestone detected during a bulk reconciliation backfill.

Q: The bot filter is incorrectly rejecting 15% of legitimate views from a new country where mobile networks share a small CIDR. How would you fix it without redeploying?

Expected depth: The IP velocity signal is the culprit - carrier NAT concentrates many users behind one IP. Solutions: (1) expose a feature flag API that adjusts per-signal weights in real time; (2) add a CIDR allowlist that bypasses the velocity check for known carrier NAT ranges; (3) reduce the weight of IP velocity for mobile UAs. Discuss that score adjustments should be A/B tested against an audit sample to verify the false positive rate drops without letting new bot traffic through.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article