Build YouTube's View Count System
scalability databases performance
System Design Deep Dive
YouTube’s View Count System
Counting billions of views accurately in near real-time without slowing video delivery or letting bots inflate the numbers
Imagine running a gas station with a broken pump display. You need to show customers exactly how much gas is flowing - but the sensor inside the pump takes a reading only once a minute and the display lags by another 30 seconds. Meanwhile, there’s a guy with a garden hose pretending to fill up without paying. That is precisely the problem with view counts at YouTube scale: you want an accurate number fast, and you need to filter out fraud, but the mechanism that serves the number must never slow down the thing people actually care about - watching the video.
YouTube ingests roughly 5 billion views per day, or about 58,000 view events per second at peak. A naive counter - increment a single row in Postgres every time someone watches - would melt under this load. At 58K writes per second a single row would be the hottest spot in any database ever deployed. Even Redis, which is fast, would saturate a single key at somewhere around 100K-200K writes per second before CPU becomes the bottleneck - and that is with no concurrency protection. More critically, a single counter has no isolation: if the counting service restarts, you lose uncounted views. If the bot detection step is synchronous, it blocks video delivery. If you query the raw counter directly from video pages, count reads add latency to the most traffic-heavy path in the system.
The real challenge is that four concerns pull in opposite directions. Accuracy requires counting every real view exactly once. Latency requires never touching the counting path from the video serve path. Freshness requires counts that update in seconds for trending content. Fraud resistance requires discarding bot traffic without allowing bots to cause legitimate views to be lost. Getting all four right simultaneously is the engineering problem.
We need to solve for write throughput through counter sharding, approximate deduplication via HyperLogLog, asynchronous bot filtering off the critical path, and eventual consistency between in-memory counters and the durable source of truth. Let’s build it.
Requirements and Constraints
Functional Requirements
- Track a view when a user watches a video for at least 30 seconds (YouTube’s documented threshold)
- Display the current view count on the video watch page, updated within 30 seconds for normal videos
- For trending videos (top 0.1% by current velocity), update the displayed count within 5 seconds
- Resist inflation from bots, page refreshes within short windows, and automated crawlers
- Provide an admin API to query raw, unfiltered event counts for abuse investigation
- Support a “unique viewers” metric distinct from total play count (one viewer watching 10 times = 10 plays, 1 unique)
Non-Functional Requirements
- Write throughput: 58,000 view events per second sustained, 200,000 at peak (viral video moments)
- Read latency: View count reads on video pages must complete in under 10ms (P99)
- Count freshness: Less than 30 seconds lag for standard videos; less than 5 seconds for trending
- Durability: No view event lost once acknowledged by the edge layer, even if internal services crash
- Availability: 99.99% - view counting outages cause visible stale numbers, which erode creator trust
- Storage: ~800 million active videos; storing sharded counters + reconciliation records = roughly 200GB in Redis, 50GB in Postgres
Constraints and Assumptions
- The video serve path (CDN delivery) must be completely decoupled from counting; a counting outage must not affect playback
- View events are submitted via fire-and-forget beacons from the CDN edge, not from the browser directly
- We accept eventual consistency: the displayed count can be up to 30 seconds stale in the normal case
- Bot filtering runs asynchronously - some bot views may be counted briefly before the filter catches up, but reconciliation corrects this within hours
- “View” definition is a product decision fixed upstream; we design for that threshold, not for detecting it
High-Level Architecture
The system splits cleanly into a write path and a read path, with an async reconciliation job connecting them to the durable store.
The write path starts at the CDN edge. When the player’s 30-second watch threshold triggers, the client fires a beacon to the nearest CDN edge node. The edge node enqueues the event to Kafka (the view_events topic, 30 partitions) and immediately returns a 200 - it never touches a database. Kafka gives us durability: the event is on disk before we ACK.
The Bot Filter Service consumes from the view_events topic, scores each event using IP velocity, user-agent signals, watch duration, and replay rate, and emits valid events onto a filtered clean_view_events topic. The View Aggregation Service consumes from clean_view_events, batches events in 500ms windows, and writes increments to Redis Counter Shards and HyperLogLog registers.
The read path is completely separate. The Count Read API serves GET /count/{video_id} requests from video pages. It reads from a Redis read cache with a 30-second TTL (5 seconds for trending videos). It never reads from the write path’s shards directly.
An hourly Reconciliation Job sums all shard counters for each video and writes a canonical count to Postgres, which becomes the source of truth for historical reporting, analytics, and the fallback when Redis is unavailable.
The video serve path and the counting path share zero infrastructure. A complete failure of the counting system shows stale numbers - it does not drop a single video frame. Decoupling them with Kafka as the handoff point is the architectural decision everything else depends on.
The Ingestion Pipeline
The ingestion pipeline’s job is to durably capture every view event off the critical path, at any volume, without acknowledgment delays.
The CDN edge node does exactly two things when a view event arrives: enqueue to Kafka and return 200. The Kafka producer is configured with acks=1 (leader acknowledgment only) and linger.ms=5 to batch events from concurrent watchers into fewer produce requests. The tradeoff: acks=1 risks losing an event if the Kafka leader crashes before replication. At YouTube scale this is acceptable - losing 0.001% of events in a node crash scenario is better than adding latency to 100% of video deliveries.
# View event beacon handler on the CDN edge (Python/FastAPI sketch)
# Demonstrates: fire-and-forget event submission, no database touch
from aiokafka import AIOKafkaProducer
import msgpack, time
producer = AIOKafkaProducer(
bootstrap_servers="kafka-cluster:9092",
acks=1,
linger_ms=5,
compression_type="lz4",
)
async def handle_view_event(video_id: str, viewer_id: str, ip: str, ua: str):
event = {
"video_id": video_id,
"viewer_id": viewer_id,
"ip": ip,
"ua": ua,
"ts": int(time.time() * 1000),
"session_id": generate_session_id(viewer_id, video_id),
}
# Partition by video_id so events for the same video land on the same partition
key = video_id.encode()
await producer.send("view_events", value=msgpack.packb(event), key=key)
return {"status": "ok"} # never block on Kafka confirmation
Partitioning by video_id means all events for a given video land on the same Kafka partition, which preserves ordering and makes downstream deduplication simpler. With 30 partitions and 800 million videos, roughly 27 million videos per partition - all well within Kafka’s limits.
YouTube’s actual view counting uses a similar fire-and-forget beacon approach. Early versions counted views synchronously during video serve, which caused counting to slow video delivery under load. The decoupling to async beacons was one of the earliest architectural changes at scale.
The Bot Filter Service
The bot filter’s job is to discard fraudulent view events before they reach the aggregation layer, using a scoring pipeline that runs in microseconds per event.
Think of the bot filter as a customs agent at an airport. Most travelers wave through in seconds because their documents are obviously valid. A small fraction get pulled aside for a longer check. A tiny fraction get rejected entirely. The scoring is the same: most events score low and pass immediately; suspicious ones get inspected; clear bots get dropped.
Each event is evaluated on five signal categories:
- IP velocity: how many view events have come from this IP in the last 60 seconds, tracked in Redis as a sliding window counter per
(ip, minute_bucket). Datacenter CIDRs and known VPN ASNs get a flat penalty. - Watch duration ratio: the ratio of watch time to video length. A 10-minute video watched for exactly 31 seconds on 200 different requests from the same IP is a strong bot signal.
- User-agent quality: presence of known headless browser fingerprints (
HeadlessChrome,python-requests), missingAccept-Languageheaders, or UA strings that rotate suspiciously. - Replay rate: has this
(viewer_id, video_id)pair been seen in the last 30 minutes? Legitimate users rarely watch the same video 10 times in 30 minutes. - TLS fingerprint: the JA3 hash of the TLS handshake identifies the client library. Automated tools have distinctive JA3 fingerprints even when they spoof the User-Agent.
# Bot scoring: weighted feature evaluation
# Demonstrates: signal extraction and score composition
import redis
import hashlib
import time
r = redis.Redis(host="redis-bot-signals", decode_responses=True)
WEIGHTS = {
"ip_velocity": 0.35,
"watch_ratio": 0.25,
"ua_quality": 0.20,
"replay_rate": 0.15,
"tls_fp": 0.05,
}
def score_event(event: dict) -> float:
scores = {}
# IP velocity signal
ip_key = f"ipvel:{event['ip']}:{int(time.time()) // 60}"
count = r.incr(ip_key)
r.expire(ip_key, 120)
scores["ip_velocity"] = min(1.0, count / 200.0) # 200 req/min = max score
# Watch duration ratio
watch_secs = event.get("watch_secs", 0)
video_secs = event.get("video_duration_secs", 1)
ratio = watch_secs / max(video_secs, 1)
# Suspicious: always exactly at the threshold (31s) or less than 5%
scores["watch_ratio"] = 1.0 if ratio < 0.05 else (0.6 if abs(ratio - 31 / video_secs) < 0.01 else 0.0)
# UA quality
ua = event.get("ua", "")
headless_signals = ["HeadlessChrome", "python-requests", "curl/", "Wget/", "Go-http-client"]
scores["ua_quality"] = 1.0 if any(h in ua for h in headless_signals) else 0.0
# Replay rate
replay_key = f"replay:{event['viewer_id']}:{event['video_id']}"
seen = r.get(replay_key)
r.setex(replay_key, 1800, "1") # 30-minute window
scores["replay_rate"] = 0.8 if seen else 0.0
# TLS fingerprint (passed in by edge)
known_bot_ja3 = {"a0e9f5d64349fb13191bc781f81f42e1", "b32309a26951912be7dba376398d89"}
scores["tls_fp"] = 1.0 if event.get("ja3") in known_bot_ja3 else 0.0
total = sum(WEIGHTS[k] * scores[k] for k in WEIGHTS)
return total
def filter_event(event: dict) -> bool:
score = score_event(event)
if score > 0.7:
log_rejected(event, score)
return False
return True
Events with a score above 0.7 are rejected and written to an audit log for retrospective analysis. Events between 0.4 and 0.7 are soft-flagged and sampled for ML model retraining. Everything below 0.4 passes without further inspection.
A synchronous bot filter on the write path is a correctness trap: if the filter is slow, you back-pressure Kafka consumers and views pile up unprocessed. The filter must run as an async consumer group, not inline with event ingestion. If it falls behind, you accept temporarily counting some bot views - reconciliation will correct them.
The Aggregation Service and Counter Sharding
The aggregation service’s job is to translate a stream of individual view events into atomic counter increments with minimal write amplification.
Counter sharding is the key mechanism here. A single Redis key for a viral video would receive tens of thousands of INCR operations per second during peak views. Redis is single-threaded per key - this bottlenecks on CPU, not memory bandwidth. The solution is to shard the counter: instead of one key view_count:{video_id}, we maintain 128 keys view_count:{video_id}:{shard}. Each write goes to exactly one shard, selected by hash(event_id) % 128. The true count is the sum of all 128 shards.
The aggregation service runs 50 parallel workers, each consuming a slice of Kafka partitions. Each worker maintains a local 500ms batch buffer. At the end of each batch window, it emits a INCRBY per shard - not one INCR per event. This batching is critical: at 58K events/sec, batching 500ms turns 29,000 individual INCR calls into roughly 128 INCRBY calls (one per active shard), a 225x reduction in Redis round trips.
# Aggregation worker: batching + sharded INCR
# Demonstrates: batch window, shard selection, pipeline write
import asyncio
import hashlib
import redis.asyncio as aioredis
SHARD_COUNT = 128
BATCH_WINDOW_MS = 500
class AggregationWorker:
def __init__(self, redis_client):
self.r = redis_client
self.buffer: dict[str, int] = {} # key -> increment count
def get_shard_key(self, video_id: str, event_id: str) -> str:
shard = int(hashlib.md5(event_id.encode()).hexdigest(), 16) % SHARD_COUNT
return f"vc:{video_id}:{shard}"
def add_event(self, video_id: str, event_id: str):
key = self.get_shard_key(video_id, event_id)
self.buffer[key] = self.buffer.get(key, 0) + 1
async def flush(self):
if not self.buffer:
return
pipe = self.r.pipeline(transaction=False)
for key, count in self.buffer.items():
pipe.incrby(key, count)
pipe.expire(key, 86400 * 7) # 7-day TTL; reconciliation persists before expiry
await pipe.execute()
self.buffer.clear()
async def run(self, kafka_consumer):
while True:
flush_task = asyncio.create_task(asyncio.sleep(BATCH_WINDOW_MS / 1000))
events_in_window = []
async for msg in kafka_consumer:
events_in_window.append(msg)
if flush_task.done():
break
for msg in events_in_window:
self.add_event(msg.value["video_id"], msg.value["session_id"])
await self.flush()
Reading the total count requires summing all 128 shards. This is done via a Redis pipeline (MGET of all 128 keys) and takes roughly 2-3ms. The count is then cached in a separate read-layer key with a 30-second TTL, so the 128-key aggregation runs at most once every 30 seconds per video, not on every page load.
Sharding the counter trades read complexity (sum 128 keys instead of GET 1) for write scalability (128 independent hot keys instead of 1 bottleneck). The read cost is paid once every TTL window and cached - the write benefit is continuous and proportional to the number of shards.
Approximate Counting with HyperLogLog
Alongside the raw play count, the system tracks unique viewers - a fundamentally different problem. Counting unique viewers requires deduplication: if viewer A watches video X ten times, she contributes 1 to the unique count, not 10.
Exact deduplication at scale requires storing a set of all viewer IDs per video - at hundreds of millions of viewers per viral video, this is gigabytes of state per video, just for deduplication. HyperLogLog (HLL) solves this with a probabilistic data structure that estimates set cardinality using 12KB of memory regardless of the set size, with a standard error of 0.81%.
Redis implements HLL natively. Adding a viewer is PFADD unique_viewers:{video_id} {viewer_id} (O(1)). Querying the unique count is PFCOUNT unique_viewers:{video_id} (O(1)). The memory overhead is constant and tiny.
# HyperLogLog unique viewer tracking
# Demonstrates: PFADD for dedup, PFCOUNT for cardinality estimate
import redis
r = redis.Redis()
def record_view(video_id: str, viewer_id: str) -> bool:
"""Returns True if this is a new unique viewer for this video."""
hll_key = f"unique_viewers:{video_id}"
# PFADD returns 1 if the internal HLL changed (likely a new unique element)
# Returns 0 if the element was probably already present
is_new = r.pfadd(hll_key, viewer_id)
r.expire(hll_key, 86400 * 30) # 30-day rolling window
return bool(is_new)
def get_unique_viewers(video_id: str) -> int:
"""Returns approximate unique viewer count with ~0.81% standard error."""
return r.pfcount(f"unique_viewers:{video_id}")
def merge_unique_viewers(video_ids: list[str]) -> int:
"""Get unique viewers across a set of videos (e.g., for a channel total)."""
keys = [f"unique_viewers:{vid}" for vid in video_ids]
return r.pfcount(*keys) # Redis merges HLLs on the fly for cross-key PFCOUNT
The 0.81% error means a video with 10 million unique viewers might be displayed as 10,081,000 or 9,919,000. This is perfectly acceptable for display purposes - YouTube already rounds large counts to two significant figures (“10M views” rather than “10,081,337 views”).
Redis’s PFADD and PFCOUNT are used by Twitter for tracking unique tweet impressions, by Cloudflare for unique IP counting per zone, and by many ad-tech systems for reach frequency capping. The 12KB-per-counter memory cost makes HLL viable even when tracking billions of unique elements.
The Read Path and Eventual Consistency Display
The count read path’s job is to serve sub-10ms count lookups to video pages without touching the write path’s counter shards on every request.
The read architecture uses a two-tier cache. The shard aggregation cache is a single Redis key cached_count:{video_id} that stores the sum of all 128 shards. A background process (the “count refresher”) updates this key on a TTL schedule: every 30 seconds for standard videos, every 5 seconds for trending videos.
Video pages read from cached_count:{video_id}. If the key is missing (cold start, or TTL expired and the refresher hasn’t run yet), they fall back to summing the shards directly. If the shards are also empty (very cold video or Redis cold restart), they fall back to Postgres.
# Count read API with tiered fallback
# Demonstrates: read path decoupling from write path, tiered fallback
import redis
import psycopg2
from functools import lru_cache
r = redis.Redis()
TRENDING_TTL = 5
STANDARD_TTL = 30
SHARD_COUNT = 128
def get_view_count(video_id: str) -> dict:
# Tier 1: read-layer cache (fastest path)
cached = r.get(f"cached_count:{video_id}")
if cached:
return {"count": int(cached), "source": "cache", "lag_seconds": STANDARD_TTL}
# Tier 2: sum shards directly
shard_keys = [f"vc:{video_id}:{s}" for s in range(SHARD_COUNT)]
values = r.mget(shard_keys)
shard_sum = sum(int(v) for v in values if v)
if shard_sum > 0:
# Backfill the read cache for next request
is_trending = is_trending_video(video_id)
ttl = TRENDING_TTL if is_trending else STANDARD_TTL
r.setex(f"cached_count:{video_id}", ttl, shard_sum)
return {"count": shard_sum, "source": "shards", "lag_seconds": 0}
# Tier 3: Postgres (source of truth, for cold/old videos)
count = fetch_count_from_postgres(video_id)
return {"count": count, "source": "postgres", "lag_seconds": 3600}
def fetch_count_from_postgres(video_id: str) -> int:
with psycopg2.connect(dsn=PG_DSN) as conn:
with conn.cursor() as cur:
cur.execute(
"SELECT view_count FROM video_view_counts WHERE video_id = %s",
(video_id,),
)
row = cur.fetchone()
return row[0] if row else 0
The eventual consistency display means users will sometimes see a count that is up to 30 seconds behind. This is a deliberate product decision. YouTube has trained users to understand that view counts are approximate and lag slightly. The alternative - a strongly consistent count that requires a distributed transaction on every view - would add hundreds of milliseconds to every video load.
If the read cache TTL and the shard update frequency are both 30 seconds but are not synchronized, a viewer could see a count that lags up to 60 seconds (cache expires at second 29, shards were last updated at second 1). Stagger the TTL slightly below the update interval to avoid compounding lag.
Periodic Reconciliation
The reconciliation job’s job is to periodically drain the in-memory shard counters into Postgres, resetting the source of truth and enabling shard key cleanup.
Every hour, the reconciliation job runs as a scheduled task. It reads all 128 shard counters for each video using a Redis pipeline, sums them, and adds the total to the Postgres video_view_counts.view_count column using an upsert. After a successful write, it resets the shard counters to zero (atomically, using a Lua script to avoid a TOCTOU race).
-- Schema: view counting source of truth
-- Demonstrates: upsert pattern, audit trail, partitioning strategy
CREATE TABLE video_view_counts (
video_id TEXT NOT NULL,
view_count BIGINT NOT NULL DEFAULT 0,
unique_viewers BIGINT NOT NULL DEFAULT 0,
last_reconciled TIMESTAMPTZ NOT NULL DEFAULT NOW(),
shard_epoch BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY (video_id)
);
-- Separate reconciliation audit table for debugging
CREATE TABLE reconciliation_log (
id BIGSERIAL PRIMARY KEY,
video_id TEXT NOT NULL,
delta BIGINT NOT NULL,
shard_snapshot JSONB, -- stores {shard_id: count} for debugging
reconciled_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON reconciliation_log (video_id, reconciled_at DESC);
CREATE INDEX ON video_view_counts (view_count DESC); -- for "most viewed" queries
The shard reset uses a Lua script to ensure atomicity - we do not want to decrement the shard by an amount different from what we read, which would happen if new views arrive between the MGET and the DECRBY:
-- Lua script: atomic read-and-reset of all shards for a video
-- Demonstrates: Lua atomicity in Redis, TOCTOU prevention
local prefix = ARGV[1]
local shard_count = tonumber(ARGV[2])
local total = 0
for i = 0, shard_count - 1 do
local key = prefix .. ":" .. i
local val = redis.call("GETSET", key, "0")
if val then
total = total + tonumber(val)
end
end
return total
The Lua script uses GETSET to atomically swap each shard to zero and return the old value. Any views that arrive between shard reads within the Lua script are counted in the next reconciliation cycle - they are not lost.
Reconciliation does not need to be transactionally consistent across all videos simultaneously - it just needs to eventually produce a correct total for each video independently. Running one video’s reconciliation while another video’s shards keep accumulating is perfectly safe because each video’s shards are independent.
Data Model
The data model separates concerns into three tiers: real-time write state (Redis shards), durable aggregate state (Postgres), and transient read state (Redis read cache).
-- Redis key patterns (documented as DDL equivalent)
-- vc:{video_id}:{shard} String INCR target, 7-day TTL
-- unique_viewers:{video_id} HyperLogLog, 30-day TTL
-- cached_count:{video_id} String read cache, 5s or 30s TTL
-- ipvel:{ip}:{minute_bucket} String bot detection velocity, 2-min TTL
-- replay:{viewer_id}:{video_id} String replay detection, 30-min TTL
-- Postgres: durable view count store
CREATE TABLE video_view_counts (
video_id TEXT PRIMARY KEY,
view_count BIGINT NOT NULL DEFAULT 0,
unique_viewers BIGINT NOT NULL DEFAULT 0,
last_reconciled TIMESTAMPTZ NOT NULL DEFAULT NOW(),
shard_epoch BIGINT NOT NULL DEFAULT 0
);
-- Postgres: per-video metadata (written by content pipeline, read-only here)
CREATE TABLE video_metadata (
video_id TEXT PRIMARY KEY,
title TEXT,
channel_id TEXT NOT NULL,
published_at TIMESTAMPTZ,
duration_secs INT,
is_trending BOOLEAN NOT NULL DEFAULT FALSE
);
-- Postgres: bot rejection audit
CREATE TABLE rejected_view_events (
id BIGSERIAL PRIMARY KEY,
video_id TEXT NOT NULL,
ip INET,
viewer_id TEXT,
bot_score FLOAT NOT NULL,
reject_reason TEXT,
event_ts TIMESTAMPTZ NOT NULL,
logged_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON rejected_view_events (video_id, event_ts DESC);
CREATE INDEX ON rejected_view_events (ip, event_ts DESC);
Partitioning strategy: video_view_counts does not need partitioning - 800 million rows with a single integer per row is about 30GB, well within Postgres’s comfort zone. rejected_view_events is time-partitioned by logged_at monthly, since it grows without bound and old rejections are rarely queried.
Sharding key choice: Redis shard keys use hash(session_id) % 128 rather than hash(viewer_id) % 128. Using session_id means the same viewer watching from two devices in the same minute gets balanced across shards, rather than always hitting the same shard. This distributes write load more evenly.
Key Algorithms and Protocols
HyperLogLog Cardinality Estimation
HyperLogLog works by hashing each element to a binary string and observing the position of the leftmost 1-bit. Elements that hash to strings starting with many leading zeros are rare - they indicate a large set. By tracking the maximum leading-zero position across all elements seen, and averaging across multiple hash functions (called registers), HLL estimates cardinality with remarkable accuracy.
# HyperLogLog register update (conceptual implementation)
# Demonstrates: the hash-and-observe mechanism behind PFADD
import hashlib
import math
def hll_add(registers: list[int], element: str, num_registers: int = 16) -> bool:
# Hash the element to a 64-bit integer
h = int(hashlib.sha256(element.encode()).hexdigest(), 16) & 0xFFFFFFFFFFFFFFFF
# Use first log2(num_registers) bits to select the register
register_bits = int(math.log2(num_registers))
register_idx = h >> (64 - register_bits)
# Count leading zeros in the remaining bits
remaining = h & ((1 << (64 - register_bits)) - 1)
leading_zeros = 0
for i in range(64 - register_bits - 1, -1, -1):
if remaining & (1 << i):
break
leading_zeros += 1
# Update register if we've seen more leading zeros
if leading_zeros + 1 > registers[register_idx]:
registers[register_idx] = leading_zeros + 1
return True # register changed = likely new unique element
return False
def hll_count(registers: list[int]) -> int:
m = len(registers)
# Harmonic mean of 2^(-M[j]) across all registers
z = sum(2.0 ** (-r) for r in registers)
alpha = 0.7213 / (1 + 1.079 / m)
estimate = alpha * m * m / z
return int(estimate)
Time complexity: O(1) per add, O(m) per count where m is number of registers (16 for Redis’s implementation in sparse mode). Space: O(m) = 12KB for m=16384 registers.
The property that makes HLL correct at scale is that the maximum leading-zero count across millions of elements converges to a stable estimate of log2(n). Adding elements that are already in the set does not change the maximum - they are idempotent, which gives HLL its deduplication property for free.
Sliding Window Velocity Counter
Bot detection uses a sliding window counter per IP address to detect velocity spikes. Fixed-window counters have an edge case: a bot sending 999 requests at 11:59:59 PM and 999 requests at 12:00:01 AM evades a “1000/minute” limit because each falls in a separate window. A true sliding window avoids this.
# Sliding window velocity counter using sorted sets
# Demonstrates: correct sliding window without fixed-window boundary bypass
import time
import redis
r = redis.Redis()
def check_and_record_velocity(ip: str, video_id: str, window_secs: int = 60, limit: int = 100) -> bool:
now_ms = int(time.time() * 1000)
window_ms = window_secs * 1000
key = f"vel:{ip}:{video_id}"
pipe = r.pipeline()
# Remove events outside the window
pipe.zremrangebyscore(key, 0, now_ms - window_ms)
# Count events in the window
pipe.zcard(key)
# Add this event
pipe.zadd(key, {str(now_ms): now_ms})
# Set TTL
pipe.expire(key, window_secs * 2)
results = pipe.execute()
current_count = results[1]
return current_count < limit # True = not exceeding limit, False = rate limited
Space complexity: O(k) per key where k is the count of events in the window. Time: O(log k) for the sorted set operations. For most users this is negligible; for bots generating thousands of events we cap stored entries by the zremrangebyscore call.
Idempotent Session Deduplication
A user who refreshes a video page should not generate two view events. The client assigns a session_id that combines viewer_id, video_id, and a session start timestamp floored to the nearest 30 minutes. The aggregation service tracks seen session IDs in a Redis set with a 30-minute TTL, discarding duplicate events within the same session window.
# Session-based deduplication on the aggregation side
# Demonstrates: idempotency key pattern for view dedup
import redis, hashlib, time
r = redis.Redis()
def is_duplicate_view(viewer_id: str, video_id: str) -> bool:
window = int(time.time()) // 1800 # 30-minute window
session_id = hashlib.sha256(f"{viewer_id}:{video_id}:{window}".encode()).hexdigest()[:16]
dedup_key = f"dedup:{session_id}"
# SET NX returns True only the first time
is_new = r.set(dedup_key, 1, nx=True, ex=3600)
return not bool(is_new) # True if duplicate (key already existed)
Scaling and Performance
The system’s scaling story is primarily about write throughput. Reads scale trivially with read replicas and CDN caching. Writes scale through sharding and batching.
Capacity Estimation (peak load):
Input:
5B views/day = 58K views/sec average
Viral spike: 3.4x = 200K views/sec peak
Average event size: 250 bytes (video_id, viewer_id, ip, ua, ts)
Kafka:
200K events/sec * 250 bytes = 50 MB/sec ingress
30 partitions, lz4 compressed = ~10 MB/sec per partition
7-day retention: 50 MB/sec * 86400 * 7 = ~30 TB
Redis (counter shards):
800M videos * 128 shards/video * 8 bytes = 820 GB (if all shards populated)
In practice: ~2% of videos have active views = ~16 GB live shard data
Peak INCR ops: 200K events/sec / 128 shards = 1,562 ops/shard/sec
Redis can sustain 200K+ ops/sec per node; shards need 1 Redis node
Bot filter:
200K events/sec, ~1ms scoring time per event
Parallelism needed: 200 concurrent goroutines or 20 workers with batching
Postgres:
800M rows * ~200 bytes = ~160 GB (fits on single node with SSDs)
Reconciliation: 1 write/video/hour = ~800M upserts/hour = ~220 upserts/sec
Well within Postgres limits
Bottlenecks and mitigations:
The dominant write bottleneck is Redis. With 128 shards and 50 parallel aggregation workers, peak writes per shard are around 1,600 per second - well within a single Redis node’s capacity. If load doubles, doubling shard count doubles capacity linearly.
The Kafka consumer throughput becomes the bottleneck if bot filtering falls behind. The bot filter consumer group scales horizontally - add more consumer instances up to the partition count (30 partitions = 30 max parallel consumers).
Twitter’s “like count” system faced the same bottleneck. They moved from a single Redis key per tweet to a sharded counter design identical to this one, with the additional optimization of using Redis Cluster with keyslot tags to ensure all shards for a given tweet land on the same cluster node, allowing multi-key pipelines without cross-node hops.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Kafka broker crash | Kafka JMX UnderReplicatedPartitions > 0 | Events queue on producers; no data loss if acks=1 | Kafka controller elects new leader; producers retry automatically |
| Bot filter consumer lag | Consumer group lag > 100K events | Bot views counted temporarily; corrected by reconciliation | Scale up consumer instances; backpressure to Kafka auto-manages |
| Redis shard node crash | Redis Sentinel failover alert | In-flight shard increments lost; ~30 seconds of uncounted views | Redis Sentinel promotes replica; reconciliation fills Postgres gap |
| Reconciliation job failure | Job runs as Kubernetes CronJob; alerts on missing reconciled_at update | Redis shards grow indefinitely; 7-day TTL clears them without Postgres update | Re-run job from last checkpoint; shard counters still valid if TTL not expired |
| Count Read API node crash | Load balancer health check | Count reads fall back to Postgres (slower but correct) | Pod restart via Kubernetes; stateless so instant recovery |
| Postgres primary failure | PgBouncer connection errors; monitoring on replication lag | Reconciliation writes pause; read fallback serves stale Postgres | Promote read replica; replay recent reconciliation logs |
The most operationally dangerous failure is Redis key expiry during a reconciliation outage. If the reconciliation job fails for longer than the shard TTL (7 days), the shard data evaporates and the view count delta is lost permanently. Always alert when reconciliation job last-run time exceeds 2 hours.
Comparison of Approaches
| Approach | Write Latency | Read Latency | Consistency | Fraud Resistance | Best Fit |
|---|---|---|---|---|---|
| Single counter row (Postgres) | 5-50ms (locking) | 1-5ms | Strong | Easy (single row to audit) | Very low traffic only |
| Redis single key | 0.1-0.5ms | 0.1ms | Eventual | Moderate | Medium scale, no viral spikes |
| Redis sharded counter (this design) | 0.1-0.3ms | 1-5ms (128 MGET) | Eventual (30s) | Strong (async filter) | YouTube-scale write throughput |
| Flink streaming aggregation | 100-500ms (window) | 5-20ms | Near-real-time | Strong (in-stream) | When analytics depth needed alongside count |
| Lambda architecture (batch + speed) | 0ms (fire-forget) | 10-50ms | Eventually consistent | Strong | When historical recomputation needed |
| CRDT counter (distributed) | 1-5ms | 5-10ms (merge) | Eventual (convergent) | Moderate | Multi-datacenter active-active write |
The sharded Redis approach wins for this use case because the read side is heavily cached and the write side is the bottleneck. The engineering investment in summing 128 keys on reads is trivial compared to the operational simplicity of avoiding a streaming framework like Flink for what is fundamentally just a counter increment.
Flink would be the right choice if we needed to compute more complex aggregations alongside the count - retention curves, watch time histograms, or funnel analysis. For pure view counting, it is overengineered.
Key Takeaways
- Counter sharding distributes write load by splitting a single counter into N independent keys, eliminating the hot-key bottleneck that would appear on any viral video.
- Approximate counting with HyperLogLog enables unique viewer tracking with constant 12KB memory per video, accepting 0.81% error - a reasonable tradeoff for a display metric.
- Eventual consistency display is a deliberate product choice, not a limitation: 30 seconds of lag is invisible to users and enables a fully decoupled write path.
- Async count pipeline ensures that bot filtering, aggregation, and deduplication never block video delivery - Kafka is the durable handoff buffer between them.
- Bot filtering signals layer multiple weak signals (IP velocity, watch duration, UA quality, replay rate, TLS fingerprint) into a single score, making it expensive for bots to evade all signals simultaneously.
- Periodic reconciliation bridges the gap between ephemeral Redis counters and durable Postgres state, providing a source of truth that survives full Redis failures.
- Read path decoupling keeps count reads on a separate cache tier so popular video pages do not fan out to 128 Redis keys per request - the refresh is done in the background, not on the read path.
- Idempotency via session IDs prevents double-counting from refreshes without requiring a distributed lock per view event.
The counter-intuitive lesson in this system is that the displayed view count is not the system’s most important data. The reconciled Postgres count is. The displayed count is a best-effort approximation that serves UX needs. The Postgres count serves creator payments, content ranking, and abuse investigations. Designing for that distinction - rather than trying to make the displayed count exactly equal to the true count in real time - is what makes the architecture tractable at this scale.
Frequently Asked Questions
Q: Why not just use a database sequence or atomic counter directly?
A: At 58K writes per second, even Redis’s single-key INCR approaches its per-key CPU limit for viral videos. A database row with a locking UPDATE view_count = view_count + 1 saturates a Postgres connection pool within seconds of a viral spike. Sharding to 128 Redis keys distributes the hot-write load and brings each shard well below saturation limits. The read cost of summing 128 keys is paid once per TTL interval, not per page view.
Q: Why Kafka instead of writing directly to Redis from the CDN edge?
A: Kafka provides durability that Redis does not. A Redis write that lands on a node before it crashes is lost. A Kafka write is replicated before ACK. Since we fire-and-forget from the CDN edge, we need the intermediary to be durable. Kafka also decouples the CDN edge from the processing pipeline - the CDN does not need to know about bot filtering, shard counts, or reconciliation.
Q: Why not run bot detection before Kafka, at the CDN edge?
A: CDN edge nodes are designed for minimal-latency request handling - they should not run ML-weight scoring inline. Synchronous bot detection at the edge adds 5-20ms to every video view acknowledgment. Async detection via Kafka adds zero latency to the user experience. The tradeoff is that some bot views land in the counter briefly before the filter catches up, but reconciliation corrects this.
Q: How does the system handle a video going viral and spiking to 10x normal traffic in 60 seconds?
A: The architecture handles this gracefully because every component scales independently. Kafka absorbs the write burst (it is designed for exactly this). The bot filter consumer group can be scaled horizontally within minutes by adding pods. The Redis shards handle 200K writes/sec total (1,562/shard) comfortably. The only thing that needs adjustment is the read cache TTL - trending detection sets the TTL to 5 seconds when velocity exceeds a threshold, so the displayed count updates faster during the spike.
Q: Why not use a CRDT counter for multi-datacenter active-active writes?
A: CRDTs (like a G-Counter) are the right tool when you need active-active writes across geographic regions without a single coordination point. For YouTube’s view counting, the write path funnels through Kafka, which already provides ordering within a partition. CRDTs add merge complexity and require more network coordination on reads. The sharded Redis approach with regional Kafka clusters and cross-regional reconciliation is simpler to operate. If active-active multi-region writes with zero coordination were a hard requirement, CRDTs would be the right choice.
Q: What happens to the view count if the reconciliation job runs late and Redis TTLs expire?
A: This is the most dangerous failure mode. If Redis shard keys expire before reconciliation writes to Postgres, those views are permanently lost. The mitigation is defense in depth: (1) set shard TTLs to 7 days, much longer than the 1-hour reconciliation interval; (2) alert loudly when reconciliation last-run time exceeds 2 hours; (3) keep the reconciliation_log table which stores delta per reconciliation run, allowing reconstruction of the approximate total even after a partial data loss.
Interview Questions
Q: Design the counter architecture for a video that suddenly goes viral and receives 500,000 views per minute. How does your system behave?
Expected depth: Discuss how shard count determines write capacity (each shard handles ~1500 ops/sec at saturation, so 128 shards = ~192K/sec). Explain that 500K/min = 8,333/sec exceeds current capacity. Solutions: pre-shard to 512 or 1024, or dynamically add shards (requires key migration). Discuss Kafka partition saturation separately. Note that read cache TTL should drop to 5s for trending videos.
Q: How would you modify this system to guarantee exactly-once view counting - no view ever counted twice or missed?
Expected depth: Explain the tension between exactly-once and performance. Discuss idempotency keys per session as the primary dedup mechanism. Explain that the Kafka at-least-once delivery model means aggregation must handle duplicate events from consumer retries. Propose transactional outbox or Kafka exactly-once semantics (EOS) with enable.idempotence=true on the producer. Note that HyperLogLog for unique viewers is inherently approximate - exact unique counting requires a distributed set which is expensive.
Q: A creator claims their video has 1 million views but their analytics shows only 600,000 monetizable views. How would you debug this discrepancy?
Expected depth: Walk through the audit trail: compare video_view_counts.view_count (total including bots post-reconciliation) with rejected_view_events count for the same video and time range. Explain that ~40% rejection rate for a new creator is suspicious and might indicate a bot attack or an overly aggressive filter. Check the bot filter’s signal breakdown for that video’s rejected events. Discuss the shard_epoch field which allows correlating Postgres writes to specific reconciliation runs.
Q: How would you implement a “view milestone” notification - alerting a creator when their video crosses 1M, 10M, 100M views?
Expected depth: The naive approach (check count on every reconciliation) is quadratic in videos. Better: publish count updates to a separate Kafka topic after reconciliation. A milestone checker service consumes this topic and does threshold comparisons. Use Redis sorted sets to track which milestones have been triggered per video to avoid duplicate notifications. Discuss edge cases: count decreasing (bot cleanup), milestone detected during a bulk reconciliation backfill.
Q: The bot filter is incorrectly rejecting 15% of legitimate views from a new country where mobile networks share a small CIDR. How would you fix it without redeploying?
Expected depth: The IP velocity signal is the culprit - carrier NAT concentrates many users behind one IP. Solutions: (1) expose a feature flag API that adjusts per-signal weights in real time; (2) add a CIDR allowlist that bypasses the velocity check for known carrier NAT ranges; (3) reduce the weight of IP velocity for mobile UAs. Discuss that score adjustments should be A/B tested against an audit sample to verify the false positive rate drops without letting new bot traffic through.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.