Build Spotify's Song Radio Feature
data-engineering scalability performance
System Design Deep Dive
Spotify Song Radio
How do you generate an infinite, personalized, never-repeating playlist from a single seed song for 600 million users in real time?
Think of a jazz musician who has just heard one song and must immediately improvise an infinite set that feels coherent - same mood, compatible keys, similar energy arcs - while surprising the audience with each new tune and never repeating the same piece twice across the whole night. That improvisation problem, scaled to 600 million simultaneous listeners each holding a different seed song, is what Spotify’s Song Radio must solve in under 500 milliseconds per request.
The technical challenge compounds across several dimensions at once. The catalog contains 80 million tracks. Each listener has a unique taste profile built from years of streaming history. A song played in this session must not replay for five sessions back. And the system must produce a result that feels neither monotonous - all songs identical to the seed - nor jarring, where genre whiplash breaks the listener’s flow. This is not a simple similarity search problem; it is a multi-objective optimization problem running at internet scale.
Three tensions define the design space. Relevance versus diversity: a playlist of 50 songs acoustically identical to the seed is technically accurate but boring after five tracks. The system needs controlled variety. Personalization versus latency: computing a bespoke recommendation for each of 600 million users in real time requires either precomputed embeddings or extremely fast model inference. You cannot run a deep neural network per user per request. And cold-start for new releases: a track uploaded this morning has zero listening history, so collaborative filtering signals - which depend entirely on aggregated play counts - cannot apply. Audio features are the only signal available, and the system must degrade gracefully to pure acoustic similarity when history is absent.
Requirements and Constraints
Functional Requirements
- Given a seed
track_idanduser_id, return an ordered playlist of at least 50 recommended tracks in real time - The recommendation mix must blend acoustic similarity (audio features) with collaborative filtering signals (user listening patterns)
- Track deduplication must span at least 5 previous radio sessions for the same user - no track repeats within that window
- New tracks with zero play history must receive recommendations via acoustic similarity fallback
- The playlist must inject diversity to prevent genre or tempo monotony across the full session
Non-Functional Requirements
- Latency: sub-500ms end-to-end for the first batch of recommendations; P99 under 400ms
- Throughput: support 600 million active users; assume 10 million concurrent radio sessions at peak
- Catalog scale: 80 million tracks indexed for approximate nearest neighbor search
- Session memory: deduplication across 5 most recent radio sessions per user
- Availability: 99.99% - radio is a high-engagement feature; degraded mode serves pure audio similarity results when the CF model is unavailable
Constraints and Scope
- Track audio analysis (raw audio to feature vector) is an offline batch pipeline; we receive pre-computed feature vectors at query time
- Explicit user feedback (thumbs up/down) is ingested asynchronously and does not affect the current session in real time
- Ad insertion and podcast episodes are handled by separate systems outside this scope
- The Diversity Engine operates on the top-200 post-scored candidates, not the raw 80 million catalog
High-Level Architecture
The system has six layers. The API Gateway accepts the seed track_id and user_id, validates auth, and routes to the Radio Generation Service - the orchestrator that calls every downstream component. The Feature Store provides pre-computed audio embeddings for the seed track; these are fed into the ANN Index for approximate nearest neighbor retrieval of 500 candidate tracks. In parallel, the CF Model scores those candidates against the user’s taste profile. The Dedup Store and Diversity Engine then filter and reorder the scored candidates before the final playlist is returned to the client.
The ANN index and CF model run in parallel, not sequentially. The ANN index retrieves 500 acoustically similar candidates based on the seed’s audio embedding. Simultaneously, the CF model loads the user’s taste embedding. Only when both results arrive does the Radio Generation Service fuse the two scores. This fan-out-then-merge pattern keeps latency bounded by the slower of the two calls, not their sum.
Audio Feature Extraction
Every track in Spotify’s catalog passes through an audio analysis pipeline that converts raw audio into a 128-dimensional feature vector. The low-level features include tempo (beats per minute), key (musical key and mode), valence (musical positiveness, 0 to 1), energy (perceptual intensity), and danceability (how suitable a track is for dancing). These five scalar features are visible via Spotify’s public API, but the internal representation is a much richer learned embedding from a neural audio encoder.
# Audio feature extraction pipeline - converts raw audio to feature vector
import librosa
import numpy as np
from dataclasses import dataclass
from typing import Optional
@dataclass
class AudioFeatures:
track_id: str
tempo: float # BPM, e.g. 120.5
key: int # 0=C, 1=C#, ..., 11=B
mode: int # 0=minor, 1=major
valence: float # [0, 1] - musical positiveness
energy: float # [0, 1] - perceptual intensity
danceability: float # [0, 1] - rhythmic suitability
loudness: float # dBFS, typically -60 to 0
embedding: np.ndarray # 128-d learned embedding from audio encoder
def extract_scalar_features(audio_path: str, track_id: str) -> AudioFeatures:
"""
Extract scalar audio features from raw audio file.
In production this runs as a distributed Spark job over the full catalog.
"""
y, sr = librosa.load(audio_path, sr=22050)
# Tempo and beat detection
tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
# Chroma features for key estimation
chroma = librosa.feature.chroma_cqt(y=y, sr=sr)
key_idx = int(np.argmax(np.mean(chroma, axis=1)))
# Spectral features proxy for energy
rms = float(np.mean(librosa.feature.rms(y=y)))
energy = min(1.0, rms * 20.0) # normalize to [0, 1]
# Loudness in dBFS
loudness = float(librosa.amplitude_to_db(np.mean(np.abs(y))))
# Valence and danceability require a trained regressor in production;
# here we use spectral contrast as a rough proxy
spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
valence_proxy = float(np.clip(np.mean(spectral_contrast) / 30.0, 0.0, 1.0))
# Onset strength proxy for danceability
onset_env = librosa.onset.onset_strength(y=y, sr=sr)
danceability_proxy = float(np.clip(np.mean(onset_env) / 5.0, 0.0, 1.0))
# 128-d embedding would come from a pre-trained audio encoder (e.g. VGGish, MusicNN)
# Placeholder: in production, load embedding from the neural encoder output
embedding = np.zeros(128, dtype=np.float32)
return AudioFeatures(
track_id=track_id,
tempo=float(tempo),
key=key_idx,
mode=1,
valence=valence_proxy,
energy=energy,
danceability=danceability_proxy,
loudness=loudness,
embedding=embedding,
)
The scalar features are human-interpretable quality checks. The 128-dimensional embedding is what actually drives similarity search at scale. This embedding is produced by a convolutional neural encoder trained on mel spectrograms, using contrastive learning so that tracks that appear frequently together in playlists have embeddings that are geometrically close in the vector space.
Separating scalar features from the learned embedding is a deliberate engineering choice. Scalar features (tempo, key, valence) are used for diversity enforcement - you can write interpretable rules like “do not serve more than 3 consecutive tracks with tempo within 5 BPM of each other.” The 128-d embedding drives the ANN search because it captures holistic acoustic similarity that scalar features cannot, such as the difference between a fast jazz track and a fast punk track both at 180 BPM.
Collaborative Filtering Signal
Matrix factorization is the backbone of collaborative filtering for music. The intuition: if user A and user B have both saved tracks X and Y, and user A has also saved track Z, then user B probably wants to hear track Z even though they have never encountered it. This implicit feedback signal - what users play, skip, save, or add to playlists - is far richer than explicit ratings.
# Collaborative filtering via implicit ALS (Alternating Least Squares)
# Uses the implicit library which implements ALS for implicit feedback
import numpy as np
import scipy.sparse as sp
from typing import List, Tuple
class ImplicitALSModel:
"""
Thin wrapper around a pre-trained ALS model.
In production, this model is trained offline daily on the full interaction matrix
(600M users x 80M tracks) using distributed Spark ML or a custom GPU trainer.
The trained user and item factor matrices are served from a feature store.
"""
def __init__(self, user_factors: np.ndarray, item_factors: np.ndarray, track_ids: List[str]):
# user_factors: shape (num_users, embedding_dim)
# item_factors: shape (num_tracks, embedding_dim)
self.user_factors = user_factors
self.item_factors = item_factors
# Map from track_id string to row index in item_factors
self.track_id_to_idx = {tid: i for i, tid in enumerate(track_ids)}
def user_score(self, user_embedding: np.ndarray, candidate_track_ids: List[str]) -> np.ndarray:
"""
Compute CF scores for a set of candidate tracks given a user embedding.
Score = dot product of user embedding with each candidate's item embedding.
Returns array of scores, one per candidate, in the same order.
"""
indices = [self.track_id_to_idx.get(tid, -1) for tid in candidate_track_ids]
valid_mask = [i >= 0 for i in indices]
scores = np.zeros(len(candidate_track_ids), dtype=np.float32)
valid_indices = [i for i in indices if i >= 0]
if not valid_indices:
return scores
candidate_factors = self.item_factors[valid_indices] # shape (N_valid, dim)
dot_products = candidate_factors @ user_embedding # shape (N_valid,)
# Sigmoid to map to [0, 1] range
dot_products = 1.0 / (1.0 + np.exp(-dot_products))
valid_pos = 0
for i, is_valid in enumerate(valid_mask):
if is_valid:
scores[i] = dot_products[valid_pos]
valid_pos += 1
return scores
def get_user_embedding(self, user_idx: int) -> np.ndarray:
"""Retrieve user's learned latent factor vector."""
if user_idx < 0 or user_idx >= len(self.user_factors):
return np.zeros(self.user_factors.shape[1], dtype=np.float32)
return self.user_factors[user_idx]
The ALS model produces two factor matrices: one for users (600M x 128) and one for tracks (80M x 128). The user matrix is too large to hold in memory on a single serving node, so user embeddings are retrieved per-request from a Redis feature store. Track embeddings can be partially cached since the candidate pool is bounded to 500 tracks per request.
Seed-Based Candidate Retrieval
The ANN index stores all 80 million track embeddings and supports approximate nearest neighbor queries via HNSW (Hierarchical Navigable Small World graphs). Given the seed track’s 128-d embedding, an HNSW query returns the 500 most acoustically similar tracks in under 20ms.
# Seed-based candidate retrieval using HNSW
import hnswlib
import numpy as np
from typing import List, Tuple
class ANNTrackIndex:
"""
HNSW index over 80M track embeddings.
Built offline, loaded into memory at serving time.
At 80M tracks * 128 dims * 4 bytes = ~41GB per index.
Sharded by genre bucket in production to fit in per-node RAM.
"""
def __init__(self, dim: int = 128, max_elements: int = 80_000_000):
self.index = hnswlib.Index(space='cosine', dim=dim)
self.index.init_index(max_elements=max_elements, ef_construction=200, M=32)
self.track_ids: List[str] = []
def build(self, embeddings: np.ndarray, track_ids: List[str]) -> None:
"""Build the index from a numpy array of embeddings."""
self.track_ids = track_ids
int_ids = np.arange(len(track_ids))
self.index.add_items(embeddings, int_ids)
self.track_ids = track_ids
def load(self, index_path: str, track_ids_path: str) -> None:
"""Load pre-built index from disk (called at serving startup)."""
self.index.load_index(index_path)
with open(track_ids_path) as f:
self.track_ids = [line.strip() for line in f]
def query(self, seed_embedding: np.ndarray, k: int = 500) -> List[Tuple[str, float]]:
"""
Retrieve k nearest neighbors for a seed embedding.
Returns list of (track_id, cosine_distance) pairs, closest first.
"""
self.index.set_ef(max(k * 2, 200)) # ef >= k for good recall
labels, distances = self.index.knn_query(seed_embedding.reshape(1, -1), k=k)
results = []
for label, dist in zip(labels[0], distances[0]):
if label < len(self.track_ids):
results.append((self.track_ids[label], float(dist)))
return results
The index is sharded by genre bucket rather than by track ID hash. Genre-local sharding means that when a user seeds with a jazz track, the query only fans out to jazz-adjacent shards rather than all shards. This reduces both latency and unnecessary cross-genre candidates that would be filtered anyway. Cross-shard queries happen when diversity injection explicitly pulls from distant genre buckets.
Real-Time Feature Blending
Once the ANN index returns 500 candidates and the CF model scores them against the user embedding, the Radio Generation Service fuses the two signals into a single composite score. The blend weight is adaptive: users with deep listening history get higher CF weight; users with sparse history get higher audio similarity weight.
# Real-time feature blending: audio similarity + CF score
import numpy as np
from typing import List, Dict, Tuple
def compute_blend_weight(user_listen_count: int) -> float:
"""
Adaptive CF weight based on how much listening history a user has.
Users with fewer than 50 listens: CF signal is noisy, weight it low.
Users with 1000+ listens: CF signal is rich, weight it high.
"""
if user_listen_count < 50:
return 0.1
elif user_listen_count < 200:
return 0.3
elif user_listen_count < 1000:
return 0.5
else:
return 0.7
def blend_scores(
candidates: List[Tuple[str, float]], # (track_id, audio_distance) from ANN
cf_scores: Dict[str, float], # track_id -> CF score [0, 1]
user_listen_count: int,
seed_valence: float,
seed_tempo: float,
track_features: Dict[str, Dict], # track_id -> {valence, tempo, energy, ...}
) -> List[Tuple[str, float]]:
"""
Compute composite score for each candidate.
composite = (1 - cf_weight) * audio_score + cf_weight * cf_score + context_bonus
"""
cf_weight = compute_blend_weight(user_listen_count)
audio_weight = 1.0 - cf_weight
scored = []
for track_id, audio_distance in candidates:
# Convert cosine distance to similarity [0, 1]
audio_sim = max(0.0, 1.0 - audio_distance)
cf_score = cf_scores.get(track_id, 0.0)
# Context bonus: slight reward for tracks close in mood/tempo to the seed
context_bonus = 0.0
feat = track_features.get(track_id)
if feat:
valence_diff = abs(feat.get("valence", 0.5) - seed_valence)
tempo_diff = abs(feat.get("tempo", 120.0) - seed_tempo) / 200.0
context_bonus = 0.05 * (1.0 - valence_diff) + 0.05 * (1.0 - tempo_diff)
composite = audio_weight * audio_sim + cf_weight * cf_score + context_bonus
scored.append((track_id, composite))
scored.sort(key=lambda x: x[1], reverse=True)
return scored
Session-Level Deduplication
A user listening to Song Radio expects never to hear the same track twice within the current session, and ideally not across their last five radio sessions. The deduplication mechanism must be fast (checked per-candidate, not after ranking) and compact (600 million users cannot each hold a 5-session listening list in memory as a raw set).
Bloom filters are the right data structure here. A bloom filter for a single user’s 5-session window - assuming 50 tracks per session, so 250 tracks total - takes roughly 480 bytes at a 1% false positive rate. For 600 million users, that is about 288 GB total, which fits comfortably in a Redis cluster.
# Session-level deduplication using bloom filter per user
import mmh3 # MurmurHash3 - fast, good distribution
import math
from typing import List
class UserBloomFilter:
"""
Bloom filter for a single user's recently heard tracks.
Tracks 5 sessions * 50 tracks = 250 items at 1% FPR.
Size: ~480 bytes per user.
"""
def __init__(self, expected_items: int = 250, false_positive_rate: float = 0.01):
# Optimal bit array size and number of hash functions
m = self._optimal_m(expected_items, false_positive_rate)
k = self._optimal_k(m, expected_items)
self.m = m
self.k = k
self.bits = bytearray(math.ceil(m / 8))
@staticmethod
def _optimal_m(n: int, p: float) -> int:
return int(-n * math.log(p) / (math.log(2) ** 2))
@staticmethod
def _optimal_k(m: int, n: int) -> int:
return max(1, int((m / n) * math.log(2)))
def _hash_positions(self, item: str) -> List[int]:
positions = []
for seed in range(self.k):
h = mmh3.hash(item, seed, signed=False)
positions.append(h % self.m)
return positions
def add(self, track_id: str) -> None:
for pos in self._hash_positions(track_id):
byte_idx, bit_idx = divmod(pos, 8)
self.bits[byte_idx] |= (1 << bit_idx)
def contains(self, track_id: str) -> bool:
"""Returns True if track_id was probably added (may false-positive at 1%)."""
for pos in self._hash_positions(track_id):
byte_idx, bit_idx = divmod(pos, 8)
if not (self.bits[byte_idx] & (1 << bit_idx)):
return False
return True
def to_bytes(self) -> bytes:
return bytes(self.bits)
@classmethod
def from_bytes(cls, data: bytes, expected_items: int = 250, false_positive_rate: float = 0.01) -> "UserBloomFilter":
bf = cls(expected_items, false_positive_rate)
bf.bits = bytearray(data)
return bf
def filter_seen_tracks(
candidates: List[str],
bloom: UserBloomFilter,
) -> List[str]:
"""Remove candidates that the bloom filter believes were recently played."""
return [tid for tid in candidates if not bloom.contains(tid)]
Bloom filters have false positives but never false negatives. A 1% false positive rate means roughly 1 in 100 tracks that have never been heard will be incorrectly filtered out. This is acceptable - the catalog has 80 million tracks and 500 candidates, so losing ~5 candidates to false positives is invisible to the listener. However, never use a bloom filter for the inverse problem: you cannot reliably determine that a user has heard a track using a filter with any false positive rate, because false positives mean “might have heard” not “definitely heard.”
Diversity Injection
A playlist where every track has the same tempo, genre, and energy creates listener fatigue. The Diversity Engine post-processes the CF + audio scored candidates to enforce spread across multiple acoustic dimensions. The algorithm is a greedy maximum marginal relevance (MMR) selector: at each step, pick the next track that maximizes the score minus a penalty proportional to its similarity to already-selected tracks.
# Diversity injection via greedy Maximum Marginal Relevance (MMR)
import numpy as np
from typing import List, Dict, Tuple
def mmr_diversity_select(
candidates: List[Tuple[str, float]], # (track_id, composite_score)
embeddings: Dict[str, np.ndarray], # track_id -> 128-d embedding
n_select: int = 50,
lambda_diversity: float = 0.3, # 0=pure relevance, 1=pure diversity
) -> List[str]:
"""
Select n_select tracks from candidates using MMR.
Balances relevance (composite score) against diversity (distance from already selected).
lambda_diversity=0.3 means: 70% weight on relevance, 30% on diversity.
Tune via offline A/B test against session skip rate.
"""
if not candidates:
return []
selected: List[str] = []
selected_embeddings: List[np.ndarray] = []
remaining = list(candidates) # (track_id, score)
while len(selected) < n_select and remaining:
best_track = None
best_mmr = float('-inf')
for track_id, relevance_score in remaining:
emb = embeddings.get(track_id)
if emb is None:
mmr_score = (1.0 - lambda_diversity) * relevance_score
elif not selected_embeddings:
# No tracks selected yet - pure relevance
mmr_score = relevance_score
else:
# Max similarity to any already selected track
sims = [float(np.dot(emb, sel_emb) / (np.linalg.norm(emb) * np.linalg.norm(sel_emb) + 1e-9))
for sel_emb in selected_embeddings]
max_sim = max(sims)
mmr_score = (1.0 - lambda_diversity) * relevance_score - lambda_diversity * max_sim
if mmr_score > best_mmr:
best_mmr = mmr_score
best_track = (track_id, relevance_score, emb)
if best_track is None:
break
track_id, _, emb = best_track
selected.append(track_id)
if emb is not None:
selected_embeddings.append(emb)
remaining = [(tid, s) for tid, s in remaining if tid != track_id]
return selected
Cold Start for New Releases
A track uploaded today has zero listening events. The CF model’s item factor matrix has no row for this track. Collaborative filtering fails entirely. The fallback is pure audio similarity: compute the track’s 128-d embedding and find the 50 most acoustically similar established tracks in the catalog. New releases are then inserted into radio sessions as “wildcards” - occasional insertions into the primary CF + audio ranked list - rather than core results. This ensures exposure while preventing untested tracks from dominating the playlist.
# Cold-start insertion for new releases
from typing import List, Dict, Tuple, Optional
import numpy as np
NEW_RELEASE_WINDOW_DAYS = 30
def is_cold_start(track_metadata: dict) -> bool:
"""A track is cold-start if it was released within 30 days and has <100 plays."""
days_since_release = track_metadata.get("days_since_release", 999)
play_count = track_metadata.get("global_play_count", 0)
return days_since_release <= NEW_RELEASE_WINDOW_DAYS and play_count < 100
def insert_cold_start_wildcards(
ranked_playlist: List[str],
cold_start_pool: List[Tuple[str, float]], # (track_id, audio_sim_to_seed)
wildcard_rate: float = 0.08, # inject 1 wildcard per ~12 tracks
) -> List[str]:
"""
Insert cold-start tracks at pseudo-random positions in the ranked playlist.
Wildcards are inserted with probability wildcard_rate at each position.
Audio similarity to the seed is used to select which cold-start track to inject.
"""
if not cold_start_pool:
return ranked_playlist
# Sort cold-start pool by audio similarity descending (best match first)
cold_start_pool = sorted(cold_start_pool, key=lambda x: x[1], reverse=True)
cold_idx = 0
result = []
for i, track_id in enumerate(ranked_playlist):
result.append(track_id)
# Inject wildcard after every ~12 tracks if pool not exhausted
if cold_idx < len(cold_start_pool) and (i + 1) % int(1.0 / wildcard_rate) == 0:
wildcard_id, _ = cold_start_pool[cold_idx]
result.append(wildcard_id)
cold_idx += 1
return result
Spotify’s Discover Weekly and Radio features both use cold-start injection strategies for new releases. Apple Music uses a similar approach with their “New Music Mix” where tracks from recently signed artists are injected alongside established similar artists. YouTube Music addresses cold start differently - they leverage video view count as a proxy for early popularity, giving tracks with viral video presence a head start even before listening data accumulates.
Data Model
-- Core track catalog and feature tables
CREATE TABLE tracks (
track_id UUID PRIMARY KEY,
title TEXT NOT NULL,
artist_id UUID NOT NULL REFERENCES artists(artist_id),
album_id UUID REFERENCES albums(album_id),
duration_ms INT NOT NULL,
release_date DATE NOT NULL,
global_play_count BIGINT DEFAULT 0,
is_explicit BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX tracks_artist_idx ON tracks (artist_id);
CREATE INDEX tracks_release_date_idx ON tracks (release_date DESC);
-- Pre-computed audio features and 128-d embedding per track
-- Stored separately because embeddings are large (512 bytes each)
CREATE TABLE audio_features (
track_id UUID PRIMARY KEY REFERENCES tracks(track_id),
tempo FLOAT NOT NULL,
key SMALLINT NOT NULL, -- 0=C, 1=C#, ..., 11=B
mode SMALLINT NOT NULL, -- 0=minor, 1=major
valence FLOAT NOT NULL,
energy FLOAT NOT NULL,
danceability FLOAT NOT NULL,
loudness FLOAT NOT NULL,
embedding BYTEA NOT NULL, -- 128 float32 values = 512 bytes
analyzed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- User listening history for CF training and dedup context
-- Partitioned by user_id hash for 600M users
CREATE TABLE user_listening_history (
user_id UUID NOT NULL,
track_id UUID NOT NULL REFERENCES tracks(track_id),
listened_at TIMESTAMPTZ NOT NULL,
play_pct FLOAT, -- 0.0-1.0: fraction of track played
source TEXT, -- 'radio', 'playlist', 'album', 'search'
PRIMARY KEY (user_id, listened_at, track_id)
) PARTITION BY HASH (user_id);
CREATE INDEX ulh_user_recent_idx ON user_listening_history (user_id, listened_at DESC);
-- Active radio sessions
CREATE TABLE radio_sessions (
session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
seed_track_id UUID NOT NULL REFERENCES tracks(track_id),
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_active_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
tracks_served INT DEFAULT 0
);
CREATE INDEX radio_sessions_user_idx ON radio_sessions (user_id, started_at DESC);
-- Per-session dedup log (also stored as bloom filter in Redis for speed)
CREATE TABLE play_dedup_log (
user_id UUID NOT NULL,
session_id UUID NOT NULL REFERENCES radio_sessions(session_id),
track_id UUID NOT NULL REFERENCES tracks(track_id),
served_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, session_id, track_id)
) PARTITION BY HASH (user_id);
The user_listening_history and play_dedup_log tables are partitioned by user_id hash across 128 partitions to handle 600 million users. The audio_features table stores embeddings as raw BYTEA rather than as a PostgreSQL vector type - in practice these embeddings are read by the ANN index at startup, not queried via SQL. The authoritative store for embeddings at serving time is the Redis Feature Store, not PostgreSQL.
Key Algorithms and Protocols
Seed-Based ANN Retrieval
# Full seed-based retrieval pipeline
import numpy as np
import redis
import struct
from typing import List, Tuple, Optional
EMBED_DIM = 128
def get_seed_embedding(track_id: str, redis_client: redis.Redis) -> Optional[np.ndarray]:
"""
Retrieve pre-computed audio embedding for seed track from Redis Feature Store.
Key format: embed:{track_id}
Value: 512 bytes (128 float32 little-endian)
"""
raw = redis_client.get(f"embed:{track_id}")
if raw is None or len(raw) < EMBED_DIM * 4:
return None
values = struct.unpack(f"<{EMBED_DIM}f", raw[:EMBED_DIM * 4])
embedding = np.array(values, dtype=np.float32)
# L2 normalize for cosine similarity search
norm = np.linalg.norm(embedding)
if norm > 0:
embedding /= norm
return embedding
def retrieve_candidates(
seed_track_id: str,
ann_index: "ANNTrackIndex",
redis_client: redis.Redis,
k: int = 500,
) -> List[Tuple[str, float]]:
"""
Main candidate retrieval function.
Returns list of (track_id, cosine_distance) sorted ascending (closest first).
"""
embedding = get_seed_embedding(seed_track_id, redis_client)
if embedding is None:
# Fallback: return empty list - Radio Gen Service will use popularity signal
return []
candidates = ann_index.query(embedding, k=k)
# Exclude the seed track itself from candidates
candidates = [(tid, dist) for tid, dist in candidates if tid != seed_track_id]
return candidates
Score Fusion: CF + Audio + Recency
# Score fusion combining CF score, audio similarity, and track recency
import numpy as np
import time
from typing import List, Dict, Tuple
def fuse_scores(
ann_candidates: List[Tuple[str, float]], # (track_id, cosine_distance)
cf_scores: Dict[str, float], # track_id -> CF score [0, 1]
track_meta: Dict[str, dict], # track_id -> {release_date, play_count, ...}
cf_weight: float,
recency_bonus_weight: float = 0.05,
) -> List[Tuple[str, float]]:
"""
Fuse three signals:
1. Audio similarity from ANN (converted from distance to similarity)
2. CF score from matrix factorization
3. Recency bonus for tracks released within last 30 days
Final score = (1 - cf_weight - recency_bonus_weight) * audio_sim
+ cf_weight * cf_score
+ recency_bonus_weight * recency_bonus
"""
audio_weight = 1.0 - cf_weight - recency_bonus_weight
scored = []
now_ts = time.time()
for track_id, cosine_dist in ann_candidates:
audio_sim = max(0.0, 1.0 - cosine_dist) # cosine dist [0,2] -> sim [0,1]
cf = cf_scores.get(track_id, 0.0)
meta = track_meta.get(track_id, {})
release_ts = meta.get("release_timestamp", 0)
days_old = (now_ts - release_ts) / 86400
# Recency bonus decays from 1.0 (just released) to 0.0 (30+ days old)
recency = max(0.0, 1.0 - days_old / 30.0)
score = audio_weight * audio_sim + cf_weight * cf + recency_bonus_weight * recency
scored.append((track_id, score))
scored.sort(key=lambda x: x[1], reverse=True)
return scored
Bloom Filter for Session Dedup
# Bloom filter management in Redis for session deduplication
import redis
import math
import mmh3
from typing import List
BLOOM_KEY_PREFIX = "bloom:user:"
BLOOM_TTL_SECONDS = 60 * 60 * 24 * 30 # 30 days
def get_user_bloom(user_id: str, redis_client: redis.Redis) -> "UserBloomFilter":
"""Load user's bloom filter from Redis or return a fresh one."""
raw = redis_client.get(f"{BLOOM_KEY_PREFIX}{user_id}")
if raw is None:
return UserBloomFilter()
return UserBloomFilter.from_bytes(raw)
def save_user_bloom(user_id: str, bloom: "UserBloomFilter", redis_client: redis.Redis) -> None:
"""Persist updated bloom filter back to Redis with TTL refresh."""
redis_client.set(
f"{BLOOM_KEY_PREFIX}{user_id}",
bloom.to_bytes(),
ex=BLOOM_TTL_SECONDS,
)
def mark_tracks_served(
user_id: str,
track_ids: List[str],
redis_client: redis.Redis,
) -> None:
"""
After serving a batch of tracks, add them to the user's bloom filter.
Uses Redis PIPELINE to batch the read-modify-write atomically.
"""
bloom = get_user_bloom(user_id, redis_client)
for track_id in track_ids:
bloom.add(track_id)
save_user_bloom(user_id, bloom, redis_client)
Scaling and Performance
Capacity Estimation
Catalog:
80M tracks * 128 dims * 4 bytes = 41GB raw embeddings
HNSW index overhead (M=32): ~20 bytes/node = 1.6GB metadata
Total ANN index: ~43GB
With genre sharding (10 shards): ~4.3GB per shard (fits in 8GB RAM nodes)
User Bloom Filters:
600M users * 480 bytes/bloom = ~288GB
Redis cluster at 32GB/node: ~9 Redis nodes just for bloom filters
With replication (x3): 27 Redis nodes
CF Model Factor Matrices:
User factors: 600M users * 128 dims * 4 bytes = 307GB
Track factors: 80M tracks * 128 dims * 4 bytes = 41GB
Too large for single node: shard user factors by user_id range (10 shards)
Each user-factor shard: ~30GB (fits in 64GB RAM node)
Radio Generation Service throughput:
10M concurrent sessions; each session fetches next batch every ~3 minutes
= 10M / 180s = ~55,000 RPS to the Radio Gen Service
Per request: 1 ANN query (~15ms) + 1 CF lookup (~5ms) + 1 bloom check (~2ms)
+ score fusion (~1ms) + MMR selection (~5ms) = ~28ms per request
Instances needed at 28ms avg, 200ms P99 budget:
55,000 RPS * 0.028s = ~1,540 concurrent requests in flight
With 32 threads per instance: ~48 Radio Gen Service instances
ANN Index shards:
10 genre shards, 3 replicas each = 30 ANN index server instances
Each receives: 55,000 RPS / 10 shards = 5,500 QPS per shard
HNSW query at ef=500: ~8ms, throughput ~125 QPS/thread
Threads needed per shard: 5,500 / 125 = ~44 threads
2 instances per shard with 24 cores each = 48 threads (sufficient with headroom)
Spotify’s internal infrastructure team has published details on their “Cosmos” data processing platform and their use of HNSW for track recommendation. Apple Music uses a similar two-tower embedding approach for their Radio feature. YouTube Music shards their embedding index differently - by upload recency rather than genre - because their catalog grows at a faster rate than Spotify’s music catalog does. The HNSW library used in most production systems (hnswlib or the nmslib predecessor) was originally developed at Microsoft Research and is now the de facto standard for billion-scale ANN search.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| ANN index stale (>24h since rebuild) | Index age monitor alert | Candidates exclude tracks released in past 24h; existing tracks still served | Alert on-call; trigger incremental index rebuild for new tracks; stale candidates are acceptable for 24h |
| CF model offline | gRPC health check fails; timeout >50ms | Radio generation falls back to pure audio similarity mode | Circuit breaker opens; Radio Gen Service uses audio-only scoring; degrade gracefully, no user error |
| Feature Store (Redis) down | MGET timeout >5ms | Seed embedding unavailable; cannot seed ANN query | Fall back to popularity-based candidates for that genre; cache last 100 embeddings in Radio Gen Service local memory |
| Bloom filter hash collision (false positive) | Cannot detect at runtime; estimated 1% FPR | Rare tracks incorrectly filtered from candidates | Acceptable by design; false positive rate is a tunable parameter; reduce FPR to 0.1% by doubling bloom filter size |
| Session store failure | Redis SET returns error | Cannot persist bloom filter updates; dedup state lost | Radio Gen Service continues serving without dedup for the session; next session starts fresh bloom filter |
Comparison of Approaches
| Approach | Latency | Personalization | Diversity Control | Cold Start | Best Fit |
|---|---|---|---|---|---|
| Pure Collaborative Filtering | Medium - CF requires user and item factor lookup; 50-100ms | High - directly models user taste | Weak - CF alone tends to surface popular tracks | Fails - no CF signal for new tracks | Mature catalog, dense interaction matrix |
| Pure Audio Features | Low - ANN search is fast; 15-30ms | None - same results for all users | Good - acoustic diversity enforced via MMR | Excellent - audio features available immediately | New platform, cold start heavy, any user |
| Hybrid CF + Audio (this design) | Low-medium - parallel calls; 25-50ms | High - CF weighted by user history depth | Good - MMR on audio embeddings + CF score | Handled - audio-only fallback for new tracks | Production at scale with mixed user tenure |
| Graph-Based (Listen Graph) | High - graph traversal; 200-500ms | Very high - captures social listening patterns | Depends on graph structure | Poor - new tracks have no edges | Social music platforms, last.fm-style scrobbling |
| Session-Aware Hybrid | Medium - same as hybrid + session context; 40-70ms | Highest - adapts within a single session | Best - tracks current session trajectory | Handled via audio fallback | Premium experience where within-session coherence is critical |
Key Takeaways
- Parallel fan-out is non-negotiable: the ANN index query and CF model lookup must run concurrently. Sequential execution would double latency and violate the 500ms SLA.
- Adaptive CF weighting beats a fixed blend: users with 10 listens and users with 10,000 listens should not receive the same CF-to-audio ratio. Adaptive blending produces measurably better playlist quality for both groups.
- Bloom filters are the right dedup data structure: 480 bytes per user versus kilobytes for a raw set. At 600M users, the difference between 288GB and multiple terabytes is the difference between a manageable Redis cluster and an unaffordable one.
- Genre-sharded ANN indexes beat a single global index: genre sharding reduces per-query scan space and concentrates cache warming on acoustically relevant tracks. A global index at 80M tracks would require 43GB per ANN node, pricing out most commodity hardware.
- Cold start via audio similarity is a feature, not a limitation: new tracks discovered through pure acoustic similarity often surprise users pleasantly because they surface hidden gems that have not yet accumulated social proof. Frame cold start as discovery, not degradation.
- MMR diversity selection is a tunable dial: lambda_diversity=0.3 is a starting point. A/B test it against session skip rate and session length. Lower values create tighter acoustic cohesion; higher values introduce more surprise at the cost of coherence.
- Session dedup state belongs in Redis, not in the recommendation model: trying to encode dedup state into the model (e.g., training on negative labels for recently played tracks) makes the model context-dependent and hard to evaluate offline. Keep session state in a fast store and apply it as a post-filter.
- Degraded mode must be defined before the pager fires: when the CF model goes down, pure audio similarity must already be wired as the fallback. A system that returns errors when the CF model is unavailable is unacceptable for a feature that 600 million users touch daily.
Frequently Asked Questions
Q: Why not use a transformer model for recommendation instead of matrix factorization? Transformer-based sequential recommendation models (like SASRec or BERT4Rec) are state of the art in offline metrics. The problem is inference latency. A transformer running attention over a user’s full listening history produces a context-aware embedding in 100-300ms per user at inference time. Matrix factorization produces a static user embedding that is read from a store in 2ms. At 55,000 RPS, the latency difference translates directly into a 10x infrastructure cost difference. Transformers are the right choice for the batch-computed “Discover Weekly” use case; static embeddings are the right choice for real-time radio.
Q: Why not build one giant FAISS index with all 80M tracks instead of sharding by genre? A single flat FAISS index at 80M tracks x 128 dims requires 41GB of RAM per replica. With 3 replicas, that is 123GB just for the index, requiring expensive high-memory instances. Genre sharding brings each shard to ~4GB, which runs comfortably on commodity 8GB nodes. More importantly, genre-local queries have higher precision - a jazz seed querying only jazz-adjacent shards returns candidates with a higher floor quality than querying all genres and discarding 90% of results.
Q: Why not store user listening history in the bloom filter instead of tracking last-5-sessions separately? A bloom filter is write-only - you cannot remove items from it. If we added every track a user has ever heard to the bloom filter, it would fill and produce excessive false positives within a few thousand tracks. The 5-session window is intentional: radio is meant to recycle tracks over longer time horizons. Listening to a great track in a session six months ago should not prevent it from appearing in today’s radio. The rolling 5-session window balances freshness against variety.
Q: Why 500 ANN candidates instead of 100 or 1000? Too few candidates (100) limits the diversity the MMR selector can inject - if 80 of the 100 are acoustically identical, the final 50-track playlist will be monotonous regardless of the lambda_diversity setting. Too many candidates (1000) adds latency in the CF scoring step - scoring 1000 candidates via dot product is 2x more compute than 500. The CF scoring step is the bottleneck, not the ANN query. 500 candidates gives enough raw material for diversity while keeping CF scoring under 5ms on a CPU.
Q: How do you handle a seed track that is not in the ANN index - for example, a track a user uploaded themselves? User-uploaded tracks are not processed through the audio analysis pipeline immediately. The Radio Generation Service detects that the seed embedding is absent from the Feature Store and falls back to a “genre radio” mode: the user’s listening history is used to infer a genre, and the 500 most popular tracks from that genre over the past 30 days are used as the candidate pool. This is a significantly weaker experience but gracefully avoids an error response.
Q: Why use cosine distance for the ANN index instead of L2 (Euclidean) distance? Cosine distance measures the angle between embeddings, not their magnitude. Two tracks with very different loudness profiles will have embeddings with different magnitudes (L2 norms) but potentially the same acoustic character. L2 distance would penalize magnitude differences that are acoustically meaningless. Cosine distance - or equivalently, dot product after L2 normalizing all embeddings - captures directional similarity in the embedding space, which correlates better with perceived musical similarity than raw L2 distance does.
Interview Questions
Q: Walk me through what happens from the moment a user taps “Start Radio” on a seed track to when the first song starts playing.
Expected depth: Client sends POST /api/radio/start with track_id and user_id to the API Gateway. Gateway validates auth token, forwards to Radio Generation Service via gRPC. Service issues two parallel calls: (1) Redis Feature Store lookup for the seed’s 128-d embedding, (2) Redis lookup for the user’s bloom filter. Once the embedding arrives, it fans out to the genre-appropriate ANN index shard for a k=500 HNSW query (~15ms). Simultaneously, the user’s CF embedding is fetched from the user factor store (~5ms). Score fusion runs in ~1ms. Bloom filter filters candidates (~2ms). MMR selects 50 diverse tracks (~5ms). First batch returned to client. Total: ~30ms median, ~120ms P99. Client buffers the first batch and starts playing track 1 while prefetching the next batch.
Q: The system is returning noticeably worse recommendations for users who have just signed up versus users with 5 years of listening history. How do you diagnose and fix this?
Expected depth: Root cause is almost certainly the CF weight for new users. New users have sparse listening history - maybe 10-50 tracks - and the CF model’s user embedding for them is noisy. If compute_blend_weight is returning 0.5 for new users (because they have 200 listens), the CF signal is dominating but the signal quality is poor. Diagnosis: segment recommendation quality metrics (skip rate, session length) by user tenure. Fix: lower the CF weight threshold for users under 12 months old; use listening hour count as a more reliable freshness signal than absolute play count. Also check whether new users are receiving the cold-start treatment correctly for recently released tracks they encounter.
Q: You need to add a “mood” filter to Song Radio - users can specify “energetic,” “calm,” or “happy” and the radio should bias toward that mood. How do you extend the architecture?
Expected depth: Mood maps directly to scalar audio features: energetic = high energy (above 0.7), calm = low energy (below 0.4) + low tempo, happy = high valence (above 0.6). The simplest extension adds a mood constraint as a post-filter on the ANN candidates before CF scoring. A more sophisticated approach adds a mood bias term to the score fusion: mood_bonus = 0.1 * mood_match(track_features, user_mood). The ANN index itself does not need to change - cosine similarity still returns acoustically similar candidates regardless of mood. The mood filter is applied in the Radio Generation Service before MMR selection. Cold start tracks that match the mood should receive a higher wildcard injection rate.
Q: Song Radio is consuming 40% of your total Redis cluster memory. How do you reduce it without degrading the user experience?
Expected depth: Three main consumers - user bloom filters (288GB), user CF embeddings (dynamic cache), and seed audio embeddings (80M * 512 bytes = 41GB). Bloom filter size can be reduced by lowering the 5-session window to 3 sessions (40% reduction in bit array size) or by increasing the false positive rate from 1% to 2% (halves the bit array size). For CF embeddings, ensure the TTL is set appropriately - users who have not started a radio session in 7 days should have their embedding evicted. For audio embeddings, consider storing only the most frequently seeded tracks (top 10M by play count) in Redis and falling back to PostgreSQL for the remaining 70M. A tiered caching strategy - Redis for hot embeddings, PostgreSQL for cold - can reduce Redis embedding memory by 80% with minimal latency impact for popular seeds.
Q: How would you add real-time feedback - if a user skips a track in the radio session, bias away from similar tracks immediately within the same session?
Expected depth: Real-time within-session feedback requires a session-scoped “dislike embedding.” When a user skips track T, retrieve T’s audio embedding and add it to a “session repulsion vector” - a running average of disliked track embeddings. In the next score fusion call, subtract the dot product of each candidate’s embedding against the repulsion vector from the composite score: adjusted_score = composite - repulsion_weight * dot(candidate_emb, repulsion_vec). This biases the MMR selection away from tracks acoustically similar to skipped tracks without requiring a model retrain. The repulsion vector is stored in session state (Redis key tied to session_id, TTL matching session expiry) and grows incrementally as the user provides feedback. This is pure real-time inference - no model update required.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.