Build Spotify's Song Radio Feature


data-engineering scalability performance

System Design Deep Dive

Spotify Song Radio

How do you generate an infinite, personalized, never-repeating playlist from a single seed song for 600 million users in real time?

14 min readAdvancedMusic Recommendation

Think of a jazz musician who has just heard one song and must immediately improvise an infinite set that feels coherent - same mood, compatible keys, similar energy arcs - while surprising the audience with each new tune and never repeating the same piece twice across the whole night. That improvisation problem, scaled to 600 million simultaneous listeners each holding a different seed song, is what Spotify’s Song Radio must solve in under 500 milliseconds per request.

The technical challenge compounds across several dimensions at once. The catalog contains 80 million tracks. Each listener has a unique taste profile built from years of streaming history. A song played in this session must not replay for five sessions back. And the system must produce a result that feels neither monotonous - all songs identical to the seed - nor jarring, where genre whiplash breaks the listener’s flow. This is not a simple similarity search problem; it is a multi-objective optimization problem running at internet scale.

Three tensions define the design space. Relevance versus diversity: a playlist of 50 songs acoustically identical to the seed is technically accurate but boring after five tracks. The system needs controlled variety. Personalization versus latency: computing a bespoke recommendation for each of 600 million users in real time requires either precomputed embeddings or extremely fast model inference. You cannot run a deep neural network per user per request. And cold-start for new releases: a track uploaded this morning has zero listening history, so collaborative filtering signals - which depend entirely on aggregated play counts - cannot apply. Audio features are the only signal available, and the system must degrade gracefully to pure acoustic similarity when history is absent.

Requirements and Constraints

Functional Requirements

  • Given a seed track_id and user_id, return an ordered playlist of at least 50 recommended tracks in real time
  • The recommendation mix must blend acoustic similarity (audio features) with collaborative filtering signals (user listening patterns)
  • Track deduplication must span at least 5 previous radio sessions for the same user - no track repeats within that window
  • New tracks with zero play history must receive recommendations via acoustic similarity fallback
  • The playlist must inject diversity to prevent genre or tempo monotony across the full session

Non-Functional Requirements

  • Latency: sub-500ms end-to-end for the first batch of recommendations; P99 under 400ms
  • Throughput: support 600 million active users; assume 10 million concurrent radio sessions at peak
  • Catalog scale: 80 million tracks indexed for approximate nearest neighbor search
  • Session memory: deduplication across 5 most recent radio sessions per user
  • Availability: 99.99% - radio is a high-engagement feature; degraded mode serves pure audio similarity results when the CF model is unavailable

Constraints and Scope

  • Track audio analysis (raw audio to feature vector) is an offline batch pipeline; we receive pre-computed feature vectors at query time
  • Explicit user feedback (thumbs up/down) is ingested asynchronously and does not affect the current session in real time
  • Ad insertion and podcast episodes are handled by separate systems outside this scope
  • The Diversity Engine operates on the top-200 post-scored candidates, not the raw 80 million catalog

High-Level Architecture

The system has six layers. The API Gateway accepts the seed track_id and user_id, validates auth, and routes to the Radio Generation Service - the orchestrator that calls every downstream component. The Feature Store provides pre-computed audio embeddings for the seed track; these are fed into the ANN Index for approximate nearest neighbor retrieval of 500 candidate tracks. In parallel, the CF Model scores those candidates against the user’s taste profile. The Dedup Store and Diversity Engine then filter and reorder the scored candidates before the final playlist is returned to the client.

Spotify Song Radio high-level architecture
Key Insight

The ANN index and CF model run in parallel, not sequentially. The ANN index retrieves 500 acoustically similar candidates based on the seed’s audio embedding. Simultaneously, the CF model loads the user’s taste embedding. Only when both results arrive does the Radio Generation Service fuse the two scores. This fan-out-then-merge pattern keeps latency bounded by the slower of the two calls, not their sum.

Audio Feature Extraction

Every track in Spotify’s catalog passes through an audio analysis pipeline that converts raw audio into a 128-dimensional feature vector. The low-level features include tempo (beats per minute), key (musical key and mode), valence (musical positiveness, 0 to 1), energy (perceptual intensity), and danceability (how suitable a track is for dancing). These five scalar features are visible via Spotify’s public API, but the internal representation is a much richer learned embedding from a neural audio encoder.

# Audio feature extraction pipeline - converts raw audio to feature vector
import librosa
import numpy as np
from dataclasses import dataclass
from typing import Optional

@dataclass
class AudioFeatures:
    track_id: str
    tempo: float           # BPM, e.g. 120.5
    key: int               # 0=C, 1=C#, ..., 11=B
    mode: int              # 0=minor, 1=major
    valence: float         # [0, 1] - musical positiveness
    energy: float          # [0, 1] - perceptual intensity
    danceability: float    # [0, 1] - rhythmic suitability
    loudness: float        # dBFS, typically -60 to 0
    embedding: np.ndarray  # 128-d learned embedding from audio encoder

def extract_scalar_features(audio_path: str, track_id: str) -> AudioFeatures:
    """
    Extract scalar audio features from raw audio file.
    In production this runs as a distributed Spark job over the full catalog.
    """
    y, sr = librosa.load(audio_path, sr=22050)

    # Tempo and beat detection
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)

    # Chroma features for key estimation
    chroma = librosa.feature.chroma_cqt(y=y, sr=sr)
    key_idx = int(np.argmax(np.mean(chroma, axis=1)))

    # Spectral features proxy for energy
    rms = float(np.mean(librosa.feature.rms(y=y)))
    energy = min(1.0, rms * 20.0)  # normalize to [0, 1]

    # Loudness in dBFS
    loudness = float(librosa.amplitude_to_db(np.mean(np.abs(y))))

    # Valence and danceability require a trained regressor in production;
    # here we use spectral contrast as a rough proxy
    spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    valence_proxy = float(np.clip(np.mean(spectral_contrast) / 30.0, 0.0, 1.0))

    # Onset strength proxy for danceability
    onset_env = librosa.onset.onset_strength(y=y, sr=sr)
    danceability_proxy = float(np.clip(np.mean(onset_env) / 5.0, 0.0, 1.0))

    # 128-d embedding would come from a pre-trained audio encoder (e.g. VGGish, MusicNN)
    # Placeholder: in production, load embedding from the neural encoder output
    embedding = np.zeros(128, dtype=np.float32)

    return AudioFeatures(
        track_id=track_id,
        tempo=float(tempo),
        key=key_idx,
        mode=1,
        valence=valence_proxy,
        energy=energy,
        danceability=danceability_proxy,
        loudness=loudness,
        embedding=embedding,
    )

The scalar features are human-interpretable quality checks. The 128-dimensional embedding is what actually drives similarity search at scale. This embedding is produced by a convolutional neural encoder trained on mel spectrograms, using contrastive learning so that tracks that appear frequently together in playlists have embeddings that are geometrically close in the vector space.

Candidate retrieval and scoring pipeline
Key Insight

Separating scalar features from the learned embedding is a deliberate engineering choice. Scalar features (tempo, key, valence) are used for diversity enforcement - you can write interpretable rules like “do not serve more than 3 consecutive tracks with tempo within 5 BPM of each other.” The 128-d embedding drives the ANN search because it captures holistic acoustic similarity that scalar features cannot, such as the difference between a fast jazz track and a fast punk track both at 180 BPM.

Collaborative Filtering Signal

Matrix factorization is the backbone of collaborative filtering for music. The intuition: if user A and user B have both saved tracks X and Y, and user A has also saved track Z, then user B probably wants to hear track Z even though they have never encountered it. This implicit feedback signal - what users play, skip, save, or add to playlists - is far richer than explicit ratings.

# Collaborative filtering via implicit ALS (Alternating Least Squares)
# Uses the implicit library which implements ALS for implicit feedback
import numpy as np
import scipy.sparse as sp
from typing import List, Tuple

class ImplicitALSModel:
    """
    Thin wrapper around a pre-trained ALS model.
    In production, this model is trained offline daily on the full interaction matrix
    (600M users x 80M tracks) using distributed Spark ML or a custom GPU trainer.
    The trained user and item factor matrices are served from a feature store.
    """

    def __init__(self, user_factors: np.ndarray, item_factors: np.ndarray, track_ids: List[str]):
        # user_factors: shape (num_users, embedding_dim)
        # item_factors: shape (num_tracks, embedding_dim)
        self.user_factors = user_factors
        self.item_factors = item_factors
        # Map from track_id string to row index in item_factors
        self.track_id_to_idx = {tid: i for i, tid in enumerate(track_ids)}

    def user_score(self, user_embedding: np.ndarray, candidate_track_ids: List[str]) -> np.ndarray:
        """
        Compute CF scores for a set of candidate tracks given a user embedding.
        Score = dot product of user embedding with each candidate's item embedding.
        Returns array of scores, one per candidate, in the same order.
        """
        indices = [self.track_id_to_idx.get(tid, -1) for tid in candidate_track_ids]
        valid_mask = [i >= 0 for i in indices]
        scores = np.zeros(len(candidate_track_ids), dtype=np.float32)

        valid_indices = [i for i in indices if i >= 0]
        if not valid_indices:
            return scores

        candidate_factors = self.item_factors[valid_indices]  # shape (N_valid, dim)
        dot_products = candidate_factors @ user_embedding     # shape (N_valid,)

        # Sigmoid to map to [0, 1] range
        dot_products = 1.0 / (1.0 + np.exp(-dot_products))

        valid_pos = 0
        for i, is_valid in enumerate(valid_mask):
            if is_valid:
                scores[i] = dot_products[valid_pos]
                valid_pos += 1

        return scores

    def get_user_embedding(self, user_idx: int) -> np.ndarray:
        """Retrieve user's learned latent factor vector."""
        if user_idx < 0 or user_idx >= len(self.user_factors):
            return np.zeros(self.user_factors.shape[1], dtype=np.float32)
        return self.user_factors[user_idx]

The ALS model produces two factor matrices: one for users (600M x 128) and one for tracks (80M x 128). The user matrix is too large to hold in memory on a single serving node, so user embeddings are retrieved per-request from a Redis feature store. Track embeddings can be partially cached since the candidate pool is bounded to 500 tracks per request.

Seed-Based Candidate Retrieval

The ANN index stores all 80 million track embeddings and supports approximate nearest neighbor queries via HNSW (Hierarchical Navigable Small World graphs). Given the seed track’s 128-d embedding, an HNSW query returns the 500 most acoustically similar tracks in under 20ms.

# Seed-based candidate retrieval using HNSW
import hnswlib
import numpy as np
from typing import List, Tuple

class ANNTrackIndex:
    """
    HNSW index over 80M track embeddings.
    Built offline, loaded into memory at serving time.
    At 80M tracks * 128 dims * 4 bytes = ~41GB per index.
    Sharded by genre bucket in production to fit in per-node RAM.
    """

    def __init__(self, dim: int = 128, max_elements: int = 80_000_000):
        self.index = hnswlib.Index(space='cosine', dim=dim)
        self.index.init_index(max_elements=max_elements, ef_construction=200, M=32)
        self.track_ids: List[str] = []

    def build(self, embeddings: np.ndarray, track_ids: List[str]) -> None:
        """Build the index from a numpy array of embeddings."""
        self.track_ids = track_ids
        int_ids = np.arange(len(track_ids))
        self.index.add_items(embeddings, int_ids)
        self.track_ids = track_ids

    def load(self, index_path: str, track_ids_path: str) -> None:
        """Load pre-built index from disk (called at serving startup)."""
        self.index.load_index(index_path)
        with open(track_ids_path) as f:
            self.track_ids = [line.strip() for line in f]

    def query(self, seed_embedding: np.ndarray, k: int = 500) -> List[Tuple[str, float]]:
        """
        Retrieve k nearest neighbors for a seed embedding.
        Returns list of (track_id, cosine_distance) pairs, closest first.
        """
        self.index.set_ef(max(k * 2, 200))  # ef >= k for good recall
        labels, distances = self.index.knn_query(seed_embedding.reshape(1, -1), k=k)
        results = []
        for label, dist in zip(labels[0], distances[0]):
            if label < len(self.track_ids):
                results.append((self.track_ids[label], float(dist)))
        return results
Radio generation data flow

The index is sharded by genre bucket rather than by track ID hash. Genre-local sharding means that when a user seeds with a jazz track, the query only fans out to jazz-adjacent shards rather than all shards. This reduces both latency and unnecessary cross-genre candidates that would be filtered anyway. Cross-shard queries happen when diversity injection explicitly pulls from distant genre buckets.

Real-Time Feature Blending

Once the ANN index returns 500 candidates and the CF model scores them against the user embedding, the Radio Generation Service fuses the two signals into a single composite score. The blend weight is adaptive: users with deep listening history get higher CF weight; users with sparse history get higher audio similarity weight.

# Real-time feature blending: audio similarity + CF score
import numpy as np
from typing import List, Dict, Tuple

def compute_blend_weight(user_listen_count: int) -> float:
    """
    Adaptive CF weight based on how much listening history a user has.
    Users with fewer than 50 listens: CF signal is noisy, weight it low.
    Users with 1000+ listens: CF signal is rich, weight it high.
    """
    if user_listen_count < 50:
        return 0.1
    elif user_listen_count < 200:
        return 0.3
    elif user_listen_count < 1000:
        return 0.5
    else:
        return 0.7

def blend_scores(
    candidates: List[Tuple[str, float]],  # (track_id, audio_distance) from ANN
    cf_scores: Dict[str, float],          # track_id -> CF score [0, 1]
    user_listen_count: int,
    seed_valence: float,
    seed_tempo: float,
    track_features: Dict[str, Dict],      # track_id -> {valence, tempo, energy, ...}
) -> List[Tuple[str, float]]:
    """
    Compute composite score for each candidate.
    composite = (1 - cf_weight) * audio_score + cf_weight * cf_score + context_bonus
    """
    cf_weight = compute_blend_weight(user_listen_count)
    audio_weight = 1.0 - cf_weight
    scored = []

    for track_id, audio_distance in candidates:
        # Convert cosine distance to similarity [0, 1]
        audio_sim = max(0.0, 1.0 - audio_distance)

        cf_score = cf_scores.get(track_id, 0.0)

        # Context bonus: slight reward for tracks close in mood/tempo to the seed
        context_bonus = 0.0
        feat = track_features.get(track_id)
        if feat:
            valence_diff = abs(feat.get("valence", 0.5) - seed_valence)
            tempo_diff = abs(feat.get("tempo", 120.0) - seed_tempo) / 200.0
            context_bonus = 0.05 * (1.0 - valence_diff) + 0.05 * (1.0 - tempo_diff)

        composite = audio_weight * audio_sim + cf_weight * cf_score + context_bonus
        scored.append((track_id, composite))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored

Session-Level Deduplication

A user listening to Song Radio expects never to hear the same track twice within the current session, and ideally not across their last five radio sessions. The deduplication mechanism must be fast (checked per-candidate, not after ranking) and compact (600 million users cannot each hold a 5-session listening list in memory as a raw set).

Bloom filters are the right data structure here. A bloom filter for a single user’s 5-session window - assuming 50 tracks per session, so 250 tracks total - takes roughly 480 bytes at a 1% false positive rate. For 600 million users, that is about 288 GB total, which fits comfortably in a Redis cluster.

# Session-level deduplication using bloom filter per user
import mmh3   # MurmurHash3 - fast, good distribution
import math
from typing import List

class UserBloomFilter:
    """
    Bloom filter for a single user's recently heard tracks.
    Tracks 5 sessions * 50 tracks = 250 items at 1% FPR.
    Size: ~480 bytes per user.
    """

    def __init__(self, expected_items: int = 250, false_positive_rate: float = 0.01):
        # Optimal bit array size and number of hash functions
        m = self._optimal_m(expected_items, false_positive_rate)
        k = self._optimal_k(m, expected_items)
        self.m = m
        self.k = k
        self.bits = bytearray(math.ceil(m / 8))

    @staticmethod
    def _optimal_m(n: int, p: float) -> int:
        return int(-n * math.log(p) / (math.log(2) ** 2))

    @staticmethod
    def _optimal_k(m: int, n: int) -> int:
        return max(1, int((m / n) * math.log(2)))

    def _hash_positions(self, item: str) -> List[int]:
        positions = []
        for seed in range(self.k):
            h = mmh3.hash(item, seed, signed=False)
            positions.append(h % self.m)
        return positions

    def add(self, track_id: str) -> None:
        for pos in self._hash_positions(track_id):
            byte_idx, bit_idx = divmod(pos, 8)
            self.bits[byte_idx] |= (1 << bit_idx)

    def contains(self, track_id: str) -> bool:
        """Returns True if track_id was probably added (may false-positive at 1%)."""
        for pos in self._hash_positions(track_id):
            byte_idx, bit_idx = divmod(pos, 8)
            if not (self.bits[byte_idx] & (1 << bit_idx)):
                return False
        return True

    def to_bytes(self) -> bytes:
        return bytes(self.bits)

    @classmethod
    def from_bytes(cls, data: bytes, expected_items: int = 250, false_positive_rate: float = 0.01) -> "UserBloomFilter":
        bf = cls(expected_items, false_positive_rate)
        bf.bits = bytearray(data)
        return bf

def filter_seen_tracks(
    candidates: List[str],
    bloom: UserBloomFilter,
) -> List[str]:
    """Remove candidates that the bloom filter believes were recently played."""
    return [tid for tid in candidates if not bloom.contains(tid)]
Watch Out

Bloom filters have false positives but never false negatives. A 1% false positive rate means roughly 1 in 100 tracks that have never been heard will be incorrectly filtered out. This is acceptable - the catalog has 80 million tracks and 500 candidates, so losing ~5 candidates to false positives is invisible to the listener. However, never use a bloom filter for the inverse problem: you cannot reliably determine that a user has heard a track using a filter with any false positive rate, because false positives mean “might have heard” not “definitely heard.”

Diversity Injection

A playlist where every track has the same tempo, genre, and energy creates listener fatigue. The Diversity Engine post-processes the CF + audio scored candidates to enforce spread across multiple acoustic dimensions. The algorithm is a greedy maximum marginal relevance (MMR) selector: at each step, pick the next track that maximizes the score minus a penalty proportional to its similarity to already-selected tracks.

# Diversity injection via greedy Maximum Marginal Relevance (MMR)
import numpy as np
from typing import List, Dict, Tuple

def mmr_diversity_select(
    candidates: List[Tuple[str, float]],   # (track_id, composite_score)
    embeddings: Dict[str, np.ndarray],     # track_id -> 128-d embedding
    n_select: int = 50,
    lambda_diversity: float = 0.3,         # 0=pure relevance, 1=pure diversity
) -> List[str]:
    """
    Select n_select tracks from candidates using MMR.
    Balances relevance (composite score) against diversity (distance from already selected).

    lambda_diversity=0.3 means: 70% weight on relevance, 30% on diversity.
    Tune via offline A/B test against session skip rate.
    """
    if not candidates:
        return []

    selected: List[str] = []
    selected_embeddings: List[np.ndarray] = []
    remaining = list(candidates)  # (track_id, score)

    while len(selected) < n_select and remaining:
        best_track = None
        best_mmr = float('-inf')

        for track_id, relevance_score in remaining:
            emb = embeddings.get(track_id)
            if emb is None:
                mmr_score = (1.0 - lambda_diversity) * relevance_score
            elif not selected_embeddings:
                # No tracks selected yet - pure relevance
                mmr_score = relevance_score
            else:
                # Max similarity to any already selected track
                sims = [float(np.dot(emb, sel_emb) / (np.linalg.norm(emb) * np.linalg.norm(sel_emb) + 1e-9))
                        for sel_emb in selected_embeddings]
                max_sim = max(sims)
                mmr_score = (1.0 - lambda_diversity) * relevance_score - lambda_diversity * max_sim

            if mmr_score > best_mmr:
                best_mmr = mmr_score
                best_track = (track_id, relevance_score, emb)

        if best_track is None:
            break

        track_id, _, emb = best_track
        selected.append(track_id)
        if emb is not None:
            selected_embeddings.append(emb)
        remaining = [(tid, s) for tid, s in remaining if tid != track_id]

    return selected

Cold Start for New Releases

A track uploaded today has zero listening events. The CF model’s item factor matrix has no row for this track. Collaborative filtering fails entirely. The fallback is pure audio similarity: compute the track’s 128-d embedding and find the 50 most acoustically similar established tracks in the catalog. New releases are then inserted into radio sessions as “wildcards” - occasional insertions into the primary CF + audio ranked list - rather than core results. This ensures exposure while preventing untested tracks from dominating the playlist.

# Cold-start insertion for new releases
from typing import List, Dict, Tuple, Optional
import numpy as np

NEW_RELEASE_WINDOW_DAYS = 30

def is_cold_start(track_metadata: dict) -> bool:
    """A track is cold-start if it was released within 30 days and has <100 plays."""
    days_since_release = track_metadata.get("days_since_release", 999)
    play_count = track_metadata.get("global_play_count", 0)
    return days_since_release <= NEW_RELEASE_WINDOW_DAYS and play_count < 100

def insert_cold_start_wildcards(
    ranked_playlist: List[str],
    cold_start_pool: List[Tuple[str, float]],  # (track_id, audio_sim_to_seed)
    wildcard_rate: float = 0.08,               # inject 1 wildcard per ~12 tracks
) -> List[str]:
    """
    Insert cold-start tracks at pseudo-random positions in the ranked playlist.
    Wildcards are inserted with probability wildcard_rate at each position.
    Audio similarity to the seed is used to select which cold-start track to inject.
    """
    if not cold_start_pool:
        return ranked_playlist

    # Sort cold-start pool by audio similarity descending (best match first)
    cold_start_pool = sorted(cold_start_pool, key=lambda x: x[1], reverse=True)
    cold_idx = 0

    result = []
    for i, track_id in enumerate(ranked_playlist):
        result.append(track_id)
        # Inject wildcard after every ~12 tracks if pool not exhausted
        if cold_idx < len(cold_start_pool) and (i + 1) % int(1.0 / wildcard_rate) == 0:
            wildcard_id, _ = cold_start_pool[cold_idx]
            result.append(wildcard_id)
            cold_idx += 1

    return result
Real World

Spotify’s Discover Weekly and Radio features both use cold-start injection strategies for new releases. Apple Music uses a similar approach with their “New Music Mix” where tracks from recently signed artists are injected alongside established similar artists. YouTube Music addresses cold start differently - they leverage video view count as a proxy for early popularity, giving tracks with viral video presence a head start even before listening data accumulates.

Data Model

-- Core track catalog and feature tables

CREATE TABLE tracks (
track_id UUID PRIMARY KEY,
title TEXT NOT NULL,
artist_id UUID NOT NULL REFERENCES artists(artist_id),
album_id UUID REFERENCES albums(album_id),
duration_ms INT NOT NULL,
release_date DATE NOT NULL,
global_play_count BIGINT DEFAULT 0,
is_explicit BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX tracks_artist_idx ON tracks (artist_id);
CREATE INDEX tracks_release_date_idx ON tracks (release_date DESC);

-- Pre-computed audio features and 128-d embedding per track
-- Stored separately because embeddings are large (512 bytes each)
CREATE TABLE audio_features (
track_id UUID PRIMARY KEY REFERENCES tracks(track_id),
tempo FLOAT NOT NULL,
key SMALLINT NOT NULL,           -- 0=C, 1=C#, ..., 11=B
mode SMALLINT NOT NULL,          -- 0=minor, 1=major
valence FLOAT NOT NULL,
energy FLOAT NOT NULL,
danceability FLOAT NOT NULL,
loudness FLOAT NOT NULL,
embedding BYTEA NOT NULL,        -- 128 float32 values = 512 bytes
analyzed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- User listening history for CF training and dedup context
-- Partitioned by user_id hash for 600M users
CREATE TABLE user_listening_history (
user_id UUID NOT NULL,
track_id UUID NOT NULL REFERENCES tracks(track_id),
listened_at TIMESTAMPTZ NOT NULL,
play_pct FLOAT,                  -- 0.0-1.0: fraction of track played
source TEXT,                     -- 'radio', 'playlist', 'album', 'search'
PRIMARY KEY (user_id, listened_at, track_id)
) PARTITION BY HASH (user_id);

CREATE INDEX ulh_user_recent_idx ON user_listening_history (user_id, listened_at DESC);

-- Active radio sessions
CREATE TABLE radio_sessions (
session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
seed_track_id UUID NOT NULL REFERENCES tracks(track_id),
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_active_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
tracks_served INT DEFAULT 0
);

CREATE INDEX radio_sessions_user_idx ON radio_sessions (user_id, started_at DESC);

-- Per-session dedup log (also stored as bloom filter in Redis for speed)
CREATE TABLE play_dedup_log (
user_id UUID NOT NULL,
session_id UUID NOT NULL REFERENCES radio_sessions(session_id),
track_id UUID NOT NULL REFERENCES tracks(track_id),
served_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, session_id, track_id)
) PARTITION BY HASH (user_id);

The user_listening_history and play_dedup_log tables are partitioned by user_id hash across 128 partitions to handle 600 million users. The audio_features table stores embeddings as raw BYTEA rather than as a PostgreSQL vector type - in practice these embeddings are read by the ANN index at startup, not queried via SQL. The authoritative store for embeddings at serving time is the Redis Feature Store, not PostgreSQL.

Key Algorithms and Protocols

Seed-Based ANN Retrieval

# Full seed-based retrieval pipeline
import numpy as np
import redis
import struct
from typing import List, Tuple, Optional

EMBED_DIM = 128

def get_seed_embedding(track_id: str, redis_client: redis.Redis) -> Optional[np.ndarray]:
    """
    Retrieve pre-computed audio embedding for seed track from Redis Feature Store.
    Key format: embed:{track_id}
    Value: 512 bytes (128 float32 little-endian)
    """
    raw = redis_client.get(f"embed:{track_id}")
    if raw is None or len(raw) < EMBED_DIM * 4:
        return None
    values = struct.unpack(f"<{EMBED_DIM}f", raw[:EMBED_DIM * 4])
    embedding = np.array(values, dtype=np.float32)
    # L2 normalize for cosine similarity search
    norm = np.linalg.norm(embedding)
    if norm > 0:
        embedding /= norm
    return embedding

def retrieve_candidates(
    seed_track_id: str,
    ann_index: "ANNTrackIndex",
    redis_client: redis.Redis,
    k: int = 500,
) -> List[Tuple[str, float]]:
    """
    Main candidate retrieval function.
    Returns list of (track_id, cosine_distance) sorted ascending (closest first).
    """
    embedding = get_seed_embedding(seed_track_id, redis_client)
    if embedding is None:
        # Fallback: return empty list - Radio Gen Service will use popularity signal
        return []

    candidates = ann_index.query(embedding, k=k)
    # Exclude the seed track itself from candidates
    candidates = [(tid, dist) for tid, dist in candidates if tid != seed_track_id]
    return candidates

Score Fusion: CF + Audio + Recency

# Score fusion combining CF score, audio similarity, and track recency
import numpy as np
import time
from typing import List, Dict, Tuple

def fuse_scores(
    ann_candidates: List[Tuple[str, float]],  # (track_id, cosine_distance)
    cf_scores: Dict[str, float],              # track_id -> CF score [0, 1]
    track_meta: Dict[str, dict],              # track_id -> {release_date, play_count, ...}
    cf_weight: float,
    recency_bonus_weight: float = 0.05,
) -> List[Tuple[str, float]]:
    """
    Fuse three signals:
    1. Audio similarity from ANN (converted from distance to similarity)
    2. CF score from matrix factorization
    3. Recency bonus for tracks released within last 30 days

    Final score = (1 - cf_weight - recency_bonus_weight) * audio_sim
                + cf_weight * cf_score
                + recency_bonus_weight * recency_bonus
    """
    audio_weight = 1.0 - cf_weight - recency_bonus_weight
    scored = []
    now_ts = time.time()

    for track_id, cosine_dist in ann_candidates:
        audio_sim = max(0.0, 1.0 - cosine_dist)  # cosine dist [0,2] -> sim [0,1]
        cf = cf_scores.get(track_id, 0.0)

        meta = track_meta.get(track_id, {})
        release_ts = meta.get("release_timestamp", 0)
        days_old = (now_ts - release_ts) / 86400
        # Recency bonus decays from 1.0 (just released) to 0.0 (30+ days old)
        recency = max(0.0, 1.0 - days_old / 30.0)

        score = audio_weight * audio_sim + cf_weight * cf + recency_bonus_weight * recency
        scored.append((track_id, score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored

Bloom Filter for Session Dedup

# Bloom filter management in Redis for session deduplication
import redis
import math
import mmh3
from typing import List

BLOOM_KEY_PREFIX = "bloom:user:"
BLOOM_TTL_SECONDS = 60 * 60 * 24 * 30  # 30 days

def get_user_bloom(user_id: str, redis_client: redis.Redis) -> "UserBloomFilter":
    """Load user's bloom filter from Redis or return a fresh one."""
    raw = redis_client.get(f"{BLOOM_KEY_PREFIX}{user_id}")
    if raw is None:
        return UserBloomFilter()
    return UserBloomFilter.from_bytes(raw)

def save_user_bloom(user_id: str, bloom: "UserBloomFilter", redis_client: redis.Redis) -> None:
    """Persist updated bloom filter back to Redis with TTL refresh."""
    redis_client.set(
        f"{BLOOM_KEY_PREFIX}{user_id}",
        bloom.to_bytes(),
        ex=BLOOM_TTL_SECONDS,
    )

def mark_tracks_served(
    user_id: str,
    track_ids: List[str],
    redis_client: redis.Redis,
) -> None:
    """
    After serving a batch of tracks, add them to the user's bloom filter.
    Uses Redis PIPELINE to batch the read-modify-write atomically.
    """
    bloom = get_user_bloom(user_id, redis_client)
    for track_id in track_ids:
        bloom.add(track_id)
    save_user_bloom(user_id, bloom, redis_client)

Scaling and Performance

Scaling and sharding architecture

Capacity Estimation

Catalog:
  80M tracks * 128 dims * 4 bytes = 41GB raw embeddings
  HNSW index overhead (M=32): ~20 bytes/node = 1.6GB metadata
  Total ANN index: ~43GB
  With genre sharding (10 shards): ~4.3GB per shard (fits in 8GB RAM nodes)

User Bloom Filters:
  600M users * 480 bytes/bloom = ~288GB
  Redis cluster at 32GB/node: ~9 Redis nodes just for bloom filters
  With replication (x3): 27 Redis nodes

CF Model Factor Matrices:
  User factors: 600M users * 128 dims * 4 bytes = 307GB
  Track factors: 80M tracks * 128 dims * 4 bytes = 41GB
  Too large for single node: shard user factors by user_id range (10 shards)
  Each user-factor shard: ~30GB (fits in 64GB RAM node)

Radio Generation Service throughput:
  10M concurrent sessions; each session fetches next batch every ~3 minutes
  = 10M / 180s = ~55,000 RPS to the Radio Gen Service
  Per request: 1 ANN query (~15ms) + 1 CF lookup (~5ms) + 1 bloom check (~2ms)
    + score fusion (~1ms) + MMR selection (~5ms) = ~28ms per request
  Instances needed at 28ms avg, 200ms P99 budget:
    55,000 RPS * 0.028s = ~1,540 concurrent requests in flight
    With 32 threads per instance: ~48 Radio Gen Service instances

ANN Index shards:
  10 genre shards, 3 replicas each = 30 ANN index server instances
  Each receives: 55,000 RPS / 10 shards = 5,500 QPS per shard
  HNSW query at ef=500: ~8ms, throughput ~125 QPS/thread
  Threads needed per shard: 5,500 / 125 = ~44 threads
  2 instances per shard with 24 cores each = 48 threads (sufficient with headroom)
Real World

Spotify’s internal infrastructure team has published details on their “Cosmos” data processing platform and their use of HNSW for track recommendation. Apple Music uses a similar two-tower embedding approach for their Radio feature. YouTube Music shards their embedding index differently - by upload recency rather than genre - because their catalog grows at a faster rate than Spotify’s music catalog does. The HNSW library used in most production systems (hnswlib or the nmslib predecessor) was originally developed at Microsoft Research and is now the de facto standard for billion-scale ANN search.

Failure Modes and Recovery

FailureDetectionImpactRecovery
ANN index stale (>24h since rebuild)Index age monitor alertCandidates exclude tracks released in past 24h; existing tracks still servedAlert on-call; trigger incremental index rebuild for new tracks; stale candidates are acceptable for 24h
CF model offlinegRPC health check fails; timeout >50msRadio generation falls back to pure audio similarity modeCircuit breaker opens; Radio Gen Service uses audio-only scoring; degrade gracefully, no user error
Feature Store (Redis) downMGET timeout >5msSeed embedding unavailable; cannot seed ANN queryFall back to popularity-based candidates for that genre; cache last 100 embeddings in Radio Gen Service local memory
Bloom filter hash collision (false positive)Cannot detect at runtime; estimated 1% FPRRare tracks incorrectly filtered from candidatesAcceptable by design; false positive rate is a tunable parameter; reduce FPR to 0.1% by doubling bloom filter size
Session store failureRedis SET returns errorCannot persist bloom filter updates; dedup state lostRadio Gen Service continues serving without dedup for the session; next session starts fresh bloom filter

Comparison of Approaches

ApproachLatencyPersonalizationDiversity ControlCold StartBest Fit
Pure Collaborative FilteringMedium - CF requires user and item factor lookup; 50-100msHigh - directly models user tasteWeak - CF alone tends to surface popular tracksFails - no CF signal for new tracksMature catalog, dense interaction matrix
Pure Audio FeaturesLow - ANN search is fast; 15-30msNone - same results for all usersGood - acoustic diversity enforced via MMRExcellent - audio features available immediatelyNew platform, cold start heavy, any user
Hybrid CF + Audio (this design)Low-medium - parallel calls; 25-50msHigh - CF weighted by user history depthGood - MMR on audio embeddings + CF scoreHandled - audio-only fallback for new tracksProduction at scale with mixed user tenure
Graph-Based (Listen Graph)High - graph traversal; 200-500msVery high - captures social listening patternsDepends on graph structurePoor - new tracks have no edgesSocial music platforms, last.fm-style scrobbling
Session-Aware HybridMedium - same as hybrid + session context; 40-70msHighest - adapts within a single sessionBest - tracks current session trajectoryHandled via audio fallbackPremium experience where within-session coherence is critical

Key Takeaways

  • Parallel fan-out is non-negotiable: the ANN index query and CF model lookup must run concurrently. Sequential execution would double latency and violate the 500ms SLA.
  • Adaptive CF weighting beats a fixed blend: users with 10 listens and users with 10,000 listens should not receive the same CF-to-audio ratio. Adaptive blending produces measurably better playlist quality for both groups.
  • Bloom filters are the right dedup data structure: 480 bytes per user versus kilobytes for a raw set. At 600M users, the difference between 288GB and multiple terabytes is the difference between a manageable Redis cluster and an unaffordable one.
  • Genre-sharded ANN indexes beat a single global index: genre sharding reduces per-query scan space and concentrates cache warming on acoustically relevant tracks. A global index at 80M tracks would require 43GB per ANN node, pricing out most commodity hardware.
  • Cold start via audio similarity is a feature, not a limitation: new tracks discovered through pure acoustic similarity often surprise users pleasantly because they surface hidden gems that have not yet accumulated social proof. Frame cold start as discovery, not degradation.
  • MMR diversity selection is a tunable dial: lambda_diversity=0.3 is a starting point. A/B test it against session skip rate and session length. Lower values create tighter acoustic cohesion; higher values introduce more surprise at the cost of coherence.
  • Session dedup state belongs in Redis, not in the recommendation model: trying to encode dedup state into the model (e.g., training on negative labels for recently played tracks) makes the model context-dependent and hard to evaluate offline. Keep session state in a fast store and apply it as a post-filter.
  • Degraded mode must be defined before the pager fires: when the CF model goes down, pure audio similarity must already be wired as the fallback. A system that returns errors when the CF model is unavailable is unacceptable for a feature that 600 million users touch daily.

Frequently Asked Questions

Q: Why not use a transformer model for recommendation instead of matrix factorization? Transformer-based sequential recommendation models (like SASRec or BERT4Rec) are state of the art in offline metrics. The problem is inference latency. A transformer running attention over a user’s full listening history produces a context-aware embedding in 100-300ms per user at inference time. Matrix factorization produces a static user embedding that is read from a store in 2ms. At 55,000 RPS, the latency difference translates directly into a 10x infrastructure cost difference. Transformers are the right choice for the batch-computed “Discover Weekly” use case; static embeddings are the right choice for real-time radio.

Q: Why not build one giant FAISS index with all 80M tracks instead of sharding by genre? A single flat FAISS index at 80M tracks x 128 dims requires 41GB of RAM per replica. With 3 replicas, that is 123GB just for the index, requiring expensive high-memory instances. Genre sharding brings each shard to ~4GB, which runs comfortably on commodity 8GB nodes. More importantly, genre-local queries have higher precision - a jazz seed querying only jazz-adjacent shards returns candidates with a higher floor quality than querying all genres and discarding 90% of results.

Q: Why not store user listening history in the bloom filter instead of tracking last-5-sessions separately? A bloom filter is write-only - you cannot remove items from it. If we added every track a user has ever heard to the bloom filter, it would fill and produce excessive false positives within a few thousand tracks. The 5-session window is intentional: radio is meant to recycle tracks over longer time horizons. Listening to a great track in a session six months ago should not prevent it from appearing in today’s radio. The rolling 5-session window balances freshness against variety.

Q: Why 500 ANN candidates instead of 100 or 1000? Too few candidates (100) limits the diversity the MMR selector can inject - if 80 of the 100 are acoustically identical, the final 50-track playlist will be monotonous regardless of the lambda_diversity setting. Too many candidates (1000) adds latency in the CF scoring step - scoring 1000 candidates via dot product is 2x more compute than 500. The CF scoring step is the bottleneck, not the ANN query. 500 candidates gives enough raw material for diversity while keeping CF scoring under 5ms on a CPU.

Q: How do you handle a seed track that is not in the ANN index - for example, a track a user uploaded themselves? User-uploaded tracks are not processed through the audio analysis pipeline immediately. The Radio Generation Service detects that the seed embedding is absent from the Feature Store and falls back to a “genre radio” mode: the user’s listening history is used to infer a genre, and the 500 most popular tracks from that genre over the past 30 days are used as the candidate pool. This is a significantly weaker experience but gracefully avoids an error response.

Q: Why use cosine distance for the ANN index instead of L2 (Euclidean) distance? Cosine distance measures the angle between embeddings, not their magnitude. Two tracks with very different loudness profiles will have embeddings with different magnitudes (L2 norms) but potentially the same acoustic character. L2 distance would penalize magnitude differences that are acoustically meaningless. Cosine distance - or equivalently, dot product after L2 normalizing all embeddings - captures directional similarity in the embedding space, which correlates better with perceived musical similarity than raw L2 distance does.

Interview Questions

Q: Walk me through what happens from the moment a user taps “Start Radio” on a seed track to when the first song starts playing.

Expected depth: Client sends POST /api/radio/start with track_id and user_id to the API Gateway. Gateway validates auth token, forwards to Radio Generation Service via gRPC. Service issues two parallel calls: (1) Redis Feature Store lookup for the seed’s 128-d embedding, (2) Redis lookup for the user’s bloom filter. Once the embedding arrives, it fans out to the genre-appropriate ANN index shard for a k=500 HNSW query (~15ms). Simultaneously, the user’s CF embedding is fetched from the user factor store (~5ms). Score fusion runs in ~1ms. Bloom filter filters candidates (~2ms). MMR selects 50 diverse tracks (~5ms). First batch returned to client. Total: ~30ms median, ~120ms P99. Client buffers the first batch and starts playing track 1 while prefetching the next batch.

Q: The system is returning noticeably worse recommendations for users who have just signed up versus users with 5 years of listening history. How do you diagnose and fix this?

Expected depth: Root cause is almost certainly the CF weight for new users. New users have sparse listening history - maybe 10-50 tracks - and the CF model’s user embedding for them is noisy. If compute_blend_weight is returning 0.5 for new users (because they have 200 listens), the CF signal is dominating but the signal quality is poor. Diagnosis: segment recommendation quality metrics (skip rate, session length) by user tenure. Fix: lower the CF weight threshold for users under 12 months old; use listening hour count as a more reliable freshness signal than absolute play count. Also check whether new users are receiving the cold-start treatment correctly for recently released tracks they encounter.

Q: You need to add a “mood” filter to Song Radio - users can specify “energetic,” “calm,” or “happy” and the radio should bias toward that mood. How do you extend the architecture?

Expected depth: Mood maps directly to scalar audio features: energetic = high energy (above 0.7), calm = low energy (below 0.4) + low tempo, happy = high valence (above 0.6). The simplest extension adds a mood constraint as a post-filter on the ANN candidates before CF scoring. A more sophisticated approach adds a mood bias term to the score fusion: mood_bonus = 0.1 * mood_match(track_features, user_mood). The ANN index itself does not need to change - cosine similarity still returns acoustically similar candidates regardless of mood. The mood filter is applied in the Radio Generation Service before MMR selection. Cold start tracks that match the mood should receive a higher wildcard injection rate.

Q: Song Radio is consuming 40% of your total Redis cluster memory. How do you reduce it without degrading the user experience?

Expected depth: Three main consumers - user bloom filters (288GB), user CF embeddings (dynamic cache), and seed audio embeddings (80M * 512 bytes = 41GB). Bloom filter size can be reduced by lowering the 5-session window to 3 sessions (40% reduction in bit array size) or by increasing the false positive rate from 1% to 2% (halves the bit array size). For CF embeddings, ensure the TTL is set appropriately - users who have not started a radio session in 7 days should have their embedding evicted. For audio embeddings, consider storing only the most frequently seeded tracks (top 10M by play count) in Redis and falling back to PostgreSQL for the remaining 70M. A tiered caching strategy - Redis for hot embeddings, PostgreSQL for cold - can reduce Redis embedding memory by 80% with minimal latency impact for popular seeds.

Q: How would you add real-time feedback - if a user skips a track in the radio session, bias away from similar tracks immediately within the same session?

Expected depth: Real-time within-session feedback requires a session-scoped “dislike embedding.” When a user skips track T, retrieve T’s audio embedding and add it to a “session repulsion vector” - a running average of disliked track embeddings. In the next score fusion call, subtract the dot product of each candidate’s embedding against the repulsion vector from the composite score: adjusted_score = composite - repulsion_weight * dot(candidate_emb, repulsion_vec). This biases the MMR selection away from tracks acoustically similar to skipped tracks without requiring a model retrain. The repulsion vector is stored in session state (Redis key tied to session_id, TTL matching session expiry) and grows incrementally as the user provides feedback. This is pure real-time inference - no model update required.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article