Build a Personalized Typeahead Search Engine


performance databases scalability

System Design Deep Dive

Personalized Typeahead Search Engine

Same prefix, different results per user - blending global frequency, personal history, and real-time trending signals under 50ms end-to-end

⏱ 14 min read📐 Advanced🏗️ Typeahead

Picture a public library reference desk staffed by two librarians simultaneously. The first librarian has memorized every question ever asked in every branch across the country and can rank topics by global popularity. The second knows only you - every title you have ever requested, which you finished, which you abandoned, and how long ago. When you begin a question with “py…”, the first librarian is already pulling “Python”, “PyTorch”, and “Python Tutorial” from the national frequency list. The second is whispering “you searched python dataclasses three times last week, put that first.” The challenge is that both librarians must finish arguing and hand you a unified answer in 50 milliseconds - before you type the third character.

That is the personalized typeahead problem. It is harder than standard autocomplete because the same two-character prefix must produce different ranked results for a machine-learning engineer, a Python beginner, and someone who has never typed “py” before. The trie index that drives global suggestions is shared and cacheable. Personal signals are per-user and cannot be cached at the prefix layer without multiplying cache storage by the number of users. Trending queries spike without warning and must appear in suggestions within minutes. All three signals must converge into a single ranked list under a 50ms budget that includes network round-trip time.

Three forces pull in opposite directions. Caching wants to serve the same result for every “py” query regardless of who is asking. Personalization wants to serve different results for every user. Freshness wants to update those results whenever a query spikes in volume. A system that optimizes for any one of these forces at the expense of the others will fail the product requirements. We need to design around all three simultaneously.

Requirements and Constraints

Functional Requirements

  • Return the top 10 search suggestions for any typed prefix starting at 2 characters
  • The same prefix must return different ranked results for different users based on their search history
  • Globally trending queries from the last 5-15 minutes must appear in suggestions when relevant to the prefix
  • Support cold-start users who have no personal history - fall back to segment-level preferences
  • Support A/B testing of ranking algorithms per user cohort without affecting the core index
  • All suggestions must pass a content policy blocklist check before returning

Non-Functional Requirements

  • Latency: p99 under 50ms end-to-end including network; p50 under 15ms
  • Throughput: 150,000 suggestion requests per second at peak (3-5x the actual search submission rate)
  • Availability: 99.99% - typeahead is on the critical interaction path; failure causes perceived slowness even when search itself is up
  • Freshness: trending queries must appear in suggestions within 10 minutes of onset; index rebuilds every 24 hours for global scores
  • Storage: global index covers 5 billion distinct query strings; personalization state for up to 2 billion active users with up to 500 stored prefixes each

Constraints and Scope

  • We are not building spell-correction or semantic intent understanding - those are downstream of suggestions
  • Content moderation runs asynchronously; flags are pre-applied to the index and to a real-time Redis blocklist
  • User auth tokens arrive on the suggestion request and are used for personalization lookups
  • Geographic trending (regional hot queries) is a future extension; architecture must not preclude it

High-Level Architecture

The system has five distinct planes: a client debounce layer that converts keystrokes into throttled HTTP requests, an API gateway for auth and prefix-based routing, a suggestion blender that orchestrates the read path, a prefix cache per user segment that absorbs repeat traffic, and an async trending pipeline that continuously injects freshness signals.

Personalized typeahead search system architecture overview

On the read path, a keypress fires a debounced request to the API gateway after 120ms of idle time. The gateway authenticates the user, extracts the prefix, and routes to the correct trie shard based on hash(prefix[0:2]) % num_shards. The Suggestion Blender checks Redis for a cached result keyed by suggest:{prefix}:{user_segment} - a segment-level cache that provides partial personalization without per-user cache explosion. On a miss, the Blender fans out in parallel to the Trie Index Service (global top-50 candidates) and the Personalization Service (user history from Cassandra). The two result sets are merged using decay-weighted scoring, trending boosts are injected from a Redis Sorted Set, the blocklist is checked, and the top-10 are returned. Simultaneously, the query event is published to Kafka for the trending pipeline.

Each service has a single responsibility. The Trie Index Service holds the in-memory prefix tree and returns GetTopK(prefix, 50). The Personalization Service returns a map of {query_text: decay_score} for a given user. The Suggestion Blender does all the merging. The Trending Pipeline (Kafka + Flink + Trie Updater) closes the feedback loop and keeps the index fresh.

Key Insight

We cache at the segment level rather than the user level. Users are bucketed into ~1,000 segments based on broad behavioral signals (language, device category, prior topic affinity). The segment cache absorbs 60-70% of cache traffic while requiring only 1,000x the storage of a single-prefix cache - not 2 billion times. The remaining personalization is applied in the final re-ranking step at request time.

The Trie Index Service

The Trie Index Service is responsible for returning the top-50 globally scored completion candidates for any prefix in under 2ms. It is the foundation that both the cached path and the cache-miss path build on.

Think of it like a pre-printed phone book index with thumbtabs. When you flip to “Py”, you instantly see the 50 most-called numbers starting with “Py” - you do not need to scan the full “P” section. The trie achieves this by caching pre-computed top-K lists at every internal node, not just at leaf nodes. When GetTopK("py", 50) arrives, the service traverses 2 trie nodes and reads the cached list. O(prefix_length) time, no subtree traversal.

// Trie node with cached top-K completions for O(prefix_length) lookup
package trie

import "sync"

type Suggestion struct {
	Text        string
	GlobalScore float64
	TrendBoost  float64
}

type TrieNode struct {
	mu       sync.RWMutex
	children map[rune]*TrieNode
	isEnd    bool
	topK     []Suggestion // pre-sorted descending by blended score
}

type Trie struct {
	root *TrieNode
}

// GetTopK returns the cached top-K without any subtree traversal
func (t *Trie) GetTopK(prefix string, k int) []Suggestion {
	node := t.root
	for _, ch := range prefix {
		node.mu.RLock()
		child, ok := node.children[ch]
		node.mu.RUnlock()
		if !ok {
			return nil
		}
		node = child
	}
	node.mu.RLock()
	defer node.mu.RUnlock()
	if len(node.topK) <= k {
		result := make([]Suggestion, len(node.topK))
		copy(result, node.topK)
		return result
	}
	result := make([]Suggestion, k)
	copy(result, node.topK[:k])
	return result
}

// UpdateQueryScore updates a query's score and propagates upward
func (t *Trie) UpdateQueryScore(query string, newScore float64, trendBoost float64) {
	path := []*TrieNode{t.root}
	node := t.root
	for _, ch := range query {
		node.mu.RLock()
		child, ok := node.children[ch]
		node.mu.RUnlock()
		if !ok {
			return
		}
		node = child
		path = append(path, node)
	}
	// Update terminal node score
	terminal := path[len(path)-1]
	terminal.mu.Lock()
	for i := range terminal.topK {
		if terminal.topK[i].Text == query {
			terminal.topK[i].GlobalScore = newScore
			terminal.topK[i].TrendBoost = trendBoost
			break
		}
	}
	sortTopK(terminal.topK)
	terminal.mu.Unlock()
	// Bottom-up propagation: re-merge top-K at each ancestor
	for i := len(path) - 2; i >= 0; i-- {
		parent := path[i]
		parent.mu.Lock()
		parent.topK = mergeChildrenTopK(parent, 50)
		parent.mu.Unlock()
	}
}

The trie is partitioned across 32 shards using the first two characters of the prefix as the shard key. With 5 billion query strings averaging 25 bytes each and top-K lists at every node (roughly 200 million internal nodes each caching 50 suggestions at 60 bytes each), per-shard memory lands around 110 GB - requiring 128-256 GB RAM instances.

Watch Out

Never update the live trie under a write lock during high traffic. Trending score updates affect every ancestor node from the updated leaf to the root - for a 15-character query, that is 15 lock acquisitions on the write path, each blocking all concurrent readers on that node. Instead, build a delta list of updated queries, apply them in a batch during a low-traffic window using copy-on-write semantics, then atomically swap the updated node list.

Suggestion Blender internal architecture showing global, personal, and trending signal paths

The User Query History Store

The User Query History Store is responsible for returning a user’s prefix-relevant search history in under 5ms so the Personalization Service can compute boost multipliers.

The natural partition key is user_id. Each row stores a (user_id, query_text) pair with a count of how many times the user searched that query and the timestamp of the last occurrence. The Personalization Service queries for all rows where the query_text starts with the typed prefix - not a LIKE query against a relational DB, but a direct row lookup since we store history by query text, not by prefix.

-- Cassandra logical schema for user query history
-- Partition key: user_id ensures single-node reads
-- Clustering key: last_searched_ts DESC surfaces recency first

CREATE TABLE user_search_history (
    user_id          UUID         NOT NULL,
    query_text       TEXT         NOT NULL,
    search_count     INT          NOT NULL DEFAULT 1,
    last_searched_ts TIMESTAMPTZ  NOT NULL,
    clicked_result   BOOLEAN      NOT NULL DEFAULT FALSE,
    decay_score      DOUBLE PRECISION NOT NULL DEFAULT 0.0,
    PRIMARY KEY (user_id, query_text)
) WITH CLUSTERING ORDER BY (query_text ASC);

-- For prefix-based history lookup per user
CREATE INDEX ON user_search_history (user_id, last_searched_ts);

-- Global query frequency table (source of truth for offline index build)
CREATE TABLE query_frequencies (
    query_text       TEXT             NOT NULL PRIMARY KEY,
    raw_count        BIGINT           NOT NULL DEFAULT 0,
    click_count      BIGINT           NOT NULL DEFAULT 0,
    first_seen_date  DATE             NOT NULL,
    last_seen_ts     TIMESTAMPTZ      NOT NULL,
    global_score     DOUBLE PRECISION NOT NULL DEFAULT 0.0,
    is_blocked       BOOLEAN          NOT NULL DEFAULT FALSE
);

CREATE INDEX idx_qf_score ON query_frequencies (global_score DESC)
WHERE is_blocked = FALSE;

-- Trending scores emitted by Flink, consumed by Trie Updater
CREATE TABLE trending_scores (
    query_text       TEXT             NOT NULL,
    window_start_ts  TIMESTAMPTZ      NOT NULL,
    window_count     BIGINT           NOT NULL,
    prev_window_count BIGINT          NOT NULL DEFAULT 0,
    velocity         DOUBLE PRECISION NOT NULL,
    trend_boost      DOUBLE PRECISION NOT NULL DEFAULT 1.0,
    PRIMARY KEY (query_text, window_start_ts)
);

CREATE INDEX idx_ts_boost ON trending_scores (trend_boost DESC, window_start_ts DESC);

-- User segment assignments for segment-level caching
CREATE TABLE user_segments (
    user_id          UUID    NOT NULL PRIMARY KEY,
    segment_id       INT     NOT NULL,
    computed_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

The Personalization Service receives a (user_id, prefix) pair and fetches all rows from user_search_history where query_text >= prefix AND query_text < prefix_successor(prefix). For Cassandra, this is a range scan on the clustering key within the user’s partition - still a single-node operation. The result is a map from query_text to history metadata, which is passed to the decay scoring function.

Key Insight

We store history by full query_text, not by prefix. This means a user who searched “python dataclasses” has one row for that full query, not separate rows for “p”, “py”, “pyt” etc. At personalization time, the service does a prefix range scan within the user’s Cassandra partition, which is fast because the clustering key is already sorted alphabetically. No denormalized prefix writes needed.

Decay-Based Scoring

The decay-based scoring function is responsible for combining global query frequency with personal usage history and time-recency into a single blended score that determines final ranking.

A stock ticker analogy is useful here: global frequency is the stock’s 52-week volume - a measure of broad, stable popularity. Personal history is your own trading pattern with that stock. The decay function is the time-weighting that makes recent activity matter more than old activity. A query the user searched yesterday should outrank one they searched 6 months ago, even if the old query had 3x the historical count.

# Decay-based scoring: blends global frequency + personal usage + trend velocity
import math
import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class PersonalHistory:
    count: int
    last_used_ts: int  # unix seconds
    clicked_result: bool

def compute_global_score(
    raw_count: int,
    days_active: int,
    click_through_rate: float,
    trend_boost: float = 1.0,
    is_blocked: bool = False,
) -> float:
    """Base global score before personalization."""
    if is_blocked:
        return 0.0
    # Log-normalize to compress the power-law distribution
    log_count = math.log(1 + raw_count)
    # Soft time decay: older queries lose authority slowly
    # half-life approximately 365 days
    time_decay = 1.0 / (1.0 + math.exp(-0.005 * (days_active - 180)))
    # CTR signal: queries people click on are more useful
    ctr_signal = 1.0 + (click_through_rate * 0.5)
    return log_count * time_decay * ctr_signal * trend_boost


def compute_decay_score(history: PersonalHistory) -> float:
    """
    Personal history score with exponential time decay.
    Half-life: 30 days. Recent click doubles the boost.
    """
    now = int(time.time())
    days_ago = (now - history.last_used_ts) / 86400.0
    # Exponential decay: score halves every 30 days
    decay_factor = math.exp(-0.693 * days_ago / 30.0)
    count_signal = math.log(1 + history.count)
    click_bonus = 1.5 if history.clicked_result else 1.0
    return count_signal * decay_factor * click_bonus


def blend_scores(
    global_score: float,
    personal_history: Optional[PersonalHistory],
    alpha: float = 0.6,
) -> float:
    """
    Blend global and personal scores.
    alpha=0.6 means global gets 60% weight for users with no history on a query.
    For users with strong history, personal_boost can exceed global entirely.
    """
    if personal_history is None or personal_history.count == 0:
        return global_score
    personal_score = compute_decay_score(personal_history)
    # Additive blend: personal boost is additive, not multiplicative
    # This prevents personal history from completely suppressing global authority
    return (alpha * global_score) + ((1 - alpha) * global_score) + personal_score


def compute_cold_start_score(
    global_score: float,
    segment_affinity: float,  # 0.0 to 1.0, from segment model
) -> float:
    """For users with no history, use segment-level affinity as a weak personal signal."""
    return global_score * (1.0 + 0.3 * segment_affinity)

Log-normalization of raw counts is critical. Query frequencies follow a power law - “the” has orders of magnitude more occurrences than “python dataclasses”. Without log-scaling, personal boost signals (which are small integers, not billions) are always dwarfed by the raw count difference. After log-normalization, a user who searched “python dataclasses” 5 times last week can actually see it promoted above “python” in their personal results.

Real World

Google’s Search-as-you-type system has used a combination of server-side personalization and client-side recent-search injection for over a decade. Queries you searched in the last hour are surfaced by the browser client before the server response even arrives - the client overlay has zero latency because it reads from localStorage. Server personalization picks up longer-term patterns spanning days and months.

The Trending Signal Pipeline is responsible for detecting queries that are spiking in volume right now and injecting a velocity-based boost multiplier into the suggestion system within 10 minutes of spike onset.

The pipeline is a three-stage stream: Kafka receives raw query events at 150,000 events per second; Flink computes 5-minute tumbling window counts and compares to the previous window to derive velocity; the Trie Updater applies boost multipliers to affected trie nodes and invalidates stale Redis cache keys.

# Flink-equivalent trending score computation (Python pseudocode matching Flink API)
from dataclasses import dataclass
from typing import Dict
import math

@dataclass
class WindowCount:
    query_text: str
    window_start: int  # unix seconds
    count: int

@dataclass
class TrendingScore:
    query_text: str
    trend_boost: float
    window_count: int

# In-process state store mapping query_text -> previous window count
prev_window_state: Dict[str, int] = {}

def compute_trending_score(window_result: WindowCount) -> TrendingScore:
    """
    Velocity = (current_count - prev_count) / prev_count
    trend_boost = 1.0 + log(velocity) if velocity > 2.0, else 1.0
    Capped at 5.0x to prevent a single viral event from dominating all suggestions.
    """
    prev_count = prev_window_state.get(window_result.query_text, 0)
    if prev_count == 0:
        velocity = 1.0
    else:
        velocity = window_result.count / max(1, prev_count)
    prev_window_state[window_result.query_text] = window_result.count
    if velocity > 2.0:
        boost = min(5.0, 1.0 + math.log(velocity))
    else:
        boost = 1.0
    return TrendingScore(
        query_text=window_result.query_text,
        trend_boost=boost,
        window_count=window_result.count,
    )


def inject_trending_into_redis(
    redis_client,
    trending_scores: list,
    trending_set_key: str = "trending:global",
    ttl_seconds: int = 900,  # 15 minutes
) -> None:
    """
    Write trending queries to a Redis Sorted Set keyed by trend_boost score.
    The Suggestion Blender reads the top-N from this set at request time.
    TTL ensures stale trends auto-expire.
    """
    pipe = redis_client.pipeline()
    pipe.delete(trending_set_key)
    for ts in trending_scores:
        if ts.trend_boost > 1.0:
            pipe.zadd(trending_set_key, {ts.query_text: ts.trend_boost})
    pipe.expire(trending_set_key, ttl_seconds)
    pipe.execute()

At request time, the Suggestion Blender does a ZREVRANGEBYSCORE trending:global +inf 1.5 LIMIT 0 20 to retrieve the top-20 currently trending queries. It then checks whether any of those trending queries start with the user’s typed prefix and, if so, includes them in the candidate pool with their trend_boost multiplier applied. This means trending injection is a Redis read at request time - not a lookup that blocks on the Flink pipeline.

Watch Out

Never inject trending queries unconditionally. A trending query like “earthquake los angeles” should only appear in suggestions if it starts with the typed prefix. Injecting it into suggestions for “e” is appropriate; injecting it for “py” is noise that hurts suggestion quality. Always filter trending candidates against the prefix before including them in the candidate pool.

Prefix Cache Per User Segment

The prefix cache layer is responsible for eliminating trie and personalization round-trips for the majority of requests, while still providing meaningful personalization without per-user cache entries.

The naive personalized cache key is suggest:{prefix}:{user_id}. At 150,000 QPS and 2 billion users, this creates a cache space too large for any Redis cluster to hold in memory - and a cold-cache cascade any time a user’s first session starts. The solution is segment-level caching: bucket users into ~1,000 behavioral segments and cache results at suggest:{prefix}:{segment_id}.

# Segment assignment and cache key construction
import hashlib
from typing import Optional
import redis

def assign_user_segment(
    user_id: str,
    primary_language: str,
    device_type: str,  # "mobile" | "desktop" | "tablet"
    top_category: str,  # most-searched topic category
    num_segments: int = 1000,
) -> int:
    """
    Deterministic segment assignment based on user behavioral signals.
    Users in the same segment share a prefix cache - close enough for
    85-90% of personalization value without per-user cache entries.
    """
    segment_key = f"{primary_language}:{device_type}:{top_category}"
    h = int(hashlib.md5(segment_key.encode()).hexdigest(), 16)
    return h % num_segments


def get_cached_suggestions(
    redis_client: redis.Redis,
    prefix: str,
    segment_id: int,
) -> Optional[list]:
    """Cache lookup with segment-level key."""
    cache_key = f"suggest:{prefix}:{segment_id}"
    cached = redis_client.get(cache_key)
    if cached:
        return deserialize(cached)
    return None


def set_cached_suggestions(
    redis_client: redis.Redis,
    prefix: str,
    segment_id: int,
    suggestions: list,
    ttl_seconds: int = 60,
) -> None:
    """
    TTL strategy:
    - Prefixes of length 1-2: TTL 30s (very hot, high churn from trending)
    - Prefixes of length 3-4: TTL 60s
    - Prefixes of length 5+: TTL 300s (less traffic, results more stable)
    """
    if len(prefix) <= 2:
        ttl_seconds = 30
    elif len(prefix) <= 4:
        ttl_seconds = 60
    else:
        ttl_seconds = 300
    cache_key = f"suggest:{prefix}:{segment_id}"
    redis_client.setex(cache_key, ttl_seconds, serialize(suggestions))

The final per-user personalization delta (the decay-score boost for queries the specific user has searched) is applied after the segment cache is fetched. This means a segment cache hit still gets full personal history applied - it just skips the trie and trending lookups, which together account for most of the latency. For users with strong personal signals, personal queries not in the segment cache result can surface from a secondary lookup to Cassandra.

Key Insight

The segment cache delivers 60-70% of personalization value while reducing cache storage by 2,000,000x compared to per-user caching. The remaining personalization gap (the delta between segment-average and individual user) is covered by the fast Cassandra personal history lookup, which runs in parallel with the cache check. If the cache hits, we skip the trie lookup but still apply personal history to the cached global candidates.

Cold Start for New Users

New users have no personal history, which means the Personalization Service has nothing to return and the Suggestion Blender has no personal signals to apply. The cold start resolver fills this gap.

# Cold start resolution: segment fallback + onboarding signal injection
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class ColdStartContext:
    user_id: str
    segment_id: int
    onboarding_topics: List[str] = field(default_factory=list)
    # Topics declared during signup (e.g. "I am interested in: ML, DevOps")
    is_new_user: bool = True

def resolve_cold_start(
    global_top50: list,
    cold_start_ctx: ColdStartContext,
    segment_affinities: dict,  # segment_id -> {query: affinity_score}
) -> list:
    """
    For new users:
    1. Apply segment-level affinity scores as weak personal signals
    2. Boost any queries matching onboarding-declared topics
    3. Return the adjusted list - never surfaces less than global top-10
    """
    segment_affs = segment_affinities.get(cold_start_ctx.segment_id, {})
    adjusted = []
    for suggestion in global_top50:
        score = suggestion["score"]
        # Weak segment affinity boost (max 20% uplift)
        seg_aff = segment_affs.get(suggestion["text"], 0.0)
        score = score * (1.0 + 0.2 * seg_aff)
        # Stronger boost for declared onboarding topics
        for topic in cold_start_ctx.onboarding_topics:
            if topic.lower() in suggestion["text"].lower():
                score = score * 1.5
                break
        adjusted.append({**suggestion, "score": score, "source": "cold_start"})
    adjusted.sort(key=lambda x: x["score"], reverse=True)
    return adjusted[:10]

Cold start does not require a special code path in the Suggestion Blender. The Personalization Service returns an empty history map for new users, which the Blender treats as “no personal boost”. The Cold Start Resolver then runs as a post-processing step, applying segment affinity scores from a pre-computed segment affinity table. Within 3-5 searches, the user accumulates enough history that the standard decay-scoring path takes over completely.

Real World

YouTube’s recommendation system handles cold start by asking new users to select topics of interest during onboarding, then using those selections to seed the initial recommendation model. The same principle applies to typeahead: onboarding topic selection gives the system enough signal to provide 70-80% of a personalized experience from the first keystroke, before any actual search history exists.

A/B Testing Suggestion Quality

The A/B testing layer is responsible for routing each user to a specific ranking variant so we can measure whether a new blending formula, decay parameter, or trending boost cap improves click-through rate on suggestions.

# A/B variant selection and metric tracking for suggestion quality
import hashlib
from typing import Optional

EXPERIMENT_CONFIG = {
    "exp_decay_halflife": {
        "variants": {
            "control": {"halflife_days": 30},
            "short_halflife": {"halflife_days": 14},
            "long_halflife": {"halflife_days": 60},
        },
        "traffic_split": {"control": 0.5, "short_halflife": 0.25, "long_halflife": 0.25},
    }
}

def get_experiment_variant(
    user_id: str,
    experiment_name: str,
) -> Optional[str]:
    """
    Deterministic variant assignment via user_id hash.
    Same user always gets the same variant within an experiment.
    """
    config = EXPERIMENT_CONFIG.get(experiment_name)
    if not config:
        return "control"
    h = int(hashlib.sha256(f"{experiment_name}:{user_id}".encode()).hexdigest(), 16)
    bucket = (h % 1000) / 1000.0  # 0.0 to 0.999
    cumulative = 0.0
    for variant, fraction in config["traffic_split"].items():
        cumulative += fraction
        if bucket < cumulative:
            return variant
    return "control"


def log_suggestion_event(
    event_type: str,  # "shown" | "clicked" | "ignored"
    user_id: str,
    prefix: str,
    suggestion_text: str,
    rank: int,
    variant: str,
    experiment_name: str,
) -> None:
    """
    Emit a structured event to Kafka for offline A/B analysis.
    Click-through rate per variant per prefix length is the primary metric.
    """
    event = {
        "event_type": event_type,
        "user_id": user_id,
        "prefix_length": len(prefix),
        "suggestion_text": suggestion_text,
        "rank": rank,
        "variant": variant,
        "experiment_name": experiment_name,
        "ts": int(time.time() * 1000),
    }
    kafka_producer.send("suggestion-events", value=event)

The key metric for suggestion quality is not click-through rate on the suggestion itself, but suggestion acceptance rate - the fraction of times the user accepted the suggestion (stopped typing and submitted the suggested query) versus typed past it. A suggestion the user types past is a failure even if they eventually click a search result. We track both metrics per rank position and per prefix length.

Key Insight

A/B testing typeahead is harder than testing a search results page because the “impression” is ambiguous - was the user exposed to a suggestion if they typed faster than the debounce fires? Track only suggestions that appeared for at least 200ms to avoid logging impressions the user never actually saw. Use the first suggestion exposure time as the impression timestamp, not the keypress time.

Key Algorithms and Protocols

Request data flow from keypress to 50ms personalized response

Global vs Personal Suggestion Blend

The core merge algorithm takes 50 global candidates and a map of personal history and produces a final ranked top-10. The algorithm is additive rather than multiplicative for personal boosts, which prevents a user’s most-searched query from completely suppressing globally authoritative results.

# Final suggestion merge algorithm
from typing import List, Dict, Optional
from dataclasses import dataclass
import math
import time

@dataclass
class RankedSuggestion:
    text: str
    final_score: float
    source: str  # "global" | "personal_boost" | "trending" | "cold_start"

def merge_suggestions(
    global_candidates: List[dict],        # [{text, global_score, trend_boost}]
    personal_history: Dict[str, dict],    # {query_text: {count, last_used_ts, clicked}}
    trending_prefix_matches: List[dict],  # [{text, trend_boost}] matching current prefix
    max_results: int = 10,
    personal_weight: float = 0.4,
) -> List[RankedSuggestion]:
    now = int(time.time())
    scored = {}

    for c in global_candidates:
        text = c["text"]
        g_score = c["global_score"] * c.get("trend_boost", 1.0)
        personal_boost = 0.0
        source = "global"

        if text in personal_history:
            h = personal_history[text]
            days_ago = (now - h["last_used_ts"]) / 86400.0
            decay = math.exp(-0.693 * days_ago / 30.0)
            personal_boost = math.log(1 + h["count"]) * decay
            if h.get("clicked"):
                personal_boost *= 1.5
            source = "personal_boost"

        # Additive: personal_boost adds to global, never replaces it
        final = g_score + (personal_weight * personal_boost)
        scored[text] = RankedSuggestion(text=text, final_score=final, source=source)

    # Surface personal queries not in global top-50
    global_texts = set(scored.keys())
    for query, h in personal_history.items():
        if query not in global_texts:
            days_ago = (now - h["last_used_ts"]) / 86400.0
            if h["count"] >= 2 and days_ago < 90:
                decay = math.exp(-0.693 * days_ago / 30.0)
                p_score = math.log(1 + h["count"]) * decay
                scored[query] = RankedSuggestion(
                    text=query, final_score=p_score, source="personal"
                )

    # Inject trending matches not already in pool
    trending_texts = {t["text"] for t in trending_prefix_matches}
    for t in trending_prefix_matches:
        if t["text"] not in scored:
            scored[t["text"]] = RankedSuggestion(
                text=t["text"],
                final_score=t["trend_boost"] * 2.0,
                source="trending",
            )

    results = sorted(scored.values(), key=lambda s: s.final_score, reverse=True)
    return results[:max_results]

Prefix Shard Routing

The routing function maps any prefix to its trie shard. We use FNV hash on the first two characters rather than consistent hashing because prefix shard assignment is static - we never reassign prefixes between shards without a full rebuild.

// Deterministic prefix shard routing - O(1), no consistent hash ring needed
package router

import "hash/fnv"

type ShardRouter struct {
	numShards int
	addresses []string
}

func NewShardRouter(addresses []string) *ShardRouter {
	return &ShardRouter{numShards: len(addresses), addresses: addresses}
}

func (r *ShardRouter) GetAddress(prefix string) string {
	key := prefix
	if len(prefix) > 2 {
		key = prefix[:2]
	}
	if len(key) == 0 {
		return r.addresses[0]
	}
	h := fnv.New32a()
	h.Write([]byte(key))
	return r.addresses[int(h.Sum32())%r.numShards]
}

Scaling and Performance

Prefix sharding and horizontal scaling strategy for personalized typeahead

Capacity Estimation

Given:
  - 150,000 suggestion requests/second at peak
  - Average prefix length: 3-4 characters
  - Response payload: 10 suggestions x 60 bytes = 600 bytes
  - 2 billion active users, 500 personal queries per user each
  - 5 billion distinct global queries, avg 25 bytes each
  - 32 trie shards

Trie memory per shard:
  - Queries per shard: 5B / 32 = 156M
  - Trie nodes: ~200M (including internal nodes)
  - Each node: 50 suggestions x 60 bytes + 50 bytes struct = 3,050 bytes
  - Per shard RAM: 200M nodes x 3,050 bytes = ~600GB raw
  - With pointer compression and struct packing: ~110GB per shard
  - Total trie fleet: 32 x 110GB = 3.5TB RAM

Redis cache:
  - Top 100K prefixes x 1,000 segments x 600 bytes = 60GB (manageable)
  - Short-prefix cache (len 1-2): 1,296 keys x 1,000 segs = 1.3M entries
  - Long-prefix cache (len 3-5): ~50M entries, LRU eviction at 200GB limit
  - Redis cluster: 10 nodes x 128GB = 1.28TB capacity

User history storage (Cassandra):
  - 2B users x 500 queries x 100 bytes/row = 100TB raw
  - With Cassandra compression (~50%): ~50TB
  - Hot working set (5% of users active): 5TB - fits in warm tier

Trending pipeline:
  - 150K events/sec into Kafka = 9MB/s write throughput
  - Flink 5-minute window: 45M events/window
  - Trending score updates per window: ~200K queries
  - Redis ZADD operations per window: ~200K (trivial)

Read throughput:
  - 150K req/s at 95% cache hit rate: 7,500 req/s to trie shards
  - Per-shard: 7,500 / 32 = 234 trie lookups/sec (very comfortable)
  - At 5ms trie lookup: 234 x 5ms = 1.2 CPU-core-seconds/s per shard

The dominant bottleneck is trie RAM. At 110 GB per shard, each shard server needs a 128 GB+ high-memory instance. We cannot reduce this by adding more shards without increasing the fleet size proportionally - each shard still needs the full top-K structure for its prefix range. Vertical scaling wins here over horizontal scaling.

The secondary bottleneck is Redis cache invalidation during trending bursts. When “earthquake” starts trending, cache keys for “e”, “ea”, “ear”, “eart”, “earth”, “earthq”, “earthqu”, “earthqua”, “earthquak”, “earthquake” all need invalidation. With 200K trending queries per 5-minute window, that is 1.4 million DEL operations per window. Batching these into 500-key pipeline calls reduces Redis round trips to under 3,000 operations per window.

Real World

LinkedIn’s typeahead system for people and job search uses a two-level cache: a CDN edge cache for the anonymous (non-personalized) prefix responses and an origin Redis cache for authenticated user sessions. The edge cache handles 80%+ of traffic since many users are browsing without logging in or from shared devices. This architecture was described in their 2019 engineering blog post on their Galene search infrastructure.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Trie shard crashHealth check failure, p99 latency spike on affected prefix rangeQueries for affected bigrams return empty or fall to global defaultReplica promotes in 30s; trie reload from snapshot takes 5-15 min; serve from replica while reloading
Redis cache cluster node lossCache hit rate drops below 80%; latency alert firesCache miss rate spikes; trie shard load increases 5-10xRedis Cluster auto-rebalances slots; warm affected cache keys from surviving nodes; jitter rebuild to avoid thundering herd
Cassandra personalization timeoutp99 Cassandra read latency exceeds 8msPersonalization step times out; circuit breaker opensDegrade gracefully: return segment-cached suggestions without personal boost; circuit breaker recovers after 30s
Flink job lag (trending backlog)Kafka consumer lag exceeds 5 million messagesTrending scores freeze; new trending queries get no boostRestart Flink from last checkpoint; replay Kafka events; trending injection auto-resumes when lag drains
Cold start segment table staleSegment affinities not refreshed in 48hNew users get generic global results instead of segment-personalizedTrigger offline segment recompute job; serve pure global results as fallback - still functional
Cache stampede on short prefixSudden p99 spike when “g” or “py” cache key expires simultaneouslyThousands of concurrent trie lookups; shard CPU saturatesJittered TTL (base + random 0-30s); refresh-ahead (recompute cache 10s before TTL expiry); serve stale while refreshing
Watch Out

The personalization service is the most common latency outlier. Cassandra tail latency at p99.9 can spike to 20ms+ during GC pauses or compaction. Always run the personalization fetch with a hard timeout of 10ms and a circuit breaker that opens after 5 consecutive timeouts. The suggestion response with global-only results is acceptable; a 50ms SLA breach because personalization hung is not.

Comparison of Approaches

ApproachLatencyPersonalizationUpdate FreshnessMemory CostBest Fit
Trie with top-K per node + segment cache (this design)p99 <50msSegment + per-user history blendTrending in <10minVery high (110GB/shard)High-volume personalized prod
Elasticsearch Completion Suggester10-30msPlugin-based, limitedNear-real-time via mapping updateMedium (FST compressed)Teams without custom infra
Redis Sorted Sets per prefixp99 <5msNone (global only)Immediate ZADDVery high per prefixSmall prefix universe, no personalization
Database full-text prefix (LIKE)100ms-2sEasy to join with user tableImmediateLow (no extra index)Prototypes, internal tools
Client-side local history only0msPerfect for recent queriesImmediateZero server costRecent-query recall only
Vector semantic completion20-50msVia user embeddingRebuild needed on model changeHigh (embedding store)Semantic/fuzzy completions

The trie with top-K caching and segment-level Redis is the only approach that achieves sub-50ms latency, partial personalization, and trending freshness simultaneously. Elasticsearch is the right choice for teams that cannot justify operating a custom trie service - it handles up to 50,000 QPS comfortably. Redis Sorted Sets per prefix work for small deployments but break at billions of distinct prefixes since each prefix requires its own sorted set and the total memory becomes unmanageable.

Key Takeaways

  • Segment-level caching beats user-level caching: caching at ~1,000 behavioral segments delivers 70% of personalization value while requiring 2 million times less cache storage than per-user cache entries
  • Personal boost is additive, not multiplicative: adding personal scores to global scores rather than multiplying by a boost factor prevents a user’s most-repeated query from completely suppressing globally authoritative results, which would make the suggestion list feel narrow and unhelpful
  • Decay scoring is not optional: without time-decay on personal history, a query the user searched 18 months ago competes equally with one from yesterday; half-life of 30 days hits the right balance between recency and retention
  • Trending injection is a read, not a write: pre-computed trending scores in a Redis Sorted Set let the Suggestion Blender inject trending at request time with a single ZREVRANGEBYSCORE command rather than blocking on the Flink pipeline
  • Cold start resolves in 3-5 searches: segment affinity scores plus onboarding topics provide enough signal to produce meaningful personalized suggestions from the first session; users should never see a completely impersonal global list
  • A/B test acceptance rate, not click-through rate: a suggestion that causes the user to stop typing and submit is a success; one the user types past is a failure regardless of whether they eventually click a result below; track both separately per rank position
  • Circuit-break the personalization path aggressively: the typeahead SLA is 50ms; personalization adds 3-8ms on average but can spike to 20ms at Cassandra tail latency; a hard 10ms timeout with global fallback is required, not optional
  • Trie shard memory is the scaling bottleneck: more shards do not reduce per-shard memory because each shard still needs the full top-K structure for its prefix range; vertical scaling (bigger RAM instances) wins over horizontal sharding for the trie layer

Frequently Asked Questions

Q: Why not build this on top of Elasticsearch? A: Elasticsearch’s Completion Suggester handles up to about 50,000 QPS reliably. Beyond that, JVM GC pauses introduce unpredictable latency spikes at p99, and the FST lookup overhead makes it hard to stay under 50ms at scale. More importantly, Elasticsearch does not natively support the blend of global scores, personal history, and real-time trending injection this design requires - you would need to rebuild that logic in a plugin or in application code anyway. For most product teams, Elasticsearch with a thin personalization overlay is the right starting point and can serve 5-10 million users before needing a custom trie.

Q: Why not cache at the user level instead of the segment level? A: A per-user cache for 2 billion active users, each with up to 500 active prefixes, would require 2B * 500 * 600 bytes = 600 TB of Redis capacity. That is an order of magnitude beyond what any production Redis cluster can hold cost-effectively. Even LRU eviction does not help much because every user’s “active prefixes” change session by session. Segment-level caching requires only 1,000 segments * 100K prefixes * 600 bytes = 60 GB - a realistic cache footprint.

Q: Why 5-minute trending windows rather than 1-minute? A: 1-minute windows produce too many false positive trending signals. A single viral tweet can cause a 20x spike in a query within 1 minute, making that query appear trending when it is really just a transient burst. 5-minute windows smooth out these single-spike events while still delivering trending signal within the 10-minute freshness SLA. Google reportedly uses both a 1-2 minute window (for breaking news detection) and a 15-minute window (for sustained trending), assigning different boost multipliers to each and taking the maximum.

Q: How do you handle offensive or harmful queries in suggestions? A: Two layers. First, the is_blocked flag on query_frequencies is applied at trie index build time - blocked queries never enter the trie. This handles known violations. Second, for real-time incidents (a new offensive query that goes viral), a Redis blocklist at block:{query_text} is checked by the Suggestion Blender before returning any result. Adding a query to the blocklist takes effect within seconds and does not require a trie rebuild.

Q: Why not serve suggestions from the CDN edge for personalized users? A: For anonymous users, CDN edge caching of prefix responses is absolutely correct and eliminates origin load entirely. For authenticated users, personalized suggestions vary per user, which means you cannot cache at the CDN without either: (a) using per-user cache keys at the CDN (which misses every other user and wastes CDN resources), or (b) stripping personalization at the edge (which defeats the purpose). The practical answer is: CDN caches anonymous prefix responses; authenticated users get routed to origin where the segment-level Redis cache handles most of the load.

Q: How does the system handle Unicode and non-Latin prefixes? A: The trie key is a sequence of Unicode code points, not ASCII bytes. The shard key uses the first two Unicode code points as the hash input. For CJK characters, a single code point can represent a complete word in logographic writing systems, so a 2-character CJK shard key covers more semantic territory than a 2-character Latin shard key - in practice, CJK deployments sometimes use 1-character shard keys to avoid over-concentration of queries in a single shard.

Interview Questions

Q: Walk me through how typing “py” produces personalized suggestions in under 50ms for a logged-in user.

Expected depth: Trace the full path: client debounce fires at 120ms idle; HTTPS request to API Gateway with JWT; gateway validates token (sub-5ms with local key cache), extracts prefix “py”, routes to shard via hash(“py”) % 32; Suggestion Blender checks Redis for suggest:py:{user_segment} (cache hit returns in under 5ms total); on cache miss, parallel fan-out: trie GetTopK(“py”, 50) in ~2ms and Cassandra user history range scan in ~5ms; personal history applied with decay scoring; trending Redis ZREVRANGEBYSCORE check; blocklist gate; top-10 sorted and returned; cache write-back at TTL=30s for short prefix. Allocate 50ms budget: 8ms network round-trip, 3ms gateway, 5ms cache lookup, 10ms trie+personalization parallel, 2ms merge, 5ms buffer.

Q: The “py” trie shard handles 300 million queries starting with “py”. How do you prevent it from being a latency bottleneck?

Expected depth: The trie shard is almost never the bottleneck - the Redis cache absorbs 95%+ of traffic. On the trie itself: the shard holds 156M queries total across all bigrams, not 300M for “py” alone; the GetTopK call on the “py” node is O(2) node traversals and a list copy - microseconds, not milliseconds. Replicas absorb read traffic with read-any routing. If the “py” node becomes a CPU bottleneck, split the “py” sub-shard (“pa-pl” vs “pm-py”) into a dedicated server. The bigger concern is the shadow of the “py” cache key expiring simultaneously for all users - jittered TTL and refresh-ahead prevent the stampede.

Q: How would you add geographic trending so that “earthquake” trends in California but not in Germany?

Expected depth: Add a region dimension to the Kafka query event stream. Partition the Flink aggregation by (region, query_text) pairs. Maintain per-region trending sets in Redis: trending:us-west vs trending:eu-central. Extend the suggestion request to include the user’s detected region (from IP geolocation at the gateway). Cache keys become suggest:{prefix}:{segment}:{region}. Memory impact: 50 regions x current cache size = 50x cache footprint, mitigated by only materializing caches for top-10 regions and defaulting others to global trending. Trie layer stays unchanged - trending is always injected at request time from Redis, never baked into the trie.

Q: Design the schema and update strategy for the user history store to handle 1 billion profile reads per second.

Expected depth: Cassandra with partition key user_id and clustering key query_text ASC for prefix range scans within a user’s partition. At 1B reads/sec across 2B users, hot users need read replicas - Cassandra RF=3 with nearest-replica routing handles hot users automatically since multiple replicas can serve reads simultaneously. Add a per-user Redis TTL-60s cache for users active in the last 10 minutes (the hot working set): history:{user_id}:{prefix} - covers ~5% of users but absorbs 40%+ of reads since active users generate disproportionate traffic. Write path: query submission event into Kafka -> consumer UPSERT every 5 seconds using Cassandra lightweight transactions (CAS on count increment) to avoid write amplification; alternatively use server-side counters.

Q: A competitor ships a feature where suggestions update within 1 second of a query going viral. How would you redesign the trending pipeline to match?

Expected depth: Current 5-minute Flink windows cannot support 1-second freshness. Switch to a per-query sliding counter with Redis INCR: each query submission fires INCR query:count:{query_text} and INCR query:count:{query_text}:1min. A 1-second Flink job or a Redis keyspace notification triggers when any query’s 1-minute count crosses a threshold (e.g., 5x its baseline). The trending Redis ZSET updates within 1-2 seconds of the spike onset. Cache invalidation becomes the critical path: at 1-second freshness, cache TTLs cannot be 30 seconds - you need either 1-second TTLs (destroying cache efficiency) or event-driven cache invalidation via Pub/Sub when trending scores change. The tradeoff: 1-second trending freshness requires either accepting very high trie load (short TTLs) or significant infrastructure for event-driven cache invalidation. For most products, 5-10 minutes is good enough - reserve 1-second freshness for breaking news or live event scenarios.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article