Build a Semantic Search Engine with Hybrid Ranking


databases performance data-engineering

System Design Deep Dive

Semantic Search with Hybrid Ranking

When users search for “show me articles about distributed consensus” and your corpus contains zero documents with those exact three words - hybrid ranking is what separates a useful system from an empty results page.

14 min readAdvancedSemantic Search

Picture a senior librarian who has read every book in a five-million-volume collection. When you ask “where is that thing about leaders in distributed systems?”, she knows you probably want Lamport’s Paxos paper, Raft consensus documentation, or Leslie Lamport’s biography - even though none of those contain your exact phrase. She does not pattern-match your words; she maps your intent to the information space she has memorized.

Keyword search engines are the opposite. They are extraordinarily fast filing clerks who will find every document containing the exact words “leaders in distributed systems” and return zero results if no document uses that exact phrasing. The filing clerk is never wrong - it just answers a different question than the one you asked.

Hybrid ranking combines both: a BM25 sparse retriever (the fast filing clerk) and a dense vector retriever (the semantic librarian) run in parallel, and their results are fused by a scoring function that promotes documents that score well in either dimension. A cross-encoder re-ranker optionally refines the top candidates by scoring each (query, document) pair directly - much more accurate than either retriever alone, but too slow to run over thousands of candidates.

The design challenge is achieving this in under 100ms end-to-end at 5,000 queries per second. That forces every component choice: we cannot afford a slow embedding call before retrieval, cannot afford serial retrieval, and must contain the cross-encoder to at most 20 pairs within a hard 30ms budget gate.

Requirements and Constraints

Functional Requirements

  • Accept free-text queries and return ranked documents relevant by both keyword and semantic meaning
  • Support 100 million indexed documents averaging 500 tokens each
  • Return top-K results (K = 1 to 20) with relevance scores and optional explanations
  • Handle zero-keyword-match queries: queries where no indexed document contains the exact query terms
  • Support document-level metadata filters combined with semantic search (e.g., “published after 2024, topic = ml”)
  • Near-real-time indexing: new documents searchable within 30 seconds of ingestion

Non-Functional Requirements

  • Query latency: p50 under 60ms, p99 under 100ms at 5,000 QPS
  • Index size: 100M documents; raw text storage separate from search index
  • Throughput: 5,000 search queries per second sustained
  • Recall@10 for semantic queries: above 85% vs human-labeled ground truth
  • Availability: 99.9% for reads; 99.5% for writes
  • Embedding freshness: document vectors generated within 30 seconds of document creation

Constraints

  • Embedding model is fixed at index creation; model upgrades require full re-embedding
  • Cross-encoder re-ranking is gated by a 30ms latency budget: skip if the upstream retrieval is already over 70ms
  • BM25 and vector indexes must be co-located per shard to avoid cross-shard latency overhead in the fusion step

High-Level Architecture

The system has five planes: the query processor that encodes the query into both a token stream and a dense embedding, the dual retrieval layer that runs BM25 and ANN search in parallel, the score fusion engine that merges the two ranked lists, the cross-encoder re-ranker that refines top candidates, and the indexing pipeline that keeps both indexes current.

Hybrid ranking system architecture showing dual retrieval paths, score fusion, and cross-encoder re-rank tier

At query time, the Query API Gateway accepts a search request and simultaneously fires two requests: the query text goes to the BM25 inverted index, and the query text goes to the embedding service which converts it to a 768-dimensional dense vector before hitting the HNSW vector index. Both retrievers return their top-100 candidates. The Score Fusion engine merges these 200 candidates using Reciprocal Rank Fusion (RRF), producing a combined ranked list. The top-20 entries from the fused list are passed to the Cross-Encoder service if the latency budget allows; otherwise the fused ranking is returned directly.

Query latency budget breakdown showing encode, parallel retrieve, score fusion, and cross-encoder re-rank steps with per-step timing
Key Insight

The latency budget is what makes or breaks hybrid search in production. BM25 is inherently fast - under 15ms for a well-tuned inverted index. ANN vector search is 10-20ms. But embedding the query itself takes 3-8ms on GPU, and cross-encoding the top-20 results takes another 20-40ms. The architecture must account for all three sequential steps: encode (5ms), parallel retrieve (15ms), fuse (2ms), re-rank (30ms). Total: ~52ms p50. Any component that spikes means the cross-encoder gets skipped via a latency gate - this is a deliberate design choice, not a fallback.

Sparse Retrieval: BM25 Deep Dive

BM25 (Best Match 25) is the standard first-stage retriever for keyword-based search. It scores documents based on term frequency (TF), inverse document frequency (IDF), and document length normalization. The key equation for document D and query Q is:

score(D, Q) = sum over terms t of:
  IDF(t) * TF_norm(t, D)

where:
  IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 )
  TF_norm(t, D) = tf(t,D) * (k1+1) / (tf(t,D) + k1*(1 - b + b*|D|/avgdl))
  k1 = 1.2, b = 0.75 (tunable)

BM25’s weakness is vocabulary mismatch. A query for “car insurance” will not match a document about “automobile coverage” because the terms do not overlap. This is precisely the gap dense retrieval fills.

Real World

Elasticsearch and OpenSearch use BM25 as the default similarity function for text fields (replacing TF-IDF in Elasticsearch 5.0). Both expose the k1 and b parameters via index settings. Elastic’s kNN search feature, introduced in version 8.0, adds the dense retrieval path - but Elastic’s hybrid scoring still required manual score normalization until the Reciprocal Rank Fusion API landed in 8.14.

The BM25 service runs on Elasticsearch with document-sharded indexes. Each shard holds approximately 25 million documents. For a query with 3-5 terms, the shard executor:

  1. Looks up each term’s posting list (sorted list of docIDs + term frequencies)
  2. Intersects or unions posting lists using DAAT (Document-at-a-Time) traversal
  3. Scores matched documents with BM25 using cached global IDF statistics
  4. Returns the top-100 scored results to the query router

BM25 also handles query expansion automatically when configured with a synonyms filter at index time: a synonym file maps “automobile” to “car”, “vehicle”, “auto”. This is a lightweight alternative to the full query expansion approach and adds negligible latency.

Key Insight

BM25 and dense retrieval have complementary failure modes. BM25 fails on vocabulary mismatch (synonyms, paraphrases, conceptual queries). Dense retrieval fails on rare terms and proper nouns: a query for “CVE-2024-12345” produces meaningless embeddings because the model has never seen that identifier. Hybrid search is most powerful when both retrievers are plumbed in parallel - neither is optional.

Dense Retrieval: Bi-Encoders and HNSW

Sparse retrieval represents documents and queries as high-dimensional sparse vectors (one dimension per vocabulary term). Dense retrieval represents them as low-dimensional dense vectors learned by a neural encoder. A sentence like “the cat sat on the mat” becomes a 768-float vector where semantically similar sentences cluster nearby in embedding space.

The bi-encoder architecture processes queries and documents independently through the same transformer model, producing embeddings that can be compared by cosine similarity. This enables precomputing all document embeddings offline.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # 768 dims, fast

def embed_query(query: str) -> np.ndarray:
    """Encode query to dense vector. Called at query time."""
    embedding = model.encode(query, normalize_embeddings=True)
    return embedding  # shape: (768,), L2-normalized for cosine via dot product

def embed_documents(docs: list[str], batch_size: int = 256) -> np.ndarray:
    """Batch encode documents at index time. Run offline."""
    embeddings = model.encode(
        docs,
        batch_size=batch_size,
        normalize_embeddings=True,
        show_progress_bar=True,
    )
    return embeddings  # shape: (N, 768)

These document embeddings are indexed into an HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbor search. HNSW achieves sub-millisecond search over millions of vectors by building a navigable multi-layer graph - starting from a sparse top layer for coarse navigation and drilling down to a dense bottom layer for precise neighbor lookup.

import hnswlib
import numpy as np

def build_hnsw_index(
    embeddings: np.ndarray,
    doc_ids: list[int],
    ef_construction: int = 200,
    M: int = 16,         # connections per node (16-64 typical)
    ef_search: int = 50, # search expansion factor
) -> hnswlib.Index:
    dim = embeddings.shape[1]
    index = hnswlib.Index(space="cosine", dim=dim)
    index.init_index(max_elements=len(embeddings), ef_construction=ef_construction, M=M)
    index.add_items(embeddings, doc_ids)
    index.set_ef(ef_search)
    return index

def ann_search(
    index: hnswlib.Index,
    query_vec: np.ndarray,
    k: int = 100,
) -> list[tuple[int, float]]:
    """Return top-k (doc_id, distance) pairs. Lower distance = more similar."""
    labels, distances = index.knn_query(query_vec.reshape(1, -1), k=k)
    return list(zip(labels[0].tolist(), distances[0].tolist()))
Real World

Vespa.ai (the open-source search engine used by Yahoo and Spotify) natively co-locates BM25 and HNSW indexes in the same shard - each document shard holds both the inverted index and the vector index for the same document set. This co-location is architecturally important: it means score fusion can happen in-process on the shard node rather than requiring a network merge step. Pinecone and Weaviate offer similar hybrid search capabilities in their managed offerings.

Score Fusion Strategies

After both retrievers return their top-100 candidates, the score fusion engine must combine them into a single ranked list. Three strategies are commonly used:

1. Linear score combination: hybrid_score = alpha * bm25_norm + (1-alpha) * vector_norm

This requires normalizing scores from both retrievers to the same range (typically 0-1) before combining. The problem is that BM25 scores have unbounded range (they depend on corpus statistics) while cosine distances are bounded [-1, 1]. Min-max normalization over the top-K results is fragile: a single outlier score can compress all other scores toward zero.

2. Reciprocal Rank Fusion (RRF): rrf_score(d) = sum over lists i of: 1 / (k + rank_i(d))

RRF uses document ranks, not raw scores, making it immune to score distribution differences between retrievers. The constant k=60 (from the original Cormack and Clarke 2009 paper) smooths the contribution of low-ranked items. A document ranked #1 by both retrievers gets 1/(60+1) + 1/(60+1) = 0.0328. A document ranked #1 by only one retriever gets 1/(60+1) = 0.0164.

3. Convex combination with learned weights: Same as linear combination but with weights learned from click-through data or relevance labels. Requires a labeled training set and periodic retraining.

RRF is the production default for most teams because it requires no normalization, no training data, and is robust to score outliers. The main limitation is that it discards absolute score information: if BM25 returns an exact match with score 50.0 and all other BM25 scores are under 5.0, RRF treats the exact match as “ranked #1” and gives it the same weight as if BM25 scores were all similar.

Score fusion internals showing RRF formula applied to BM25 and ANN ranked lists, then cross-encoder re-ranking
from collections import defaultdict

def reciprocal_rank_fusion(
    ranked_lists: list[list[int]],  # each list is ordered doc_ids from one retriever
    k: int = 60,
) -> list[tuple[float, int]]:
    """
    RRF fusion over N ranked lists.
    Returns list of (rrf_score, doc_id) sorted descending.
    """
    scores: dict[int, float] = defaultdict(float)
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    # Sort by descending RRF score
    fused = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [(score, doc_id) for doc_id, score in fused]

# Example: BM25 returns doc 4471 at rank 1, doc 2203 at rank 2.
# ANN returns doc 2203 at rank 1, doc 8872 at rank 2.
bm25_ranked = [4471, 2203, 9011, 3344]
ann_ranked  = [2203, 8872, 4471, 7701]
fused = reciprocal_rank_fusion([bm25_ranked, ann_ranked], k=60)
# doc 2203: 1/62 + 1/61 = 0.0161 + 0.0164 = 0.0325  (top)
# doc 4471: 1/61 + 1/63 = 0.0164 + 0.0159 = 0.0323  (second)
# doc 8872: 0    + 1/62 = 0.0161              (third)
print(fused[:3])
# [(0.0325, 2203), (0.0323, 4471), (0.0161, 8872)]
Key Insight

RRF’s k=60 is not magic - it is calibrated so that a document at rank 60 contributes exactly half the weight of a document at rank 1. Lower k values make the fusion more sensitive to top ranks (aggressive fusion). Higher k values flatten the contribution curve, making fusion more democratic. For most hybrid search setups k=60 is the right starting point, but if your BM25 is consistently better for navigational queries, a lower k biases the fusion toward whichever retriever ranks documents in the top 30.

Bi-Encoder vs Cross-Encoder

This distinction is the most important architectural tradeoff in semantic search:

Bi-encoders encode query and document independently. This means document embeddings can be precomputed and stored. At query time, only the query needs encoding (5ms on GPU), and retrieval is a fast ANN search. The tradeoff: the encoder must represent all meaning in a fixed-size vector without seeing the query, which loses fine-grained relevance signals.

Cross-encoders take a (query, document) pair as input and output a relevance score. Because the model sees both query and document together, it can apply full attention across all tokens - capturing precise relevance signals that a bi-encoder misses. The tradeoff: cross-encoders cannot precompute anything. Every query-document pair must be scored at query time, which is prohibitively slow over thousands of candidates.

from sentence_transformers import CrossEncoder

# Cross-encoder for re-ranking (NOT for first-stage retrieval)
# ms-marco-MiniLM-L-6-v2 is a popular, fast cross-encoder trained on MS MARCO
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_with_cross_encoder(
    query: str,
    doc_texts: list[str],    # document text (not embeddings)
    doc_ids: list[int],
    top_n: int = 10,
    latency_budget_ms: float = 30.0,
) -> list[tuple[float, int]]:
    """
    Score (query, doc) pairs with cross-encoder.
    Only call on top-20 candidates after RRF fusion.
    Latency: ~1.5ms per pair on A10G GPU; 20 pairs = ~30ms.
    """
    import time
    start = time.monotonic()

    pairs = [(query, doc_text) for doc_text in doc_texts]
    scores = cross_encoder.predict(pairs, batch_size=len(pairs))

    elapsed_ms = (time.monotonic() - start) * 1000
    if elapsed_ms > latency_budget_ms:
        # Budget exceeded: return RRF order (doc_ids already ranked)
        return [(0.0, doc_id) for doc_id in doc_ids[:top_n]]

    ranked = sorted(zip(scores, doc_ids), reverse=True)
    return [(float(score), doc_id) for score, doc_id in ranked[:top_n]]

The production pattern: use the bi-encoder at retrieval scale (100M candidates), then the cross-encoder at re-ranking scale (top 20-50 candidates). Never use the cross-encoder for first-stage retrieval.

Query Expansion

Query expansion augments the user’s query with additional terms to improve sparse retrieval recall. Two variants:

Lexical query expansion adds synonyms and related terms from a curated dictionary or WordNet: “fast” expands to “fast OR quick OR rapid OR speedy”. Simple, fast, and language-dependent.

Neural query expansion uses a language model to generate alternative phrasings. The most effective approach is Hypothetical Document Embeddings (HyDE): generate a synthetic document that would answer the query, embed it, and use that embedding as the query vector for ANN search instead of the query embedding itself.

from anthropic import Anthropic

client = Anthropic()

def hyde_expand(query: str, model: SentenceTransformer) -> np.ndarray:
    """
    HyDE: generate a hypothetical document, embed it, use as query vector.
    Improves recall for underspecified queries by 5-15% on standard benchmarks.
    """
    # Step 1: generate a hypothetical answer document
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": (
                f"Write a 2-sentence passage that directly answers: '{query}'. "
                "Be specific and factual. Do not start with 'I' or explain that you are an AI."
            ),
        }],
    )
    hypothetical_doc = response.content[0].text.strip()

    # Step 2: embed the hypothetical document (not the original query)
    hyde_embedding = model.encode(hypothetical_doc, normalize_embeddings=True)
    return hyde_embedding

def query_with_hyde(
    query: str,
    index: hnswlib.Index,
    model: SentenceTransformer,
    k: int = 100,
    use_hyde: bool = True,
) -> list[tuple[int, float]]:
    if use_hyde:
        query_vec = hyde_expand(query, model)
    else:
        query_vec = model.encode(query, normalize_embeddings=True)
    return ann_search(index, query_vec, k=k)
Watch Out

HyDE adds 50-150ms of latency due to the LLM call, which blows the 100ms budget. Gate it with a query classifier: run HyDE only for queries detected as “underspecified” (short queries, no named entities, low BM25 candidate overlap). For navigational queries (“Kubernetes pod spec syntax”), direct embedding is faster and more accurate. Never run HyDE unconditionally in a latency-sensitive production path.

Data Model

The hybrid search system has three storage layers per shard: the inverted index (BM25, managed by Elasticsearch), the vector index (HNSW, managed by the vector service), and the document store (text + metadata, for cross-encoder scoring).

-- Document registry: master record per indexed document
CREATE TABLE indexed_documents (
  doc_id          BIGINT          NOT NULL,
  shard_id        SMALLINT        NOT NULL,
  external_id     VARCHAR(512)    NOT NULL,
  title           TEXT            NOT NULL,
  body_snippet    TEXT            NOT NULL,    -- first 512 tokens for cross-encoder
  vector_version  INT             NOT NULL DEFAULT 1,  -- embedding model version
  indexed_at      TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
  updated_at      TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
  is_deleted      BOOLEAN         NOT NULL DEFAULT FALSE,
  metadata        JSONB           NOT NULL DEFAULT '{}',
  PRIMARY KEY (doc_id),
  CONSTRAINT uq_external UNIQUE (external_id)
) PARTITION BY HASH (doc_id);

CREATE INDEX idx_docs_shard ON indexed_documents (shard_id, is_deleted, indexed_at DESC);
CREATE INDEX idx_docs_version ON indexed_documents (vector_version, is_deleted) WHERE NOT is_deleted;

-- Embedding index tracking: which docs have vectors in which HNSW index version
CREATE TABLE vector_index_entries (
  doc_id          BIGINT          NOT NULL,
  shard_id        SMALLINT        NOT NULL,
  model_version   INT             NOT NULL,
  hnsw_label      BIGINT          NOT NULL,   -- internal hnswlib label
  indexed_at      TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
  PRIMARY KEY (doc_id, model_version)
);

CREATE INDEX idx_vec_shard_model ON vector_index_entries (shard_id, model_version);

-- Query log: for offline eval and latency tracking
CREATE TABLE query_log (
  query_id        UUID            NOT NULL DEFAULT gen_random_uuid() PRIMARY KEY,
  query_text      TEXT            NOT NULL,
  bm25_latency_ms SMALLINT,
  ann_latency_ms  SMALLINT,
  fusion_latency_ms SMALLINT,
  rerank_latency_ms SMALLINT,
  total_latency_ms SMALLINT,
  bm25_top1_doc   BIGINT,
  ann_top1_doc    BIGINT,
  fused_top1_doc  BIGINT,
  reranked_top1_doc BIGINT,
  logged_at       TIMESTAMPTZ     NOT NULL DEFAULT NOW()
);
Key Insight

The query_log table is your offline evaluation goldmine. By logging which doc won at each stage (bm25_top1_doc, ann_top1_doc, fused_top1_doc, reranked_top1_doc), you can measure inter-stage agreement over production traffic. High disagreement between fused_top1_doc and reranked_top1_doc indicates the cross-encoder is catching cases where fusion got it wrong - a signal to increase re-ranking depth. Zero disagreement often indicates the cross-encoder is not being called (latency gate firing too aggressively).

Key Algorithms and Protocols

Reciprocal Rank Fusion - Production Implementation

The production RRF implementation adds two refinements beyond the basic formula: a metadata pre-filter that removes ineligible candidates before fusion, and a diversity re-ranking pass that penalizes near-duplicate documents.

from dataclasses import dataclass
from collections import defaultdict
from typing import Optional

@dataclass
class SearchCandidate:
    doc_id: int
    bm25_rank: Optional[int]   # None if not in BM25 results
    ann_rank: Optional[int]    # None if not in ANN results
    bm25_score: float = 0.0
    ann_distance: float = 1.0

def production_rrf_fusion(
    bm25_results: list[tuple[float, int]],    # (bm25_score, doc_id) sorted desc
    ann_results: list[tuple[float, int]],     # (distance, doc_id) sorted asc
    metadata_filter: Optional[set[int]],      # allowed doc_ids; None = no filter
    k: int = 60,
    top_n: int = 20,
) -> list[tuple[float, int]]:
    """
    RRF fusion with optional metadata pre-filter.
    Returns top-N (rrf_score, doc_id) sorted descending.
    """
    candidates: dict[int, SearchCandidate] = {}

    for rank, (score, doc_id) in enumerate(bm25_results, start=1):
        if metadata_filter and doc_id not in metadata_filter:
            continue
        candidates.setdefault(doc_id, SearchCandidate(doc_id=doc_id, bm25_rank=None, ann_rank=None))
        candidates[doc_id].bm25_rank = rank
        candidates[doc_id].bm25_score = score

    for rank, (distance, doc_id) in enumerate(ann_results, start=1):
        if metadata_filter and doc_id not in metadata_filter:
            continue
        candidates.setdefault(doc_id, SearchCandidate(doc_id=doc_id, bm25_rank=None, ann_rank=None))
        candidates[doc_id].ann_rank = rank
        candidates[doc_id].ann_distance = distance

    # Compute RRF scores
    scored: list[tuple[float, int]] = []
    for doc_id, cand in candidates.items():
        rrf = 0.0
        if cand.bm25_rank is not None:
            rrf += 1.0 / (k + cand.bm25_rank)
        if cand.ann_rank is not None:
            rrf += 1.0 / (k + cand.ann_rank)
        scored.append((rrf, doc_id))

    scored.sort(reverse=True)
    return scored[:top_n]

The right evaluation metric depends on whether you have explicit relevance labels (graded) or click-through data (implicit):

def ndcg_at_k(
    ranked_doc_ids: list[int],
    relevance_labels: dict[int, int],  # doc_id -> relevance grade (0, 1, 2, 3)
    k: int = 10,
) -> float:
    """
    Normalized Discounted Cumulative Gain at K.
    Standard metric for graded relevance in information retrieval.
    """
    import math

    def dcg(doc_ids: list[int], labels: dict[int, int], at_k: int) -> float:
        return sum(
            (2 ** labels.get(doc_id, 0) - 1) / math.log2(rank + 2)
            for rank, doc_id in enumerate(doc_ids[:at_k])
        )

    actual_dcg = dcg(ranked_doc_ids, relevance_labels, k)

    # Ideal DCG: sort by relevance descending
    ideal_order = sorted(relevance_labels.keys(), key=lambda d: relevance_labels[d], reverse=True)
    ideal_dcg = dcg(ideal_order, relevance_labels, k)

    if ideal_dcg == 0:
        return 0.0
    return actual_dcg / ideal_dcg

def recall_at_k(
    ranked_doc_ids: list[int],
    relevant_doc_ids: set[int],
    k: int = 10,
) -> float:
    """Fraction of relevant docs found in the top-K results."""
    if not relevant_doc_ids:
        return 0.0
    hits = sum(1 for doc_id in ranked_doc_ids[:k] if doc_id in relevant_doc_ids)
    return hits / len(relevant_doc_ids)

# A/B comparison of pure BM25 vs hybrid (typical results)
pure_bm25_ndcg10 = 0.42     # degrades on semantic queries
pure_ann_ndcg10  = 0.51     # better on paraphrase queries
hybrid_rrf_ndcg10 = 0.63   # beats both on mixed query sets
hybrid_rerank_ndcg10 = 0.71 # cross-encoder adds further +8pp
Key Insight

NDCG@10 is the standard metric for comparing retrieval systems, but it requires graded relevance labels (0-3 scale) which are expensive to obtain. For teams without labeled datasets, use BEIR (Benchmarking Information Retrieval) - an open benchmark with 18 diverse test sets covering news, biomedical, financial, and code search. Running your hybrid system against BEIR’s public test sets provides a calibrated baseline before you invest in domain-specific labeling.

Scaling and Performance

Index sharding diagram showing N document shards each co-locating BM25 and HNSW, with embedding service pool and cross-encoder GPU pool
Capacity Estimation (100M docs, 5000 QPS):

Vector index:
  100M docs x 768 dims x 4 bytes = 307 GB raw float32
  With int8 scalar quantization (4x reduction): ~80 GB
  With 4 shards: ~20 GB per shard (fits in 32 GB instance)
  HNSW graph overhead (M=16 connections): ~2 GB per shard
  Total per shard node: ~22 GB RAM for vector index

BM25 index (Elasticsearch):
  100M docs x 500 tokens avg = 5B token occurrences
  After delta+VByte compression: ~50 GB per shard (10 shards)
  Elasticsearch heap: 32 GB per node; OS page cache: 64 GB
  Total Elasticsearch nodes: 10 shards x 2 replicas = 20 nodes

Query compute at 5000 QPS:
  Embedding: 5000 QPS / 1200 embed/s per A10G GPU = 4.2 GPUs -> 5 GPU replicas
  BM25: 5000 QPS / 10 shards = 500 shard searches/s, ~5ms each = 0.25 vCPU per shard node
  ANN: 5000 QPS / 4 shards = 1250 shard searches/s, ~3ms each = 0.375 vCPU per shard node
  Cross-encoder: 5000 QPS x 20 pairs / 2000 pairs-per-s per A10G = 50 GPU replicas
    -> Cache hit rate 60% (same queries repeated): need 50 x 0.4 = 20 GPU replicas

Re-ranking cost justification:
  Without re-ranking: NDCG@10 = 0.63 (hybrid RRF)
  With re-ranking: NDCG@10 = 0.71 (+13%)
  Cost: 20 additional A10G GPU instances at $2/hr = $40/hr = $35k/month
  For a search product with 5000 QPS, this is typically worth it.
  Alternative: cache cross-encoder results in Redis with 60s TTL;
  target 60% cache hit rate to reduce to 20 GPUs.
Real World

Cohere’s Rerank API monetizes exactly the cross-encoder step: you do your own retrieval (BM25 + ANN), then call Cohere’s API with up to 1000 candidates per query to get cross-encoder scores. This offloads the GPU infrastructure cost to Cohere. At 5000 QPS with a 60% cache hit rate and 20 candidates per uncached query, you make 2000 API calls per second - Cohere’s pricing at $2 per 1000 searches makes this $5.76M/year, which is cheaper than running 50 A10Gs in-house but only if you can achieve the cache hit rate.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Embedding service GPU OOMGPU memory alert, embed latency > 50msQueries fall back to BM25-only (no dense path)Circuit breaker disables dense path; auto-scale GPU pool; BM25-only mode degrades recall but maintains p99
HNSW index corruption on shard restartIndex load checksum failureThat shard’s dense retrieval returns empty setFall back to remaining shards; rebuild HNSW from stored float32 vectors (takes 20-30 min per shard)
BM25 shard unavailable (Elasticsearch node down)Elasticsearch health check redSparse path returns empty for affected shard; hybrid quality degradesRoute to replica; Elasticsearch auto-promotes replica to primary within 30s
Cross-encoder latency spikep99 cross-encoder time > 40msLatency gate fires; re-ranking skipped for most queriesAuto-scale cross-encoder replicas; increase Redis TTL to serve more cached results
Stale vector index after model upgradeEmbedding version mismatch alertQueries from new model version mismatch old document vectorsMaintain parallel indexes; route new-model queries to new index; deprecate old index after full re-embedding completes
RRF constant tuning driftOffline NDCG@10 regression below baselineHybrid quality degrades silently; users see worse resultsShadow-score production queries with alternative k values; A/B test before deploying k changes
Watch Out

The most operationally painful failure mode is the embedding model upgrade. When you upgrade from all-MiniLM-L6-v2 to a better model, every document in the index needs re-embedding. The new embeddings live in a different region of the vector space - mixing old and new embeddings in the same HNSW index produces incoherent results. The safe pattern is to build a new parallel HNSW index while keeping the old one serving, gradually route traffic to the new index as re-embedding completes, then deprecate the old index. Do not do partial re-embedding in place.

Comparison of Approaches

ApproachSemantic RecallKeyword PrecisionLatencyInfra CostBest Fit
BM25 onlyLow (vocabulary-bound)High (exact match)5-15msLow (just ES)Legal/regulatory search where exact terms matter
Dense (ANN) onlyHigh (semantic)Low (misses rare terms)10-20ms + embedMedium (GPU embed)Conversational query, FAQ, semantic similarity
Hybrid RRF (this design)HighHigh25-40msMedium-HighGeneral search, e-commerce, enterprise knowledge base
Hybrid + Cross-EncoderVery HighVery High55-100msHigh (GPU rerank)High-stakes search: medical, legal, code, customer support
ColBERT / late interactionHighMedium30-60msHigh (store token vecs)Dense but token-level matching; good for code search
Learning-to-Rank (LTR)Depends on featuresHigh20-50msMediumWhen behavioral signals (clicks) are available for training

The choice between Hybrid RRF and Hybrid + Cross-Encoder often comes down to query distribution. If more than 50% of your queries are navigational (users know the exact document they want), the cross-encoder adds minimal NDCG gain and you can skip it. If more than 50% are informational (open-ended questions, research queries), the cross-encoder’s +8pp NDCG improvement justifies the GPU cost.

Key Takeaways

  • BM25 and dense retrieval fail in opposite directions: BM25 fails on vocabulary mismatch; dense fails on rare terms and proper nouns. Running both in parallel and fusing results is not optional - it is the minimum viable semantic search architecture.
  • Reciprocal Rank Fusion beats score normalization: RRF requires no score normalization, no training data, and degrades gracefully when one retriever returns poor results. Start with k=60 and tune experimentally.
  • Bi-encoders retrieve, cross-encoders rank: Never use a cross-encoder for first-stage retrieval over large corpora. Its latency is quadratic in corpus size; restrict it to at most 50 candidates after RRF fusion.
  • The latency budget must be hardcoded: Implement a latency gate that skips the cross-encoder if upstream retrieval takes over 70ms. A hybrid system that regularly delivers 150ms p99 is worse than a BM25 system that consistently delivers 20ms p99.
  • Query expansion via HyDE is a tradeoff: Hypothetical document embeddings improve recall on underspecified queries by 5-15% but add 50-150ms of LLM latency. Gate it behind a query classifier that enables HyDE only for short, ambiguous queries.
  • Evaluation needs graded relevance: NDCG@10 is the right metric for comparing retrieval systems, but requires labeled data. Use the BEIR benchmark for zero-shot calibration before investing in domain-specific labeling.
  • Embedding model upgrades are the hardest operational problem: Build a parallel index migration capability from day one. Any architecture that requires downtime to upgrade the embedding model will either accumulate technical debt or degrade quality as better models become available.
  • Cross-encoder results cache well: Repeated queries to the same corpus with similar phrasing benefit enormously from a Redis cache with 60-120 second TTL. A 60% cache hit rate halves your GPU cost for the re-ranking tier.

Frequently Asked Questions

Q: Why not use a single model that does both retrieval and re-ranking? A: A model that does both tasks well does not exist at production latency. ColBERT is the closest attempt - it stores per-token embeddings for each document and uses MaxSim scoring at query time for token-level matching. ColBERT achieves better recall than bi-encoders and is faster than cross-encoders, but requires storing 128 token vectors per document (128x more storage than bi-encoder embeddings) and is still slower than ANN search. For most production systems at 100M documents, bi-encoder + cross-encoder on top-K candidates is a better cost-latency-quality tradeoff than ColBERT.

Q: Why not just use Elasticsearch’s built-in kNN search instead of a separate vector service? A: Elasticsearch 8.x does support HNSW (via the knn query), and for teams already running Elasticsearch, starting there is reasonable. The limitation appears at scale: Elasticsearch’s HNSW implementation is optimized for Lucene’s segment-based storage model, which means the HNSW graph is rebuilt per segment and merged periodically. For a write-heavy corpus, Elasticsearch HNSW suffers from segment merge overhead that a dedicated vector database (Qdrant, Weaviate, Milvus) handles more gracefully. If your corpus is stable (product catalog, knowledge base), Elasticsearch’s native kNN is an excellent choice that eliminates operational complexity.

Q: How do you handle multi-language search with this architecture? A: Replace the English-specific embedding model with a multilingual one (e.g., paraphrase-multilingual-MiniLM-L12-v2 from sentence-transformers, or Cohere’s embed-multilingual-v3.0). The BM25 layer needs language-specific analyzers (tokenizers, stemmers, stopword lists) per language, which Elasticsearch handles via language-specific built-in analyzers. RRF fusion is language-agnostic. The cross-encoder needs a multilingual variant (e.g., cross-encoder/msmarco-MiniLM-L6-en-de-v1 for English-German). The main operational complexity is that multilingual embeddings are in the same vector space, so you can cross-language search for free - but this may or may not be desirable for your use case.

Q: What should I monitor in production to catch quality regressions before users notice? A: Four metrics: (1) BM25-ANN overlap@10: percentage of top-10 results that appear in both BM25 and ANN ranked lists. A drop indicates one retriever is degrading. (2) RRF-rerank agreement@1: whether the cross-encoder top result matches the RRF top result. Sustained disagreement above 30% suggests the fusion k needs retuning. (3) Zero-result rate: queries where both retrievers return fewer than K candidates. A spike indicates an index problem. (4) Latency gate fire rate: percentage of queries where the cross-encoder was skipped due to the latency gate. If this exceeds 5%, the upstream latency is degrading and the re-ranking budget is being eaten.

Q: Why not use FAISS instead of hnswlib for the vector index? A: FAISS and hnswlib both implement HNSW, but their operational profiles differ. FAISS is a research library optimized for batch accuracy benchmarks, with excellent GPU acceleration for index builds but complex Python bindings and less focus on production serving. hnswlib is a production-first C++ library with a clean Python API designed for single-node serving. For distributed production systems, purpose-built vector databases (Qdrant, Weaviate) are typically better than either bare library: they add replication, live index updates, metadata filtering, and REST/gRPC APIs without custom engineering.

Q: Can the cross-encoder be replaced with a smaller, faster model? A: Yes, and this is a common optimization. The ms-marco-MiniLM-L-6-v2 cross-encoder (6-layer MiniLM) is already a distilled model - much smaller than BERT-base but within 2-3 pp NDCG of it on standard benchmarks. For even lower latency, ms-marco-TinyBERT-L-2-v2 (2-layer TinyBERT) scores within 5 pp of MiniLM-L6 at 3x lower latency. The right choice depends on your latency budget: if the 30ms gate is firm, TinyBERT; if you can relax to 50ms, MiniLM-L6 is meaningfully better.

Interview Questions

Q: Walk me through why BM25 and dense retrieval fail in complementary ways, and how that shapes the hybrid architecture.

Expected depth: BM25 scores documents based on term overlap - if the query “automobile fuel efficiency” does not appear verbatim in any document, BM25 returns zero results even if every document about “car miles per gallon” is highly relevant. Dense retrieval encodes semantic meaning: “automobile fuel efficiency” and “car mpg” map to nearby vectors. But dense retrieval fails on rare or novel terms: a query for “GPT-4 jailbreak CVE-2024-XXXXX” produces a meaningless embedding because the model was not trained on that identifier, while BM25 finds exact matches trivially. The candidate must explain that running both in parallel covers each other’s failure modes, and that the fusion step (RRF or score combination) must not require normalization to work when one retriever returns an empty result set.

Q: Design the embedding service such that it handles 5000 QPS with sub-10ms latency per query.

Expected depth: The key insight is that embedding latency is dominated by model inference, not the query itself. At batch size 1, a MiniLM-L6-v2 on A10G GPU takes ~4ms. To handle 5000 QPS at that latency, you need at least 5000 x 4ms / 1000ms = 20 queries simultaneously - achievable via batching. Dynamic micro-batching: accumulate queries for 1ms, then run a batch of 5-20 queries together. The batch processes in ~6ms (marginal increase from batching), but throughput is now 5-20 queries per 6ms = up to 3300 QPS per GPU. At 5000 QPS, need 2 GPUs with some headroom. Cover: gRPC service with asyncio request handling, batch accumulator with a timeout of 1ms and a max batch size of 32, response dispatch back to waiting requests. Also discuss caching: repeated queries get the same embedding; a Redis LRU cache with 1-second TTL handles high-repetition query workloads.

Q: How would you evaluate whether adding cross-encoder re-ranking improves real user experience, not just offline NDCG?

Expected depth: Offline metrics (NDCG@10) measure against a labeled test set which may not reflect actual user queries. To evaluate real user impact: (1) A/B test: route 50% of traffic to re-ranking, 50% to RRF-only; measure click-through rate (CTR) on top-3 results, mean reciprocal rank of first click, and zero-click rate (user abandoned search). (2) Interleaving test: for each query, create a merged result list that interleaves re-ranked and non-re-ranked results; measure which results users prefer by click position. (3) Track latency impact: re-ranking adds 20-40ms; does CTR improvement justify this? If users bounce 2% more due to latency than they gain from relevance, re-ranking is net negative. The candidate should know the typical result: re-ranking improves CTR by 5-15% for informational queries and has near-zero impact on navigational queries, with latency overhead typically acceptable for B2B/enterprise search but borderline for consumer search.

Q: The cross-encoder latency gate is firing 40% of the time - what do you investigate?

Expected depth: A 40% fire rate means upstream retrieval (embed + BM25 + ANN + fusion) is exceeding 70ms on 40% of queries. Diagnose by component: (1) Check embedding service p99 - if it is spiking to 20-30ms, the GPU pool is undersized or has hot-spot queries (very long queries run slower; cap query length at 256 tokens). (2) Check ANN search latency per shard - if one shard is consistently slow, it may be garbage collecting, or the HNSW index is partially evicted from memory. (3) Check BM25 shard latency - if Elasticsearch shard latency is spiking, look for segment merge storms or hot queries with very large posting lists. (4) Check RRF fusion latency - usually negligible (under 2ms), but a bug that loads documents at fusion time would spike it. The resolution often involves: increase embedding batch timeout to 0.5ms to improve GPU utilization, add one GPU replica, and tune Elasticsearch merge policy to reduce merge storm frequency.

Q: How do you maintain two synchronized indexes (inverted index and vector index) without inconsistency?

Expected depth: The canonical solution is to treat the document ID as the synchronization primitive. When a document is ingested: (1) assign a stable doc_id, (2) write to a queue (Kafka topic document-ingest), (3) a BM25 consumer indexes into Elasticsearch with that doc_id, (4) an embedding consumer generates the vector and inserts into HNSW with that doc_id as the label. Both consumers are idempotent (re-processing a document with the same doc_id is safe). Eventual consistency window: 5-30 seconds while both consumers process. During that window, a query might find the document in BM25 but not ANN (or vice versa). RRF handles this gracefully: a document appearing in only one retriever still gets an RRF score from that retriever and can appear in results. For deletion: use soft deletes - mark is_deleted=true in the document registry, propagate to Elasticsearch (delete document), and mark the HNSW entry as deleted (hnswlib supports soft delete via mark_deleted). Hard deletion from HNSW requires index rebuild, so defer it to periodic compaction.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article