Build a Vector Search Engine


databases performance data-engineering

System Design Deep Dive

Vector Search Engine

Finding the nearest neighbor in 100 million dimensions in under 20ms - accuracy vs speed is a false choice.

⏱ 14 min read📐 Advanced🏗️ Vector-Search

Think of a card catalog in a library from the 1970s. You want to find books “similar to” a book you already read. A keyword search looks for exact word matches - title, author, subject heading. But similarity is about meaning, not words. Vector search is the modern equivalent of a librarian who has read everything and can say “based on the themes in that novel, you’d also like these three, even though their titles share no words.”

The engineering challenge is that measuring similarity between 100 million vectors of 1,536 dimensions - typical for a modern embedding model - requires computing dot products or cosine distances. Exact nearest neighbor search requires comparing a query against all 100 million vectors: even at 1 nanosecond per distance calculation, that is 100 milliseconds of pure compute per query. At 10,000 queries per second that is 1,000 CPU-seconds per second - a compute bill that scales linearly with dataset size.

Approximate Nearest Neighbor (ANN) algorithms solve this by accepting a small accuracy penalty in exchange for orders-of-magnitude speedup. The HNSW (Hierarchical Navigable Small World) graph structure can search 100 million 1,536-dimensional vectors in under 2ms with 95%+ recall. But HNSW introduces new challenges: the graph is built in memory, making it expensive at scale; updates to the index require rebuilding affected graph connections; and running HNSW at production scale means sharding, replication, and live updates without search downtime.

Quantization further reduces memory - a 1,536-dimension float32 vector is 6KB; the same vector quantized to int8 is 1.5KB, and a binary quantized version is 192 bytes. But quantization degrades recall, so you re-rank the approximate candidates against the full-precision vectors to recover accuracy.

We need to solve for ANN index structure, live index updates, quantization with re-ranking, and horizontal sharding simultaneously.

Requirements and Constraints

Functional Requirements

  • Index 100 million high-dimensional embedding vectors (1,024 to 1,536 dimensions, float32)
  • Serve k-nearest-neighbor queries (k=1 to 100) with under 20ms p99 latency
  • Support live index updates: insert, update, and soft-delete vectors without taking the index offline
  • Metadata filtering: combine vector similarity with structured filters (e.g., “find the 10 most similar documents that are also tagged category=finance and date > 2024-01-01”)
  • Batch indexing: ingest up to 1 million new vectors per hour during bulk loads

Non-Functional Requirements

  • Query latency: p50 under 5ms, p99 under 20ms, p999 under 50ms
  • Recall at k=10: above 95% vs exact nearest neighbor
  • Throughput: 10,000 queries/second at target latency
  • Availability: 99.99% for reads; 99.9% for writes
  • Storage: 100 million x 1,536 x 4 bytes = 614 GB raw; target under 200 GB with quantization
  • Index rebuild time: under 4 hours for full 100M vector rebuild

Constraints

  • Vector dimensionality is fixed at index creation; schema changes require full rebuild
  • Approximate results are acceptable - exact KNN is out of scope for production queries
  • Cross-shard filtering requires merging results; we handle this at the query router layer

High-Level Architecture

The system has four planes: the ingestion pipeline that embeds and indexes new vectors, the index layer that maintains the HNSW graph per shard, the query router that fans out searches and merges results, and the re-ranking layer that refines approximate results.

Vector search engine system architecture overview

Raw documents arrive at the embedding service, which calls an embedding model (OpenAI, Cohere, or a self-hosted sentence-transformer) to produce dense vectors. Vectors are routed to the correct index shard based on a hash of the vector ID. Each shard maintains an in-memory HNSW graph and a persistent vector store on disk. At query time, the query router converts the query text to an embedding, fans out the search to all shards, collects top-k candidates from each, and merges them. The re-ranker optionally scores the merged candidates against the original query for precision-critical use cases.

Key Insight

The HNSW graph must fit in RAM for sub-10ms search. At 100M vectors x 1,536 dims x 4 bytes = 614 GB, a single-node in-memory index is impractical. Sharding reduces each shard to a manageable size, but merging k results from N shards requires N x k candidates and an extra merge step - this is the fundamental latency-recall tradeoff in distributed vector search.

The HNSW Index

HNSW’s job is to find the approximate k nearest neighbors in a large vector space without scanning every point.

HNSW builds a multi-layer graph where each node represents a vector. The top layer is sparse - only a few hundred nodes connected with long-range edges. Each lower layer is denser, with shorter-range edges. A search starts at the top layer, greedily moves toward the query, drops down a layer, and repeats until reaching the bottom layer. This is analogous to a navigation system that first routes you to the correct city (top layer), then to the neighborhood (mid layer), then to the street (bottom layer).

HNSW graph structure and search path
# HNSW search: greedy layer-by-layer descent
import heapq
import numpy as np
from dataclasses import dataclass, field
from typing import Any

@dataclass
class HNSWNode:
    id: int
    vector: np.ndarray
    connections: dict = field(default_factory=dict)  # layer -> list[int]

def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    return 1.0 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def hnsw_search(
    nodes: dict[int, HNSWNode],
    query: np.ndarray,
    entry_point_id: int,
    num_layers: int,
    ef: int,  # exploration factor: higher = better recall, slower
    k: int
) -> list[tuple[float, int]]:
    current_id = entry_point_id
    current_dist = cosine_distance(query, nodes[current_id].vector)

    # Greedy search from top layer down to layer 1
    for layer in range(num_layers, 0, -1):
        changed = True
        while changed:
            changed = False
            for neighbor_id in nodes[current_id].connections.get(layer, []):
                d = cosine_distance(query, nodes[neighbor_id].vector)
                if d < current_dist:
                    current_dist = d
                    current_id = neighbor_id
                    changed = True

    # Beam search at layer 0 with ef candidates
    candidates = [(current_dist, current_id)]
    visited = {current_id}
    result_heap = [(-current_dist, current_id)]  # max-heap by distance

    while candidates:
        dist, node_id = heapq.heappop(candidates)
        if dist > -result_heap[0][0]:
            break

        for neighbor_id in nodes[node_id].connections.get(0, []):
            if neighbor_id not in visited:
                visited.add(neighbor_id)
                d = cosine_distance(query, nodes[neighbor_id].vector)
                if len(result_heap) < ef or d < -result_heap[0][0]:
                    heapq.heappush(candidates, (d, neighbor_id))
                    heapq.heappush(result_heap, (-d, neighbor_id))
                    if len(result_heap) > ef:
                        heapq.heappop(result_heap)

    results = sorted([(-d, nid) for d, nid in result_heap], key=lambda x: x[0])
    return results[:k]

The ef parameter controls the recall-speed tradeoff. At ef=10, the search is fast but may miss some true neighbors. At ef=200, recall approaches 99% but latency doubles. Production systems tune ef per use case.

Real World

Weaviate, Qdrant, and Milvus all implement HNSW as their primary index structure. Pinecone uses a proprietary variant of HNSW optimized for on-disk operation. The original HNSW paper by Malkov and Yashunin (2018) demonstrated that the algorithm achieves state-of-the-art recall-speed tradeoffs on billion-scale datasets.

Quantization and Re-Ranking

Memory is the dominant constraint for large-scale vector search. A 100M vector index at float32 takes 600 GB - requiring many machines or massive RAM. Quantization compresses vectors to reduce memory, accepting a recall penalty that re-ranking recovers.

The strategy is two-stage: search with compressed vectors for speed, re-rank with full-precision vectors for accuracy.

# Scalar quantization: float32 -> int8 (4x memory reduction)
import numpy as np

def quantize_f32_to_int8(vectors: np.ndarray) -> tuple[np.ndarray, float, float]:
    """Returns quantized vectors, scale, and zero point."""
    v_min = vectors.min()
    v_max = vectors.max()
    scale = (v_max - v_min) / 255.0
    zero_point = v_min

    quantized = np.clip(
        np.round((vectors - zero_point) / scale),
        0, 255
    ).astype(np.uint8)
    return quantized, scale, zero_point

def dequantize_int8_to_f32(
    quantized: np.ndarray, scale: float, zero_point: float
) -> np.ndarray:
    return quantized.astype(np.float32) * scale + zero_point

def two_stage_search(
    index_int8,       # HNSW built on int8 vectors
    full_vectors,     # original float32 vectors, indexed by id
    query: np.ndarray,
    k: int,
    oversample: int = 4  # fetch 4x candidates for re-ranking
) -> list[tuple[float, int]]:
    # Stage 1: fast ANN on quantized index, get k*oversample candidates
    candidates = index_int8.search(query, k=k * oversample)

    # Stage 2: re-rank candidates with exact float32 distance
    reranked = []
    for candidate_id, _ in candidates:
        full_vec = full_vectors[candidate_id]
        exact_dist = cosine_distance(query, full_vec)
        reranked.append((exact_dist, candidate_id))

    reranked.sort(key=lambda x: x[0])
    return reranked[:k]

The oversampling factor 4 means: fetch 40 candidates from the compressed index, re-rank all 40 with exact distances, return top 10. The exact distance computation for 40 vectors at 1,536 dims is about 40 x 1,536 x 3 operations = 184K floating point ops - negligible compared to searching 100M vectors.

Watch Out

Product quantization (PQ) reduces memory more aggressively than scalar quantization but requires training a codebook on your specific vector distribution. If you use a generic codebook or apply PQ to an out-of-distribution dataset, recall degrades severely. Always train and evaluate quantization on a representative sample of your actual production vectors.

Live Index Updates

HNSW was designed for static indices. Live updates are the hard part: inserting a new vector requires finding its neighbors and connecting it into the graph, which may require updating the connections of existing nodes. Deletes are even harder - removing a node from a graph without leaving dangling edges requires reconnecting its neighbors.

The practical solution is soft deletes with periodic compaction:

# Soft delete and compaction pattern
import time
from threading import Lock

class LiveHNSWIndex:
    def __init__(self, base_index, max_pending: int = 50_000):
        self.base_index = base_index  # immutable HNSW built from snapshot
        self.pending_inserts: dict[int, np.ndarray] = {}  # id -> vector
        self.deleted_ids: set[int] = set()
        self.lock = Lock()
        self.max_pending = max_pending

    def insert(self, vector_id: int, vector: np.ndarray):
        with self.lock:
            self.pending_inserts[vector_id] = vector
            if len(self.pending_inserts) > self.max_pending:
                self._trigger_compaction()

    def delete(self, vector_id: int):
        with self.lock:
            self.deleted_ids.add(vector_id)
            self.pending_inserts.pop(vector_id, None)

    def search(self, query: np.ndarray, k: int, ef: int = 50) -> list:
        with self.lock:
            # Search base index, filter deleted
            base_results = self.base_index.search(query, k=k * 2, ef=ef)
            base_filtered = [
                (dist, vid) for dist, vid in base_results
                if vid not in self.deleted_ids
            ]

            # Search pending inserts via brute force (small buffer)
            pending_results = []
            for vid, vec in self.pending_inserts.items():
                if vid not in self.deleted_ids:
                    d = cosine_distance(query, vec)
                    pending_results.append((d, vid))

            # Merge and return top k
            all_results = sorted(base_filtered + pending_results)
            return all_results[:k]

    def _trigger_compaction(self):
        # Rebuild HNSW incorporating pending inserts, excluding deleted
        # Runs in background; swaps base_index atomically when complete
        pass
Key Insight

The pending insert buffer is searched by brute force because it is small (under 50,000 vectors). Brute-force on 50K vectors at 1,536 dims takes under 1ms - fast enough to merge with the HNSW results without adding meaningful latency. This is the same trick LSM-trees use: a small in-memory write buffer plus a large on-disk index.

Data Model

The vector store has three storage layers: the HNSW graph (in-memory, rebuilt on startup), the raw vector store (on-disk, float32 for re-ranking), and the metadata store (relational, for filtered search).

-- Vector metadata: structured attributes for filtered search
CREATE TABLE vector_metadata (
  vector_id    BIGINT       NOT NULL PRIMARY KEY,
  namespace    VARCHAR(64)  NOT NULL,
  external_id  VARCHAR(256) NOT NULL,
  category     VARCHAR(128),
  created_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  updated_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  is_deleted   BOOLEAN      NOT NULL DEFAULT FALSE,
  payload      JSONB        NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_vm_namespace ON vector_metadata (namespace, is_deleted);
CREATE INDEX idx_vm_category ON vector_metadata (namespace, category, is_deleted);
CREATE INDEX idx_vm_external ON vector_metadata (namespace, external_id) WHERE NOT is_deleted;

-- Shard assignment: which shard holds which vector
CREATE TABLE shard_assignments (
  vector_id    BIGINT    NOT NULL PRIMARY KEY,
  shard_id     INT       NOT NULL,
  indexed_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_sa_shard ON shard_assignments (shard_id);

The raw vectors are stored in a flat binary file per shard, indexed by position:

# Raw vector store: flat binary file, O(1) random access by vector_id
import struct
import mmap
import os

VECTOR_DIM = 1536
VECTOR_BYTES = VECTOR_DIM * 4  # float32

class FlatVectorStore:
    def __init__(self, path: str, max_vectors: int):
        self.path = path
        self.max_vectors = max_vectors
        file_size = max_vectors * VECTOR_BYTES
        if not os.path.exists(path):
            with open(path, 'wb') as f:
                f.seek(file_size - 1)
                f.write(b'\x00')
        self.f = open(path, 'r+b')
        self.mm = mmap.mmap(self.f.fileno(), 0)

    def write(self, position: int, vector: np.ndarray):
        offset = position * VECTOR_BYTES
        self.mm[offset:offset + VECTOR_BYTES] = vector.astype(np.float32).tobytes()

    def read(self, position: int) -> np.ndarray:
        offset = position * VECTOR_BYTES
        raw = self.mm[offset:offset + VECTOR_BYTES]
        return np.frombuffer(raw, dtype=np.float32).copy()
Real World

Faiss (Facebook AI Similarity Search) stores vectors exactly this way - a flat binary index (IndexFlatL2) that enables O(1) random access for re-ranking. HNSW in Faiss is layered on top, with the flat store providing the exact distance computation needed for the two-stage search pipeline.

Key Algorithms and Protocols

Filtered vector search - “find the 10 most similar vectors where category = 'finance'” - is harder than it looks. Two naive approaches fail at different extremes.

Pre-filter: fetch all IDs matching the filter, build a sub-index, search it. Correct, but slow when the filter matches millions of vectors.

Post-filter: run the full ANN search for k x oversampling candidates, then filter by metadata. Fast, but if the filter is selective (1% of vectors match), you may get 0 results despite candidates existing.

The correct approach is ACORN (segment-aware post-filter) or segment-scoped search: pre-partition the index by high-cardinality filter attributes, and route queries to the relevant partition.

# Segment-scoped search: partition index by category
def filtered_search(
    segment_indices: dict[str, HNSWIndex],  # category -> HNSW shard
    full_vectors: FlatVectorStore,
    query: np.ndarray,
    category_filter: str | None,
    k: int,
    ef: int = 100
) -> list[tuple[float, int]]:
    if category_filter and category_filter in segment_indices:
        # Route to category-specific segment
        index = segment_indices[category_filter]
        candidates = index.search(query, k=k * 4, ef=ef)
    else:
        # Search all segments, merge results
        all_candidates = []
        for idx in segment_indices.values():
            all_candidates.extend(idx.search(query, k=k, ef=ef))
        candidates = sorted(all_candidates)[:k * 4]

    # Re-rank with full-precision vectors
    reranked = []
    for dist, vid in candidates:
        exact_dist = cosine_distance(query, full_vectors.read(vid))
        reranked.append((exact_dist, vid))

    return sorted(reranked)[:k]

Embedding Dimensionality Reduction via PCA

When ingesting vectors from multiple embedding models with different dimensions, normalize to a common dimension via PCA projection:

# PCA projection to reduce dimensionality while preserving variance
from sklearn.decomposition import PCA
import numpy as np

def fit_pca_projection(sample_vectors: np.ndarray, target_dim: int) -> PCA:
    """Fit PCA on a sample of vectors. Use at ingest time."""
    pca = PCA(n_components=target_dim, random_state=42)
    pca.fit(sample_vectors)
    explained = pca.explained_variance_ratio_.sum()
    print(f"PCA {sample_vectors.shape[1]}D -> {target_dim}D: {explained:.1%} variance retained")
    return pca

def project_and_normalize(
    vectors: np.ndarray, pca: PCA
) -> np.ndarray:
    projected = pca.transform(vectors)
    # L2 normalize for cosine similarity via dot product
    norms = np.linalg.norm(projected, axis=1, keepdims=True)
    return projected / np.maximum(norms, 1e-8)
Key Insight

Normalizing vectors to unit length before indexing converts cosine similarity search into a maximum inner product search (MIPS), which lets you use dot product distance instead of cosine - dot product is 20% faster because it avoids two norm computations per comparison.

Scaling and Performance

The index scales by horizontal sharding - splitting the 100M vector dataset across N shards, each holding 100M/N vectors. Each shard runs a full HNSW in memory. Queries fan out to all shards and merge results.

Horizontal sharding and query fan-out at scale
Capacity Estimation:
Given:
  - 100M vectors at 1,536 dimensions, float32
  - 10,000 queries/second
  - 20ms p99 latency target
  - k=10 results per query

Raw vector storage:
  100M x 1,536 x 4 bytes = 614 GB

With int8 quantization (4x reduction):
  HNSW graph memory: ~614 / 4 = 154 GB for quantized vectors
  HNSW graph overhead: ~20% = ~31 GB
  Total per full index: ~185 GB

With 10 shards of 10M vectors each:
  Per shard RAM: ~18.5 GB (fits in a 32 GB instance)
  Re-ranking vectors (float32, hot cache): ~10% of shard = 1.8 GB

Query compute:
  10,000 QPS x 10 shards = 100,000 shard searches/s
  Each shard search: 5ms -> 0.5 vCPU per shard
  10 shards x 0.5 vCPU = 5 vCPUs minimum for search
  With 3x headroom: 15-20 vCPUs for search layer

Embedding compute:
  10,000 QPS x 1536-dim embedding = ~40ms at batch size 1 (GPU)
  GPU throughput at batch 64: ~1600 embeddings/s per GPU
  Need: 10,000 / 1,600 = 6.25 -> 8 GPUs for embedding
Real World

Pinecone’s serverless architecture separates storage (S3) from compute (pods that load index segments on demand). This decoupling allows scaling query compute independently from storage, which is the right tradeoff when query patterns are bursty but the index is stable. Weaviate’s cloud offering uses a similar storage-compute separation.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Shard node OOM (HNSW evicted)Memory pressure alertThat shard goes offline; queries degradeReload HNSW from persisted snapshot; warm with hot queries
Index build failure midwayBuild job heartbeat timeoutStale index; new vectors not searchableResume from last checkpoint; replay pending insert log
Embedding service latency spikep99 latency alertQuery latency exceeds SLAFallback to cached embeddings for recent queries; scale GPU fleet
Metadata store slow queryDB query latency alertFiltered search degradesTune index; for hot filters, pre-join with vector IDs at index time
Shard data corruptionChecksum mismatch on loadShard offlineRestore from object storage snapshot; rebuild from raw vectors
Split-brain after network partitionShard replica divergenceStale results from one replicaRead-repair on query; promote replica with fresher snapshot
Watch Out

HNSW graph construction is not idempotent - inserting the same vector twice creates duplicate nodes, degrading recall as duplicate results fill the top-k slots. Always check for existing IDs before inserting. Use an upsert API (delete-then-insert) rather than blind inserts for any update flow.

Comparison of Approaches

ApproachRecallLatencyMemoryLive UpdatesBest Fit
HNSW (this design)95-99%1-5msHigh (full index in RAM)Moderate (soft delete + compaction)Low-latency production search
FAISS IVF-Flat90-97%5-15msMedium (centroid list)Hard (full rebuild)Batch search, research workloads
FAISS IVF-PQ80-93%2-8msLow (heavy quantization)HardMemory-constrained environments
Exact KNN (brute force)100%100ms+Low (just vectors)EasySmall datasets under 1M vectors
ScaNN (Google)95-99%2-6msMediumModerateLarge-scale with TPU acceleration
DiskANN95-98%5-15msLow (SSD-backed)ModerateBillion-scale on commodity hardware

For a production system at 100M vectors requiring sub-20ms latency and 95%+ recall, HNSW is the right choice. If the dataset grows toward 1 billion vectors and RAM becomes the constraint, DiskANN’s SSD-backed index is the practical next step - it achieves comparable recall at 3-5x higher latency by streaming graph traversal from fast NVMe SSDs rather than keeping the full graph in memory.

Key Takeaways

  • ANN trades exactness for speed: HNSW achieves 95%+ recall with 100x+ speedup over brute-force; the recall gap is recovered by oversampling and re-ranking
  • Quantization is memory, not accuracy: int8 quantization reduces memory 4x with minimal recall loss when paired with float32 re-ranking
  • HNSW must fit in RAM: the graph traversal pattern defeats OS page cache; structure your shard sizes so the full HNSW graph resides in memory
  • Live updates use an LSM-style buffer: small write buffer searched by brute force, merged periodically into the main HNSW index via compaction
  • Filtered search requires segment-aware routing: post-filter fails on selective filters; pre-build per-segment indices for high-cardinality filter attributes
  • Normalize vectors to unit length: converts cosine search to inner product, enabling faster SIMD-optimized distance computation
  • Query fan-out multiplies latency budget: with 10 shards, your 20ms budget allows only 15ms per shard (5ms for merge and network); shard sizing drives your latency floor

The counter-intuitive lesson: exact nearest neighbor search is not a baseline to strive toward - it is a failure mode. At 100M vectors, even 99.9% recall from ANN is far more valuable than 100% recall from a system that is too slow to deploy. The question is never “is this approximate?” but “is the approximation error within the product’s tolerance?”

Frequently Asked Questions

Q: When should I use product quantization (PQ) vs scalar quantization vs binary quantization? A: Scalar quantization (float32 to int8) is the safest default - 4x memory reduction, under 1% recall loss, minimal implementation complexity. PQ achieves 16-32x compression but requires training a codebook and accepts 5-10% recall loss; use it when memory is severely constrained and you have representative training data. Binary quantization (1 bit per dimension) gives 32x compression but degrades recall significantly unless combined with aggressive re-ranking oversampling (64-128x); OpenAI recommends it only for their latest embeddings designed for binary compatibility.

Q: How do you handle multi-tenant isolation in a shared vector index? A: Three options: separate indices per tenant (strong isolation, high operational overhead), namespace partitioning within a shared index (each vector tagged with tenant ID, queries filtered by namespace), or dedicated shard routing (tenant A always queries shards 0-2, tenant B shards 3-5). The namespace approach works well up to thousands of tenants; above that, dedicated shards per large tenant with shared shards for small tenants is a common tiered approach. The key constraint is that HNSW graphs cannot be meaningfully merged across tenants once built.

Q: Why not just use a full-text search engine like Elasticsearch with vector fields? A: Elasticsearch added vector search (kNN) in version 8.x, but the implementation uses HNSW under the hood - so you get the same algorithm with additional JSON serialization overhead. For pure vector workloads, a purpose-built vector database (Qdrant, Weaviate, Milvus) is 2-5x faster due to SIMD-optimized distance kernels, memory layout optimization, and no document storage overhead. Elasticsearch makes sense when you need hybrid: exact keyword matching combined with semantic similarity in the same query.

Q: How do you keep embedding quality consistent as the embedding model updates? A: This is the most underappreciated operational challenge in vector search. When you upgrade the embedding model, old vectors and new vectors occupy different regions of the embedding space - querying the old index with a new-model query vector produces meaningless results. Solutions: maintain a version field on each vector, keep both old and new indices in parallel during migration, re-embed all vectors with the new model before switching query routing. Never do a partial migration. The re-embedding job for 100M vectors at a typical embedding API cost is non-trivial - this is an argument for self-hosting your embedding model.

Q: What is the right ef parameter for production HNSW? A: Start with ef = 50 and measure recall against an exact brute-force sample. Tune upward until you hit recall targets or latency budget. Most production systems land at ef = 50-200. The relationship is roughly linear: doubling ef roughly doubles latency and improves recall by 1-3 percentage points above 90%. For k = 10 queries at 95% recall target, ef = 100 is typical. For k = 100 queries, ef = 200-400 is common.

Q: How do you evaluate recall in production without ground truth? A: Sample 1% of production queries and run them against both the ANN index and a brute-force exact index on the same shard. Track “overlap@k” - the fraction of results that appear in both. Do this continuously in a shadow query path so recall degradation (from index staleness, quantization drift, or ef tuning changes) is caught before it affects user experience. A 5% recall drop is often invisible to users but indicates structural problems in the index.

Interview Questions

Q: Walk me through the memory layout and access pattern of HNSW search. Why does it require the full graph in RAM?

Expected depth: HNSW search traverses graph edges in a greedy fashion - each step reads a node, examines its neighbor list, jumps to the nearest neighbor. The access pattern is pointer-chasing with poor spatial locality: consecutive steps access memory addresses determined at runtime by the current node’s neighbors. This defeats hardware prefetching and the OS page cache. A single search might touch 200 random nodes across the graph; if those nodes are on disk, even NVMe SSD at 100 microseconds per random read makes one search take 20ms just in IO. HNSW on disk is only viable with specialized prefetching (DiskANN’s beam search with predictive prefetch).

Q: Two identical queries arrive 1ms apart. How do you avoid computing the embedding and running the ANN search twice?

Expected depth: Query result caching with an embedding-based cache key. Hash the query string for exact-match cache; use embedding similarity for semantic cache (cache hit if query embedding is within distance 0.02 of a cached query). The semantic cache requires a small secondary HNSW over cached query embeddings - an index-of-indices. Cover TTL strategy (cached results go stale as index updates arrive; use a version vector to invalidate), cold start (first query always computes, caches the result), and the semantic cache quality tradeoff (too tight threshold = low hit rate; too loose = returning results for a different question).

Q: Your HNSW recall drops from 96% to 88% over 3 months. Diagnose it.

Expected depth: Several causes: (1) Index fragmentation - too many soft-deleted vectors polluting the graph, degrading edge quality; trigger full compaction. (2) Distribution shift - new vectors from a different data source have different embedding distribution than training data; the existing graph has poor coverage in new regions. (3) ef drift - ef parameter inadvertently lowered by a config change. (4) Quantization codebook staleness - if using PQ with a trained codebook, new vectors fall outside the codebook’s learned distribution. Diagnosis approach: compare recall on old vs new vectors separately; if new vectors have lower recall, it is a distribution shift problem.

Q: Design the embedding pipeline to handle 1M new document inserts per hour without blocking query serving.

Expected depth: Batch embedding on GPUs asynchronously from the query path. Queue new documents into Kafka, consumer pulls batches of 256, sends to embedding API, writes vectors to a pending insert buffer (not the HNSW). The HNSW search includes the pending buffer via brute force. Compaction runs every 15 minutes, building a delta HNSW from pending vectors and merging it into the main index. The merge is a background atomic swap - no query downtime. Cover back-pressure: if the pending buffer grows above 500K vectors, brute-force latency degrades; trigger emergency compaction or temporarily throttle ingest.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article