Build a RAG Pipeline for an Enterprise Knowledge Base


data-engineering performance scalability

System Design Deep Dive

Enterprise RAG Pipeline

When every answer must be grounded in your documents, not the model’s training data

⏱ 14 min read📐 Advanced🏗️ RAG

A large enterprise has 10 years of internal documentation scattered across Confluence, Google Drive, Slack archives, and engineering wikis - roughly 2 million documents and growing. An engineer asks the new AI assistant: “What is our incident escalation policy for P0s?” The LLM does not know. It was trained on public data, not your internal runbooks. Without retrieval, it either halts or hallucinates an answer that sounds authoritative but cites procedures from a different company entirely.

Retrieval-Augmented Generation solves this by treating the LLM like a brilliant analyst and your document corpus like the analyst’s reference library. Before the LLM answers, the system fetches the most relevant passages from your knowledge base and hands them to the model as context. Think of it as giving a consultant the exact page of the manual before asking the question - rather than relying on what they remember from reading it three years ago.

The engineering challenge is not just “embed documents and do cosine search.” Enterprise RAG at scale means: documents arrive in 20 formats and need structural understanding before chunking; keyword queries still outperform embeddings for exact-match lookups; embedding models drift over time as internal terminology evolves; re-ranking a candidate pool of 50 documents on every query with a cross-encoder adds 80ms of GPU latency; and measuring whether answers are actually correct requires an evaluation harness that runs continuously.

We need to solve for retrieval quality, query latency, index freshness, and answer correctness simultaneously.

Requirements and Constraints

Functional Requirements

  • Ingest documents from Confluence, Google Drive, Slack, and PDF uploads
  • Chunk documents preserving semantic coherence
  • Embed chunks using a consistent embedding model
  • Support hybrid retrieval: dense vector similarity plus keyword BM25
  • Re-rank retrieved chunks with a cross-encoder before passing to LLM
  • Manage context window assembly to fit within LLM token limits
  • Return grounded answers with source citations
  • Provide an evaluation framework measuring retrieval and generation quality

Non-Functional Requirements

  • Query latency: p95 under 3 seconds including LLM generation (streaming preferred)
  • Index freshness: new documents searchable within 5 minutes of ingestion
  • Corpus scale: 2 million documents, average 5 pages each, ~10 billion tokens total
  • Concurrent queries: 500 QPS at peak
  • Embedding throughput: 50,000 chunks/hour during bulk ingestion
  • Retrieval accuracy: NDCG@5 above 0.82 on internal eval set
  • LLM answer faithfulness: above 0.90 per RAGAS evaluation

Constraints and Assumptions

  • Documents require access control - results must respect per-user permissions
  • Embedding model is fixed across ingest and query (no mid-flight model changes)
  • LLM context window is 128K tokens; we target 4K-8K for retrieved context
  • No real-time document editing sync required (5-minute freshness is acceptable)

High-Level Architecture

The pipeline splits cleanly into two paths: the ingestion path that processes and indexes documents, and the query path that retrieves and generates answers.

RAG pipeline full architecture showing ingestion and query paths

The ingestion path starts when a document arrives from a source connector. The chunker breaks it into semantically coherent pieces. The embedder converts each chunk to a 1536-dimensional vector. Two indexes receive the output in parallel: the vector store (pgvector or Pinecone) gets the dense embedding, and the BM25 index (Elasticsearch) gets the tokenized text.

The query path starts with a user question. The query encoder converts it to the same vector space as the indexed chunks - using the exact same embedding model ensures the distance calculations are meaningful. The hybrid retrieval layer runs ANN search against the vector store and keyword search against Elasticsearch simultaneously, then fuses the results using Reciprocal Rank Fusion. The top 50 candidates go through the cross-encoder re-ranker, which scores each (query, chunk) pair more precisely. The top 5 land in context assembly, which builds the final prompt and calls the LLM.

Key Insight

The embedding model used at query time must be byte-for-byte identical to the model used at ingest time - even minor version changes produce incompatible vector spaces that silently destroy retrieval quality.

Document Chunking Pipeline

The chunker’s job is to split a document into pieces that are small enough to fit in context but large enough to be semantically self-contained. Think of it like cutting newspaper articles to paste into a scrapbook: you want each clipping to make sense on its own, not end mid-sentence.

Chunking strategy comparison and decision tree

Recursive character chunking is the workhorse strategy. It tries to split on paragraph boundaries first (\n\n), then line breaks (\n), then sentences (.), then spaces. Target chunk size is 400 tokens with 80-token overlap between adjacent chunks. The overlap ensures that a concept split across a boundary still appears completely in at least one chunk.

# Recursive chunking with overlap using LangChain's splitter logic
from typing import List

def recursive_chunk(text: str, chunk_size: int = 400, overlap: int = 80) -> List[str]:
    separators = ["\n\n", "\n", ". ", " ", ""]
    for sep in separators:
        if sep == "":
            # fallback: hard split at chunk_size tokens
            return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)]
        parts = text.split(sep)
        if max(len(p.split()) for p in parts if p) <= chunk_size:
            chunks = []
            current = ""
            for part in parts:
                if len((current + sep + part).split()) <= chunk_size:
                    current = (current + sep + part).strip()
                else:
                    if current:
                        chunks.append(current)
                    current = part
            if current:
                chunks.append(current)
            return add_overlap(chunks, overlap)
    return [text]

def add_overlap(chunks: List[str], overlap_tokens: int) -> List[str]:
    result = [chunks[0]]
    for i in range(1, len(chunks)):
        prev_tail = " ".join(chunks[i-1].split()[-overlap_tokens:])
        result.append(prev_tail + " " + chunks[i])
    return result

For code and markdown, recursive splitting breaks function definitions mid-signature. Use AST-aware splitting instead: parse the file as code and split at function or class boundaries, keeping each definition as one chunk with its docstring. For tables and structured data, row-level chunking creates one chunk per row plus the header row for column context.

Watch Out

Chunking with zero overlap means a concept that spans 10 tokens across a chunk boundary - like “maximum retry count is 5” split as “maximum retry count” / “is 5” - will never be retrievable in one piece. Always configure overlap at 15-20% of chunk size.

Each chunk gets stored with metadata: doc_id, chunk_index, source_url, created_at, acl_tags (for access control), and the original text. The acl_tags field is critical - retrieval filters on it so users cannot see chunks from documents they lack access to.

-- Core chunk storage table
CREATE TABLE chunks (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  doc_id      UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index INT  NOT NULL,
  text        TEXT NOT NULL,
  token_count INT  NOT NULL,
  embedding   vector(1536),
  acl_tags    TEXT[] NOT NULL DEFAULT '{}',
  source_url  TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (doc_id, chunk_index)
);

CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX ON chunks (acl_tags) USING GIN;
CREATE INDEX ON chunks (doc_id);

Embedding Model Selection

The embedding model converts text to a dense vector representing semantic meaning. Distance between vectors approximates semantic similarity. Choosing the wrong model is catastrophic and hard to fix because re-embedding 10 million chunks takes hours and requires schema migrations.

Dimension vs quality tradeoff: text-embedding-3-large (OpenAI) produces 3072-d vectors with excellent retrieval but doubles storage versus 1536-d. Most deployments truncate to 1536-d with negligible quality loss (OpenAI supports dimensions parameter for this). text-embedding-3-small is 30% cheaper at roughly 5% lower NDCG - acceptable for low-sensitivity corpora.

import openai
from typing import List

client = openai.OpenAI()

def embed_batch(texts: List[str], model: str = "text-embedding-3-large", dimensions: int = 1536) -> List[List[float]]:
    # batch up to 2048 inputs per call
    response = client.embeddings.create(
        input=texts,
        model=model,
        dimensions=dimensions,
        encoding_format="float"
    )
    return [item.embedding for item in response.data]

def embed_chunks_with_retry(chunks: List[str], batch_size: int = 512) -> List[List[float]]:
    all_embeddings = []
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        embeddings = embed_batch(batch)
        all_embeddings.extend(embeddings)
    return all_embeddings
Real World

Cohere’s embed-v3 and OpenAI’s text-embedding-3 both support Matryoshka Representation Learning - you can truncate embeddings to smaller dimensions (256, 512, 1536) while retaining most retrieval quality. This lets you store cheaper embeddings for cold-path queries and full-size embeddings for hot-path precision.

For on-premises deployments where data cannot leave the network, nomic-embed-text runs on a single A10 GPU and achieves near-parity with OpenAI on English enterprise corpora. The key constraint: whatever model you pick, lock its version in your deployment manifest and never upgrade without re-embedding the entire corpus.

Hybrid Retrieval Layer

Pure dense retrieval fails on exact-match queries. Ask “what is the JIRA ticket for the login bug” and the embedding of “JIRA ticket login bug” is semantically far from any chunk containing “PROJ-4821 login regression.” BM25 handles this because it scores on token overlap, not meaning.

The Reciprocal Rank Fusion algorithm merges ranked lists from multiple retrievers without needing calibrated scores:

from typing import Dict, List, Tuple

def reciprocal_rank_fusion(
    ranked_lists: List[List[str]],
    k: int = 60
) -> List[Tuple[str, float]]:
    """
    Fuse multiple ranked result lists. k=60 is the RRF constant from the original paper.
    Returns list of (doc_id, rrf_score) sorted descending.
    """
    scores: Dict[str, float] = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def hybrid_retrieve(
    query_text: str,
    query_vec: List[float],
    top_k: int = 50,
    acl_tags: List[str] = None
) -> List[str]:
    # run both searches in parallel
    ann_results = vector_store.search(query_vec, top_k=top_k, filter={"acl_tags": acl_tags})
    bm25_results = es_client.search(query_text, top_k=top_k, filter={"acl_tags": acl_tags})

    ann_ids = [r.id for r in ann_results]
    bm25_ids = [r.id for r in bm25_results]

    fused = reciprocal_rank_fusion([ann_ids, bm25_ids])
    return [doc_id for doc_id, _ in fused[:top_k]]
Key Insight

RRF outperforms score normalization (e.g., min-max scaling BM25 and cosine scores to [0,1] then averaging) because it is robust to outlier scores in either retriever - a document that ranks #1 in BM25 but #100 in ANN still gets a meaningful combined score without boosting from a runaway BM25 score.

The two searches run in parallel. Vector store ANN search (HNSW index) returns in ~15ms. Elasticsearch BM25 returns in ~25ms. Total retrieval time is bounded by the slower of the two, not their sum. RRF fusion is O(n) and adds negligible time.

Re-ranker Models

The re-ranker solves a different problem than retrieval. Retrieval uses a bi-encoder: one model encodes the query, another encodes the document, and you compare vectors. Fast but approximate - the models never “see” query and document together. A cross-encoder takes (query, chunk) as a single input and produces a relevance score. It’s 10-50x slower but far more accurate because it models query-document interaction.

from sentence_transformers import CrossEncoder
from typing import List, Tuple

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: List[Tuple[str, str]], top_n: int = 5) -> List[str]:
    """
    chunks: list of (chunk_id, chunk_text)
    Returns top_n chunk_ids sorted by cross-encoder relevance score.
    """
    pairs = [(query, text) for _, text in chunks]
    scores = reranker.predict(pairs, show_progress_bar=False)
    ranked = sorted(zip([cid for cid, _ in chunks], scores), key=lambda x: x[1], reverse=True)
    return [cid for cid, _ in ranked[:top_n]]

Cross-encoders are the bottleneck. ms-marco-MiniLM-L-6-v2 runs in ~80ms for 50 pairs on a T4 GPU. ms-marco-electra-base is 3x more accurate at 3x the latency. For most enterprise use cases, the MiniLM variant is the right tradeoff: it catches the gross ranking errors (BM25 putting a policy document above an actual runbook) while staying within latency budget.

Real World

Cohere’s Rerank API and Jina AI’s reranker offer hosted cross-encoder APIs that return in 100-150ms for 50 documents with no GPU management overhead. Enterprises with strict data residency requirements run open-source models (bge-reranker-base) locally on shared GPU fleets.

Context Window Management

After re-ranking, we have the top 5 most relevant chunks. They need to fit inside the LLM’s context window alongside the system prompt and the user query. Context window management is about maximizing relevant signal density while staying under the token budget.

import tiktoken

def assemble_context(
    query: str,
    chunks: List[dict],
    system_prompt: str,
    max_context_tokens: int = 4096,
    model: str = "gpt-4o"
) -> str:
    """
    Build a prompt that fits within max_context_tokens.
    chunks: list of {"id": str, "text": str, "source_url": str}
    """
    enc = tiktoken.encoding_for_model(model)
    
    system_tokens = len(enc.encode(system_prompt))
    query_tokens = len(enc.encode(query))
    overhead = 50  # formatting tokens
    available = max_context_tokens - system_tokens - query_tokens - overhead

    context_parts = []
    used_tokens = 0

    for i, chunk in enumerate(chunks):
        chunk_header = f"[Source {i+1}: {chunk['source_url']}]\n"
        chunk_full = chunk_header + chunk["text"] + "\n\n"
        chunk_tokens = len(enc.encode(chunk_full))

        if used_tokens + chunk_tokens > available:
            # truncate last chunk to fit
            remaining = available - used_tokens
            if remaining > 100:
                truncated = enc.decode(enc.encode(chunk_full)[:remaining])
                context_parts.append(truncated)
            break

        context_parts.append(chunk_full)
        used_tokens += chunk_tokens

    context = "".join(context_parts)
    return f"{system_prompt}\n\nContext:\n{context}\n\nQuestion: {query}"

The system prompt instructs the model to answer only from the provided context and cite source numbers. This is the faithfulness constraint - the model should say “I don’t have information on this” rather than hallucinate when no relevant context was retrieved.

Watch Out

The “lost in the middle” problem: LLMs attend more strongly to the beginning and end of context than the middle. Put the highest-ranked chunk first and the second-highest last - the weakest chunks go in the middle - to maximize the chance the model uses the best evidence.

Data Model

-- Document registry with source metadata
CREATE TABLE documents (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source       TEXT NOT NULL,  -- 'confluence', 'gdrive', 'slack', 'pdf'
  external_id  TEXT NOT NULL,
  title        TEXT NOT NULL,
  content_hash TEXT NOT NULL,  -- SHA-256 to detect updates
  acl_tags     TEXT[] NOT NULL DEFAULT '{}',
  indexed_at   TIMESTAMPTZ,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (source, external_id)
);

-- Chunks with vector embeddings (pgvector extension)
CREATE TABLE chunks (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  doc_id      UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index INT NOT NULL,
  text        TEXT NOT NULL,
  token_count INT NOT NULL,
  embedding   vector(1536),
  acl_tags    TEXT[] NOT NULL,
  source_url  TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (doc_id, chunk_index)
);

-- Vector similarity index (IVFFlat for 1M+ vectors)
CREATE INDEX chunks_embedding_idx ON chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 200);

-- ACL filter index
CREATE INDEX chunks_acl_idx ON chunks USING GIN (acl_tags);

-- Eval dataset for continuous quality monitoring
CREATE TABLE eval_queries (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  query           TEXT NOT NULL,
  expected_chunks UUID[] NOT NULL,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE eval_runs (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  run_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
  ndcg_at_5    FLOAT NOT NULL,
  faithfulness FLOAT NOT NULL,
  answer_rel   FLOAT NOT NULL
);

The content_hash on documents enables incremental updates: when a document changes, compute the new hash, compare to stored, and only re-chunk and re-embed if changed. This prevents the ingestion pipeline from doing unnecessary work on unchanged documents.

Key Algorithms and Protocols

BM25 Scoring

BM25 scores a document for a query by summing term frequency saturation scores across query terms:

import math
from typing import Dict, List

def bm25_score(
    query_terms: List[str],
    doc_terms: List[str],
    corpus_stats: Dict,  # {"N": total_docs, "avgdl": avg_doc_length, "idf": {term: float}}
    k1: float = 1.5,
    b: float = 0.75
) -> float:
    """
    k1=1.5: term frequency saturation (higher = less saturation)
    b=0.75: document length normalization (1.0 = full normalization)
    """
    N = corpus_stats["N"]
    avgdl = corpus_stats["avgdl"]
    dl = len(doc_terms)
    tf_counter: Dict[str, int] = {}
    for t in doc_terms:
        tf_counter[t] = tf_counter.get(t, 0) + 1

    score = 0.0
    for term in query_terms:
        if term not in tf_counter:
            continue
        tf = tf_counter[term]
        idf = corpus_stats["idf"].get(term, 0.0)
        numerator = tf * (k1 + 1)
        denominator = tf + k1 * (1 - b + b * dl / avgdl)
        score += idf * (numerator / denominator)
    return score
Key Insight

BM25’s term frequency saturation is what makes it better than TF-IDF for long documents: mentioning “authentication” 50 times in a document only scores slightly higher than mentioning it 10 times, preventing runaway scores from verbose documents.

Hierarchical Navigable Small World graphs organize vectors in a multi-layer structure where each node connects to its closest neighbors. Search traverses from the top (sparse, long-range connections) to the bottom (dense, local connections), taking greedy hops toward the query vector at each layer.

# pgvector HNSW index creation - tune for recall vs latency
# m: number of connections per node (16-64, higher = better recall but more memory)
# ef_construction: search width during index build (more = slower build, better quality)

CREATE INDEX chunks_hnsw_idx ON chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- At query time, tune ef_search for recall vs latency tradeoff
SET hnsw.ef_search = 40;  -- default 40, increase to 100+ for high-recall workloads

Eval Framework (RAGAS)

RAGAS measures RAG quality with four metrics computed by an LLM judge:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

def run_eval(
    questions: list,
    answers: list,
    contexts: list,  # list of list of retrieved chunks
    ground_truths: list,
) -> dict:
    """
    faithfulness: fraction of answer claims supported by retrieved context
    answer_relevancy: how well the answer addresses the question
    context_precision: fraction of retrieved chunks that are relevant
    context_recall: fraction of ground-truth facts covered by retrieved chunks
    """
    data = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    }
    dataset = Dataset.from_dict(data)
    results = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
    return results

Scaling and Performance

Horizontal scaling architecture for the RAG pipeline

The query path scales horizontally. Stateless query workers handle encode-plus-retrieve and scale to handle 500 QPS with 10-20 instances. The vector store is the most sensitive component: pgvector with IVFFlat supports 1-2 million vectors per instance before query latency degrades. For larger corpora, Pinecone or Weaviate handle horizontal sharding natively.

Given:
  - 2M documents, avg 10 pages, 500 words/page
  - Chunked at 400 tokens with 80-token overlap
  - Avg 25 chunks per document

Total chunks: 2M * 25 = 50M chunks
Vector storage (1536-d float32): 50M * 1536 * 4 bytes = 307 GB
BM25 inverted index: ~50-100 GB compressed

Ingestion throughput:
  50,000 chunks/hour / 3600s = ~14 chunks/s
  Embedding cost: 14 * 400 tokens = 5,600 tokens/s
  At OpenAI pricing: ~$0.02/1M tokens -> $0.11/hour

Query cost per request:
  Query embedding: 100 tokens = negligible
  Re-ranker: 50 * 400 = 20,000 tokens GPU-compute
  LLM generation: ~2,000 tokens = ~$0.01 at GPT-4o pricing

The re-ranker GPU is the primary cost and latency driver. Use a GPU inference pool (NVIDIA Triton Server) with dynamic batching: accumulate queries for 5ms and run the cross-encoder on a batch of 8-16 query/chunk pairs simultaneously. This improves GPU utilization from ~30% (one-at-a-time) to ~80%.

Real World

Notion’s Q&A feature and Atlassian Intelligence both use hybrid retrieval with re-ranking. Atlassian’s engineering blog describes using Cohere Rerank to drop irrelevant Confluence pages before passing to the LLM, reducing hallucinations by 40% in their internal evaluation.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Embedding model API timeoutHTTP 504 on embed callNew documents not indexed; queries degrade silentlyRetry with exponential backoff; fail ingestion loudly, not silently
Vector index divergence from BM25Eval NDCG drop >5%Hybrid retrieval returns stale chunks for recent docsRe-index documents with timestamp > last-known-good checkpoint
Cross-encoder GPU OOMModel server returns 503All queries fall back to no re-rankingSkip re-ranking gracefully; alert on fallback rate; add GPU capacity
ACL tag missing on chunkQuery returns forbidden documentsSecurity incidentValidate ACL presence at ingest; block write if tags are empty
Chunk-document orphan after deleteDoc deleted but chunks remainStale data returned in searchON DELETE CASCADE + periodic orphan sweep job
Context window overflowLLM returns error 400Query failsTruncate last chunk to fit; log truncation rate as quality metric
Watch Out

Silent retrieval degradation is the most dangerous failure mode: if the vector index drifts (corrupted segment, replication lag) but queries still return results, you get answers that seem reasonable but are grounded in outdated context. Run the eval framework on a held-out golden set every 15 minutes and alert when NDCG drops more than 5% from baseline.

Comparison of Approaches

ApproachLatencyRetrieval QualityComplexityBest Fit
Dense-only (ANN)Fast (15ms)Good for semantic queriesLowFAQ chatbots, short documents
Sparse-only (BM25)Fast (25ms)Good for exact-matchLowCode search, ticket IDs, acronyms
Hybrid (BM25 + ANN + RRF)Medium (40ms)Best overallMediumMost enterprise use cases
Hybrid + cross-encoder rerankSlow (120ms retrieval)ExcellentHighHigh-stakes answers, legal, medical
ColBERT late interactionMedium (50ms)ExcellentVery highWhen GPU budget allows; avoids re-ranker

We’d build hybrid with cross-encoder re-ranking for an enterprise knowledge base. The quality improvement from re-ranking is worth 80ms given that LLM generation dominates total latency anyway. ColBERT is worth investigating once retrieval quality becomes a bottleneck, but it requires a specialized inference infrastructure that most teams are not ready to maintain.

Key Takeaways

  • Chunking strategy determines retrieval ceiling: No retrieval algorithm can recover a concept that was split across chunk boundaries without overlap.
  • Hybrid retrieval is table stakes: Dense-only retrieval fails on exact-match queries; sparse-only fails on paraphrase queries. Run both and fuse with RRF.
  • The embedding model is the contract: Upgrading the embedding model mid-deployment requires re-indexing the entire corpus - treat it like a database schema migration.
  • Cross-encoder re-ranking is high ROI: It adds 80ms but moves NDCG@5 from ~0.72 (bi-encoder only) to ~0.86, which is the difference between useful and unreliable answers.
  • Context window assembly is not trivial: Token budgeting, source ordering (best first and last), and the “lost in the middle” effect all affect answer quality.
  • Eval is a first-class pipeline component: Without continuous RAGAS metrics, retrieval degradation is invisible until users notice wrong answers.
  • ACL filtering at retrieval time is non-negotiable: A RAG system that leaks confidential HR documents to engineers is a security incident, not a retrieval bug.

The non-obvious lesson: the LLM is the least interesting component in a RAG system. The retrieval quality ceiling - set by your chunking strategy, embedding model choice, hybrid retrieval configuration, and re-ranker - determines whether users trust the system. An excellent retrieval pipeline paired with GPT-3.5 outperforms poor retrieval paired with GPT-4o.

Frequently Asked Questions

Q: Why not just put the entire document corpus in the context window? A: Long-context LLMs (1M tokens) can theoretically hold large corpora, but retrieval is still better in practice. With 50 million chunks at 400 tokens each, full context is 20 trillion tokens - impossible. Even for smaller corpora, retrieval focuses the LLM on relevant sections rather than diluting attention across an entire knowledge base. Latency and cost also scale linearly with context length.

Q: Why use pgvector instead of a dedicated vector database like Pinecone or Weaviate? A: For under 5 million vectors with co-located metadata, pgvector is simpler to operate, supports transactional consistency between chunks and documents, and eliminates a service dependency. Above 10 million vectors or with multi-tenant isolation requirements, purpose-built vector DBs handle sharding and per-tenant namespace isolation better.

Q: How do you handle documents that are updated frequently? A: Store a content_hash on each document. On re-ingest, compute the new hash. If changed, delete all existing chunks for that doc_id (cascade) and re-process. If unchanged, skip. Set a webhook or polling connector per source system to detect changes. Target 5-minute freshness by polling frequently-changing sources (Confluence, Slack) every 2-3 minutes.

Q: Why not use a ColBERT model instead of bi-encoder + cross-encoder? A: ColBERT (late interaction) computes per-token embeddings for documents and queries and scores them with a MaxSim operation. It avoids the bi-encoder approximation without the full cross-encoder latency. However, it stores ~100x more vectors per document than a bi-encoder, requires specialized indexing infrastructure (PLAID), and most teams are not yet running it in production. The bi-encoder + cross-encoder approach is well-understood and delivers similar quality for most enterprise corpora.

Q: How do you prevent the LLM from hallucinating when context is insufficient? A: Two mechanisms. First, a faithfulness instruction in the system prompt: “Answer only from the provided context. If the context does not contain the answer, say ‘I don’t have information on this in the knowledge base.’” Second, post-generation citation checking: parse the LLM’s answer for factual claims and verify each claim appears in the retrieved chunks. Flag answers with low citation density for human review.

Q: What is a good NDCG@5 target for production? A: For general enterprise knowledge bases, 0.75-0.80 is functional and 0.85+ is excellent. Measure this on a held-out eval set of 200-500 question/expected-chunk pairs that cover your actual user query distribution. NDCG below 0.70 means users frequently get wrong answers and will abandon the system.

Interview Questions

Q: Walk me through how you would design the chunking pipeline for a corpus that includes both long-form PDFs and short Slack messages.

Expected depth: Discuss adaptive chunk size by document type - Slack messages may be a single chunk each while PDFs need recursive splitting. Explain overlap strategy. Address the metadata schema (source, doc_id, acl_tags). Mention how to handle thread context in Slack (include parent message in child reply chunk).

Q: The re-ranker is adding 100ms of latency. How would you reduce it without sacrificing quality?

Expected depth: Discuss reducing the candidate pool from 50 to 20 (lowers re-ranker input size). GPU batching with Triton dynamic batching. Using a smaller cross-encoder (MiniLM-L-6 vs. Electra-base). Caching re-rank results for repeated queries (embedding hash as cache key). ColBERT as an alternative that avoids the separate re-ranker step.

Q: How do you ensure that a query from user A never returns documents that user B cannot access?

Expected depth: Explain ACL tags stored per chunk, added at ingest time from the source system’s permission model. Filter applied at both vector store query time (metadata filter in pgvector/Pinecone) and BM25 search time (Elasticsearch filter clause). Discuss the risk of ACL tag drift when document permissions change after indexing - need a permission-change webhook to trigger re-tagging.

Q: Your eval shows NDCG@5 dropped from 0.84 to 0.72 overnight. How do you debug it?

Expected depth: Check if a new batch of documents was ingested (content quality issue). Check if the embedding model or vector index was updated (vector space drift). Run the eval on a pre-upgrade snapshot to isolate timeframe. Check re-ranker model version. Examine specific failing eval queries - are they concentrated in a particular document source or topic? Check the BM25 index refresh status.

Q: How would you design the eval framework to catch quality regressions before they reach production?

Expected depth: Describe a golden eval set of 300+ query/expected-chunk pairs. Run RAGAS metrics (faithfulness, answer_relevancy, context_precision, context_recall) on each deployment. Gate promotion on NDCG@5 not dropping more than 2% from baseline. Shadow-mode testing: route 5% of production traffic to the new pipeline and compare metrics. Discuss the challenge of annotating the golden set - needs domain experts to label expected chunks.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article