Build a Shadow Traffic / Dark Launch System


deployment reliability performance

System Design Deep Dive

Shadow Traffic / Dark Launch System

Test your new service against real production load - without users ever knowing it exists

⏱ 14 min read📐 Advanced🏗️ Traffic-Mirroring

Imagine a dress rehearsal where the actors on stage are performing the real show, but there is a second, identical company performing in a dark theater next door with the exact same script. The audience sees only the main stage. The dark company makes every mistake, improvises freely, and sometimes forgets their lines - but no one notices because the audience never sees them. What you get is a complete picture of how the new production would go without any risk to the paying audience.

Shadow traffic testing is that dark theater. When you have a new service you want to validate against real production traffic, you do not A/B test it with 1% of users. You copy every inbound request, send the copy to the new service asynchronously, and compare what the new service would have returned against what the old service actually returned. Users see only the old service. The new service sees 100% of real traffic. You get a confidence signal that no staging environment, load test, or synthetic traffic generator can match.

The challenge is not the copying itself - that is straightforward. The hard problems are the side effects. The new service must not write to the production database, send emails, charge payment cards, or make third-party API calls when processing shadow requests. And the comparison must be smart: two responses that are semantically identical but structurally different (timestamps, ordering, UUIDs) must be recognized as equivalent, while two responses that differ on a single business-critical field must be flagged as divergent. Finally, the shadow path must add zero latency to the production request path - if the mirroring subsystem is slow, it cannot slow down real users.

We need to solve for three constraints simultaneously: zero production latency impact, complete side-effect suppression in the shadow target, and intelligent divergence comparison that surfaces signal without noise.

Requirements and Constraints

Functional Requirements

  • Mirror a configurable percentage (1-100%) of inbound HTTP requests to a shadow service
  • Shadow requests must be fully isolated: no writes to shared databases, no external API calls, no emails
  • Compare shadow responses against production responses field-by-field
  • Generate a divergence report with a confidence score updated in near real-time
  • Support sampling rate control - adjustable without a service restart
  • Support request transformation - the ability to modify shadow request headers or body before forwarding

Non-Functional Requirements

  • Production request latency overhead: under 1ms added latency (shadow dispatch must be fully async)
  • Shadow request throughput: up to 50,000 requests/second mirrored at peak
  • Divergence report refresh: updated within 60 seconds of change in divergence rate
  • Confidence score: computed over rolling 5-minute windows, updated every 30 seconds
  • Data retention: raw comparison results retained for 72 hours; aggregated metrics retained for 30 days
  • Shadow failures must never surface as errors to production callers

Constraints and Assumptions

  • The shadow target service is owned by the same team; we trust it not to crash production DBs
  • HTTP/gRPC traffic; not applicable to raw TCP streams or binary protocols without framing
  • We accept eventual consistency in the divergence report - real-time comparison is not required
  • The comparison system ignores non-deterministic fields (timestamps, trace IDs) by configuration

High-Level Architecture

The system has three major planes: the mirror plane (request duplication at the edge), the shadow execution plane (isolated service + side-effect suppression), and the comparison plane (response diffing and confidence reporting).

Shadow traffic system architecture overview

The mirror plane sits at the API gateway or service mesh layer. When a request arrives, the gateway dispatches it to the production service synchronously (blocking the caller) and simultaneously enqueues a copy of the request to a Shadow Queue. The gateway returns the production response immediately - the shadow dispatch adds microseconds, not milliseconds.

The shadow execution plane reads from the Shadow Queue, applies request transformations if configured, and forwards the request to the Shadow Target with a special X-Shadow-Mode: true header. The Shadow Target uses an isolation layer to suppress all side effects - writes are intercepted and dropped, external calls are mocked, email sends are no-oped.

The comparison plane receives both responses asynchronously, runs the divergence scoring algorithm, and writes results to the Comparison Store. A reporting service aggregates results into a live confidence dashboard.

Key Insight

The production path and shadow path must share zero synchronous dependencies after the fork point. Any coupling - even a shared mutex or a shared connection pool - will cause shadow slowness to leak into production latency. Treat the two paths as if they belong to entirely different processes.

The Mirror Plane

The Mirror Plane is the request duplication layer. It must be as thin as possible - a single goroutine dispatching to a buffered channel - and it must never block the production request handler.

Mirror plane request duplication internals

The mirror is implemented as middleware in the API gateway, not in the service itself. This is critical: if the shadow logic lives inside the service, a bug in the shadow path can corrupt shared state in the service. The gateway-level implementation gives you clean isolation.

// Shadow mirror middleware - adds zero blocking latency to production path
type ShadowMirror struct {
    queue        chan ShadowRequest
    samplingRate float64
    rng          *rand.Rand
    mu           sync.RWMutex
}

type ShadowRequest struct {
    Method     string
    URL        string
    Headers    http.Header
    Body       []byte
    ReceivedAt time.Time
    RequestID  string
}

func (m *ShadowMirror) Middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Read body once - both production and shadow need it
        bodyBytes, _ := io.ReadAll(r.Body)
        r.Body = io.NopCloser(bytes.NewReader(bodyBytes))

        // Dispatch shadow copy non-blocking
        m.mu.RLock()
        rate := m.samplingRate
        m.mu.RUnlock()

        if m.rng.Float64() < rate {
            shadow := ShadowRequest{
                Method:     r.Method,
                URL:        r.URL.String(),
                Headers:    r.Header.Clone(),
                Body:       bodyBytes,
                ReceivedAt: time.Now(),
                RequestID:  r.Header.Get("X-Request-ID"),
            }
            select {
            case m.queue <- shadow:
                // enqueued without blocking
            default:
                // queue full - drop shadow, never block production
                shadowDroppedCounter.Inc()
            }
        }

        // Production handler runs normally
        next.ServeHTTP(w, r)
    })
}

func (m *ShadowMirror) SetSamplingRate(rate float64) {
    m.mu.Lock()
    m.samplingRate = rate
    m.mu.Unlock()
}

The select { case m.queue <- shadow: default: } pattern is the safety valve. If the shadow queue is full (e.g., the shadow target is slow), the mirror drops shadow copies without ever blocking the production caller. This is the critical design invariant: shadow failures are invisible to users.

Watch Out

Do not use a blocking channel send for the shadow queue. The temptation is to ensure every request is mirrored, but a blocking send couples production latency to shadow processing latency. Drop shadow copies freely - you are sampling for statistical confidence, not building a record system. A 95% capture rate is fine; a 1ms production latency spike is not.

Side-Effect Suppression

Side-effect suppression is the hardest part of building a dark launch system. The shadow target must process requests as if it were production, but it must not produce any real-world effects. Think of it like running a simulation where the physics engine is real but the actors are holograms - the simulation runs at full fidelity, but nothing in the physical world changes.

There are three categories of side effects to suppress:

Database writes: All INSERT, UPDATE, DELETE, and TRUNCATE operations must be intercepted. The shadow target connects to a read-only replica of the production database - it can read any data it needs, but writes fail at the driver level before reaching the DB.

# Shadow DB connection - read-only mode
import psycopg2
from psycopg2.extensions import ISOLATION_LEVEL_READ_COMMITTED

def get_shadow_db_connection(dsn: str) -> psycopg2.connection:
    """Returns a read-only database connection for shadow execution."""
    conn = psycopg2.connect(dsn)
    # Set connection to read-only at the session level
    conn.set_session(readonly=True, isolation_level=ISOLATION_LEVEL_READ_COMMITTED)
    return conn

# Alternatively, route shadow traffic to a read replica with a read-only user
SHADOW_DB_DSN = "postgresql://shadow_readonly:pass@replica.db.internal:5432/prod"

External API calls: HTTP calls to third-party services (payment processors, email providers, SMS gateways) must be intercepted and mocked. The shadow target uses an HTTP client wrapper that checks for the shadow header before making outbound calls.

# Shadow-aware HTTP client
import httpx
import os

class ShadowAwareClient:
    def __init__(self, is_shadow: bool, mock_responses: dict):
        self.is_shadow = is_shadow
        self.mock_responses = mock_responses
        self._real_client = httpx.Client()

    def post(self, url: str, **kwargs) -> httpx.Response:
        if self.is_shadow:
            # Return a mocked 200 response instead of calling the real API
            mock = self.mock_responses.get(url, {"status_code": 200, "json": {}})
            return MockResponse(mock["status_code"], mock["json"])
        return self._real_client.post(url, **kwargs)

class MockResponse:
    def __init__(self, status_code: int, json_body: dict):
        self.status_code = status_code
        self._json = json_body

    def json(self) -> dict:
        return self._json

    def raise_for_status(self) -> None:
        if self.status_code >= 400:
            raise httpx.HTTPStatusError(f"Mock {self.status_code}", request=None, response=self)

Message queue publishes: Kafka produces, SQS sends, and webhook dispatches must be dropped. The shadow target uses a no-op message producer that records the intent (for divergence comparison on the “what would have been enqueued” signal) but does not actually send.

Real World

GitHub’s Scientist library (used at GitHub itself, ported to many languages) implements exactly this side-effect suppression pattern. It wraps the candidate code path in a try/catch that never lets exceptions propagate to the caller, and runs the candidate asynchronously after returning the control path’s result - the same fork-and-forget model we’re describing here.

Response Comparison and Divergence Scoring

Two identical business operations may produce structurally different responses: different timestamps, different UUID-based IDs in object references, different ordering of array elements when no sort is specified. A naive byte-level diff would report 100% divergence on any system that generates IDs or timestamps. The comparison engine must be field-aware.

Response comparison and divergence scoring internals

The comparison engine operates on the deserialized response objects, not the raw bytes. For JSON responses, it performs a recursive structural diff with a field-level exclusion list configured per endpoint.

# Divergence scoring algorithm
import json
from typing import Any
from dataclasses import dataclass

@dataclass
class ComparisonResult:
    request_id: str
    endpoint: str
    diverged: bool
    divergence_score: float  # 0.0 = identical, 1.0 = completely different
    diff_paths: list[str]    # JSON paths where divergence occurred
    production_status: int
    shadow_status: int

IGNORED_FIELDS_BY_ENDPOINT = {
    "/api/orders": ["created_at", "updated_at", "request_id", "trace_id"],
    "/api/users": ["last_login", "session_token", "updated_at"],
    "default": ["timestamp", "request_id", "trace_id", "x_request_id"],
}

def compare_responses(
    request_id: str,
    endpoint: str,
    prod_body: dict,
    shadow_body: dict,
    prod_status: int,
    shadow_status: int,
) -> ComparisonResult:
    ignored = set(IGNORED_FIELDS_BY_ENDPOINT.get(endpoint, IGNORED_FIELDS_BY_ENDPOINT["default"]))
    diff_paths = []
    _recursive_diff(prod_body, shadow_body, ignored, "", diff_paths)

    # Status code divergence is always a signal
    if prod_status != shadow_status:
        diff_paths.append(f"status_code: {prod_status} vs {shadow_status}")

    total_fields = _count_fields(prod_body)
    divergence_score = len(diff_paths) / max(total_fields, 1)

    return ComparisonResult(
        request_id=request_id,
        endpoint=endpoint,
        diverged=len(diff_paths) > 0,
        divergence_score=min(divergence_score, 1.0),
        diff_paths=diff_paths,
        production_status=prod_status,
        shadow_status=shadow_status,
    )

def _recursive_diff(
    a: Any, b: Any, ignored: set, path: str, diffs: list
) -> None:
    if isinstance(a, dict) and isinstance(b, dict):
        all_keys = set(a.keys()) | set(b.keys())
        for key in all_keys:
            if key in ignored:
                continue
            child_path = f"{path}.{key}" if path else key
            if key not in a:
                diffs.append(f"missing in prod: {child_path}")
            elif key not in b:
                diffs.append(f"missing in shadow: {child_path}")
            else:
                _recursive_diff(a[key], b[key], ignored, child_path, diffs)
    elif isinstance(a, list) and isinstance(b, list):
        if len(a) != len(b):
            diffs.append(f"list length: {path} ({len(a)} vs {len(b)})")
        else:
            for i, (ia, ib) in enumerate(zip(a, b)):
                _recursive_diff(ia, ib, ignored, f"{path}[{i}]", diffs)
    elif a != b:
        diffs.append(f"{path}: {repr(a)} != {repr(b)}")

def _count_fields(obj: Any) -> int:
    if isinstance(obj, dict):
        return sum(_count_fields(v) for v in obj.values()) + len(obj)
    elif isinstance(obj, list):
        return sum(_count_fields(item) for item in obj)
    return 1
Key Insight

The divergence score is most useful as a rate over time, not as a per-request metric. A 0.5 divergence score on one request is meaningless - it could be a non-deterministic field you forgot to exclude. A divergence rate of 0.5% over 10,000 requests, stable for 10 minutes, means something is genuinely different between production and shadow behavior.

Confidence Score and Reporting

The confidence score is the system’s primary output - the signal that tells you whether the shadow service is ready to replace production. It is computed over a rolling window using a weighted combination of divergence rate, error rate delta, and latency ratio.

# Confidence score computation (runs every 30 seconds)
from dataclasses import dataclass
import statistics

@dataclass
class ConfidenceMetrics:
    window_start: float
    window_end: float
    total_requests: int
    diverged_requests: int
    prod_error_rate: float   # fraction of 5xx responses
    shadow_error_rate: float
    prod_p99_ms: float
    shadow_p99_ms: float

def compute_confidence_score(metrics: ConfidenceMetrics) -> float:
    """
    Returns 0.0 (no confidence) to 1.0 (full confidence).
    Score drops sharply on divergence or shadow errors exceeding production.
    """
    if metrics.total_requests < 100:
        return 0.0  # insufficient sample size

    divergence_rate = metrics.diverged_requests / metrics.total_requests
    error_rate_delta = max(0.0, metrics.shadow_error_rate - metrics.prod_error_rate)
    latency_ratio = metrics.shadow_p99_ms / max(metrics.prod_p99_ms, 1.0)

    # Component scores: each 0.0 to 1.0
    divergence_score = max(0.0, 1.0 - (divergence_rate * 20))  # 5% divergence = 0 score
    error_score = max(0.0, 1.0 - (error_rate_delta * 10))      # 10% extra errors = 0 score
    latency_score = max(0.0, 2.0 - latency_ratio)              # 2x slower = 0 score

    # Weighted combination
    return (divergence_score * 0.5) + (error_score * 0.3) + (latency_score * 0.2)

The confidence score gives engineers a single number to gate promotion: above 0.95 for 30 consecutive minutes across a representative traffic sample means the shadow service is ready. Below 0.70 triggers an automatic alert with the specific diff paths causing divergence.

Real World

Twitter (now X) open-sourced Diffy, a tool that runs exactly this comparison pipeline against staged and production instances. Diffy adds a third “noise” instance running the same production code as a baseline, so it can distinguish real divergence from non-determinism in the production code itself - a technique worth adopting when your production service has inherent non-determinism from things like random sampling or LIMIT without ORDER BY.

Data Model

-- Shadow comparison results storage
CREATE TABLE shadow_comparisons (
  id               BIGSERIAL PRIMARY KEY,
  request_id       TEXT NOT NULL,
  endpoint         TEXT NOT NULL,
  method           TEXT NOT NULL,
  diverged         BOOLEAN NOT NULL,
  divergence_score FLOAT NOT NULL,
  diff_paths       JSONB,
  prod_status      SMALLINT NOT NULL,
  shadow_status    SMALLINT NOT NULL,
  prod_latency_ms  INT NOT NULL,
  shadow_latency_ms INT NOT NULL,
  shadow_error_msg TEXT,
  sampled_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_shadow_comp_endpoint ON shadow_comparisons(endpoint, sampled_at DESC);
CREATE INDEX idx_shadow_comp_diverged ON shadow_comparisons(diverged, sampled_at DESC) WHERE diverged = TRUE;
CREATE INDEX idx_shadow_comp_ts ON shadow_comparisons(sampled_at DESC);

-- Confidence metrics per window (pre-aggregated for dashboard)
CREATE TABLE confidence_windows (
  id                BIGSERIAL PRIMARY KEY,
  endpoint          TEXT NOT NULL,
  window_start      TIMESTAMPTZ NOT NULL,
  window_end        TIMESTAMPTZ NOT NULL,
  total_requests    INT NOT NULL,
  diverged_requests INT NOT NULL,
  prod_p50_ms       INT,
  prod_p99_ms       INT,
  shadow_p50_ms     INT,
  shadow_p99_ms     INT,
  prod_error_rate   FLOAT,
  shadow_error_rate FLOAT,
  confidence_score  FLOAT NOT NULL,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_conf_windows_endpoint ON confidence_windows(endpoint, window_start DESC);

-- Sampling configuration (hot-loaded by mirror middleware)
CREATE TABLE shadow_config (
  id              BIGSERIAL PRIMARY KEY,
  endpoint_regex  TEXT NOT NULL,
  sampling_rate   FLOAT NOT NULL CHECK (sampling_rate BETWEEN 0.0 AND 1.0),
  enabled         BOOLEAN NOT NULL DEFAULT TRUE,
  transformations JSONB,
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

The shadow_comparisons table retains raw results for 72 hours. A nightly job aggregates them into confidence_windows and purges old raw records. The shadow_config table is polled every 10 seconds by the mirror middleware, allowing sampling rate changes without restart.

Key Algorithms and Protocols

Request pairing with timeout: The comparison engine pairs production and shadow responses by request_id. Production responses arrive first (they go straight to the caller); shadow responses arrive later (they are processed asynchronously). The pairing cache holds production responses for up to 30 seconds waiting for their shadow partner. If the shadow response does not arrive within 30 seconds, the pair is recorded as a shadow timeout.

// Response pairing cache
type PairingCache struct {
    mu      sync.Mutex
    pending map[string]*PendingPair
}

type PendingPair struct {
    ProdResponse *CapturedResponse
    ReceivedAt   time.Time
}

func (c *PairingCache) StoreProd(reqID string, resp *CapturedResponse) *CapturedResponse {
    c.mu.Lock()
    defer c.mu.Unlock()
    if entry, ok := c.pending[reqID]; ok {
        // Shadow already arrived - return it for immediate comparison
        delete(c.pending, reqID)
        return entry.ProdResponse // actually the shadow stored here
    }
    c.pending[reqID] = &PendingPair{ProdResponse: resp, ReceivedAt: time.Now()}
    return nil
}

// Reaper cleans up pairs where shadow never arrived
func (c *PairingCache) ReapExpired(maxAge time.Duration) {
    c.mu.Lock()
    defer c.mu.Unlock()
    now := time.Now()
    for id, pair := range c.pending {
        if now.Sub(pair.ReceivedAt) > maxAge {
            shadowTimeoutCounter.WithLabelValues(pair.ProdResponse.Endpoint).Inc()
            delete(c.pending, id)
        }
    }
}

Sampling rate control without restart: The mirror reads sampling rate from a configuration store every 10 seconds via a background goroutine. The rate is stored in an atomic.Value for lock-free reads on the hot path.

// Hot-reload sampling rate from config store
type SamplingConfig struct {
    Rate     float64
    Endpoint string
}

var currentConfig atomic.Value // stores *SamplingConfig

func refreshConfigLoop(configStore ConfigStore, interval time.Duration) {
    ticker := time.NewTicker(interval)
    for range ticker.C {
        cfg, err := configStore.GetShadowConfig()
        if err == nil {
            currentConfig.Store(cfg)
        }
    }
}

// Hot path reads - no locking needed
func shouldMirror(endpoint string) bool {
    cfg := currentConfig.Load().(*SamplingConfig)
    if cfg.Endpoint != "" && cfg.Endpoint != endpoint {
        return false
    }
    return rand.Float64() < cfg.Rate
}
Key Insight

Sampling rate control is your blast radius lever. Start at 0.1% (1 in 1000 requests) to validate the plumbing without risking your shadow target being overwhelmed. Ramp to 1%, 5%, 10%, then 100% as you build confidence. The confidence score gates promotion; the sampling rate gates exposure.

Scaling and Performance

Shadow traffic scaling and throughput model

The mirror plane scales horizontally with the API gateway - each gateway instance has its own local shadow queue and shadow dispatcher goroutines. There is no central coordinator for the dispatch path; the coordination happens only in the comparison plane.

Back-of-envelope capacity:
  Production traffic: 100K req/s
  Sampling rate: 10% = 10,000 shadow req/s
  Average request size: 2KB (headers + body)
  Shadow dispatcher throughput: 10K req/s per gateway instance
  Shadow target must handle: 10K req/s (same as production at 10%)

  Shadow queue capacity:
    Buffer size: 10,000 slots
    At 10K req/s production and 10K processing rate: 0 backlog steady state
    At burst 20K req/s for 5 seconds: 50K excess, drops after queue full
    Queue memory: 10K * 2KB = 20MB per gateway instance

  Comparison storage:
    At 10K comparisons/second: 864M/day raw results
    Average result size: 512 bytes (incl. diff_paths JSON)
    Storage: 864M * 512 bytes = 432GB/day - too large for 72hr retention
    Compress to 10K comparisons/second sampled at 1%: 100 comparisons/second
    Storage: 100 * 512 bytes * 86400s = 4.4GB/day, 13.2GB for 3 days - manageable

The key scaling insight is that you do not need to store every comparison - you need to store a statistically sufficient sample per endpoint per time window. At 10,000 shadow requests/second, storing 1% of comparisons gives 100 samples/second, which is more than enough for statistical confidence. Aggregated window metrics are always stored (they are small).

The shadow target itself must be scaled to handle the shadow traffic rate. At 10% sampling of 100K req/s production traffic, the shadow target receives 10K req/s - the same as a production instance at 1/10th traffic. If you ramp to 100% sampling, the shadow target must match production scale.

Real World

Envoy Proxy’s traffic shadowing feature implements exactly the gateway-level mirror pattern described here. Envoy mirrors requests asynchronously using a fire-and-forget dispatch that never waits for the shadow response. Shadow responses are received but discarded by Envoy - the response comparison must be built separately, as we’ve done here with the comparison service.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Shadow target crashShadow error rate spikes to 100%All shadow requests return errors; comparison records shadow_error_msgAuto-restart shadow target; shadow comparisons log timeouts; production unaffected
Shadow queue overflowshadowDroppedCounter metric spikesShadow requests dropped without mirroringRate-limit production traffic or scale shadow target; never impacts production
Comparison service downRaw results queue depth grows; confidence dashboard goes staleNo live confidence score for engineers; shadow requests still fireRestart comparison service; it catches up from the queue; no data loss if queue is durable
Shadow DB replica lagShadow sees stale data; false positive divergence on recent writesInflated divergence score from stale readsMonitor replica lag separately; ignore shadow comparisons when replica lag exceeds 5s
Response pairing timeoutReaper increments shadow_timeout counterShadow request completed but not compared; slight undercountingTune shadow target performance; increase pairing cache TTL if shadow is consistently slow
Side-effect breach (shadow writes to prod DB)DB audit log; unexpected write from shadow_readonly userReal data corruption if config is wrongRead-only DB user prevents writes at auth level; this is the hard enforcement boundary
Watch Out

The most dangerous failure mode is a side-effect breach - the shadow service accidentally writing to production state. This can happen if the shadow-mode header is not propagated through internal service-to-service calls, causing downstream services to behave as normal. Always propagate X-Shadow-Mode: true through every hop in the request chain, and validate it in every service’s middleware before any mutation.

Comparison of Approaches

ApproachIsolationDivergence SignalSide-Effect RiskInfrastructure Cost
Shadow traffic (this design)Full - users never see shadowHighest - real production trafficMedium - requires explicit suppressionMedium - shadow target + comparison service
A/B test (1% to new service)None - real users see new serviceHigh but with user exposureNone - real effects are intendedLow - just traffic routing
Replay-based testingFull - uses recorded traffic not liveMedium - traffic may be staleLow - recorded traffic, controlled replayLow - no additional live infrastructure
Synthetic load testFull - fake trafficLow - synthetic traffic misses edge casesNone - synthetic requests are safeLow - just a load generator
Canary deployment (feature flag)Partial - % of users on new codeHigh for those usersReal effects for canary usersLow - just flag routing

Shadow traffic is the right choice when you need to validate correctness (not just performance) of a new service against real production behavior, and when side effects can be reliably suppressed. A/B testing is better when you need real user feedback and the risk of a bad experience is acceptable. Replay testing is a good complement to shadow traffic for regression detection but misses traffic patterns that emerge only in real time.

Key Takeaways

  • The shadow dispatch must be non-blocking: use a buffered channel with a non-blocking send; never couple production latency to shadow processing time.
  • Side-effect suppression is the hardest engineering challenge: database writes, external API calls, and message queue publishes must all be intercepted; missing any one of them risks real-world consequences.
  • Divergence score as a rate, not per-request: a single divergent response is noise; a 0.5% divergence rate sustained over 10,000 requests is signal.
  • Sampling rate is your blast radius control: start low (0.1%), ramp slowly, use the confidence score to decide when to increase.
  • The confidence score must weight error rate delta heavily: a shadow service that crashes more than production is not ready even if response bodies match.
  • Propagate the shadow header through every hop: downstream services must know they are in shadow mode, or they will produce real side effects.
  • Request pairing by ID is essential: production and shadow responses arrive at different times; you need a stable identifier to join them for comparison.
  • A third “noise baseline” instance eliminates false positives from non-determinism: if your production service itself has non-deterministic output, you cannot distinguish real divergence from inherent randomness without a baseline.

The counter-intuitive lesson from building shadow traffic systems is that the hardest work is not the mirroring or the comparison - it is the discipline of side-effect suppression. Every engineer’s first instinct is to focus on “will the responses match?” but the more dangerous question is “what did the shadow service do to the world while answering?” A shadow service that accidentally sends a confirmation email or charges a payment card once per thousand requests causes invisible production incidents that are nearly impossible to debug.

Frequently Asked Questions

Q: Why not just run the new service in a staging environment with production traffic replay? A: Replayed traffic is always stale by the time it arrives. Session context, cache state, and recently-modified records will differ between production and staging. Shadow traffic uses the exact production request at the exact production moment, with the exact production data state accessible via the read replica. Replay catches structural bugs; shadow traffic catches temporal and data-state bugs that replay misses.

Q: How do you handle requests that require authentication - don’t you need to forward auth tokens to the shadow target? A: Yes - the shadow request gets the same auth headers as the production request, including session tokens and API keys. The shadow target authenticates the request normally. The critical constraint is that the shadow target must not use authenticated identity to make real writes or real third-party calls. The auth token is forwarded to give the shadow service access to the right data scope, not to authorize real effects.

Q: What happens when the shadow target is slower than production - does it affect the confidence score? A: Shadow latency does not affect production latency - the dispatcher is fully async. However, high shadow latency affects the confidence score through the latency_ratio component. If the shadow service takes 3x longer than production, its confidence score is penalized even if responses match. This is intentional: a correct-but-slow service is not production-ready.

Q: How do you validate that side-effect suppression is actually working? A: Instrument every suppression point. The read-only DB connection will throw an exception on any write attempt - log these as shadow_write_attempt events. The mock HTTP client should log every call it intercepts as shadow_external_call_suppressed. Monitor these counters - if shadow_write_attempt is zero for an endpoint that normally writes, either the suppression is working or the shadow service is not executing the write path (check divergence).

Q: Can you run shadow traffic for gRPC services, not just HTTP? A: Yes, with the same architecture. Instead of HTTP headers, use gRPC metadata to carry the shadow mode flag (x-shadow-mode: true). Envoy’s request mirroring supports gRPC natively. The comparison engine needs to deserialize protobuf messages rather than JSON - use reflection to walk the message fields and apply the same exclusion list logic. The pairing cache and confidence scoring work identically.

Q: Why store raw comparison results at all - can’t you just aggregate in-stream? A: Aggregated metrics tell you the rate of divergence but not why. When confidence drops from 0.95 to 0.80, you need the specific diff_paths from individual requests to diagnose the cause. Storing 1% of raw results gives you enough samples to identify the divergent field pattern without overwhelming storage. Full raw storage is only necessary during initial bringup when you expect high divergence and need complete data to debug.

Interview Questions

Q: Walk me through how you would guarantee zero production latency impact even if the shadow target becomes completely unreachable.

Expected depth: The shadow dispatch uses a non-blocking channel send with a drop path. The shadow target’s availability does not affect the dispatch - the gateway dispatches to a local in-memory queue, not to the shadow target directly. The shadow dispatcher goroutines consume from the queue and make outbound calls to the shadow target; if the target is down, the goroutines receive errors and log them without affecting any other goroutine. Discuss queue depth monitoring, backpressure signals, and the metric that tells you when to pause the test.

Q: Design the side-effect suppression for a service that reads from a primary database and also calls a downstream inventory service. The inventory service has its own database.

Expected depth: Two suppression layers are needed: the shadow service’s own DB connection must be read-only (pointing to a replica). Calls to the inventory service must carry the X-Shadow-Mode header. The inventory service must have its own shadow mode middleware that intercepts writes and returns mock responses. This is the header propagation requirement - every downstream service in the call graph must understand and honor shadow mode. Discuss the risk of incomplete propagation and how you’d audit for missed suppression points.

Q: How would you modify the confidence score if your production service has inherent non-determinism - for example, it randomly samples 20 features from a pool of 100 to build a response?

Expected depth: Add a third “noise” instance running the same production code. Compare production vs noise-instance to measure the baseline divergence rate from non-determinism. Subtract this baseline from the shadow vs production divergence rate to get the signal-to-noise-adjusted divergence. If prod vs noise shows 8% divergence and prod vs shadow shows 9% divergence, the shadow service has only 1% additional divergence beyond what production itself is non-deterministic about. This is Twitter’s Diffy approach.

Q: The shadow queue fills up during a traffic spike. How do you decide what to drop and how do you report on it?

Expected depth: Drop uniformly at random (the default select { default: } in Go). Never drop selectively by endpoint - you want uniform coverage across all endpoints. Track drops with a counter per endpoint. Report the effective sampling rate as (enqueued - dropped) / total_requests so the confidence score knows its actual sample size. If effective sampling rate falls below a minimum threshold (say 10% of configured rate), pause the confidence score update and alert - the results may not be statistically meaningful.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article