Build a Multi-Region Config Management System


distributed-systems reliability deployment

System Design Deep Dive

Multi-Region Config Management

One wrong config key pushed to 20 regions simultaneously - and you need rollback in under 5 seconds

⏱ 14 min read📐 Advanced🏗️ Config-Management

Think of your application’s configuration as sheet music handed to an orchestra of 20 regional ensembles. When you change the tempo marking, every ensemble needs to play the new tempo - and they need to do it within the same measure, not drift independently. One ensemble playing at the wrong tempo for even a few bars turns the symphony into chaos. Now imagine the conductor needs to be able to instantly revert to the previous tempo if the new one causes the violins to fall apart.

Configuration management at global scale carries exactly this tension. A feature_flag: true that works flawlessly in us-east-1 can trigger a latency spike in ap-southeast-1 if the dependent service has a version mismatch. A rate_limit: 1000 that protects a database in Europe may be too low for Asia-Pacific traffic patterns. The challenge is not just propagating values - it is propagating them fast enough that no region runs a different config version for any meaningful duration, and guaranteeing that a bad change can be retracted before it causes a customer-visible incident.

The naive approach - a centralized config database that all 20 regions poll every minute - fails on three fronts. First, 60-second propagation windows mean regions can diverge on a rolling-deploy timeline. Second, a polling architecture creates a thundering herd on the config DB during a high-churn deployment. Third, rollback becomes “push a new version and wait another minute,” which is too slow when you’re watching error rates climb in real time.

We need to solve for three constraints simultaneously: sub-500ms propagation to all 20 regions, prevention of config divergence between regions, and instant rollback within 5 seconds of triggering it.

Requirements and Constraints

Functional Requirements

  • Store and version configuration key-value pairs (strings, booleans, numbers, JSON blobs)
  • Push changes to all 20 global regions within 500ms of commit
  • Detect and alert on config divergence between regions
  • Support instant rollback to any previous version by key or entire config snapshot
  • Provide an audit trail of every change with author, timestamp, and diff
  • Support validation gates - schema checks and canary rollout - before global propagation

Non-Functional Requirements

  • Propagation latency: p99 under 500ms globally
  • Rollback latency: under 5 seconds from trigger to full propagation
  • Availability: 99.99% (config reads must succeed even during control-plane outages)
  • Write throughput: up to 1,000 config changes/hour during peak deployments
  • Config store size: up to 10 million key-value pairs per environment
  • Read throughput: up to 100,000 config reads/second per region (from local cache)

Constraints and Assumptions

  • Config reads are on the hot path; writes are infrequent compared to reads
  • Regions are assumed to have stable inter-region connectivity (P2P links, not internet)
  • A config change that passes validation can still be bad at runtime - rollback is the recovery path
  • We are not building feature flag evaluation logic; this is the distribution layer

High-Level Architecture

The system splits into two planes: a control plane that manages versioning, validation, and change propagation; and a data plane that serves config reads with sub-millisecond latency to local services.

Multi-region config management architecture overview

The control plane lives in a primary region and consists of a Config API, a versioned Config Store backed by a distributed KV database, a Validation Engine, and a Propagation Coordinator. When an operator commits a config change through the Config API, the Validation Engine runs schema checks and optional canary deployment. If validation passes, the Propagation Coordinator fans out the change to all 20 regional Config Agents via a push channel.

Each region runs a Config Agent that receives pushed updates, writes them to a local Config Cache (an in-memory store with disk persistence), and serves all local service reads. Services never call the control plane directly - they call the local agent, so config reads survive control-plane outages entirely.

The Config Agent also runs a background reconciliation loop that compares its local version digest against the control plane’s expected version every 30 seconds - catching any divergence from missed pushes.

Key Insight

The data plane (local agent) and control plane (central store) must be completely decoupled for reads. Reads should never cross region boundaries; only writes should. This is what gives you both low-latency reads and high availability during control-plane failures.

The Config Store

The Config Store is the single source of truth for all configuration. Its job is to maintain an immutable, append-only version history for every key, allowing point-in-time queries and enabling instant rollback.

Config store version history and snapshot model

Every write creates a new revision - a monotonically increasing integer. The Config Store does not update records in place; it appends a new version entry with the new value, the author, a timestamp, and a SHA-256 content hash. The current value for a key is always the highest revision.

A snapshot is a named, immutable point-in-time capture of all keys at a given revision. The rollback system works at snapshot granularity, not individual key granularity, ensuring the entire config set returns to a known-consistent state.

-- Config Store schema
CREATE TABLE config_versions (
  id          BIGSERIAL PRIMARY KEY,
  key         TEXT NOT NULL,
  environment TEXT NOT NULL,
  revision    BIGINT NOT NULL,
  value       TEXT NOT NULL,
  value_type  TEXT NOT NULL CHECK (value_type IN ('string','bool','int','float','json')),
  author      TEXT NOT NULL,
  commit_msg  TEXT,
  content_hash CHAR(64) NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (key, environment, revision)
);

CREATE TABLE config_snapshots (
  id            BIGSERIAL PRIMARY KEY,
  snapshot_name TEXT NOT NULL,
  environment   TEXT NOT NULL,
  revision      BIGINT NOT NULL,
  key_count     INT NOT NULL,
  digest        CHAR(64) NOT NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (snapshot_name, environment)
);

CREATE TABLE snapshot_keys (
  snapshot_id BIGINT REFERENCES config_snapshots(id),
  key         TEXT NOT NULL,
  revision    BIGINT NOT NULL,
  value_hash  CHAR(64) NOT NULL,
  PRIMARY KEY (snapshot_id, key)
);

CREATE INDEX idx_config_versions_key_env ON config_versions(key, environment, revision DESC);
CREATE INDEX idx_config_versions_created ON config_versions(created_at DESC);

The content_hash field is critical: it lets the Propagation Coordinator verify that what a regional Config Agent received matches what was sent, without transmitting the full value again.

Real World

etcd (used by Kubernetes) uses a similar revision-based model where every write increments a global revision counter and the entire history is queryable. Kubernetes uses etcd’s watch API to push changes to controllers - the same push model we’re building here for regional propagation.

The Propagation Coordinator

The Propagation Coordinator fans out config changes to all regions as fast as the network allows. Its job is to ensure that within 500ms of a write committing, every region’s Config Agent has the new version in its local cache and has acknowledged receipt.

Propagation coordinator fan-out and acknowledgement flow

The coordinator maintains a persistent gRPC streaming connection to every regional Config Agent. When a change commits, the coordinator pushes a ConfigDelta message containing the changed keys, their new values, the new revision, and the content hash. It does not wait for an ACK before pushing to the next region - it fans out to all 20 regions simultaneously and then collects ACKs.

// Config propagation proto
syntax = "proto3";

service ConfigPropagation {
  rpc Subscribe(SubscribeRequest) returns (stream ConfigDelta);
  rpc Acknowledge(AckRequest) returns (AckResponse);
  rpc ForceSync(SyncRequest) returns (SyncResponse);
}

message ConfigDelta {
  string   environment   = 1;
  int64    revision      = 2;
  int64    prev_revision = 3;
  repeated KeyChange changes = 4;
  string   digest        = 5;
  int64    timestamp_ms  = 6;
}

message KeyChange {
  string key          = 1;
  string value        = 2;
  string value_type   = 3;
  string content_hash = 4;
  bool   deleted      = 5;
}

message AckRequest {
  string region      = 1;
  int64  revision    = 2;
  string digest      = 3;
}

The coordinator tracks ACK status per region per revision. If a region fails to ACK within 2 seconds, it retries the push. If a region fails three consecutive pushes, it marks that region as diverged and fires an alert - the reconciliation loop on the agent side will detect the missed revision and trigger a full sync.

Watch Out

Do not use the acknowledgement timeout as a health check for the region itself. A slow ACK might mean the agent is processing a large config delta, not that the region is down. Separate config divergence alerts from infrastructure health monitoring - conflating the two causes alert fatigue and masks real failures.

The Regional Config Agent

The Config Agent is the local representative of the config system in each region. It serves config reads to local services at sub-millisecond latency and is deliberately designed to operate entirely independently during control-plane outages.

The agent maintains two layers of storage:

  1. In-memory hot cache: a Go sync.Map (or Redis) holding all current config values, keyed by {env}:{key}. Serves reads in under 100 microseconds.
  2. Disk-backed cold store: a local SQLite or RocksDB instance holding the full version history for the last 48 hours. Used for reconciliation and to rebuild the hot cache on restart.
// Config Agent read path
type ConfigAgent struct {
    hotCache  *sync.Map
    coldStore *badger.DB
    region    string
    env       string
    revision  int64
    mu        sync.RWMutex
}

func (a *ConfigAgent) Get(key string) (ConfigValue, bool) {
    // Hot path: sub-100 microsecond read from in-memory map
    if v, ok := a.hotCache.Load(a.cacheKey(key)); ok {
        return v.(ConfigValue), true
    }
    return ConfigValue{}, false
}

func (a *ConfigAgent) ApplyDelta(delta *ConfigDelta) error {
    if delta.Revision <= atomic.LoadInt64(&a.revision) {
        return nil // already applied, idempotent
    }
    if delta.PrevRevision != atomic.LoadInt64(&a.revision) {
        return ErrRevisionGap // trigger full sync
    }
    for _, change := range delta.Changes {
        if change.Deleted {
            a.hotCache.Delete(a.cacheKey(change.Key))
        } else {
            a.hotCache.Store(a.cacheKey(change.Key), ConfigValue{
                Value:    change.Value,
                Type:     change.ValueType,
                Revision: delta.Revision,
            })
        }
        // Persist to cold store asynchronously
        go a.persistChange(change, delta.Revision)
    }
    atomic.StoreInt64(&a.revision, delta.Revision)
    return nil
}

func (a *ConfigAgent) cacheKey(key string) string {
    return a.env + ":" + key
}

The PrevRevision check is the gap detector. If an agent missed a push (network blip, restart), delta.PrevRevision will not match the agent’s current revision, and the agent immediately requests a full sync rather than applying a partial delta on top of a stale base.

Key Insight

The PrevRevision check makes delta application idempotent and gap-safe. Without it, a missed message causes silent config divergence - the agent applies future deltas on top of a stale base and ends up with a mixed config state that is worse than being one version behind.

Config Versioning and Rollback

Rollback is the system’s most operationally critical path. When a bad config causes elevated error rates, engineers need to trigger rollback with a single command and see it complete globally within 5 seconds.

The rollback flow works at snapshot granularity:

  1. Operator runs config rollback --env production --to snapshot-2026-06-11-14:30:00
  2. The Config API looks up the target snapshot’s revision and retrieves all snapshot_keys for that snapshot
  3. For each key in the snapshot, it creates a new config_versions entry with the old value and a new revision number - rollback is itself a write, preserving the audit trail
  4. The Propagation Coordinator pushes the rollback delta as if it were a normal change - the regional agents see a new revision with the rolled-back values and apply it identically

This means rollback uses the same propagation path as forward writes, inheriting the same 500ms SLA. There is no special rollback code path - only normal propagation.

# Rollback execution logic
def execute_rollback(env: str, target_snapshot_name: str, author: str) -> int:
    """Returns the new revision created by the rollback."""
    snapshot = db.query(
        "SELECT id, revision FROM config_snapshots WHERE environment=%s AND snapshot_name=%s",
        (env, target_snapshot_name)
    ).fetchone()
    if not snapshot:
        raise ValueError(f"Snapshot {target_snapshot_name} not found")

    snapshot_keys = db.query(
        """SELECT ck.key, cv.value, cv.value_type, cv.content_hash
           FROM snapshot_keys ck
           JOIN config_versions cv ON cv.key = ck.key AND cv.revision = ck.revision AND cv.environment = %s
           WHERE ck.snapshot_id = %s""",
        (env, snapshot["id"])
    ).fetchall()

    new_revision = next_revision(env)
    for sk in snapshot_keys:
        db.execute(
            """INSERT INTO config_versions
               (key, environment, revision, value, value_type, author, commit_msg, content_hash)
               VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""",
            (sk["key"], env, new_revision, sk["value"], sk["value_type"],
             author, f"ROLLBACK to {target_snapshot_name}", sk["content_hash"])
        )
    db.commit()
    return new_revision
Real World

LaunchDarkly’s feature flag system uses this same “rollback is a write” pattern. Instead of reverting state in place, it creates a new flag version that restores old values. This means the audit log always shows a complete forward-only history, and “undo” is represented as an explicit action rather than a state mutation.

Validation Gates

Pushing a config change to 20 regions at once without any validation is operationally reckless. The Validation Engine runs two gates before global propagation:

Gate 1 - Schema validation: Every config key has a registered schema (type, allowed values, min/max for numerics, regex for strings). The validation engine rejects any write that violates the schema before it even reaches the Config Store.

# Config key schema definition
keys:
  rate_limit_rps:
    type: int
    min: 10
    max: 100000
    environments: [staging, production]
    requires_review_above: 10000
  feature_new_checkout:
    type: bool
    environments: [staging, production]
  db_connection_pool_size:
    type: int
    min: 5
    max: 500
    rollout_strategy: canary

Gate 2 - Canary rollout: For keys tagged with rollout_strategy: canary, the Propagation Coordinator pushes the change to a single canary region first and holds global propagation for a configurable wait window (default: 60 seconds). During the window, it monitors the control plane’s error rate feed for that region. If error rates stay within baseline, it proceeds to global propagation. If error rates spike, it triggers automatic rollback.

Watch Out

Canary rollout works well for gradual risk reduction but adds latency to the propagation path. Do not apply canary gates to all keys by default - reserve them for keys that control traffic routing, resource limits, or third-party integrations. Keys controlling cosmetic features (copy, colors) should propagate immediately without canary gates.

Data Model

Config change lifecycle from write to regional propagation

The data model centers on three concepts: versions (individual key writes), snapshots (consistent captures of all keys), and propagation events (the log of what was pushed to which region when).

-- Propagation event tracking
CREATE TABLE propagation_events (
  id          BIGSERIAL PRIMARY KEY,
  revision    BIGINT NOT NULL,
  environment TEXT NOT NULL,
  region      TEXT NOT NULL,
  status      TEXT NOT NULL CHECK (status IN ('pending','acked','failed','diverged')),
  pushed_at   TIMESTAMPTZ,
  acked_at    TIMESTAMPTZ,
  retry_count INT NOT NULL DEFAULT 0,
  error_msg   TEXT
);

CREATE INDEX idx_prop_events_revision ON propagation_events(revision, environment);
CREATE INDEX idx_prop_events_region_status ON propagation_events(region, status);

-- Config schema registry
CREATE TABLE config_schemas (
  key         TEXT NOT NULL,
  environment TEXT NOT NULL,
  schema_json JSONB NOT NULL,
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  PRIMARY KEY (key, environment)
);

The propagation_events table is the divergence detection source of truth. A background job queries it every 30 seconds for any revisions where not all 20 regions have acked status within 2 minutes - these are the divergence candidates that trigger reconciliation.

Key Algorithms and Protocols

Content-addressed digest for divergence detection: Each snapshot has a digest computed as the SHA-256 of all {key}={value_hash} pairs sorted lexicographically. Regional agents compute the same digest over their local state and compare it against the control plane’s expected digest during reconciliation. A mismatch triggers a full sync, regardless of how many individual key hashes match.

import hashlib, json

def compute_config_digest(key_values: dict[str, str]) -> str:
    """Canonical digest over all config key-value hashes."""
    # Sort keys lexicographically for deterministic ordering
    pairs = sorted(
        f"{k}={hashlib.sha256(v.encode()).hexdigest()}"
        for k, v in key_values.items()
    )
    combined = "\n".join(pairs)
    return hashlib.sha256(combined.encode()).hexdigest()

Two-phase push with timeout-based fallback: The coordinator pushes a delta and waits for ACKs. If any region fails to ACK within 2 seconds, the coordinator sends a ForceSync RPC to that region, which causes the agent to pull the full current snapshot from the control plane’s read replica in that geography. This converts a push failure into a pull - the agent does not rely solely on receiving pushes.

// Coordinator push with fallback to full sync
func (c *Coordinator) PropagateRevision(revision int64, delta *ConfigDelta) {
    ackCh := make(chan RegionAck, len(c.regions))
    for _, region := range c.regions {
        go func(r Region) {
            err := r.PushDelta(delta)
            if err != nil {
                ackCh <- RegionAck{Region: r.Name, Success: false}
                return
            }
            ack, err := r.WaitAck(revision, 2*time.Second)
            ackCh <- RegionAck{Region: r.Name, Success: err == nil && ack.Digest == delta.Digest}
        }(region)
    }
    timer := time.After(3 * time.Second)
    pending := make(map[string]bool)
    for _, r := range c.regions {
        pending[r.Name] = true
    }
    for len(pending) > 0 {
        select {
        case ack := <-ackCh:
            if ack.Success {
                delete(pending, ack.Region)
            }
        case <-timer:
            for regionName := range pending {
                go c.TriggerForceSync(regionName, revision)
            }
            return
        }
    }
}
Key Insight

The fallback from push to pull is the reliability safety net. A purely push-based system has a single failure mode: missed pushes cause permanent divergence. Adding pull-based reconciliation means the system self-heals - the worst case is not divergence but a slightly longer convergence time.

Scaling and Performance

Scaling architecture with read replicas and regional caches

The control plane’s write path is intentionally not on the hot path for reads. The Config Store handles at most 1,000 writes/hour during deployments - this is trivially handled by a single primary PostgreSQL instance or a distributed KV like Etcd. The bottleneck is not writes; it is the propagation fan-out.

With 20 regions and a gRPC streaming connection per region, the coordinator maintains 20 persistent connections. This is well within a single Go process’s capability. The bottleneck emerges when a config change contains thousands of changed keys (a bulk import or an environment reset). For large deltas, the coordinator compresses the payload with Zstd and switches from per-key deltas to a full snapshot transfer.

Back-of-envelope capacity:
  Config changes: 1,000/hour = 0.28 writes/second
  Keys per change: avg 5, max 10,000 (bulk)
  Value size: avg 256 bytes, max 64KB
  Regions: 20

Normal propagation:
  Delta size: 5 keys * 256 bytes = 1.28KB per change
  Fan-out bandwidth: 1.28KB * 20 regions = 25.6KB per commit
  At 1,000 changes/hour: 25.6KB * 0.28/s = 7.2KB/s outbound

Bulk propagation (environment reset):
  Delta size: 10,000 keys * 256 bytes = 2.56MB
  Zstd compressed: ~512KB
  Fan-out to 20 regions: 512KB * 20 = 10.24MB
  At 10Gbps inter-region link: 8 milliseconds per region (parallel)

Read path per region:
  100,000 reads/second from in-memory hot cache
  Cache hit rate: ~100% (all keys fit in ~256MB RAM for 1M keys at 256 bytes/key)
  Cache miss (cold start): served from local disk store in <1ms

The control plane’s read replicas (one per geographic cluster) serve the reconciliation pull requests from regional agents. This prevents reconciliation traffic from hitting the primary write node.

Real World

Etcd’s watch streaming API handles the same fan-out problem for Kubernetes controllers. A single etcd cluster can serve thousands of watch streams to controllers across a cluster. The key insight is that watches are push-based and the etcd server tracks the last revision each watcher has seen, delivering only the delta on reconnect - exactly what our coordinator does with PrevRevision tracking.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Control plane primary failureHealth check; no writes acceptedNew config writes blocked; reads unaffectedPromote read replica to primary; propagation resumes
Regional Config Agent restartAgent re-subscribes; reconciliation runsBrief window with stale cache from previous stateAgent rebuilds hot cache from local disk store on startup
Push stream disconnectgRPC keepalive timeout (10s)Region misses deltas during disconnect windowAgent requests full sync on reconnect; ForceSync from coordinator
Config divergence (missed delta)30s reconciliation digest mismatchRegion serves stale config valuesAgent pulls full snapshot from nearest read replica
Bad config pushed (elevated errors)Canary error rate monitor or manual alertElevated error rates in affected regionsOperator triggers rollback; 500ms propagation of rollback delta
Control plane network partitionSplit-brain on primary electionPossible stale reads from isolated replicaRead replicas serve stale but consistent data; writes blocked until partition heals
Watch Out

The most common operational mistake is triggering rollback manually while an automated canary already detected the error and is mid-rollback. This causes two competing write streams to the Config Store, potentially interleaving revisions. Always check the canary status before manually triggering rollback - or disable the canary first.

Comparison of Approaches

ApproachPropagation LatencyRegional AvailabilityRollback SpeedOperational Complexity
Centralized polling (1-min interval)60s averageDegrades if central DB is slowPush new version, wait 60sLow - single DB, simple polling
Push-only (no reconciliation)~100msFull during push failures, stale otherwisePush rollback deltaMedium - streaming connections needed
Pull-based with short TTL (5s)5s averageHigh - pull survives push failuresPush rollback, wait 5sLow-Medium - no persistent connections
Push + pull reconciliation (this design)~200ms p99Full - reads survive all central failuresPush rollback delta, ~200msHigh - streaming + reconciliation logic
Gossip-based propagation500ms-2sVery high - no central coordinatorRequires epoch-based versioningVery High - gossip protocol complexity

Push plus pull reconciliation is the right choice for this scale. Pure polling gives tolerable consistency at low complexity, but the 60-second propagation window is unacceptable when a bad config needs rollback in under 5 seconds. Gossip-based propagation adds substantial complexity without meaningful availability improvements since we already have regional agents that survive central failures.

Key Takeaways

  • Config versioning is append-only: every write creates a new revision; nothing is ever updated in place, making audit trails and rollback trivially correct.
  • Rollback is a write: expressing rollback as a new revision that restores old values means it uses the same propagation path as forward changes, inheriting the same SLA.
  • Local agent decouples read availability from control-plane health: services in a region should never block on cross-region calls to read config, even during control-plane outages.
  • PrevRevision gap detection prevents silent divergence: an agent that detects it missed a push immediately requests a full sync rather than applying future deltas on a stale base.
  • Push plus pull reconciliation is the right reliability model: push for latency, pull for correctness guarantees when pushes fail.
  • Content-addressed digests catch divergence that delta tracking misses: even if revision numbers match, a cryptographic digest of all values detects any bit-level divergence.
  • Canary gates on high-risk keys bound blast radius: not every config key needs a canary window, but keys controlling traffic routing or resource limits warrant one.
  • Schema validation at write time is cheaper than debugging at runtime: a rejected write with a clear error message is far less costly than a config change that silently accepts an out-of-range value and causes subtle failures at 3 AM.

The counter-intuitive lesson here is that the system’s reliability does not come from making the control plane more available - it comes from making the data plane (regional agents) completely independent of the control plane for reads. The control plane can be down for minutes and no service in any region notices. This inverts the usual instinct to build a highly available central store; instead, treat the central store as an eventually-consistent source of truth that the edge caches will eventually catch up to.

Frequently Asked Questions

Q: Why not use a managed config service like AWS AppConfig or Google Runtime Configurator? A: Managed services solve the storage and basic propagation problem but typically have propagation latencies of 10-60 seconds and limited rollback automation. For a system requiring sub-500ms propagation and 5-second rollback with custom canary logic, a purpose-built system provides better control. Managed services are appropriate for teams where operational simplicity outweighs propagation latency requirements.

Q: Why use push instead of pull with a very short TTL like 1 second? A: At 20 regions each polling every second, you have 20 requests/second to the control plane at rest - manageable. But during a rollout with 100 config changes in 5 minutes, every region polls on its own schedule, creating thundering herds and 1-second propagation gaps. Push-on-write is strictly faster and imposes zero load on the control plane during quiet periods.

Q: How do you handle a config key that exists in one region but not another after a partial push failure? A: The PrevRevision check detects the gap and the agent requests a full sync. The full sync transfers the entire snapshot, not just the missed delta, so the agent’s state converges to exactly the control plane’s current state regardless of what partial writes occurred before.

Q: What happens when you need to roll out different config values to different regions intentionally (A/B testing infrastructure)? A: Region-specific overrides require a separate abstraction - an override table keyed by (key, environment, region) - layered on top of the global config store. The regional agent merges the global value with any regional override at read time. This is a separate feature from the base propagation system described here.

Q: Why not store config in Git and use GitOps-style propagation? A: Git gives you version history and rollback for free, but the delivery mechanism (CI/CD pipelines triggered by commits) has multi-minute latency and no native push-to-edge capability. Git also lacks the runtime validation (canary error rate gates) and the sub-second rollback path needed here. GitOps is a good pattern for infrastructure configs that change daily; it is too slow for application configs that may need rollback in seconds.

Q: How do you prevent a misconfigured validation rule from blocking all config writes? A: Validation rules are themselves config, stored in config_schemas. They are loaded at validation time with a 5-minute TTL in the validator’s cache. If validation schema retrieval fails, the validator fails open with a warning log - it applies type-checking only and skips range/regex constraints. This ensures a broken schema does not create a configuration outage. Schema changes follow the same write path but bypass the schema validator (to avoid bootstrapping circularity).

Interview Questions

Q: Walk me through how you’d design the rollback mechanism to guarantee completion within 5 seconds even if 3 of 20 regions are unreachable.

Expected depth: Rollback uses the same push path as forward writes. For unreachable regions, the coordinator fires ForceSync after a 2-second push timeout - the agent pulls the rollback snapshot from the nearest read replica when connectivity restores. The 5-second SLA applies to reachable regions; unreachable regions converge as soon as connectivity resumes. Discuss whether you accept this or require stronger guarantees and what that implies for consistency models.

Q: How would you detect and alert on config divergence across 20 regions without creating excessive noise?

Expected depth: The propagation_events table tracks ACK status per revision per region. A divergence is declared when a region has not ACKed a revision within propagation_timeout * 3. Alerting should fire on the first divergence, then suppress subsequent alerts for the same region until resolved. Discuss the difference between transient divergence (brief push failure, heals in 30s) vs persistent divergence (agent offline) and how alert severity should differ.

Q: Design the validation gate for a db_connection_pool_size config key. What tests would you run before global propagation?

Expected depth: Schema validation (type int, range 5-500) at write time. For canary rollout: push to one region, observe database connection metrics and error rates for 60 seconds, then proceed. Key edge cases: the canary region may not have representative load; the config change may not take effect immediately if services cache the value. Discuss how you’d force config reload in the canary region and what metrics constitute “safe” for promotion.

Q: The propagation coordinator maintains 20 persistent gRPC streams. How do you handle coordinator restarts without losing in-flight changes?

Expected depth: The coordinator is stateless between restarts - its source of truth is the propagation_events table. On startup, it queries for any revisions with non-acked propagation events, rebuilds the pending revision list, and resumes pushing. Regional agents reconnect and request a full sync if the coordinator was down long enough for the reconnect to span a revision gap. Discuss how you prevent duplicate delivery using the idempotent ApplyDelta check on the agent side.

Q: How would you extend this system to support environment promotion - taking a staging config snapshot and promoting it to production?

Expected depth: Promotion is a read from the staging snapshot store followed by a write to the production Config Store, with the standard validation gates applied to the production write. Key challenge: values that are safe in staging may be wrong for production scale (e.g., rate_limit: 100 for staging, rate_limit: 50000 for production). Discuss requiring explicit overrides for environment-specific keys during promotion, and how you’d build a diff view showing what would change in production.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article