Build a Multi-Region Config Management System
distributed-systems reliability deployment
System Design Deep Dive
Multi-Region Config Management
One wrong config key pushed to 20 regions simultaneously - and you need rollback in under 5 seconds
Think of your application’s configuration as sheet music handed to an orchestra of 20 regional ensembles. When you change the tempo marking, every ensemble needs to play the new tempo - and they need to do it within the same measure, not drift independently. One ensemble playing at the wrong tempo for even a few bars turns the symphony into chaos. Now imagine the conductor needs to be able to instantly revert to the previous tempo if the new one causes the violins to fall apart.
Configuration management at global scale carries exactly this tension. A feature_flag: true that works flawlessly in us-east-1 can trigger a latency spike in ap-southeast-1 if the dependent service has a version mismatch. A rate_limit: 1000 that protects a database in Europe may be too low for Asia-Pacific traffic patterns. The challenge is not just propagating values - it is propagating them fast enough that no region runs a different config version for any meaningful duration, and guaranteeing that a bad change can be retracted before it causes a customer-visible incident.
The naive approach - a centralized config database that all 20 regions poll every minute - fails on three fronts. First, 60-second propagation windows mean regions can diverge on a rolling-deploy timeline. Second, a polling architecture creates a thundering herd on the config DB during a high-churn deployment. Third, rollback becomes “push a new version and wait another minute,” which is too slow when you’re watching error rates climb in real time.
We need to solve for three constraints simultaneously: sub-500ms propagation to all 20 regions, prevention of config divergence between regions, and instant rollback within 5 seconds of triggering it.
Requirements and Constraints
Functional Requirements
- Store and version configuration key-value pairs (strings, booleans, numbers, JSON blobs)
- Push changes to all 20 global regions within 500ms of commit
- Detect and alert on config divergence between regions
- Support instant rollback to any previous version by key or entire config snapshot
- Provide an audit trail of every change with author, timestamp, and diff
- Support validation gates - schema checks and canary rollout - before global propagation
Non-Functional Requirements
- Propagation latency: p99 under 500ms globally
- Rollback latency: under 5 seconds from trigger to full propagation
- Availability: 99.99% (config reads must succeed even during control-plane outages)
- Write throughput: up to 1,000 config changes/hour during peak deployments
- Config store size: up to 10 million key-value pairs per environment
- Read throughput: up to 100,000 config reads/second per region (from local cache)
Constraints and Assumptions
- Config reads are on the hot path; writes are infrequent compared to reads
- Regions are assumed to have stable inter-region connectivity (P2P links, not internet)
- A config change that passes validation can still be bad at runtime - rollback is the recovery path
- We are not building feature flag evaluation logic; this is the distribution layer
High-Level Architecture
The system splits into two planes: a control plane that manages versioning, validation, and change propagation; and a data plane that serves config reads with sub-millisecond latency to local services.
The control plane lives in a primary region and consists of a Config API, a versioned Config Store backed by a distributed KV database, a Validation Engine, and a Propagation Coordinator. When an operator commits a config change through the Config API, the Validation Engine runs schema checks and optional canary deployment. If validation passes, the Propagation Coordinator fans out the change to all 20 regional Config Agents via a push channel.
Each region runs a Config Agent that receives pushed updates, writes them to a local Config Cache (an in-memory store with disk persistence), and serves all local service reads. Services never call the control plane directly - they call the local agent, so config reads survive control-plane outages entirely.
The Config Agent also runs a background reconciliation loop that compares its local version digest against the control plane’s expected version every 30 seconds - catching any divergence from missed pushes.
The data plane (local agent) and control plane (central store) must be completely decoupled for reads. Reads should never cross region boundaries; only writes should. This is what gives you both low-latency reads and high availability during control-plane failures.
The Config Store
The Config Store is the single source of truth for all configuration. Its job is to maintain an immutable, append-only version history for every key, allowing point-in-time queries and enabling instant rollback.
Every write creates a new revision - a monotonically increasing integer. The Config Store does not update records in place; it appends a new version entry with the new value, the author, a timestamp, and a SHA-256 content hash. The current value for a key is always the highest revision.
A snapshot is a named, immutable point-in-time capture of all keys at a given revision. The rollback system works at snapshot granularity, not individual key granularity, ensuring the entire config set returns to a known-consistent state.
-- Config Store schema
CREATE TABLE config_versions (
id BIGSERIAL PRIMARY KEY,
key TEXT NOT NULL,
environment TEXT NOT NULL,
revision BIGINT NOT NULL,
value TEXT NOT NULL,
value_type TEXT NOT NULL CHECK (value_type IN ('string','bool','int','float','json')),
author TEXT NOT NULL,
commit_msg TEXT,
content_hash CHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (key, environment, revision)
);
CREATE TABLE config_snapshots (
id BIGSERIAL PRIMARY KEY,
snapshot_name TEXT NOT NULL,
environment TEXT NOT NULL,
revision BIGINT NOT NULL,
key_count INT NOT NULL,
digest CHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (snapshot_name, environment)
);
CREATE TABLE snapshot_keys (
snapshot_id BIGINT REFERENCES config_snapshots(id),
key TEXT NOT NULL,
revision BIGINT NOT NULL,
value_hash CHAR(64) NOT NULL,
PRIMARY KEY (snapshot_id, key)
);
CREATE INDEX idx_config_versions_key_env ON config_versions(key, environment, revision DESC);
CREATE INDEX idx_config_versions_created ON config_versions(created_at DESC);
The content_hash field is critical: it lets the Propagation Coordinator verify that what a regional Config Agent received matches what was sent, without transmitting the full value again.
etcd (used by Kubernetes) uses a similar revision-based model where every write increments a global revision counter and the entire history is queryable. Kubernetes uses etcd’s watch API to push changes to controllers - the same push model we’re building here for regional propagation.
The Propagation Coordinator
The Propagation Coordinator fans out config changes to all regions as fast as the network allows. Its job is to ensure that within 500ms of a write committing, every region’s Config Agent has the new version in its local cache and has acknowledged receipt.
The coordinator maintains a persistent gRPC streaming connection to every regional Config Agent. When a change commits, the coordinator pushes a ConfigDelta message containing the changed keys, their new values, the new revision, and the content hash. It does not wait for an ACK before pushing to the next region - it fans out to all 20 regions simultaneously and then collects ACKs.
// Config propagation proto
syntax = "proto3";
service ConfigPropagation {
rpc Subscribe(SubscribeRequest) returns (stream ConfigDelta);
rpc Acknowledge(AckRequest) returns (AckResponse);
rpc ForceSync(SyncRequest) returns (SyncResponse);
}
message ConfigDelta {
string environment = 1;
int64 revision = 2;
int64 prev_revision = 3;
repeated KeyChange changes = 4;
string digest = 5;
int64 timestamp_ms = 6;
}
message KeyChange {
string key = 1;
string value = 2;
string value_type = 3;
string content_hash = 4;
bool deleted = 5;
}
message AckRequest {
string region = 1;
int64 revision = 2;
string digest = 3;
}
The coordinator tracks ACK status per region per revision. If a region fails to ACK within 2 seconds, it retries the push. If a region fails three consecutive pushes, it marks that region as diverged and fires an alert - the reconciliation loop on the agent side will detect the missed revision and trigger a full sync.
Do not use the acknowledgement timeout as a health check for the region itself. A slow ACK might mean the agent is processing a large config delta, not that the region is down. Separate config divergence alerts from infrastructure health monitoring - conflating the two causes alert fatigue and masks real failures.
The Regional Config Agent
The Config Agent is the local representative of the config system in each region. It serves config reads to local services at sub-millisecond latency and is deliberately designed to operate entirely independently during control-plane outages.
The agent maintains two layers of storage:
- In-memory hot cache: a Go
sync.Map(or Redis) holding all current config values, keyed by{env}:{key}. Serves reads in under 100 microseconds. - Disk-backed cold store: a local SQLite or RocksDB instance holding the full version history for the last 48 hours. Used for reconciliation and to rebuild the hot cache on restart.
// Config Agent read path
type ConfigAgent struct {
hotCache *sync.Map
coldStore *badger.DB
region string
env string
revision int64
mu sync.RWMutex
}
func (a *ConfigAgent) Get(key string) (ConfigValue, bool) {
// Hot path: sub-100 microsecond read from in-memory map
if v, ok := a.hotCache.Load(a.cacheKey(key)); ok {
return v.(ConfigValue), true
}
return ConfigValue{}, false
}
func (a *ConfigAgent) ApplyDelta(delta *ConfigDelta) error {
if delta.Revision <= atomic.LoadInt64(&a.revision) {
return nil // already applied, idempotent
}
if delta.PrevRevision != atomic.LoadInt64(&a.revision) {
return ErrRevisionGap // trigger full sync
}
for _, change := range delta.Changes {
if change.Deleted {
a.hotCache.Delete(a.cacheKey(change.Key))
} else {
a.hotCache.Store(a.cacheKey(change.Key), ConfigValue{
Value: change.Value,
Type: change.ValueType,
Revision: delta.Revision,
})
}
// Persist to cold store asynchronously
go a.persistChange(change, delta.Revision)
}
atomic.StoreInt64(&a.revision, delta.Revision)
return nil
}
func (a *ConfigAgent) cacheKey(key string) string {
return a.env + ":" + key
}
The PrevRevision check is the gap detector. If an agent missed a push (network blip, restart), delta.PrevRevision will not match the agent’s current revision, and the agent immediately requests a full sync rather than applying a partial delta on top of a stale base.
The PrevRevision check makes delta application idempotent and gap-safe. Without it, a missed message causes silent config divergence - the agent applies future deltas on top of a stale base and ends up with a mixed config state that is worse than being one version behind.
Config Versioning and Rollback
Rollback is the system’s most operationally critical path. When a bad config causes elevated error rates, engineers need to trigger rollback with a single command and see it complete globally within 5 seconds.
The rollback flow works at snapshot granularity:
- Operator runs
config rollback --env production --to snapshot-2026-06-11-14:30:00 - The Config API looks up the target snapshot’s
revisionand retrieves allsnapshot_keysfor that snapshot - For each key in the snapshot, it creates a new
config_versionsentry with the old value and a new revision number - rollback is itself a write, preserving the audit trail - The Propagation Coordinator pushes the rollback delta as if it were a normal change - the regional agents see a new revision with the rolled-back values and apply it identically
This means rollback uses the same propagation path as forward writes, inheriting the same 500ms SLA. There is no special rollback code path - only normal propagation.
# Rollback execution logic
def execute_rollback(env: str, target_snapshot_name: str, author: str) -> int:
"""Returns the new revision created by the rollback."""
snapshot = db.query(
"SELECT id, revision FROM config_snapshots WHERE environment=%s AND snapshot_name=%s",
(env, target_snapshot_name)
).fetchone()
if not snapshot:
raise ValueError(f"Snapshot {target_snapshot_name} not found")
snapshot_keys = db.query(
"""SELECT ck.key, cv.value, cv.value_type, cv.content_hash
FROM snapshot_keys ck
JOIN config_versions cv ON cv.key = ck.key AND cv.revision = ck.revision AND cv.environment = %s
WHERE ck.snapshot_id = %s""",
(env, snapshot["id"])
).fetchall()
new_revision = next_revision(env)
for sk in snapshot_keys:
db.execute(
"""INSERT INTO config_versions
(key, environment, revision, value, value_type, author, commit_msg, content_hash)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""",
(sk["key"], env, new_revision, sk["value"], sk["value_type"],
author, f"ROLLBACK to {target_snapshot_name}", sk["content_hash"])
)
db.commit()
return new_revision
LaunchDarkly’s feature flag system uses this same “rollback is a write” pattern. Instead of reverting state in place, it creates a new flag version that restores old values. This means the audit log always shows a complete forward-only history, and “undo” is represented as an explicit action rather than a state mutation.
Validation Gates
Pushing a config change to 20 regions at once without any validation is operationally reckless. The Validation Engine runs two gates before global propagation:
Gate 1 - Schema validation: Every config key has a registered schema (type, allowed values, min/max for numerics, regex for strings). The validation engine rejects any write that violates the schema before it even reaches the Config Store.
# Config key schema definition
keys:
rate_limit_rps:
type: int
min: 10
max: 100000
environments: [staging, production]
requires_review_above: 10000
feature_new_checkout:
type: bool
environments: [staging, production]
db_connection_pool_size:
type: int
min: 5
max: 500
rollout_strategy: canary
Gate 2 - Canary rollout: For keys tagged with rollout_strategy: canary, the Propagation Coordinator pushes the change to a single canary region first and holds global propagation for a configurable wait window (default: 60 seconds). During the window, it monitors the control plane’s error rate feed for that region. If error rates stay within baseline, it proceeds to global propagation. If error rates spike, it triggers automatic rollback.
Canary rollout works well for gradual risk reduction but adds latency to the propagation path. Do not apply canary gates to all keys by default - reserve them for keys that control traffic routing, resource limits, or third-party integrations. Keys controlling cosmetic features (copy, colors) should propagate immediately without canary gates.
Data Model
The data model centers on three concepts: versions (individual key writes), snapshots (consistent captures of all keys), and propagation events (the log of what was pushed to which region when).
-- Propagation event tracking
CREATE TABLE propagation_events (
id BIGSERIAL PRIMARY KEY,
revision BIGINT NOT NULL,
environment TEXT NOT NULL,
region TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('pending','acked','failed','diverged')),
pushed_at TIMESTAMPTZ,
acked_at TIMESTAMPTZ,
retry_count INT NOT NULL DEFAULT 0,
error_msg TEXT
);
CREATE INDEX idx_prop_events_revision ON propagation_events(revision, environment);
CREATE INDEX idx_prop_events_region_status ON propagation_events(region, status);
-- Config schema registry
CREATE TABLE config_schemas (
key TEXT NOT NULL,
environment TEXT NOT NULL,
schema_json JSONB NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (key, environment)
);
The propagation_events table is the divergence detection source of truth. A background job queries it every 30 seconds for any revisions where not all 20 regions have acked status within 2 minutes - these are the divergence candidates that trigger reconciliation.
Key Algorithms and Protocols
Content-addressed digest for divergence detection: Each snapshot has a digest computed as the SHA-256 of all {key}={value_hash} pairs sorted lexicographically. Regional agents compute the same digest over their local state and compare it against the control plane’s expected digest during reconciliation. A mismatch triggers a full sync, regardless of how many individual key hashes match.
import hashlib, json
def compute_config_digest(key_values: dict[str, str]) -> str:
"""Canonical digest over all config key-value hashes."""
# Sort keys lexicographically for deterministic ordering
pairs = sorted(
f"{k}={hashlib.sha256(v.encode()).hexdigest()}"
for k, v in key_values.items()
)
combined = "\n".join(pairs)
return hashlib.sha256(combined.encode()).hexdigest()
Two-phase push with timeout-based fallback: The coordinator pushes a delta and waits for ACKs. If any region fails to ACK within 2 seconds, the coordinator sends a ForceSync RPC to that region, which causes the agent to pull the full current snapshot from the control plane’s read replica in that geography. This converts a push failure into a pull - the agent does not rely solely on receiving pushes.
// Coordinator push with fallback to full sync
func (c *Coordinator) PropagateRevision(revision int64, delta *ConfigDelta) {
ackCh := make(chan RegionAck, len(c.regions))
for _, region := range c.regions {
go func(r Region) {
err := r.PushDelta(delta)
if err != nil {
ackCh <- RegionAck{Region: r.Name, Success: false}
return
}
ack, err := r.WaitAck(revision, 2*time.Second)
ackCh <- RegionAck{Region: r.Name, Success: err == nil && ack.Digest == delta.Digest}
}(region)
}
timer := time.After(3 * time.Second)
pending := make(map[string]bool)
for _, r := range c.regions {
pending[r.Name] = true
}
for len(pending) > 0 {
select {
case ack := <-ackCh:
if ack.Success {
delete(pending, ack.Region)
}
case <-timer:
for regionName := range pending {
go c.TriggerForceSync(regionName, revision)
}
return
}
}
}
The fallback from push to pull is the reliability safety net. A purely push-based system has a single failure mode: missed pushes cause permanent divergence. Adding pull-based reconciliation means the system self-heals - the worst case is not divergence but a slightly longer convergence time.
Scaling and Performance
The control plane’s write path is intentionally not on the hot path for reads. The Config Store handles at most 1,000 writes/hour during deployments - this is trivially handled by a single primary PostgreSQL instance or a distributed KV like Etcd. The bottleneck is not writes; it is the propagation fan-out.
With 20 regions and a gRPC streaming connection per region, the coordinator maintains 20 persistent connections. This is well within a single Go process’s capability. The bottleneck emerges when a config change contains thousands of changed keys (a bulk import or an environment reset). For large deltas, the coordinator compresses the payload with Zstd and switches from per-key deltas to a full snapshot transfer.
Back-of-envelope capacity:
Config changes: 1,000/hour = 0.28 writes/second
Keys per change: avg 5, max 10,000 (bulk)
Value size: avg 256 bytes, max 64KB
Regions: 20
Normal propagation:
Delta size: 5 keys * 256 bytes = 1.28KB per change
Fan-out bandwidth: 1.28KB * 20 regions = 25.6KB per commit
At 1,000 changes/hour: 25.6KB * 0.28/s = 7.2KB/s outbound
Bulk propagation (environment reset):
Delta size: 10,000 keys * 256 bytes = 2.56MB
Zstd compressed: ~512KB
Fan-out to 20 regions: 512KB * 20 = 10.24MB
At 10Gbps inter-region link: 8 milliseconds per region (parallel)
Read path per region:
100,000 reads/second from in-memory hot cache
Cache hit rate: ~100% (all keys fit in ~256MB RAM for 1M keys at 256 bytes/key)
Cache miss (cold start): served from local disk store in <1ms
The control plane’s read replicas (one per geographic cluster) serve the reconciliation pull requests from regional agents. This prevents reconciliation traffic from hitting the primary write node.
Etcd’s watch streaming API handles the same fan-out problem for Kubernetes controllers. A single etcd cluster can serve thousands of watch streams to controllers across a cluster. The key insight is that watches are push-based and the etcd server tracks the last revision each watcher has seen, delivering only the delta on reconnect - exactly what our coordinator does with PrevRevision tracking.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Control plane primary failure | Health check; no writes accepted | New config writes blocked; reads unaffected | Promote read replica to primary; propagation resumes |
| Regional Config Agent restart | Agent re-subscribes; reconciliation runs | Brief window with stale cache from previous state | Agent rebuilds hot cache from local disk store on startup |
| Push stream disconnect | gRPC keepalive timeout (10s) | Region misses deltas during disconnect window | Agent requests full sync on reconnect; ForceSync from coordinator |
| Config divergence (missed delta) | 30s reconciliation digest mismatch | Region serves stale config values | Agent pulls full snapshot from nearest read replica |
| Bad config pushed (elevated errors) | Canary error rate monitor or manual alert | Elevated error rates in affected regions | Operator triggers rollback; 500ms propagation of rollback delta |
| Control plane network partition | Split-brain on primary election | Possible stale reads from isolated replica | Read replicas serve stale but consistent data; writes blocked until partition heals |
The most common operational mistake is triggering rollback manually while an automated canary already detected the error and is mid-rollback. This causes two competing write streams to the Config Store, potentially interleaving revisions. Always check the canary status before manually triggering rollback - or disable the canary first.
Comparison of Approaches
| Approach | Propagation Latency | Regional Availability | Rollback Speed | Operational Complexity |
|---|---|---|---|---|
| Centralized polling (1-min interval) | 60s average | Degrades if central DB is slow | Push new version, wait 60s | Low - single DB, simple polling |
| Push-only (no reconciliation) | ~100ms | Full during push failures, stale otherwise | Push rollback delta | Medium - streaming connections needed |
| Pull-based with short TTL (5s) | 5s average | High - pull survives push failures | Push rollback, wait 5s | Low-Medium - no persistent connections |
| Push + pull reconciliation (this design) | ~200ms p99 | Full - reads survive all central failures | Push rollback delta, ~200ms | High - streaming + reconciliation logic |
| Gossip-based propagation | 500ms-2s | Very high - no central coordinator | Requires epoch-based versioning | Very High - gossip protocol complexity |
Push plus pull reconciliation is the right choice for this scale. Pure polling gives tolerable consistency at low complexity, but the 60-second propagation window is unacceptable when a bad config needs rollback in under 5 seconds. Gossip-based propagation adds substantial complexity without meaningful availability improvements since we already have regional agents that survive central failures.
Key Takeaways
- Config versioning is append-only: every write creates a new revision; nothing is ever updated in place, making audit trails and rollback trivially correct.
- Rollback is a write: expressing rollback as a new revision that restores old values means it uses the same propagation path as forward changes, inheriting the same SLA.
- Local agent decouples read availability from control-plane health: services in a region should never block on cross-region calls to read config, even during control-plane outages.
- PrevRevision gap detection prevents silent divergence: an agent that detects it missed a push immediately requests a full sync rather than applying future deltas on a stale base.
- Push plus pull reconciliation is the right reliability model: push for latency, pull for correctness guarantees when pushes fail.
- Content-addressed digests catch divergence that delta tracking misses: even if revision numbers match, a cryptographic digest of all values detects any bit-level divergence.
- Canary gates on high-risk keys bound blast radius: not every config key needs a canary window, but keys controlling traffic routing or resource limits warrant one.
- Schema validation at write time is cheaper than debugging at runtime: a rejected write with a clear error message is far less costly than a config change that silently accepts an out-of-range value and causes subtle failures at 3 AM.
The counter-intuitive lesson here is that the system’s reliability does not come from making the control plane more available - it comes from making the data plane (regional agents) completely independent of the control plane for reads. The control plane can be down for minutes and no service in any region notices. This inverts the usual instinct to build a highly available central store; instead, treat the central store as an eventually-consistent source of truth that the edge caches will eventually catch up to.
Frequently Asked Questions
Q: Why not use a managed config service like AWS AppConfig or Google Runtime Configurator? A: Managed services solve the storage and basic propagation problem but typically have propagation latencies of 10-60 seconds and limited rollback automation. For a system requiring sub-500ms propagation and 5-second rollback with custom canary logic, a purpose-built system provides better control. Managed services are appropriate for teams where operational simplicity outweighs propagation latency requirements.
Q: Why use push instead of pull with a very short TTL like 1 second? A: At 20 regions each polling every second, you have 20 requests/second to the control plane at rest - manageable. But during a rollout with 100 config changes in 5 minutes, every region polls on its own schedule, creating thundering herds and 1-second propagation gaps. Push-on-write is strictly faster and imposes zero load on the control plane during quiet periods.
Q: How do you handle a config key that exists in one region but not another after a partial push failure?
A: The PrevRevision check detects the gap and the agent requests a full sync. The full sync transfers the entire snapshot, not just the missed delta, so the agent’s state converges to exactly the control plane’s current state regardless of what partial writes occurred before.
Q: What happens when you need to roll out different config values to different regions intentionally (A/B testing infrastructure)?
A: Region-specific overrides require a separate abstraction - an override table keyed by (key, environment, region) - layered on top of the global config store. The regional agent merges the global value with any regional override at read time. This is a separate feature from the base propagation system described here.
Q: Why not store config in Git and use GitOps-style propagation? A: Git gives you version history and rollback for free, but the delivery mechanism (CI/CD pipelines triggered by commits) has multi-minute latency and no native push-to-edge capability. Git also lacks the runtime validation (canary error rate gates) and the sub-second rollback path needed here. GitOps is a good pattern for infrastructure configs that change daily; it is too slow for application configs that may need rollback in seconds.
Q: How do you prevent a misconfigured validation rule from blocking all config writes?
A: Validation rules are themselves config, stored in config_schemas. They are loaded at validation time with a 5-minute TTL in the validator’s cache. If validation schema retrieval fails, the validator fails open with a warning log - it applies type-checking only and skips range/regex constraints. This ensures a broken schema does not create a configuration outage. Schema changes follow the same write path but bypass the schema validator (to avoid bootstrapping circularity).
Interview Questions
Q: Walk me through how you’d design the rollback mechanism to guarantee completion within 5 seconds even if 3 of 20 regions are unreachable.
Expected depth: Rollback uses the same push path as forward writes. For unreachable regions, the coordinator fires ForceSync after a 2-second push timeout - the agent pulls the rollback snapshot from the nearest read replica when connectivity restores. The 5-second SLA applies to reachable regions; unreachable regions converge as soon as connectivity resumes. Discuss whether you accept this or require stronger guarantees and what that implies for consistency models.
Q: How would you detect and alert on config divergence across 20 regions without creating excessive noise?
Expected depth: The propagation_events table tracks ACK status per revision per region. A divergence is declared when a region has not ACKed a revision within propagation_timeout * 3. Alerting should fire on the first divergence, then suppress subsequent alerts for the same region until resolved. Discuss the difference between transient divergence (brief push failure, heals in 30s) vs persistent divergence (agent offline) and how alert severity should differ.
Q: Design the validation gate for a db_connection_pool_size config key. What tests would you run before global propagation?
Expected depth: Schema validation (type int, range 5-500) at write time. For canary rollout: push to one region, observe database connection metrics and error rates for 60 seconds, then proceed. Key edge cases: the canary region may not have representative load; the config change may not take effect immediately if services cache the value. Discuss how you’d force config reload in the canary region and what metrics constitute “safe” for promotion.
Q: The propagation coordinator maintains 20 persistent gRPC streams. How do you handle coordinator restarts without losing in-flight changes?
Expected depth: The coordinator is stateless between restarts - its source of truth is the propagation_events table. On startup, it queries for any revisions with non-acked propagation events, rebuilds the pending revision list, and resumes pushing. Regional agents reconnect and request a full sync if the coordinator was down long enough for the reconnect to span a revision gap. Discuss how you prevent duplicate delivery using the idempotent ApplyDelta check on the agent side.
Q: How would you extend this system to support environment promotion - taking a staging config snapshot and promoting it to production?
Expected depth: Promotion is a read from the staging snapshot store followed by a write to the production Config Store, with the standard validation gates applied to the production write. Key challenge: values that are safe in staging may be wrong for production scale (e.g., rate_limit: 100 for staging, rate_limit: 50000 for production). Discuss requiring explicit overrides for environment-specific keys during promotion, and how you’d build a diff view showing what would change in production.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.