Build a Chaos Engineering Platform

reliability observability distributed-systems

System Design Deep Dive

Chaos Engineering Platform

Safely injecting latency, packet loss, and resource exhaustion into production services - with blast radius control and automatic SLO-driven rollback

14 min readAdvancedInfrastructure

Your system handles 80,000 requests per second across 50 microservices. Your oncall runbooks say “if Payment Service goes down, Order Service degrades gracefully within 3 seconds.” But you have never actually verified that. A vendor outage at 2 AM becomes the first real test, and you discover that Order Service does not degrade gracefully - it hammers Payment Service with retries until its own connection pool exhausts, then it locks up entirely, cascading to six downstream services that were never in the blast radius of the original failure. The incident postmortem recommends chaos engineering. Six months later you have zero infrastructure for running it safely.

The problem with chaos engineering is not convincing engineers it is a good idea. It is the operational safety problem: how do you break things on purpose without accidentally triggering a real incident? The moment you introduce a tool that deliberately injects failures, every stakeholder in the room has the same objection - “what if it goes wrong?” The answer has to be architectural. Blast radius control, automatic halts, and ironclad rollback cannot be afterthoughts bolted on after the first accident. Think of it like controlled burns in forestry: the whole value proposition of a controlled burn depends on the firebreaks being in place before you light the match, not after you see the fire spreading.

The engineering challenge here is that we need to build a system that is simultaneously powerful enough to simulate realistic failure conditions and conservative enough to earn the trust of every team whose services are targets. That trust is built through three concrete mechanisms: a dependency graph that knows which services are safe to target and which are excluded, a steady-state hypothesis framework that verifies conditions are normal before and during every experiment, and an automatic rollback trigger that halts injection and reverts configuration within 30 seconds of any SLO degradation.

Requirements and Constraints

Functional Requirements

Run controlled experiments that inject one or more fault types into target services: latency addition (P50/P99 configurable), packet loss percentage, CPU spike to configurable percentage, memory pressure, disk I/O throttling, and pod/instance kill
Enforce blast radius limits: maximum N% of instances per service, maximum M services per experiment, explicit exclusion list for critical services
Verify steady-state hypothesis before experiment start and continuously during injection
Automatically halt and revert faults when any monitored SLO breaches its threshold
Track a service dependency graph and use it to classify targets, control groups, and excluded services
Store full experiment history with per-second metric snapshots for post-experiment analysis
Support scheduled experiments (cron-style) and on-demand execution
Emit events to PagerDuty, Slack, and audit logs for all experiment lifecycle transitions

Non-Functional Requirements

Auto-halt latency: SLO breach detected and faults reverted within 30 seconds end-to-end
Blast radius: never affect more than 5% of total service instances globally in any single experiment
Concurrent experiment limit: max 20 experiments running simultaneously across all regions, enforced globally
Availability of the chaos platform itself: 99.9% for the control plane, 99.99% for the rollback path (it must halt even if the rest of the platform is down)
Audit trail: every experiment start, halt, and completion stored with the triggering SLO metric value
Agent overhead on target hosts: under 0.5% CPU, under 50MB resident memory

Constraints and Assumptions

We inject at the OS/kernel layer (tc netem for network faults, cgroups for resource faults) for accuracy, not at the application layer
Kubernetes is the primary deployment target; we also support VM-based services via SSH-based agents
Metrics are available via Prometheus/Grafana; SLO definitions are pre-existing PromQL expressions
Production experiments require explicit approval from a service owner before the scheduler can queue them

High-Level Architecture

The platform has six major components working in two planes: a control plane that plans and approves experiments, and an execution plane that injects and monitors them.

Chaos engineering platform architecture showing control plane, execution layer, target services, and observability integration

The Experiment Planner takes a YAML experiment spec from an engineer or the scheduler, validates it against the policy engine, and hands it to the Dependency Graph Engine. The Dependency Graph Engine maintains a live picture of service relationships scraped from service mesh telemetry and service registries - it classifies every service in the target list as a viable target, a mandatory exclusion, or a suggested control group. The Blast Radius Calculator takes that classification and applies percentage limits: if you ask to kill 50% of Payment Service pods but the global limit is 5% of instances, it rejects the experiment before any fault fires.

The SLO Watchdog is the most important component in the platform. It runs continuously, independent of whether any experiment is active, polling Prometheus for the SLO metrics of every service that has ever been a target. When an experiment is live, it tightens its polling interval from 60 seconds to 15 seconds and evaluates SLO expressions against tighter thresholds. The Fault Injection Engine executes the actual fault primitives - it is intentionally stateless; it receives a signed experiment token, applies the fault, and does nothing else. If it crashes mid-experiment, the Rollback Controller detects the missing heartbeat and reverts. The Rollback Controller maintains a rollback manifest for every active experiment: a deterministic set of operations that restore pre-experiment state, independent of the Fault Engine’s liveness.

Key Insight

The Rollback Controller is architecturally separated from the Fault Injection Engine for the same reason circuit breakers are separate from the services they protect. If the component that caused the problem is also the component responsible for cleanup, a crash mid-experiment leaves faults in place permanently. The Rollback Controller must be able to revert state even if the Fault Engine, the Control Plane, and the network between them are all degraded.

The Experiment Scheduler

Chaos experiments are not ad-hoc - mature implementations run them on a regular cadence, the way you run load tests or security scans. Netflix Chaos Monkey became famous precisely because it ran automatically on a schedule, not because engineers manually clicked “inject failure” when they felt like it.

The Scheduler solves two problems that interact in a non-obvious way: preventing experiment collisions and respecting maintenance windows. Two experiments running simultaneously on overlapping services invalidate each other’s baselines - if Payment Service is already experiencing injected latency, starting a second experiment on Order Service produces results that reflect the compound effect, not the isolated one.

Experiment lifecycle data flow from plan through inject through SLO watch to halt or complete

The Scheduler uses a distributed lock with a Redis SETNX + TTL pattern. Before queuing any experiment, it acquires a global concurrency token (capped at 20) and a per-service lock for every target service in the experiment. If any lock cannot be acquired, the experiment is queued with exponential backoff. Maintenance windows are stored as time-range entries in the Experiment DB and checked at scheduling time - if any target service has an active maintenance window, the experiment is deferred.

import redis
import time
import uuid
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ExperimentSpec:
    experiment_id: str
    target_services: List[str]
    fault_type: str
    duration_seconds: int
    blast_radius_pct: float

class ExperimentScheduler:
    def __init__(self, redis_client: redis.Redis, max_concurrent: int = 20):
        self.r = redis_client
        self.max_concurrent = max_concurrent
        self.concurrency_key = "chaos:global:concurrency"
        self.lock_ttl = 3600  # 1 hour max experiment duration

    def try_schedule(self, spec: ExperimentSpec) -> Optional[str]:
        """
        Acquire global concurrency token + per-service locks atomically.
        Returns a schedule token on success, None if locks cannot be acquired.
        """
        schedule_token = str(uuid.uuid4())
        service_lock_keys = [
            f"chaos:lock:service:{svc}" for svc in spec.target_services
        ]

        # Check and increment global concurrency counter
        pipe = self.r.pipeline(transaction=True)
        try:
            pipe.watch(self.concurrency_key, *service_lock_keys)
            current_count = int(self.r.get(self.concurrency_key) or 0)
            if current_count >= self.max_concurrent:
                pipe.reset()
                return None  # Global limit hit - queue for later

            # Check all service locks are free
            for key in service_lock_keys:
                if self.r.exists(key):
                    pipe.reset()
                    return None  # Service collision - queue for later

            # Acquire atomically
            pipe.multi()
            pipe.incr(self.concurrency_key)
            pipe.expire(self.concurrency_key, self.lock_ttl)
            for key in service_lock_keys:
                pipe.set(key, schedule_token, ex=self.lock_ttl, nx=True)
            # Store the token -> experiment mapping for rollback lookup
            pipe.set(
                f"chaos:token:{schedule_token}",
                spec.experiment_id,
                ex=self.lock_ttl
            )
            pipe.execute()
            return schedule_token

        except redis.WatchError:
            return None  # Concurrent modification - retry with backoff

    def release(self, spec: ExperimentSpec, schedule_token: str) -> None:
        """Release all locks after experiment completes or halts."""
        pipe = self.r.pipeline()
        pipe.decr(self.concurrency_key)
        for svc in spec.target_services:
            pipe.delete(f"chaos:lock:service:{svc}")
        pipe.delete(f"chaos:token:{schedule_token}")
        pipe.execute()

The maintenance window check happens before try_schedule is called. If a target service has an active window, the experiment is placed in a DEFERRED state with a reason code. The scheduler re-evaluates deferred experiments every 5 minutes.

Watch Out

The most common scheduling mistake is holding the service lock for the entire experiment duration using a fixed TTL, then having an experiment run longer than expected due to a hung rollback. Set the lock TTL to 2x the maximum experiment duration, and have the scheduler send periodic heartbeat refreshes to the lock every 30 seconds. If the heartbeat stops - because the scheduler crashed - the TTL expiry becomes the safety net that frees the lock.

The Fault Injection Engine

The Fault Injection Engine is the component that actually breaks things. It runs as a privileged agent on every host that can be a target - either as a Kubernetes DaemonSet or as a systemd service on VMs. It receives signed experiment tokens and applies OS-level faults that are both realistic and reversible.

Think of the Fault Engine like a keyhole surgeon: it reaches into the operating system through narrow, well-understood interfaces and makes precise modifications. It does not modify application code, restart processes arbitrarily, or make changes that cannot be deterministically undone. Every fault primitive has a corresponding undo operation that is stored in the Rollback Controller before the fault is applied.

Network faults use Linux Traffic Control (tc) with the netem qdisc. Adding 100ms latency with 10ms jitter to a specific network interface is three commands. Removing it is one command. Packet loss is the same mechanism with a different netem parameter.

# Apply: add 100ms latency + 10ms jitter to outbound traffic on eth0
# (target is the experiment ID tag for isolation)
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 100ms 10ms distribution normal
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
  match ip dst 10.0.0.0/8 flowid 1:3

# Revert: remove the qdisc (also removes child qdiscs and filters)
tc qdisc del dev eth0 root

CPU stress uses a stress-ng process pinned to the target cgroup. The Fault Engine creates a child cgroup, moves the stress-ng process into it, and constrains it with the same CPU limits as the target service’s cgroup - ensuring the stress competes with the real service without consuming resources from unrelated services.

import subprocess
import os
import signal
from pathlib import Path

class FaultAgent:
    def __init__(self, cgroup_base: str = "/sys/fs/cgroup"):
        self.cgroup_base = Path(cgroup_base)
        self.active_faults: dict[str, dict] = {}

    def inject_cpu_pressure(
        self,
        experiment_id: str,
        target_cgroup: str,
        cpu_percent: int,
        duration_seconds: int
    ) -> dict:
        """
        Inject CPU pressure into a cgroup by spawning stress-ng workers
        constrained to a child cgroup of the target service.
        Returns rollback manifest.
        """
        chaos_cgroup = f"{target_cgroup}/chaos-{experiment_id}"
        cgroup_path = self.cgroup_base / chaos_cgroup

        # Create child cgroup for chaos process isolation
        cgroup_path.mkdir(parents=True, exist_ok=True)

        # Read current CPU limit of target service
        parent_quota_path = self.cgroup_base / target_cgroup / "cpu.max"
        parent_quota = parent_quota_path.read_text().strip()
        # e.g., "200000 100000" = 200% CPU quota / 100ms period

        period_us = 100_000
        quota_us = int(period_us * cpu_percent / 100)

        # Set CPU limit on chaos child cgroup
        (cgroup_path / "cpu.max").write_text(f"{quota_us} {period_us}")

        # Launch stress-ng in the chaos cgroup
        proc = subprocess.Popen(
            ["stress-ng", "--cpu", "0", "--cpu-load", str(cpu_percent),
             "--timeout", str(duration_seconds)],
            preexec_fn=lambda: self._move_to_cgroup(cgroup_path)
        )

        rollback = {
            "type": "cpu_pressure",
            "experiment_id": experiment_id,
            "pid": proc.pid,
            "cgroup_path": str(cgroup_path),
        }
        self.active_faults[experiment_id] = rollback
        return rollback

    def _move_to_cgroup(self, cgroup_path: Path) -> None:
        pid = os.getpid()
        (cgroup_path / "cgroup.procs").write_text(str(pid))

    def revert(self, experiment_id: str) -> None:
        """Deterministically undo a fault using the rollback manifest."""
        fault = self.active_faults.pop(experiment_id, None)
        if not fault:
            return

        if fault["type"] == "cpu_pressure":
            try:
                os.kill(fault["pid"], signal.SIGTERM)
            except ProcessLookupError:
                pass  # Already dead, still clean up cgroup
            # Remove chaos cgroup
            subprocess.run(
                ["rmdir", fault["cgroup_path"]], check=False
            )

        elif fault["type"] == "network_latency":
            subprocess.run(
                ["tc", "qdisc", "del", "dev", fault["interface"], "root"],
                check=False
            )

Memory pressure uses memfd_create to allocate anonymous memory that the kernel cannot reclaim (no swap) for the duration of the experiment. Pod kill simply calls the Kubernetes API to delete the target pod, relying on the deployment controller to restart it. The Fault Engine does not manage pod lifecycle - Kubernetes does.

Real World

AWS Fault Injection Service (FIS) uses the same OS-layer approach, exposing a managed API over tc netem and stress-ng primitives. Gremlin’s agent runs as a privileged container and similarly uses tc for network faults. Chaos Mesh on Kubernetes uses kernel eBPF programs for network-layer chaos, which avoids the need for privileged pod access but requires a more recent kernel. The injection primitives have been stable for a decade - the hard engineering is in the control plane around them.

The SLO Watchdog

The Watchdog is the component that separates chaos engineering from recklessness. It is always running, but during an experiment it shifts into high-alert mode with a tighter polling interval and lower halt thresholds.

Think of the Watchdog like the anaesthesiologist in the operating room. The surgeon (Fault Engine) makes deliberate interventions, but the anaesthesiologist monitors the patient’s vital signs continuously and has the authority to call a halt - overriding the surgeon - the moment the situation becomes dangerous. The Watchdog’s halt signal takes precedence over everything else in the system.

Blast radius control using service dependency graph showing target, excluded, control group, and external service classifications

Each experiment spec includes an slo_assertions block: a list of PromQL expressions, comparison operators, and threshold values. The Watchdog evaluates these against live Prometheus data. If any assertion fails for two consecutive polling intervals (to avoid single-data-point noise), it fires a halt signal to the Rollback Controller.

import asyncio
import aiohttp
from dataclasses import dataclass
from typing import List
import logging

logger = logging.getLogger(__name__)

@dataclass
class SLOAssertion:
    name: str
    promql: str  # e.g., 'rate(http_requests_total{status=~"5.."}[1m])'
    operator: str  # "lt", "gt", "lte", "gte"
    threshold: float
    halt_on_breach: bool = True

class SLOWatchdog:
    def __init__(
        self,
        prometheus_url: str,
        rollback_controller,
        normal_interval_s: int = 60,
        active_interval_s: int = 15,
    ):
        self.prometheus_url = prometheus_url
        self.rollback = rollback_controller
        self.normal_interval = normal_interval_s
        self.active_interval = active_interval_s
        self.consecutive_failures: dict[str, int] = {}

    async def watch_experiment(
        self,
        experiment_id: str,
        assertions: List[SLOAssertion]
    ) -> None:
        """Poll SLO assertions every active_interval until halt or completion."""
        async with aiohttp.ClientSession() as session:
            while True:
                for assertion in assertions:
                    breached = await self._evaluate(session, assertion)
                    key = f"{experiment_id}:{assertion.name}"

                    if breached:
                        self.consecutive_failures[key] = (
                            self.consecutive_failures.get(key, 0) + 1
                        )
                        logger.warning(
                            "SLO assertion '%s' breached (consecutive=%d)",
                            assertion.name,
                            self.consecutive_failures[key]
                        )
                        if self.consecutive_failures[key] >= 2:
                            if assertion.halt_on_breach:
                                await self.rollback.halt(
                                    experiment_id,
                                    reason=f"SLO '{assertion.name}' breached "
                                           f"{self.consecutive_failures[key]}x consecutively"
                                )
                                return
                    else:
                        self.consecutive_failures.pop(key, None)

                await asyncio.sleep(self.active_interval)

    async def _evaluate(
        self, session: aiohttp.ClientSession, assertion: SLOAssertion
    ) -> bool:
        """Return True if the SLO assertion is breached."""
        url = f"{self.prometheus_url}/api/v1/query"
        async with session.get(url, params={"query": assertion.promql}) as resp:
            data = await resp.json()
            results = data.get("data", {}).get("result", [])
            if not results:
                return False  # No data = not breached (fail-safe)
            value = float(results[0]["value"][1])

        ops = {
            "lt": lambda v, t: v >= t,   # breached if NOT less than
            "gt": lambda v, t: v <= t,   # breached if NOT greater than
            "lte": lambda v, t: v > t,
            "gte": lambda v, t: v < t,
        }
        return ops[assertion.operator](value, assertion.threshold)

The two-consecutive-failure requirement is critical. A single anomalous scrape from Prometheus (network blip, scrape timeout) would otherwise cause spurious halts that erode trust in the platform. Two consecutive failures at 15-second intervals means the halt fires at most 30 seconds after a real SLO breach - fast enough to prevent cascade, but not so hair-trigger that noise kills valid experiments.

Key Insight

Define halt thresholds tighter than production SLO burn rate alerts. If your SLO alert fires when error rate exceeds 1% for 5 minutes, the chaos halt should fire at 0.5% for 30 seconds. You want the chaos platform to detect trouble and stop before your production alerting system even notices. This keeps chaos experiments below the threshold of customer-visible impact.

The Steady-State Hypothesis

Before the Fault Engine injects anything, the platform must verify that the system is actually in a normal operating state. This is the steady-state hypothesis from the Chaos Engineering principles first articulated by Netflix and later codified at principlesofchaos.org.

Imagine trying to test whether your car’s antilock brakes work by slamming on the brakes in an icy parking lot. If your tires are already flat, the test tells you nothing about the ABS - it just tells you that flat tires perform badly. The steady-state check is how you verify the tires are inflated before you drive onto the ice.

from dataclasses import dataclass
from typing import List, Tuple
import asyncio

@dataclass
class HypothesisResult:
    passed: bool
    checks: List[Tuple[str, bool, float]]  # (name, passed, value)
    reason: str

class SteadyStateVerifier:
    def __init__(self, watchdog: SLOWatchdog):
        self.watchdog = watchdog

    async def verify(
        self,
        assertions: List[SLOAssertion],
        min_stable_window_s: int = 120
    ) -> HypothesisResult:
        """
        Verify steady state over a time window, not just a single point.
        Returns HypothesisResult indicating whether experiment may proceed.
        """
        checks = []
        # Sample over a stable window: 8 checks, 15s apart = 2 minutes
        samples_needed = min_stable_window_s // 15
        failures_per_assertion: dict[str, int] = {a.name: 0 for a in assertions}

        for _ in range(samples_needed):
            async with aiohttp.ClientSession() as session:
                for assertion in assertions:
                    breached = await self.watchdog._evaluate(session, assertion)
                    if breached:
                        failures_per_assertion[assertion.name] += 1
            await asyncio.sleep(15)

        all_pass = True
        result_checks = []
        for assertion in assertions:
            failure_count = failures_per_assertion[assertion.name]
            # Allow at most 1 transient failure out of N samples
            passed = failure_count <= 1
            if not passed:
                all_pass = False
            result_checks.append((assertion.name, passed, failure_count))

        return HypothesisResult(
            passed=all_pass,
            checks=result_checks,
            reason="steady" if all_pass else (
                f"Unstable pre-experiment: {[c for c in result_checks if not c[1]]}"
            )
        )

The verifier checks the same PromQL assertions used for halt detection, but over a 2-minute window before injection starts. An experiment spec that passes the blast radius check and policy gate still gets blocked if the steady-state check shows the target is already degraded. This prevents experiments from running during ongoing incidents - a common footgun when operators forget to check current service health.

Data Model

All experiment state lives in PostgreSQL with TimescaleDB for time-series metric snapshots. The schema is intentionally normalized: experiments reference services, services reference their SLO definitions, and metric snapshots reference experiments by foreign key.

-- Core experiment record
CREATE TABLE experiments (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  slug            TEXT NOT NULL UNIQUE,
  hypothesis      TEXT NOT NULL,
  status          TEXT NOT NULL DEFAULT 'pending',
  -- pending | running | completed | halted | aborted
  fault_spec      JSONB NOT NULL,
  -- { type, params, target_services, blast_radius_pct }
  slo_assertions  JSONB NOT NULL,
  created_by      TEXT NOT NULL,
  approved_by     TEXT,
  scheduled_at    TIMESTAMPTZ,
  started_at      TIMESTAMPTZ,
  halted_at       TIMESTAMPTZ,
  completed_at    TIMESTAMPTZ,
  halt_reason     TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_experiments_status ON experiments(status);
CREATE INDEX idx_experiments_scheduled ON experiments(scheduled_at)
  WHERE status = 'pending';

-- Services known to the platform (synced from service registry)
CREATE TABLE services (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name            TEXT NOT NULL UNIQUE,
  team            TEXT NOT NULL,
  tier            INTEGER NOT NULL DEFAULT 2,
  -- tier 0 = excluded always, tier 1 = requires approval, tier 2 = open
  slo_definitions JSONB,
  -- default SLO assertions for this service
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Dependency graph edges (updated by mesh scraper)
CREATE TABLE service_dependencies (
  caller_id       UUID NOT NULL REFERENCES services(id),
  callee_id       UUID NOT NULL REFERENCES services(id),
  call_rate_rps   FLOAT,
  p99_latency_ms  FLOAT,
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (caller_id, callee_id)
);

-- Experiment-to-service mapping (many-to-many with role)
CREATE TABLE experiment_services (
  experiment_id   UUID NOT NULL REFERENCES experiments(id),
  service_id      UUID NOT NULL REFERENCES services(id),
  role            TEXT NOT NULL,
  -- target | control_group | excluded | external
  instance_count  INTEGER,
  affected_count  INTEGER,
  PRIMARY KEY (experiment_id, service_id)
);

-- Time-series metric snapshots (one row per assertion per poll interval)
CREATE TABLE experiment_metrics (
  experiment_id   UUID NOT NULL REFERENCES experiments(id),
  assertion_name  TEXT NOT NULL,
  sampled_at      TIMESTAMPTZ NOT NULL,
  value           FLOAT NOT NULL,
  threshold       FLOAT NOT NULL,
  breached        BOOLEAN NOT NULL DEFAULT false,
  PRIMARY KEY (experiment_id, assertion_name, sampled_at)
);

-- TimescaleDB hypertable for efficient time-range queries
SELECT create_hypertable('experiment_metrics', 'sampled_at');
CREATE INDEX idx_metrics_experiment ON experiment_metrics(experiment_id, sampled_at DESC);

-- Audit log for compliance
CREATE TABLE audit_events (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  experiment_id   UUID REFERENCES experiments(id),
  event_type      TEXT NOT NULL,
  actor           TEXT NOT NULL,
  payload         JSONB,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_audit_experiment ON audit_events(experiment_id, created_at DESC);

The fault_spec and slo_assertions columns are stored as JSONB because they have variable structure across fault types. The schema uses experiment_services.role to record the blast radius classification that was computed at plan time - this is what powers post-experiment analysis: “how many instances were actually affected vs the plan?”

The experiment_metrics hypertable with TimescaleDB automatically partitions by time (weekly chunks by default), making it efficient to query “all metric samples for experiment X” without scanning the full table.

Key Insight

Store the SLO threshold values in the metric snapshot rows, not just the metric values. Thresholds change over time as services mature - you want the post-experiment report to show “at the time of the experiment, the threshold was X, and the value hit Y.” Reconstructing thresholds from a separate table after the fact is fragile and makes historical analysis unreliable.

Key Algorithms and Protocols

Blast Radius Calculation

The blast radius algorithm traverses the service dependency graph using BFS from each target service, computing two risk scores: direct blast radius (services directly called by or calling the target) and transitive blast radius (services reachable within N hops). The experiment is rejected if the transitive blast radius includes any tier-0 (always-excluded) service or if the total affected instance count exceeds the global percentage cap.

from collections import deque
from typing import Set, Dict
from dataclasses import dataclass

@dataclass
class BlastRadiusReport:
    approved: bool
    direct_services: Set[str]
    transitive_services: Set[str]
    excluded_hits: Set[str]
    total_instance_pct: float
    rejection_reason: str = ""

class BlastRadiusCalculator:
    def __init__(self, graph: Dict[str, list], tier_map: Dict[str, int],
                 instance_counts: Dict[str, int], global_limit_pct: float = 5.0):
        self.graph = graph          # {service: [dependencies]}
        self.reverse_graph = self._build_reverse(graph)
        self.tier_map = tier_map    # {service: 0|1|2}
        self.instance_counts = instance_counts
        self.global_limit_pct = global_limit_pct
        self.total_instances = sum(instance_counts.values())

    def _build_reverse(self, graph):
        reverse = {}
        for caller, callees in graph.items():
            for callee in callees:
                reverse.setdefault(callee, []).append(caller)
        return reverse

    def compute(
        self,
        target_services: list[str],
        max_hops: int = 2
    ) -> BlastRadiusReport:
        visited: Set[str] = set(target_services)
        excluded_hits: Set[str] = set()
        queue = deque((svc, 0) for svc in target_services)

        while queue:
            svc, depth = queue.popleft()
            if self.tier_map.get(svc, 2) == 0:
                excluded_hits.add(svc)

            if depth < max_hops:
                # Traverse both directions: callers AND callees are in blast radius
                neighbors = (
                    self.graph.get(svc, []) +
                    self.reverse_graph.get(svc, [])
                )
                for neighbor in neighbors:
                    if neighbor not in visited:
                        visited.add(neighbor)
                        queue.append((neighbor, depth + 1))

        direct = set(target_services)
        transitive = visited - direct
        affected_instances = sum(
            self.instance_counts.get(s, 0) for s in direct
        )
        pct = (affected_instances / self.total_instances * 100
               if self.total_instances > 0 else 0)

        if excluded_hits:
            return BlastRadiusReport(
                approved=False,
                direct_services=direct,
                transitive_services=transitive,
                excluded_hits=excluded_hits,
                total_instance_pct=pct,
                rejection_reason=f"Transitive blast radius hits excluded services: {excluded_hits}"
            )
        if pct > self.global_limit_pct:
            return BlastRadiusReport(
                approved=False,
                direct_services=direct,
                transitive_services=transitive,
                excluded_hits=set(),
                total_instance_pct=pct,
                rejection_reason=f"Instance pct {pct:.1f}% exceeds limit {self.global_limit_pct}%"
            )

        return BlastRadiusReport(
            approved=True,
            direct_services=direct,
            transitive_services=transitive,
            excluded_hits=set(),
            total_instance_pct=pct
        )

Automatic Rollback Protocol

The rollback protocol uses a two-phase commit between the Rollback Controller and the Fault Agents. When an experiment starts, the Rollback Controller writes a rollback manifest to a durable store (PostgreSQL) before the Fault Engine applies any fault. This means the manifest exists even if the Fault Engine crashes immediately after applying the fault.

package rollback

import (
    "context"
    "database/sql"
    "encoding/json"
    "fmt"
    "time"
)

type RollbackManifest struct {
    ExperimentID string          `json:"experiment_id"`
    Entries      []RollbackEntry `json:"entries"`
    CreatedAt    time.Time       `json:"created_at"`
}

type RollbackEntry struct {
    AgentHost    string          `json:"agent_host"`
    FaultType    string          `json:"fault_type"`
    UndoPayload  json.RawMessage `json:"undo_payload"`
    Applied      bool            `json:"applied"`
    Reverted     bool            `json:"reverted"`
}

type RollbackController struct {
    db      *sql.DB
    agents  AgentRegistry
}

// CommitManifest persists rollback instructions BEFORE fault injection starts.
// This is phase 1 of the two-phase protocol.
func (r *RollbackController) CommitManifest(
    ctx context.Context, manifest RollbackManifest,
) error {
    data, err := json.Marshal(manifest)
    if err != nil {
        return fmt.Errorf("marshal manifest: %w", err)
    }
    _, err = r.db.ExecContext(ctx,
        `INSERT INTO rollback_manifests (experiment_id, manifest, created_at)
         VALUES ($1, $2, $3)
         ON CONFLICT (experiment_id) DO UPDATE SET manifest = $2`,
        manifest.ExperimentID, data, manifest.CreatedAt,
    )
    return err
}

// Halt reverts all faults for an experiment. Safe to call multiple times.
func (r *RollbackController) Halt(ctx context.Context, experimentID, reason string) error {
    // Load manifest from DB (works even if the original caller is gone)
    var data []byte
    err := r.db.QueryRowContext(ctx,
        `SELECT manifest FROM rollback_manifests WHERE experiment_id = $1`,
        experimentID,
    ).Scan(&data)
    if err == sql.ErrNoRows {
        return nil  // Nothing to revert
    }
    if err != nil {
        return fmt.Errorf("load manifest: %w", err)
    }

    var manifest RollbackManifest
    if err := json.Unmarshal(data, &manifest); err != nil {
        return fmt.Errorf("unmarshal manifest: %w", err)
    }

    for i, entry := range manifest.Entries {
        if entry.Reverted {
            continue  // Idempotent: skip already-reverted entries
        }
        agent, err := r.agents.Get(entry.AgentHost)
        if err != nil {
            // Agent unreachable - log but continue to other agents
            continue
        }
        if err := agent.Revert(ctx, entry); err != nil {
            continue
        }
        manifest.Entries[i].Reverted = true
    }

    // Persist updated revert status
    data, _ = json.Marshal(manifest)
    _, err = r.db.ExecContext(ctx,
        `UPDATE rollback_manifests SET manifest = $1 WHERE experiment_id = $2`,
        data, experimentID,
    )

    // Update experiment status
    _, err = r.db.ExecContext(ctx,
        `UPDATE experiments SET status = 'halted', halted_at = $1, halt_reason = $2
         WHERE id = $3`,
        time.Now(), reason, experimentID,
    )
    return err
}

The idempotency of Halt is non-negotiable. The SLO Watchdog, the scheduler heartbeat monitor, and an operator-triggered manual halt can all call Halt simultaneously. Each must be a no-op if the experiment is already halted.

Scaling and Performance

Multi-region chaos platform deployment with capacity estimates per region

The chaos platform has unusual scaling characteristics: the control plane sees very low write throughput (tens of experiments per hour at most) but the SLO Watchdog must handle high read throughput against Prometheus and the Rollback Controller must be available at all times, including during partial platform outages.

The Fault Agents are the only stateful component on target hosts. They are designed to be autonomous: if the control plane goes dark mid-experiment, the agent’s local rollback manifest (written to disk at fault application time) allows it to revert the fault when its experiment TTL expires - even without a halt signal from the Rollback Controller.

# Capacity estimation script
# Run this to size your deployment based on target fleet size

def estimate_capacity(
    target_services: int,
    avg_instances_per_service: int,
    experiments_per_day: int,
    avg_experiment_duration_minutes: int,
    slo_assertions_per_experiment: int,
) -> dict:
    # Concurrent experiments (Little's Law)
    arrival_rate = experiments_per_day / (24 * 60)  # per minute
    avg_duration = avg_experiment_duration_minutes
    concurrent = arrival_rate * avg_duration
    # Allow 3x headroom for burst + failed experiments that run to TTL
    concurrent_with_headroom = concurrent * 3

    # Watchdog Prometheus query rate
    poll_interval_s = 15  # active experiment polling
    queries_per_second = (
        concurrent_with_headroom * slo_assertions_per_experiment / poll_interval_s
    )

    # Fault agent instances needed
    # Each agent handles one fault injection per host
    # Target: never more than 5% of instances per experiment
    max_concurrent_agents = int(
        concurrent_with_headroom *
        target_services *
        avg_instances_per_service * 0.05  # 5% blast radius
    )

    # Result store writes
    metric_rows_per_second = (
        concurrent_with_headroom * slo_assertions_per_experiment / poll_interval_s
    )
    # TimescaleDB handles ~50k rows/s on a single node
    result_store_nodes = max(1, metric_rows_per_second / 50_000)

    return {
        "max_concurrent_experiments": concurrent_with_headroom,
        "watchdog_prom_qps": queries_per_second,
        "max_concurrent_agents": max_concurrent_agents,
        "result_store_write_rps": metric_rows_per_second,
        "result_store_nodes_needed": result_store_nodes,
    }

# Example: 200 services, 20 instances each, 48 experiments/day, 30min avg
print(estimate_capacity(
    target_services=200,
    avg_instances_per_service=20,
    experiments_per_day=48,
    avg_experiment_duration_minutes=30,
    slo_assertions_per_experiment=5,
))
# Output: {concurrent: ~3, prom_qps: ~1, agents: ~60, result_store_rps: ~1}
# The numbers are small - the chaos platform is not a high-throughput system.
# Its challenge is correctness and availability, not scale.

Real World

Shopify’s chaos engineering team ran 500+ experiments in 2022 across their Kubernetes fleet. Their key scaling decision was making the Fault Agent a thin stateless binary (< 5MB) that downloads its rollback manifest from S3 at startup. This made the agent trivially deployable as a Kubernetes DaemonSet across 10,000+ nodes without any per-node state management. The control plane itself ran on 3 small instances - the platform is genuinely not a compute-intensive system.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
Fault Agent crashes mid-experiment	Agent heartbeat timeout (30s)	Faults remain active on target hosts	Rollback Controller reads local manifest on host, reverts via SSH fallback
SLO Watchdog loses Prometheus connectivity	HTTP timeout on scrape (10s)	Cannot detect SLO breach during outage	Default to halt: treat watchdog connectivity loss as breach, halt experiment immediately
Control plane DB unreachable	Health check failure	Cannot start new experiments or record results	In-flight experiments continue; rollback manifests on agents serve as truth
Rollback Controller crashes during halt	Idempotent halt + retry queue	Partial revert: some agents reverted, some not	Restart Controller; re-read manifest, skip already-reverted entries
Redis (scheduler lock) unavailable	Redis connection error	Cannot acquire scheduling locks	Block new experiments until Redis recovers; do not treat as existing lock release
Experiment TTL expires before completion	TTL-based expiry in scheduler	Experiment runs longer than planned	Agent local TTL triggers self-revert; Rollback Controller syncs status to DB

Watch Out

The most dangerous failure mode is the “phantom experiment” - an experiment that the control plane believes has completed or halted, but whose faults are still active on one or more agents due to a partial revert. Detect this with a reconciliation loop that runs every 5 minutes: it queries all Fault Agents for their active fault list and compares it against experiments in a terminal state in the DB. Any mismatch triggers an immediate revert and an alert. Without this loop, phantom experiments can silently degrade services for hours.

Comparison of Approaches

Approach	Blast Radius Control	Rollback Speed	Operational Complexity	Best Fit
Chaos Monkey (original Netflix)	Service-level only, no instance %)	Manual only	Low - just kills instances	Random pod kill at service level, mature Kubernetes environments
Gremlin (managed SaaS)	Percentage-based, team-scoped, approval flows	Automatic + manual	Low - managed control plane	Teams that want zero infrastructure ownership
AWS Fault Injection Service (FIS)	AWS resource-scoped, IAM-gated	Automatic stop conditions	Medium - AWS-only, IAM setup	AWS-native workloads, compliance-heavy environments
Chaos Mesh (open source, K8s-native)	Namespace-scoped, CRD-based	Automatic via CRD deletion	High - requires cluster-admin, eBPF kernel support	Full K8s control, custom fault types needed
Custom platform (this design)	Graph-aware, global %, tiered service classification	Automatic via manifest + agent TTL	Very High - full ownership	Multi-cloud, mixed VM+K8s, bespoke compliance requirements

Key Takeaways

Steady-state hypothesis: always verify that the system is normal for at least 2 minutes before injecting any fault - running chaos on a degraded system produces noise, not signal.
Blast radius control: enforce limits at three levels simultaneously (service tier exclusion, instance percentage, global concurrent cap) because a single-level check can always be bypassed by a misconfigured experiment.
Rollback manifest durability: write the rollback manifest to durable storage before applying any fault, not after - the manifest is only useful if it survives a crash of the component that created it.
Watchdog independence: the SLO Watchdog must be architecturally independent of the Fault Engine; if they share a process, a crash takes both the injector and the safety system offline simultaneously.
Idempotent halt: every component in the halt path must be safe to call multiple times with the same experiment ID; concurrent halt signals from watchdog, operator, and TTL expiry are the normal case, not an edge case.
Dependency graph as a first-class citizen: keeping the service dependency graph current (sync from service mesh every 5 minutes) is what distinguishes safe blast radius calculation from guesswork.
Agent autonomy: Fault Agents must be able to self-revert when they lose contact with the control plane - a chaos platform that leaves faults in place during a network partition is actively dangerous.
Observability integration: the most valuable output of a chaos experiment is the metric snapshot timeline showing exactly how each SLO moved during injection - invest in storing that data, not just the pass/fail result.

Building a chaos engineering platform is ultimately an exercise in earning trust through engineering rigor. The blast radius controls, the steady-state verification, the automatic rollback - none of these are features that make chaos experiments more powerful. They are features that make engineers willing to run them. A platform that everyone trusts is the one that gets used, and a chaos platform that gets used is one that finds real weaknesses before production does.

Frequently Asked Questions

Why not just use Chaos Monkey and be done with it?

Netflix open-sourced Chaos Monkey as a random pod killer, which is powerful but narrow. It covers exactly one fault type (instance kill) and has no blast radius controls beyond “only kill one instance at a time.” For organizations with complex service topologies, compliance requirements around which services can be targeted, or a need for fault types beyond pod kill (latency, CPU, network partition), Chaos Monkey is a starting point, not a complete solution. The real Netflix chaos engineering program runs dozens of specialized tools alongside Chaos Monkey.

Why not use a service mesh (Istio/Envoy) for fault injection instead of OS-layer agents?

Service mesh fault injection (Istio’s VirtualService fault injection) is simpler to operate but only covers HTTP/gRPC traffic and only injects at the proxy layer - it cannot simulate CPU starvation, memory pressure, disk I/O throttling, or kernel-level network issues. A mesh-injected 100ms latency also behaves differently from a tc netem 100ms latency because the mesh proxy adds its own buffering. For testing resilience patterns (timeouts, circuit breakers, retries), mesh injection is excellent and should be preferred for its simplicity. For testing infrastructure-level failures, OS-layer injection is necessary.

Why a dependency graph? Can I just let engineers specify what to target?

You can, and most platforms start that way. The dependency graph becomes critical when you have more than 20 services and engineers stop knowing who calls whom. Without the graph, an experiment targeting “Notification Service” might inadvertently affect “Order Service” through a shared dependency that nobody remembered to exclude. The graph makes implicit relationships explicit and automatable - it shifts blast radius analysis from tribal knowledge to an algorithm.

How do you handle experiments that target stateful services like databases?

With extreme caution and additional gates. Database-targeting experiments (simulating disk I/O saturation, connection pool exhaustion) require tier-1 classification (requires explicit approval) and should never run against primary replicas during business hours. The blast radius for database faults is inherently wider because all services that read from that DB are affected - the dependency graph must model database fanout explicitly, including read replicas and connection poolers.

Why does the Watchdog default to halt when it loses Prometheus connectivity?

The fail-safe principle: uncertainty about safety is itself unsafe. If the Watchdog cannot verify that SLOs are being met, it cannot continue to vouch for the safety of the ongoing experiment. Halting on connectivity loss means you trade some false positives (experiments halted when SLOs were actually fine) for zero false negatives (faults left active when SLOs were degraded). In practice, Prometheus connectivity loss during an experiment is itself a signal that something is wrong with the infrastructure.

Can you run chaos experiments in staging instead of production?

You should, and you should do it first. Staging experiments build confidence that the fault injection mechanics work correctly before touching production. But staging-only chaos has a fundamental limitation: staging traffic patterns, data volumes, and third-party integrations differ from production in ways that matter for resilience testing. Netflix’s famous insight was that the only environment that accurately simulates production is production. Staging experiments find implementation bugs in the chaos platform itself; production experiments find architectural weaknesses in the services.

Interview Questions

Design the auto-halt mechanism for a chaos engineering platform. What are the failure modes?

Expected depth: Describe the polling architecture (Prometheus queries at 15s intervals), the two-consecutive-failure requirement to avoid noise-induced halts, the idempotent halt API, and the durable rollback manifest. Discuss failure modes: watchdog crash (separate process, supervisor restarts it), Prometheus connectivity loss (default to halt), concurrent halt signals (idempotent), and partial revert (reconciliation loop). Strong answers discuss the watchdog’s fail-safe default and explain why the rollback manifest must be written before fault application.

How do you prevent a chaos experiment from accidentally causing a production incident?

Expected depth: Cover all three blast radius control layers: tier-based service exclusion (tier-0 services never targeted), instance percentage cap (max 5% globally), and global concurrency limit (max 20 concurrent). Describe the steady-state hypothesis check (2-minute pre-experiment verification). Explain the approval gate for tier-1 services and maintenance window checks. Strong answers also mention the reconciliation loop that detects phantom experiments and the agent self-revert TTL as a last-resort safety net.

Walk me through the data model for storing chaos experiment results. What indexing strategy would you use?

Expected depth: Describe the experiments table with JSONB fault_spec and slo_assertions, the experiment_services junction table with role classification, and the experiment_metrics hypertable (TimescaleDB). Explain why metric snapshot rows store threshold values alongside metric values (thresholds change over time). Discuss the TimescaleDB hypertable for time-range queries and the index on (experiment_id, sampled_at DESC) for per-experiment metric retrieval. Strong answers mention partitioning strategy (weekly chunks) and retention policy (hot for 90 days, archive after).

How would you build the service dependency graph and keep it current?

Expected depth: Describe scraping options: service mesh telemetry (Istio/Linkerd service graphs, most accurate), distributed tracing span relationships (good for async calls), and static service registry declarations (easy but stale). Explain why the graph must traverse both directions (callers and callees are both in blast radius). Discuss staleness: mesh telemetry refreshes every 5 minutes, sufficient for experiment planning but potentially missing newly deployed services. Strong answers mention the graph edge metadata (call rate, P99 latency) used to weight blast radius impact and discuss handling external dependencies that should always be excluded.

A chaos experiment is running and both the control plane and the Fault Agent lose network connectivity simultaneously. What happens?

Expected depth: The agent’s local rollback manifest (written to disk at fault application time) contains the complete undo operations. The agent’s experiment TTL (set to 2x experiment duration) triggers self-revert when it expires. The control plane, when it recovers, runs the reconciliation loop to detect any agents still in a fault state. The SLO Watchdog (if running in a separate availability zone) may still be polling Prometheus and may have already fired a halt signal - the Rollback Controller retries halts until agents acknowledge. Strong answers identify the race condition: if the agent reverts before the control plane recovers and the control plane then tries to halt, the idempotent halt must handle “already reverted” gracefully.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article