Build a Distributed Lock Manager and Leader Election Service


distributed-systems reliability scalability

System Design Deep Dive

Distributed Lock Manager and Leader Election

Mutual exclusion across a cluster, lease-based locks that survive node failure, and safe leader election for singleton workloads

14 min readAdvancedDistributed-Systems

Imagine 50 workers at a construction site sharing one excavator. Without a checkout board, three workers grab the controls simultaneously, each convinced they have exclusive access. The machine fights itself. At best, they waste time undoing each other’s work. At worst, someone gets hurt. A lock manager is that checkout board for distributed systems - a shared authority that ensures only one process controls a shared resource at any instant.

The checkout-board analogy breaks down at scale. A physical board is a single point of failure: if the site supervisor goes home, nobody can check out equipment. A distributed lock manager must itself be highly available. It needs to survive the crash of any individual node while maintaining the invariant that at most one holder can possess each lock. That tension - availability versus mutual exclusion - is what makes building this service hard. Zookeeper solved it with ZAB consensus in 2010, etcd with Raft in 2013, and Google’s Chubby with Paxos internally. Each of these systems exists because the naive alternatives all fail in production.

Three failures lurk in naive implementations. First, a lock holder that crashes while holding the lock blocks every other waiter indefinitely - unless the lock has a time-to-live that expires it automatically. Second, a lock holder that takes too long due to garbage collection or network partition can have its lease expired by the lock service while it still believes it holds the lock, then both the original holder and the new acquirer are writing to the shared resource simultaneously - the classic split-brain scenario. Third, when the lock service itself fails over to a new leader, waiters that rush to acquire newly freed locks all arrive at once - the thundering herd that can overwhelm the recovering service.

Solving these cleanly requires three core architectural decisions: Raft consensus to make the lock service itself highly available without split-brain, lease-based locking with fencing tokens to protect resources from stale lock holders, and a watch API so clients react to leadership changes in near-real-time without polling.

Requirements and Constraints

Functional Requirements

  • Provide mutual exclusion: at most one client holds a named lock at any instant
  • Support lease-based locks with configurable TTL (1s to 600s); expired automatically on holder failure
  • Support leader election: exactly one candidate wins the “leader” lock per named election
  • Provide a watch API: clients subscribe to lock/leader change events with at-most-once delivery
  • Issue monotonically increasing fencing tokens with each lock grant to protect downstream resources
  • Support lock hierarchy: namespaced keys (e.g., /locks/billing/batch-job)
  • Allow lock renewal before expiry (heartbeat-based keepalive)

Non-Functional Requirements

  • Lock service availability: 99.99% (four nines) with automatic failover under 500ms
  • Lock acquisition latency: p50 under 5ms, p99 under 20ms within a datacenter
  • Support 50,000 concurrent lock holders per cluster
  • Support 10,000 lock acquire/release operations per second per shard
  • Watch event delivery latency: under 100ms after state change
  • Fencing token counter must never decrease across leader elections

Constraints and Assumptions

  • Minimum cluster size is 3 nodes for a quorum majority of 2; production uses 5 nodes for tolerance of 2 simultaneous failures
  • Network round-trip time between nodes is under 5ms; design does not address WAN consensus
  • Clients run their own heartbeat loops; the server does not push keep-alive reminders
  • Lock service is a coordination primitive, not a general-purpose key-value store - maximum value size is 4KB

High-Level Architecture

Six components make the system work: a gRPC API layer that clients talk to for lock acquire/release/watch, a Raft consensus cluster that replicates all state changes across nodes, a Lease Store that holds the in-memory state of every active lock with its TTL and fencing token, a Fencing Token Counter (a monotonically increasing integer per lock key, replicated through Raft), a Watch Engine that maintains client subscriptions and pushes state-change events, and a Lease Reaper that continuously expires TTL-exceeded leases.

Distributed lock manager architecture showing Raft cluster, lease store, watch engine, and fencing token counter

Data flows like this: a client sends AcquireLock(key="billing/job", ttl=30s) to any cluster node. Followers proxy write requests to the current Raft leader. The leader proposes the lock grant as a Raft log entry - {op: LOCK, key: "billing/job", holder: "client-A", token: 42, expires_at: T+30s}. Once a majority of nodes append this entry to their log and commit it, the leader applies the entry to its in-memory Lease Store, increments the fencing token counter, and responds to the client with {granted: true, token: 42, expires_at: T+30s}. The Watch Engine simultaneously notifies all subscribers watching billing/job that it has been acquired.

When the client’s TTL is about to expire, it sends RenewLock(key, token=42) to extend the lease. If the client crashes before renewing, the Lease Reaper running every 100ms detects the expiry, Raft-commits a {op: DELETE, key: "billing/job"} entry, and the Watch Engine notifies all waiters that the lock is free. The next acquirer gets token=43 - and any write the crashed client attempts using token=42 is rejected by the downstream resource.

Key Insight

The fencing token is not just an identifier - it is a causality proof. Every lock grant increments the counter atomically through Raft. A resource that tracks the highest token it has seen can reject any write from a token lower than its stored maximum, making split-brain writes safe to detect without asking the lock service anything.

The Lock Service

The Lock Service is the front door for all client operations. It wraps a Raft state machine that ensures every lock grant, renewal, and release is replicated to a majority of cluster nodes before acknowledging the client. Any node can serve read-only requests (checking whether a key is locked); only the Raft leader can process writes.

Think of the Raft cluster like a parliamentary vote: a motion is not carried until a majority of members raise their hands. The leader proposes the motion (lock grant), collects acknowledgments from followers, and only then announces the result. If the leader’s connection to the parliament drops mid-vote, the parliament elects a new speaker and starts fresh - but no motion is ever announced without a recorded majority.

The critical property is that the fencing token counter must survive leader elections with strictly increasing values. We achieve this by storing the counter in the Raft log itself. When a new leader is elected, it replays the log from the last committed entry, reconstructing the exact counter value where the previous leader left off. No token is ever reused.

package lockservice

import (
    "context"
    "fmt"
    "time"

    "github.com/hashicorp/raft"
)

// LockCommand is the entry written to the Raft log for every lock operation.
type LockCommand struct {
    Op        string    // "ACQUIRE", "RENEW", "RELEASE"
    Key       string    // lock key, e.g. "/locks/billing/job"
    HolderID  string    // client identifier
    TTL       int64     // lease duration in seconds
    Token     uint64    // fencing token assigned on ACQUIRE
    ExpiresAt time.Time // wall-clock expiry for Lease Reaper
}

// LockResponse is returned to the client after a committed Raft entry.
type LockResponse struct {
    Granted   bool
    Token     uint64
    ExpiresAt time.Time
    Reason    string // if !Granted, explains why
}

// AcquireLock proposes a lock grant through Raft consensus.
// Returns only after a majority of nodes have acknowledged the log entry.
func (s *LockServer) AcquireLock(ctx context.Context, key, holderID string, ttlSecs int64) (*LockResponse, error) {
    // Only the Raft leader can commit writes.
    if s.raft.State() != raft.Leader {
        leader := string(s.raft.Leader())
        return nil, fmt.Errorf("not leader; redirect to %s", leader)
    }

    s.mu.RLock()
    existing, held := s.leaseStore[key]
    s.mu.RUnlock()

    if held && time.Now().Before(existing.ExpiresAt) {
        return &LockResponse{
            Granted: false,
            Reason:  fmt.Sprintf("held by %s until %s", existing.HolderID, existing.ExpiresAt),
        }, nil
    }

    // Increment the per-key fencing token counter atomically via Raft.
    newToken := s.nextToken(key)
    cmd := LockCommand{
        Op:        "ACQUIRE",
        Key:       key,
        HolderID:  holderID,
        TTL:       ttlSecs,
        Token:     newToken,
        ExpiresAt: time.Now().Add(time.Duration(ttlSecs) * time.Second),
    }

    future := s.raft.Apply(encodeCommand(cmd), 5*time.Second)
    if err := future.Error(); err != nil {
        return nil, fmt.Errorf("raft apply: %w", err)
    }

    return &LockResponse{
        Granted:   true,
        Token:     newToken,
        ExpiresAt: cmd.ExpiresAt,
    }, nil
}
Real World

etcd, which backs Kubernetes’ entire control plane, implements almost exactly this design. Every etcd write goes through Raft via etcd/server/etcdserver/raft.go. The fencing token equivalent is the Revision field - a globally monotone integer that increments with every write. Resources that need to detect stale writes compare the Revision on the incoming request with the highest they have seen. The lease.go package implements TTL-based expiry: a background goroutine ticks every 500ms and batch-expires all leases past their checkpoint.

The Lease Manager

Lease-based locking is the answer to an uncomfortable truth: a lock that never expires is a deadlock waiting to happen. The moment a lock holder crashes, gets GC-paused, or loses network connectivity, its lock becomes permanent without a TTL. Every other waiter is blocked forever.

The analogy is a library book loan. A borrower takes a book for two weeks. If they vanish, the library doesn’t wait forever - after two weeks the book is simply available again. The fencing token is the date-stamp: a book returned after the due date can’t retroactively block later borrowers from reading it.

The Lease Manager maintains an in-memory map of key -> {holder, token, expires_at} that is rebuilt by replaying the Raft log on startup. The Lease Reaper goroutine scans this map every 100ms and expires any entry where now > expires_at.

// LeaseEntry holds the live state for a single held lock.
type LeaseEntry struct {
    HolderID  string
    Token     uint64
    ExpiresAt time.Time
    Key       string
}

// LeaseReaper runs continuously, expiring TTL-exceeded locks via Raft.
func (s *LockServer) LeaseReaper(ctx context.Context) {
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            if s.raft.State() != raft.Leader {
                // Only the leader expires leases to avoid duplicate commits.
                continue
            }
            s.mu.RLock()
            now := time.Now()
            var expired []LeaseEntry
            for _, entry := range s.leaseStore {
                if now.After(entry.ExpiresAt) {
                    expired = append(expired, entry)
                }
            }
            s.mu.RUnlock()

            for _, entry := range expired {
                cmd := LockCommand{Op: "RELEASE", Key: entry.Key, HolderID: "reaper"}
                // Fire-and-forget; Raft will apply and notify Watch Engine.
                _ = s.raft.Apply(encodeCommand(cmd), 2*time.Second)
            }
        }
    }
}

// RenewLock extends a held lease's TTL if the token matches the current holder.
func (s *LockServer) RenewLock(ctx context.Context, key, holderID string, token uint64, newTTL int64) error {
    s.mu.RLock()
    entry, ok := s.leaseStore[key]
    s.mu.RUnlock()

    if !ok || entry.Token != token || entry.HolderID != holderID {
        return fmt.Errorf("renewal rejected: token mismatch or lock not held by caller")
    }

    cmd := LockCommand{
        Op:        "RENEW",
        Key:       key,
        HolderID:  holderID,
        Token:     token,
        ExpiresAt: time.Now().Add(time.Duration(newTTL) * time.Second),
    }
    return s.raft.Apply(encodeCommand(cmd), 5*time.Second).Error()
}
Watch Out

The most dangerous failure mode is a process that is alive but paused. A Java application experiencing a 45-second GC stop-the-world, or a Python process waiting for an OS page fault, will have its lease expire even though it is not crashed. When it resumes, it still believes it holds the lock. This is why resources must validate the fencing token on every write - the lock service alone cannot protect against a paused-then-resumed holder that tries to act after its lease expired.

Leader Election Protocol

Leader election is just a special case of distributed locking: multiple candidates compete to acquire a lock named for the singleton role (e.g., /election/scheduler-leader). The first candidate that successfully acquires the lock via Raft is the leader. Others watch for the lock’s deletion and retry when it is released.

The watch mechanism is what makes this efficient. Without a watch API, every candidate would poll the lock service repeatedly to check if the leader lock is free - creating a thundering herd every time the leader changes. With a watch, each candidate registers a one-time subscription: “notify me when /election/scheduler-leader is deleted.” When the leader crashes and the Lease Reaper expires its lock, the Watch Engine pushes a single event to every registered watcher. All candidates then attempt to acquire simultaneously, and Raft serializes their competing writes into a single winner.

import grpc
import time
import threading
from lockservice_pb2 import AcquireRequest, WatchRequest
from lockservice_pb2_grpc import LockServiceStub

class LeaderElector:
    def __init__(self, candidate_id: str, election_key: str, ttl_seconds: int = 15):
        self.candidate_id = candidate_id
        self.election_key = election_key
        self.ttl = ttl_seconds
        self.is_leader = False
        self.current_token = None
        self._stop = threading.Event()

    def run(self, stub: LockServiceStub, on_leader: callable, on_demotion: callable):
        """Main election loop. Blocks until elected, then runs on_leader callback."""
        while not self._stop.is_set():
            resp = stub.Acquire(AcquireRequest(
                key=self.election_key,
                holder_id=self.candidate_id,
                ttl_seconds=self.ttl,
            ))

            if resp.granted:
                self.is_leader = True
                self.current_token = resp.token
                # Start heartbeat renewal in background thread.
                hb = threading.Thread(target=self._heartbeat, args=(stub,), daemon=True)
                hb.start()
                try:
                    on_leader(resp.token)  # Caller does its leader work here.
                finally:
                    self.is_leader = False
                    on_demotion()
            else:
                # Not elected - watch for the current holder to release.
                self._wait_for_vacancy(stub)

    def _heartbeat(self, stub: LockServiceStub):
        """Renew the lease every TTL/3 seconds to survive transient hiccups."""
        interval = max(1, self.ttl // 3)
        while self.is_leader and not self._stop.is_set():
            try:
                stub.Renew(RenewRequest(
                    key=self.election_key,
                    holder_id=self.candidate_id,
                    token=self.current_token,
                    new_ttl_seconds=self.ttl,
                ))
            except grpc.RpcError:
                # Network blip - try again next interval; lease still valid until TTL.
                pass
            time.sleep(interval)

    def _wait_for_vacancy(self, stub: LockServiceStub):
        """Block on a watch stream until the election lock is released."""
        for event in stub.Watch(WatchRequest(key=self.election_key)):
            if event.type == "DELETE":
                return  # Lock released - retry acquisition immediately.
Key Insight

Renewing at TTL/3 intervals gives three chances to renew before the lease expires under normal conditions. If one renewal fails due to a transient network blip, the next attempt (TTL/3 seconds later) still has TTL * 2/3 seconds of buffer. This is the same renewal cadence etcd recommends for its leases and is also how Kubernetes’ leader-election library operates.

Data Model

The Lease Store state machine is a map with two backing structures: an in-memory hash map for O(1) lookup by key, and a min-heap sorted by expires_at for O(log n) TTL scanning. Both are rebuilt by replaying the Raft log from the last snapshot on startup.

-- For persistence across full cluster restarts, the Raft log is checkpointed
-- to a snapshot store (RocksDB) every 10,000 entries.

-- Snapshot table written by the leader during compaction:
CREATE TABLE lock_snapshots (
  id            BIGSERIAL PRIMARY KEY,
  snapshot_data BYTEA NOT NULL,        -- protobuf-encoded lease map
  raft_index    BIGINT NOT NULL,        -- log index this snapshot covers
  raft_term     BIGINT NOT NULL,        -- Raft term at snapshot time
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Fencing token registry per key - for audit and debugging:
CREATE TABLE fencing_token_log (
  key           TEXT NOT NULL,
  token         BIGINT NOT NULL,
  holder_id     TEXT NOT NULL,
  granted_at    TIMESTAMPTZ NOT NULL,
  released_at   TIMESTAMPTZ,           -- NULL if still held at snapshot time
  ttl_seconds   INT NOT NULL,
  release_cause TEXT,                  -- 'explicit', 'ttl_expired', 'reaper'
  PRIMARY KEY (key, token)
);

-- Watch subscription registry (in-memory only, not persisted):
-- Map[key -> []ClientStream]
-- On leader election, watch subscribers must re-register.

The key partitioning strategy for scaling across multiple Raft shards uses consistent hashing: shard = consistent_hash(key) % num_shards. Each shard is an independent 3-node Raft group. The coordinator layer routes client requests to the correct shard based on the key prefix. Shard boundaries are defined by key ranges (/locks/a-h, /locks/i-q, /locks/r-z) rather than hashes to allow range-based listing.

Lock acquire and release data flow showing fencing token issuance and stale writer rejection

The fencing_token_log table is the audit trail that makes post-mortems possible. When a split-brain incident occurs and a resource reports seeing a rejected write, you can query which holder had the stale token, when their lease expired, and what the winning token value was.

Key Algorithms and Protocols

Fencing Tokens

Fencing tokens solve the problem that the lock service and the protected resource are separate systems with no shared heartbeat. Consider a storage system that accepts writes: when Client A holds the lock (token=42) and writes data, the storage system is fine. If Client A’s lease expires (token=43 is issued to Client B) but Client A’s write is delayed in the network and arrives after Client B’s, the storage system would incorrectly let A’s stale write overwrite B’s data.

The fix is for the storage system to track the highest fencing token it has seen and reject any write with a lower token:

class FencedStorage:
    """Storage resource that enforces fencing tokens to prevent stale writes."""

    def __init__(self):
        self._data = {}
        self._max_token_seen = {}  # key -> max token seen for that key

    def write(self, key: str, value: bytes, fencing_token: int) -> bool:
        """
        Accept write only if fencing_token >= max seen for this key.
        This rejects writes from stale lock holders whose lease has expired.
        """
        max_seen = self._max_token_seen.get(key, 0)
        if fencing_token < max_seen:
            # Stale write: reject without modifying state.
            raise ValueError(
                f"Fencing token {fencing_token} is below max seen {max_seen}. "
                f"Write rejected as stale. Re-acquire the lock."
            )
        # Accept: update max token and persist value.
        self._max_token_seen[key] = max(fencing_token, max_seen)
        self._data[key] = value
        return True

Note that the storage system does not need to contact the lock service to validate the token. It only needs to maintain a counter - making token validation a pure local operation with no network round-trips.

Raft Consensus for Lock State

Raft ensures that every lock grant is committed on a majority of nodes before the client is told “granted.” The diagram below shows the three phases of Raft leader election: normal operation with heartbeats, crash detection and candidate promotion, and the new leader asserting control.

Raft leader election showing normal heartbeat phase, crash detection, and new leader election across three phases

The Raft log for a lock service looks like this:

index=1  term=1  ACQUIRE key=/locks/billing/job  holder=node-a  token=1  ttl=30s
index=2  term=1  ACQUIRE key=/locks/reports/daily holder=node-b  token=1  ttl=60s
index=3  term=1  RELEASE key=/locks/billing/job  holder=node-a  token=1
index=4  term=1  ACQUIRE key=/locks/billing/job  holder=node-c  token=2  ttl=30s
index=5  term=2  <no-op: new leader asserts control, re-broadcasts term=2>
index=6  term=2  RENEW   key=/locks/billing/job  holder=node-c  token=2  ttl=30s
index=7  term=2  RELEASE key=/locks/reports/daily holder=reaper  [TTL expired]

The no-op entry at index 5 is critical: when a new Raft leader is elected, it immediately commits a no-op to its own term. This ensures no stale entries from the previous term can be committed retroactively - a subtlety of Raft’s log commitment rule that etcd’s implementation explicitly handles.

// FSM is the Raft finite state machine that applies committed log entries.
// This is the single source of truth for the in-memory Lease Store.
type FSM struct {
    mu         sync.RWMutex
    leaseStore map[string]*LeaseEntry
    tokenMap   map[string]uint64   // per-key fencing token counter
    watchCh    chan WatchEvent      // feeds Watch Engine
}

func (f *FSM) Apply(log *raft.Log) interface{} {
    cmd := decodeCommand(log.Data)
    f.mu.Lock()
    defer f.mu.Unlock()

    switch cmd.Op {
    case "ACQUIRE":
        f.leaseStore[cmd.Key] = &LeaseEntry{
            HolderID:  cmd.HolderID,
            Token:     cmd.Token,
            ExpiresAt: cmd.ExpiresAt,
            Key:       cmd.Key,
        }
        f.tokenMap[cmd.Key] = cmd.Token
        f.watchCh <- WatchEvent{Type: "PUT", Key: cmd.Key, Token: cmd.Token}

    case "RELEASE":
        delete(f.leaseStore, cmd.Key)
        f.watchCh <- WatchEvent{Type: "DELETE", Key: cmd.Key}

    case "RENEW":
        if entry, ok := f.leaseStore[cmd.Key]; ok && entry.Token == cmd.Token {
            entry.ExpiresAt = cmd.ExpiresAt
        }
    }
    return nil
}

func (f *FSM) nextToken(key string) uint64 {
    f.tokenMap[key]++
    return f.tokenMap[key]
}

Split-Brain Prevention

Split-brain occurs when network partitioning causes two nodes to both believe they are the Raft leader. Raft’s quorum requirement prevents this structurally: a node can only commit entries if it receives acknowledgment from a majority of nodes. In a 3-node cluster, a network partition creates groups of 1 and 2 or 2 and 1. The group of 2 can form a quorum; the group of 1 cannot. So only one partition can commit writes - the one with the majority.

The danger is not in correct Raft operation but in a leader that is temporarily isolated. It stops receiving ACKs and will step down after a timeout (election timeout, typically 300ms to 500ms). During this window, it stops serving writes because raft.Apply() blocks waiting for ACKs that never arrive. Clients see a 500ms timeout, then the newly elected leader takes over. Total downtime: under one second in a LAN-connected cluster.

// The lock server checks leadership before every write operation.
// This is a defense-in-depth check - Raft itself enforces this,
// but checking early gives a faster, clearer error to clients.
func (s *LockServer) checkLeader() error {
    if s.raft.State() != raft.Leader {
        addr, id := s.raft.LeaderWithID()
        return fmt.Errorf("not leader: redirect to node %s at %s", id, addr)
    }
    return nil
}
Watch Out

A common mistake is writing a custom split-brain prevention mechanism on top of Raft rather than trusting Raft’s quorum requirement. One anti-pattern: checking leadership via a separate “ping the leader” gossip message before processing writes. This creates a window where the gossip says “you are leader” but Raft has already stepped you down. Always use the authoritative Raft state (raft.State() == Leader) and let Raft’s own acknowledgment mechanism prove majority consensus before committing.

Watch API for Leadership Changes

The Watch API turns polling into event-driven notification. Instead of 1,000 candidates each polling the lock service every second (1,000 RPS of overhead), each candidate opens one persistent gRPC streaming connection and blocks until a relevant event arrives.

// WatchStream implements the server-streaming Watch RPC.
// The client receives events for all state changes on the watched key.
func (s *LockServer) Watch(req *pb.WatchRequest, stream pb.LockService_WatchServer) error {
    sub := s.watchEngine.Subscribe(req.Key)
    defer s.watchEngine.Unsubscribe(req.Key, sub)

    for {
        select {
        case <-stream.Context().Done():
            return nil
        case event := <-sub:
            if err := stream.Send(&pb.WatchEvent{
                Type:  event.Type,
                Key:   event.Key,
                Token: event.Token,
            }); err != nil {
                return err
            }
        }
    }
}

// WatchEngine manages subscriber maps for each watched key.
// After every Raft Apply(), the FSM pushes a WatchEvent to this engine.
type WatchEngine struct {
    mu   sync.RWMutex
    subs map[string][]chan WatchEvent
}

func (w *WatchEngine) Subscribe(key string) chan WatchEvent {
    ch := make(chan WatchEvent, 8) // buffered to avoid blocking FSM Apply
    w.mu.Lock()
    w.subs[key] = append(w.subs[key], ch)
    w.mu.Unlock()
    return ch
}

func (w *WatchEngine) Dispatch(event WatchEvent) {
    w.mu.RLock()
    subs := w.subs[event.Key]
    w.mu.RUnlock()
    for _, ch := range subs {
        select {
        case ch <- event:
        default:
            // Slow subscriber: drop event. Client reconnects on stream error.
        }
    }
}

Thundering Herd on Failover

When a leader crashes, all watchers receive a DELETE event simultaneously and all attempt to acquire the newly freed lock at the same instant. With 1,000 candidates, this is 1,000 simultaneous Raft proposals. Only one wins; the other 999 proposals fail and must be retried - but all 999 clients immediately retry, creating another wave of 999 proposals. This oscillates until the cluster is overwhelmed.

The fix is jittered backoff before retry. Candidates that fail to acquire the lock should wait a random interval before retrying, not immediately:

import random
import time

def acquire_with_jitter(stub, key: str, holder_id: str, ttl: int, max_attempts: int = 10):
    """
    Acquire a lock with exponential backoff and jitter.
    Prevents thundering herd when many candidates compete for the same lock.
    """
    base_delay = 0.05  # 50ms base
    for attempt in range(max_attempts):
        resp = stub.Acquire(AcquireRequest(key=key, holder_id=holder_id, ttl_seconds=ttl))
        if resp.granted:
            return resp

        if attempt < max_attempts - 1:
            # Exponential backoff with full jitter: sleep uniformly within [0, base * 2^attempt]
            cap = min(base_delay * (2 ** attempt), 2.0)  # max 2 second wait
            delay = random.uniform(0, cap)
            time.sleep(delay)

    raise RuntimeError(f"Failed to acquire lock {key!r} after {max_attempts} attempts")

This approach, pioneered by AWS in their 2015 “Exponential Backoff and Jitter” post, reduces thundering herd from 1,000 simultaneous retries to a Poisson-distributed arrival rate where the lock service can keep up.

Scaling and Performance

The lock service scales across three dimensions: more Raft nodes for fault tolerance (not throughput), more shards for horizontal lock capacity, and read-replicas for the watch/list workloads.

Scaling architecture with key range sharding, coordinator, and capacity estimation

Raft cluster size trades throughput for fault tolerance. A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2 failures. Adding beyond 5 nodes hurts throughput because every write must wait for more nodes to acknowledge before committing. For production, 5 nodes is the sweet spot. Each additional shard gets its own independent 5-node Raft group.

Key sharding linearly scales lock capacity. Each shard handles approximately 10,000 acquire/release operations per second. With 3 shards covering different key prefixes, the cluster handles 30,000 operations per second. Shards are independent - a leader election on Shard B does not affect lock acquisition on Shard A.

Read scaling - watch subscriptions and lock-status queries can be served by Raft followers. A follower’s in-memory Lease Store lags the leader by at most one Raft round-trip (2-5ms on LAN). For most use cases, serving watch subscriptions from followers reduces leader load by 40-60%.

Capacity estimation for a 3-shard, 5-node-per-shard cluster:

  Lock throughput: 3 shards * 10,000 ops/sec = 30,000 ops/sec
  Concurrent locks: 3 shards * 50,000 leases = 150,000 active leases
  Watch subscribers: 3 shards * 5,000 streams = 15,000 active watch connections

  Memory per node:
    Lease entry: ~256 bytes (key + holder + token + TTL)
    50,000 leases * 256 bytes = ~12.8 MB per shard
    5 shards (if single cluster): ~64 MB total lease state

  Raft log storage (RocksDB):
    At 30,000 ops/sec and 100 bytes/entry: 3 MB/sec
    With 10,000-entry snapshot interval: max log size = ~1 GB before compaction
    Steady-state: recent 10,000 entries = ~1 MB in memory

  Heartbeat traffic (5-node cluster, 150ms interval):
    4 followers * (1 HB / 0.15s) = ~27 RPCs/sec from leader per shard
    Negligible compared to client write traffic
Real World

Kubernetes runs etcd as a 5-node cluster handling all lock and leader-election operations for the control plane. At large clusters (1,000+ nodes), etcd handles approximately 5,000 to 10,000 writes per second for Endpoint, ConfigMap, and Lease updates. The Kubernetes leader-election library uses the same TTL-based lease pattern described here, stored as a coordination.k8s.io/Lease object, with a default lease duration of 15 seconds and a renewal interval of 10 seconds - exactly the TTL/3 renewal pattern we implement in the LeaderElector above.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Lock holder crashes mid-operationTTL expires; Lease Reaper detects after up to 100msResource locked for up to remaining TTLReaper commits DELETE via Raft; Watch notifies waiters; new acquirer gets incremented token
Raft leader crashElection timeout fires on followers (150ms-300ms random)No writes accepted during election window (~500ms)Followers elect new leader; clients redirect; fencing token counter preserved via log replay
Network partition (minority side)raft.Apply() times out after 5 secondsMinority partition stops serving writes; clients get redirect errorMajority partition elects new leader; minority rejoins and replays log diff
Split-brain (incorrect Raft implementation bug)Dual-master detection: two leaders issue same tokenTwo clients both told “granted” for same keyFencing token validation on resources catches stale writer; both holders notified on reconnect
Watch subscriber slow consumerEvent channel fill; drop flag setSubscriber misses events; stale view of stateClient detects gap by reconnecting and re-listing; Watch Engine clears stale subscriptions
Thundering herd on leader failoverSpike in acquire failures; latency spikeLock service temporarily overloaded; client timeoutsJittered exponential backoff smooths arrival; Raft processes at its own pace
Watch Out

The most common production mistake with distributed locks is setting TTL based on expected operation time rather than maximum safe delay. If a billing job typically takes 5 seconds, operators set TTL=10s for “double the buffer.” Then during a slow month-end with 8-second processing times, the lease expires while the job is still running. The next acquirer starts a second instance of the billing job, creating duplicate charges. Set TTL to the p99.9 of your operation duration multiplied by 3, then add circuit breaker logic that stops renewing if the operation is taking longer than expected - signaling the resource to voluntarily release.

Comparison of Approaches

ApproachConsistencyThroughputOperational CostBest Fit
Raft-based custom service (this design)Strong (linearizable)10K writes/sec/shardHigh (build + ops)Production coordination service; multiple teams
etcd (production-ready Raft)Strong (linearizable)10K-30K writes/secMedium (ops only)Kubernetes, existing etcd infrastructure
Zookeeper (ZAB consensus)Strong (sequential consistency)20K reads, 5K writes/secHigh (JVM, config)Legacy Hadoop/Kafka ecosystems
Redis SETNX + TTL (single node)Weak (no quorum)100K+ writes/secLowDev/test; low-stakes coordination
Redis Redlock (multi-node)Approximately strong (debated)50K writes/secMediumDistributed Redis; non-Raft deployments
PostgreSQL advisory locksStrong (MVCC)5K-10K writes/secLow (reuse existing DB)Single-region; already have Postgres
Consul sessionsStrong (Raft via Consul)5K writes/secLow (if using Consul)Service mesh teams already running Consul

Raft-based custom services are expensive to build but provide the strongest guarantees. For most engineering teams, etcd or Consul is the right answer: you get Raft consensus with battle-tested implementation, watch APIs, and fencing-token-equivalent revision numbers without writing a consensus implementation from scratch. Build a custom lock service only when you need custom semantics (hierarchical locks, lock cohorting) that off-shelf systems do not support.

Redis SETNX’s popularity is a cautionary tale: it is fast and simple until a Redis primary fails. At that moment, the sentinel promotes a replica that may not have replicated the lock key, and two holders believe they are the single owner. For any workload where double-execution causes data corruption, Redis single-node locking is the wrong choice.

Key Takeaways

  • Fencing tokens make lock-based coordination safe even when the lock service and protected resource disagree about who holds the lock - resources reject writes with tokens below their stored maximum without contacting the lock service.
  • Lease-based locking prevents indefinite blocking on holder failure; every lock must have a TTL, and the Lease Reaper must be the only authority that expires locks to avoid races.
  • Raft consensus gives the lock service itself the consistency it needs to be a source of truth - without Raft, the lock service would need its own lock service, recursing to infinity.
  • Split-brain prevention comes from Raft’s quorum requirement, not from application-level checks; a write that does not reach a majority is never committed, so two leaders cannot both issue tokens for the same key.
  • Watch API turns polling into push notification, reducing coordination overhead from O(candidates * poll-rate) to O(1) per state change.
  • Thundering herd on failover is prevented by jittered exponential backoff in the client retry logic, spreading thousands of simultaneous acquisition attempts across a 1-2 second window.
  • Lock granularity matters: coarse locks (one lock per service) serialize all coordination across a service; fine-grained locks (one lock per resource) maximize concurrency but increase lock service load - choose the most specific key that is still broader than the protected operation.
  • TTL sizing is the most consequential production decision: too short causes false expiry under GC pauses; too long blocks every waiter on holder failure for the remaining lease duration.

Distributed locking looks simple from the outside - mutual exclusion is a first-year CS concept. The depth comes from the failure modes at the intersection of clocks, network partitions, and process pauses. Every production incident involving a distributed lock can be traced back to one of three root causes: the lock had no TTL, the TTL was shorter than the real p99 operation time, or fencing tokens were not validated by the protected resource. Get these three right and the rest of the design is implementation detail.

Frequently Asked Questions

Q: Why not use a single database row with SELECT FOR UPDATE instead of building a full lock service?

A: SELECT FOR UPDATE works perfectly for single-region, single-database setups and you should use it if that is your scenario. It breaks in three ways at scale: first, the database itself becomes a single point of failure for all distributed coordination; second, SELECT FOR UPDATE holds an exclusive row lock for the entire transaction duration - if the client crashes mid-transaction, the lock is held until the connection times out (typically 60-120 seconds), which is far longer than a 5-30 second Raft-based lease; third, it does not provide fencing tokens, so you cannot protect resources outside the same database. Use a dedicated lock service when you need multi-region coordination, sub-second failure detection, or locking across heterogeneous resources.

Q: Why not use Redis SETNX for distributed locking in production?

A: Redis SETNX with TTL is fine for low-stakes coordination (rate limiting, deduplication hints) where a stale lock being held by a crashed client is not catastrophic. It fails for strong coordination because Redis replication is asynchronous: a primary can acknowledge a SETNX write, then crash before replicating to any replica. Sentinel promotes the replica, which has no record of the lock. A second client successfully acquires the “lock” while the crashed holder - if it restarts - still believes it holds it. The Redlock algorithm attempts to fix this with quorum writes across multiple Redis instances, but Martin Kleppmann’s 2016 analysis demonstrated it still fails under certain timing conditions. For storage mutations, payment operations, or leader election where double-execution has real consequences, use a Raft-based system.

Q: How do you handle a fencing token counter reset after a full cluster restart?

A: This is prevented by persisting the Raft log to durable storage (RocksDB or similar) on every node. On restart, each node replays its log from disk and reconstructs the exact token counter value. If all nodes lose their data simultaneously (catastrophic multi-node failure), the token counter resets to zero. To protect against this, replicate Raft snapshots to object storage (S3 or GCS) at regular intervals. On a full cluster cold start from snapshot, the first token issued starts at snapshot_token_max + 1. Protected resources that track max_seen_token across restarts (by persisting it to their own durable storage) can detect a counter rollback and refuse all writes until an operator confirms it is a fresh epoch.

Q: What happens to lock holders when a Raft leader election occurs?

A: Existing lock holders are not affected. Their leases remain valid with their original TTLs as recorded in the Raft log. The new leader replays the log, reconstructs the Lease Store including all active leases, and resumes serving requests. Holders continue sending heartbeat renewals; the new leader accepts them normally because the token and holder fields match the committed log entries. The only visible effect to clients is a brief period (under 500ms) where write requests time out while the election completes - clients should retry after a short backoff.

Q: How does the Watch API handle the case where a client disconnects and reconnects, potentially missing events?

A: The Watch Engine does not buffer missed events. When a client reconnects, it must re-list the current state via ListLocks() and then subscribe to future changes. This is the same “list-then-watch” pattern etcd uses for Kubernetes informers. The reconnecting client issues a watch request with the revision of the list response, and the Watch Engine replays any events committed after that revision. If the watch history for a given revision has been compacted (cleared by the snapshot process), the client must list-then-watch again from the current revision.

Q: How do you prevent a slow Lease Reaper from holding up lock expiry during a write-heavy workload?

A: The Lease Reaper runs independently of the lock acquisition path. A slow reaper means leases expire later than their TTL, not earlier - which causes resource starvation but not double-execution. To bound expiry latency, run the reaper loop every 100ms and process at most 1,000 expired leases per tick. If more than 1,000 leases expire in a 100ms window, you have a client-side failure (many clients crashed simultaneously) and the reaper will drain the backlog over subsequent ticks. For the typical case where only a handful of leases expire per second, the 100ms reaper provides expiry within TTL + 100ms of the configured deadline.

Interview Questions

Q: A client holds a lock with token=42. Its process pauses for 45 seconds due to a JVM GC. The lock’s TTL is 30 seconds. Walk me through exactly what happens to the lock, the resource, and the next acquirer.

Expected depth: Describe the timeline step by step. At T+30s, the Lease Reaper detects now > expires_at for the lock key. The Raft leader commits a DELETE entry. The Watch Engine notifies all waiters. Client B acquires the lock and receives token=43. Client B writes to the protected resource with token=43; the resource updates max_seen_token=43. At T+45s, Client A’s GC pause ends. It believes it still holds the lock with token=42. It attempts a write to the protected resource with fencing_token=42. The resource sees 42 < max_seen_token(43) and rejects the write with an error. Client A should detect this error, log the incident, and not retry - because retrying without re-acquiring the lock would still use the stale token. Discuss why Client A cannot simply re-acquire: by the time it detects the rejection, it has already corrupted its local state (it executed business logic assuming exclusive access). The correct response is to fail the operation and let the caller retry from scratch with a new lock acquisition.

Q: Design the fencing token counter such that it survives both individual node failures and complete cluster restarts without rolling back.

Expected depth: Explain that the counter is stored in the Raft log itself - every ACQUIRE entry includes the new token value, so replaying the log from disk recovers the exact counter state. For complete cluster restarts, Raft snapshots checkpointed to durable object storage (S3) preserve the counter at the snapshot point. After a full cold restart from the latest snapshot, the leader initializes the counter to snapshot_max_token + 1. Protected resources must persist their max_seen_token to durable storage so they can detect a rollback even across their own restarts. Discuss the edge case: if the snapshot was 10 minutes old and the cluster replays 10 minutes of operations from the WAL, there is no rollback. Only a total loss of all cluster data (all nodes lost with no object-storage backup) causes a rollback - and that should trigger a manual review before resuming, not automatic recovery.

Q: With 1,000 candidates watching a leader-election lock and the current leader crashes, how do you prevent all 1,000 from overwhelming the lock service in the moment of failover?

Expected depth: Describe jittered exponential backoff in the client. When candidates receive the DELETE watch event, they do not immediately retry - they sleep for a random interval in [0, jitter_max] where jitter_max = base_delay * 2^attempt. This spreads 1,000 simultaneous retries into a Poisson-distributed arrival rate that the Raft leader can process at its own pace. Mention the full jitter strategy (AWS recommendation) versus equal jitter - full jitter performs better under high concurrency. Also discuss Zookeeper’s approach of using ephemeral sequential nodes to turn the thundering herd into a queue: each candidate creates /election/candidate-XXXX and watches only the immediately preceding node in sequence order, so only one candidate is notified on each leader release. Trade off: simpler retry logic in the client (jitter) versus more complex lock service semantics (sequential ephemeral nodes).

Q: A downstream storage system needs to validate fencing tokens but is also distributed. How do you ensure the max_seen_token is consistent across storage replicas?

Expected depth: The storage system must replicate max_seen_token through its own consensus mechanism (its own Raft log, or MVCC compare-and-swap). On each write request with a fencing token, the storage leader compares the incoming token to its local max_seen_token, rejects if lower, and if accepting, includes SET max_seen_token = incoming_token as part of the same atomic committed write. This update propagates to storage followers through storage’s own replication. The key invariant is that max_seen_token is written in the same transaction as the data write - it is not a separate out-of-band update. Discuss the failure mode where a storage follower serves a read for max_seen_token before replication catches up: it may incorrectly accept a stale write. Mitigate by always serving max_seen_token reads from the storage leader, or by using read-your-writes consistency (quorum reads).

Q: How would you extend this design to support hierarchical locks where acquiring /locks/billing implicitly blocks acquisition of /locks/billing/job-A and /locks/billing/job-B?

Expected depth: Describe prefix-based conflict detection in the Raft FSM. Before committing an ACQUIRE entry for key K, the FSM checks all active leases to see if any key is a prefix of K (would block a fine-grained lock) or K is a prefix of any active key (a coarse lock blocking existing fine-grained ones). This check must happen inside the Raft FSM Apply() function - if done before proposing the entry, a race exists between the check and the commit. The cost is O(active leases) per acquire in the naive implementation; use a trie data structure for O(key length) prefix lookup. Discuss the Chubby design: Chubby supports read/write locks on directories and files, where locking a directory implicitly locks all children. The etcd equivalent is range locks on key prefixes using txn with range conditions. Trade off: hierarchical locks add complexity and reduce concurrency; most systems avoid them by flattening the lock namespace.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article