Build a Distributed Write-Ahead Log
databases distributed-systems reliability
System Design Deep Dive
Distributed Write-Ahead Log
Durability is a promise you make before you crash - designing a WAL that survives node failures and rewinds time on demand.
Think of the write-ahead log as the receipts you keep before updating your bank ledger. Before you erase anything in the ledger, you write a receipt - what the old value was, what the new value will be, and a sequence number. If the power cuts out mid-update and the ledger is corrupt, you pull out the receipts and replay from the last consistent point. The database survives because the log is always written first.
Distributing this concept across multiple nodes transforms a local durability mechanism into a replication protocol. The promise now is not just “survive my own crash” but “survive the crash of any single node in the cluster.” To make that promise, you need consensus: a write isn’t durable until a quorum of nodes have confirmed they’ve appended the entry to their local WAL. Return the acknowledgment before that, and a crash takes your unconfirmed write to the grave.
The design tensions are immediate. Waiting for a quorum before acknowledging writes adds latency - network round trips that stack up under write load. Async replication (acknowledge first, replicate later) gives you low latency but breaks the durability promise: a crash between the ack and replication loses committed data. And recovery requires replaying the log from a checkpoint, which means checkpoints must be consistent, log segments must be retained until they’re no longer needed, and truncation must be coordinated across all nodes.
We need to solve for durability across node failures, tunable replication latency, and point-in-time recovery from any checkpoint simultaneously. This is the engineering challenge that PostgreSQL’s WAL, etcd’s Raft log, and Kafka’s partition log all solve, each with different tradeoffs.
Requirements and Constraints
Functional Requirements
- Append log entries atomically with a monotonically increasing log sequence number (LSN)
- Replicate each entry to at least N/2+1 nodes before acknowledging the write
- Support synchronous and asynchronous replication modes per consumer
- Create checkpoints that allow truncating log segments already applied to state machines
- Recover to any point in time by replaying log from the nearest checkpoint
- Detect and handle split-brain (leader isolation) safely
Non-Functional Requirements
- Write throughput: 100,000 log entries per second per cluster
- Write latency: P99 under 5ms for synchronous mode (LAN replication)
- Log entry size: 128 bytes to 4MB per entry
- Durability: zero committed data loss on any single-node failure
- Recovery time: restore to consistent state within 30 seconds after leader crash
- Replication lag: followers within 1 second of leader under normal load
Constraints and Assumptions
- 3-node and 5-node cluster sizes (odd for quorum clarity)
- Leader-based replication (not multi-master)
- Entries are opaque bytes - the WAL doesn’t interpret log content
- fsync is available and must be called before acknowledging writes
- Network is reliable within a datacenter (LAN latency under 1ms)
High-Level Architecture
The WAL cluster has five key components: the log writer on the leader node that appends entries and calls fsync, the replication engine that streams log entries to followers, the follower log applier on each follower that persists entries and sends acknowledgments, the checkpoint coordinator that manages consistent state snapshots, and the recovery manager that handles leader election and log replay after failures.
A write request flows through the system like this: the client sends an entry to the leader. The leader appends it to its local WAL with the next LSN and calls fsync to guarantee durability on the leader’s disk. Simultaneously, the replication engine streams the entry to all followers. Each follower appends the entry to its local WAL, calls fsync, and sends an acknowledgment back to the leader. The leader waits for N/2 acknowledgments (quorum minus the leader itself), then returns success to the client.
On crash recovery, the new leader - elected from the follower with the highest LSN - replays unconfirmed entries from its log and either commits or aborts them based on whether a quorum holds each entry.
The leader calling fsync before sending to followers is not redundant - it ensures that if the leader crashes during replication, a follower with the highest LSN that becomes the new leader has a consistent log to work from, not a partially-written entry at the tail.
WAL Structure and Log Sequence Numbers
The WAL is the most critical data structure in the system. Its physical layout determines recovery speed, replication efficiency, and disk space utilization.
Think of a WAL as a tape recorder that only plays forward. Entries are written sequentially at the head. The tape never rewinds. When you need to recover, you fast-forward from a known-good position. When a segment of tape is too old to matter, you cut it off.
A WAL segment is a fixed-size file (typically 16MB or 64MB). When a segment fills, the log writer opens a new one. Each segment is named by its starting LSN. The LSN itself is a monotonically increasing 64-bit integer - it encodes both the segment number and the offset within that segment, making it trivially comparable across nodes.
// WAL entry format: header + payload
// Every entry is written atomically to the WAL file
type WALEntry struct {
LSN uint64 // 8 bytes: log sequence number
Length uint32 // 4 bytes: payload length in bytes
CRC32 uint32 // 4 bytes: checksum of payload
EntryType uint8 // 1 byte: DATA, CHECKPOINT, ABORT, COMMIT
Padding [3]byte // 3 bytes: alignment to 8-byte boundary
Payload []byte // variable: opaque data from state machine
}
// WAL writer: append-only, fsync on each write
type WALWriter struct {
mu sync.Mutex
currentFile *os.File
currentLSN uint64
segmentSize int64
dir string
}
func (w *WALWriter) Append(entryType uint8, payload []byte) (uint64, error) {
w.mu.Lock()
defer w.mu.Unlock()
lsn := atomic.AddUint64(&w.currentLSN, 1)
entry := WALEntry{
LSN: lsn,
Length: uint32(len(payload)),
CRC32: crc32.ChecksumIEEE(payload),
EntryType: entryType,
Payload: payload,
}
if err := binary.Write(w.currentFile, binary.LittleEndian, entry.header()); err != nil {
return 0, fmt.Errorf("write header: %w", err)
}
if _, err := w.currentFile.Write(payload); err != nil {
return 0, fmt.Errorf("write payload: %w", err)
}
// fsync BEFORE returning LSN - this is the durability contract
if err := w.currentFile.Sync(); err != nil {
return 0, fmt.Errorf("fsync: %w", err)
}
if w.currentFileSize() >= w.segmentSize {
w.rotateSegment()
}
return lsn, nil
}
On Linux, file.Sync() calls fsync(2), which flushes the kernel page cache to the disk write cache. It does NOT guarantee the data is on the physical platter if the disk has a volatile write cache. For true durability, either use O_DIRECT + O_SYNC on the WAL file, or verify that the drive’s write cache is battery-backed or disabled.
Replication Engine
The replication engine streams log entries from the leader to all followers. It is the most latency-sensitive component in the system because its speed determines how long the leader blocks waiting for quorum acknowledgment.
The critical design decision is batch vs. per-entry replication. Replicating every entry individually minimizes latency but creates network overhead proportional to entry count. Batching amortizes network overhead but increases the latency of individual entries when write rate is low.
The solution is adaptive batching: wait up to 500 microseconds or until the batch buffer reaches 64KB before sending, whichever comes first. At high write rates, the buffer fills before the timer fires - you get batching. At low write rates, the timer fires after 500us - you get bounded latency.
// Adaptive batcher for WAL replication
type ReplicationBatcher struct {
maxWait time.Duration // 500us
maxBytes int // 64KB
entries []WALEntry
totalSize int
timer *time.Timer
flushCh chan []WALEntry
}
func (b *ReplicationBatcher) Add(entry WALEntry) {
b.entries = append(b.entries, entry)
b.totalSize += len(entry.Payload) + 16
if b.totalSize >= b.maxBytes {
b.flush()
return
}
if b.timer == nil {
b.timer = time.AfterFunc(b.maxWait, b.flush)
}
}
func (b *ReplicationBatcher) flush() {
if len(b.entries) == 0 {
return
}
batch := b.entries
b.entries = nil
b.totalSize = 0
if b.timer != nil {
b.timer.Stop()
b.timer = nil
}
b.flushCh <- batch
}
// Leader quorum tracker: wait for N/2 acks before confirming write
type QuorumTracker struct {
mu sync.Mutex
pending map[uint64]*pendingWrite // LSN -> ack state
quorum int
}
type pendingWrite struct {
ch chan error
acks int
total int
}
func (q *QuorumTracker) WaitForQuorum(lsn uint64) error {
q.mu.Lock()
pw := q.pending[lsn]
q.mu.Unlock()
return <-pw.ch
}
func (q *QuorumTracker) RecordAck(followerID string, lsn uint64) {
q.mu.Lock()
pw := q.pending[lsn]
pw.acks++
if pw.acks >= q.quorum {
pw.ch <- nil
delete(q.pending, lsn)
}
q.mu.Unlock()
}
etcd’s Raft implementation uses exactly this adaptive batching approach. The Raft leader collects log entries into proposals and sends them to followers in batches, but bounds the batch flush delay to stay within latency SLAs. Under heavy write load, batching is nearly free; under light load, the timer ensures bounded latency.
Checkpoint Creation
Checkpoints solve the log growth problem. Without checkpoints, the WAL grows forever - every historical entry must be retained in case recovery needs to replay from the beginning of time. Checkpoints create a known-consistent snapshot of the state machine, after which all older log segments can be safely deleted.
A checkpoint is a WAL entry that says: “at LSN X, the state machine is in state S.” After a checkpoint at LSN X is written and acknowledged by a quorum, log segments whose highest LSN is less than X can be truncated.
// Checkpoint entry payload
type CheckpointEntry struct {
CheckpointLSN uint64 // the LSN of this checkpoint entry
StateMachineState []byte // serialized state snapshot
CompletedLSN uint64 // all entries up to this LSN are included in the snapshot
NodeID string // which node created this checkpoint
Timestamp int64 // unix nano
}
// Checkpoint coordinator: takes a snapshot and writes it to the WAL
type CheckpointCoordinator struct {
wal *WALWriter
stateMachine StateMachine
minInterval time.Duration // minimum time between checkpoints
lastCheckpointLSN uint64
}
func (c *CheckpointCoordinator) TakeCheckpoint(ctx context.Context) (uint64, error) {
// Pause state machine writes during snapshot (brief)
snapshot, snapshotLSN, err := c.stateMachine.Snapshot()
if err != nil {
return 0, err
}
entry := CheckpointEntry{
StateMachineState: snapshot,
CompletedLSN: snapshotLSN,
NodeID: c.nodeID,
Timestamp: time.Now().UnixNano(),
}
data, _ := proto.Marshal(&entry)
checkpointLSN, err := c.wal.Append(EntryTypeCheckpoint, data)
if err != nil {
return 0, err
}
// Safe to truncate segments with max_LSN < snapshotLSN
c.wal.TruncateBeforeLSN(snapshotLSN)
c.lastCheckpointLSN = checkpointLSN
return checkpointLSN, nil
}
Checkpoint frequency is a tradeoff between recovery time and write amplification. A checkpoint every 5 minutes means a crash recovery replays at most 5 minutes of log, but taking the snapshot itself adds I/O. Most systems checkpoint every 1,000 WAL entries or every 5 minutes, whichever comes first.
The checkpoint’s CompletedLSN and the checkpoint entry’s own LSN are different values. CompletedLSN is the highest log entry applied to the state machine snapshot; the checkpoint LSN is where this checkpoint record lives in the WAL. Recovery replays from CompletedLSN, not from the checkpoint LSN.
Log Truncation
Log truncation reclaims disk space by deleting WAL segments that are no longer needed for recovery or replication. Getting truncation wrong in either direction has severe consequences: truncate too aggressively and a recovering follower has no log to catch up from; truncate too conservatively and the disk fills up.
The safe truncation LSN is the minimum of:
- The LSN of the most recent durable checkpoint
- The lowest confirmed LSN across all live followers (a follower can’t catch up past its current position if segments are deleted)
// Safe truncation: respect both checkpoint and follower lag
type LogTruncator struct {
wal *WALWriter
followers map[string]*FollowerState
checkpoints []CheckpointEntry
}
func (t *LogTruncator) SafeTruncationLSN() uint64 {
// Lowest LSN any follower has confirmed
minFollowerLSN := uint64(math.MaxUint64)
for _, f := range t.followers {
if f.ConfirmedLSN < minFollowerLSN {
minFollowerLSN = f.ConfirmedLSN
}
}
// Latest checkpoint's CompletedLSN
var latestCheckpointLSN uint64
for _, cp := range t.checkpoints {
if cp.CompletedLSN > latestCheckpointLSN {
latestCheckpointLSN = cp.CompletedLSN
}
}
// Take the minimum - the log must be retained for the slowest consumer
safeLSN := minFollowerLSN
if latestCheckpointLSN < safeLSN {
safeLSN = latestCheckpointLSN
}
return safeLSN
}
func (t *LogTruncator) Truncate() error {
safeLSN := t.SafeTruncationLSN()
return t.wal.DeleteSegmentsBeforeLSN(safeLSN)
}
A follower that falls significantly behind (network partition, slow disk) can prevent log truncation indefinitely, causing the leader’s disk to fill up. Implement a maximum follower lag policy: if a follower hasn’t confirmed within X GB or Y hours, declare it dead, remove it from the quorum calculation, and allow truncation to proceed. The stale follower will need a full snapshot when it reconnects.
REDO vs UNDO Logging
The choice between REDO logging and UNDO logging determines how recovery works and what the WAL must record.
REDO logging records what the new state should be after an operation. Recovery replays REDO entries to bring the state machine forward to a consistent point. This is what PostgreSQL and most databases use for crash recovery. Recovery is fast because you play forward from a checkpoint.
UNDO logging records what the old state was before an operation. It allows rollback of in-progress transactions by replaying UNDO records backward. This is used in combination with REDO in systems that need to both recover and rollback (ARIES protocol).
For a distributed WAL focused on replication and recovery (not transaction management), pure REDO logging is the right choice. Entries record the final state, not the delta from a previous state, making replay straightforward.
// REDO log entry for a key-value state machine
type RedoEntry struct {
Operation string // "PUT", "DELETE", "SET_METADATA"
Key string
Value []byte // final value after operation (empty for DELETE)
Version uint64 // CAS version if applicable
}
// Replaying REDO entries to rebuild state
func (sm *KVStateMachine) Replay(entries []WALEntry, fromLSN uint64) error {
for _, entry := range entries {
if entry.LSN < fromLSN {
continue
}
if entry.EntryType != EntryTypeData {
continue
}
var redo RedoEntry
if err := proto.Unmarshal(entry.Payload, &redo); err != nil {
return fmt.Errorf("corrupt WAL entry at LSN %d: %w", entry.LSN, err)
}
switch redo.Operation {
case "PUT":
sm.data[redo.Key] = redo.Value
case "DELETE":
delete(sm.data, redo.Key)
}
}
return nil
}
PostgreSQL uses REDO-only WAL for crash recovery but implements UNDO behavior through multi-version concurrency control (MVCC): instead of undoing writes, old row versions are retained in the heap until VACUUM reclaims them. This avoids writing UNDO records entirely, trading undo overhead for VACUUM overhead.
Crash Recovery Protocol
Crash recovery is the sequence of steps that brings the cluster back to a consistent state after a node failure or leader crash. The protocol has four phases: leader election, log reconciliation, replay, and client reconnection.
Phase 1: Leader election. The follower with the highest confirmed LSN wins the election (Raft uses term numbers plus log index; Paxos uses ballot numbers). This guarantees the new leader has the most complete log - it won’t be missing entries that a quorum confirmed.
Phase 2: Log reconciliation. The new leader compares its log with the last known quorum state. Entries at the tail of the old leader’s log that were not quorum-committed must be determined: did a quorum confirm them or not? Entries not confirmed by a quorum are rolled back. Entries confirmed by a quorum are re-committed.
Phase 3: Replay. If the crash lost in-memory state (not the WAL itself), the new leader loads the most recent checkpoint and replays WAL entries from the checkpoint’s CompletedLSN forward. For a key-value store with checkpoints every 5 minutes and 100,000 entries/second, this means replaying at most 30,000,000 entries - which at replay speeds of 500,000 entries/second takes about 60 seconds. If that’s too slow, increase checkpoint frequency.
Phase 4: Client reconnection. The cluster announces the new leader’s address. Clients retry their in-flight writes (which the new leader may or may not have in its log, depending on whether quorum was reached). Clients must be idempotent for this to be safe.
// Recovery manager: coordinate log reconciliation after leader crash
type RecoveryManager struct {
log *WALReader
peers []PeerClient
localNodeID string
}
func (r *RecoveryManager) ReconcileLog(ctx context.Context) error {
// Find the highest LSN each peer has
peerLSNs := make(map[string]uint64)
for _, peer := range r.peers {
lsn, err := peer.GetHighestLSN(ctx)
if err != nil {
continue
}
peerLSNs[peer.NodeID()] = lsn
}
// Determine quorum-confirmed LSN (the LSN that N/2+1 nodes have)
quorumLSN := r.quorumLSN(peerLSNs)
// Truncate entries above quorumLSN (they were not committed)
localHighestLSN := r.log.HighestLSN()
if localHighestLSN > quorumLSN {
if err := r.log.TruncateFromLSN(quorumLSN + 1); err != nil {
return fmt.Errorf("log truncation: %w", err)
}
}
// Load the latest checkpoint and replay from there
checkpoint, err := r.log.LatestCheckpoint()
if err != nil {
return fmt.Errorf("no checkpoint found: %w", err)
}
entries, err := r.log.ReadFrom(checkpoint.CompletedLSN)
if err != nil {
return fmt.Errorf("log read: %w", err)
}
return r.stateMachine.Replay(entries, checkpoint.CompletedLSN)
}
func (r *RecoveryManager) quorumLSN(peerLSNs map[string]uint64) uint64 {
lsns := make([]uint64, 0, len(peerLSNs))
for _, lsn := range peerLSNs {
lsns = append(lsns, lsn)
}
sort.Slice(lsns, func(i, j int) bool { return lsns[i] > lsns[j] })
quorum := (len(r.peers)+1)/2 + 1
if quorum > len(lsns) {
return 0
}
return lsns[quorum-1]
}
Data Model
The WAL’s physical data model has two layers: the log segment files on disk and the segment index in memory (rebuilt on startup from the file headers).
-- Metadata database for cluster coordination (separate from WAL files)
CREATE TABLE wal_nodes (
node_id VARCHAR(64) PRIMARY KEY,
address VARCHAR(255) NOT NULL,
role VARCHAR(20) NOT NULL DEFAULT 'follower', -- 'leader' | 'follower' | 'candidate'
current_term BIGINT NOT NULL DEFAULT 0,
voted_for VARCHAR(64),
confirmed_lsn BIGINT NOT NULL DEFAULT 0,
last_heartbeat TIMESTAMPTZ NOT NULL DEFAULT now(),
is_alive BOOLEAN NOT NULL DEFAULT true
);
CREATE TABLE wal_segments (
segment_id BIGINT PRIMARY KEY, -- segment number (start_LSN >> 24)
start_lsn BIGINT NOT NULL,
end_lsn BIGINT, -- null if segment is not yet closed
file_path TEXT NOT NULL,
file_size_bytes BIGINT NOT NULL DEFAULT 0,
is_truncatable BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
truncated_at TIMESTAMPTZ
);
CREATE INDEX ON wal_segments(start_lsn);
CREATE INDEX ON wal_segments(is_truncatable) WHERE is_truncatable = false;
CREATE TABLE wal_checkpoints (
checkpoint_lsn BIGINT PRIMARY KEY,
completed_lsn BIGINT NOT NULL,
snapshot_path TEXT NOT NULL,
snapshot_size_bytes BIGINT NOT NULL,
node_id VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
is_valid BOOLEAN NOT NULL DEFAULT true
);
WAL segment file layout (binary format):
Segment header (64 bytes):
magic_number: 8 bytes (0xWAL000000000001)
segment_id: 8 bytes
start_lsn: 8 bytes
node_id: 36 bytes (UUID)
padding: 4 bytes
Entry (variable length, aligned to 8 bytes):
lsn: 8 bytes
length: 4 bytes (payload length)
crc32: 4 bytes
entry_type: 1 byte
padding: 3 bytes
payload: `length` bytes
alignment: 0-7 bytes to reach 8-byte boundary
Key Algorithms and Protocols
Synchronous vs Asynchronous Replication
The replication mode determines the durability-latency tradeoff per write.
In synchronous mode, the leader blocks until a quorum of followers have written and fsync’d the entry. This guarantees zero data loss on any single-node failure. The cost is the round-trip latency to followers (typically 1-3ms on LAN).
In asynchronous mode, the leader returns success after its own fsync and replicates in the background. Write latency drops to disk flush time (0.1-1ms), but a leader crash before replication completes loses committed entries.
Most real systems implement both modes and let the caller choose per-write or per-consumer. Writes that need durability (financial transactions, metadata changes) use sync mode. High-throughput analytics writes can use async mode where occasional loss is acceptable.
// Write path with configurable replication mode
type WALCluster struct {
leader *WALWriter
batcher *ReplicationBatcher
quorum *QuorumTracker
}
type WriteOptions struct {
Mode ReplicationMode // Sync | Async | LeaderOnly
Timeout time.Duration
}
func (c *WALCluster) Write(ctx context.Context, payload []byte, opts WriteOptions) (uint64, error) {
// Always fsync locally first
lsn, err := c.leader.Append(EntryTypeData, payload)
if err != nil {
return 0, err
}
switch opts.Mode {
case ReplicationModeSync:
// Block until quorum acknowledges
ctx, cancel := context.WithTimeout(ctx, opts.Timeout)
defer cancel()
if err := c.quorum.WaitForQuorum(lsn); err != nil {
return 0, fmt.Errorf("quorum wait: %w", err)
}
case ReplicationModeAsync:
// Fire and forget - add to batcher, return immediately
c.batcher.Add(c.leader.ReadEntry(lsn))
case ReplicationModeLeaderOnly:
// Local fsync only - no replication (for non-critical log data)
}
return lsn, nil
}
The replication mode is a property of the write, not the cluster - different data classes within the same WAL can use different modes. Write the cluster membership change synchronously; write the application metrics events asynchronously. One WAL cluster serves both workloads correctly.
Scaling and Performance
The distributed WAL scales along three dimensions: throughput (entries per second), entry size (bytes per entry), and consumer count (followers or downstream readers).
Capacity estimation for 100,000 entries/second:
Given:
- 100,000 entries/second
- Average entry size: 512 bytes (headers + payload)
- 3-node cluster (1 leader + 2 followers)
- Synchronous replication: 2 followers must ack before leader returns
Write throughput:
- Leader disk: 100,000 * 512 bytes = 51.2 MB/s sequential writes
- Per-follower bandwidth: 51.2 MB/s incoming
- Leader outbound: 51.2 MB/s * 2 followers = 102.4 MB/s
- Total network: 153.6 MB/s (well within 10GbE = 1,250 MB/s)
WAL retention (for 48h recovery window):
- 100,000 entries/s * 512 bytes * 48 * 3600s = 8.8 TB per node
Checkpoint interval (30s recovery target):
- 100,000 entries/s * 30s = 3,000,000 entries to replay
- At 500,000 entries/s replay speed: 6 seconds to replay
- Take checkpoint every 30 seconds to meet 30s RTO
fsync latency budget:
- P99 fsync on NVMe SSD: 0.5ms
- Network RTT (LAN): 0.3ms
- Quorum latency: max(0.5ms local, 0.5ms follower_1 + 0.3ms, 0.5ms follower_2 + 0.3ms) = ~1.3ms
- P99 write latency target: 5ms - comfortably met
The primary throughput bottleneck is fsync latency. Batching amortizes fsync calls across multiple entries: if 100 entries arrive before the current fsync completes, they can all be included in a single fsync operation. This is group commit - the same technique PostgreSQL uses to achieve high write throughput.
Kafka’s partition log achieves 1 million writes/second on a single broker by combining sequential writes (WAL-like), OS page cache buffering, and group commit. fsync is called at configurable intervals (not per-message), trading durability guarantees for throughput. For finance systems, Kafka is configured with acks=all and min.insync.replicas=2 - the distributed equivalent of synchronous quorum replication.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Leader crash | Heartbeat timeout (2-5s) | Write unavailability during election; reads from followers continue | Follower with highest LSN wins election; replays unconfirmed tail; resumes writes |
| Follower crash | Heartbeat timeout | Reduced quorum size (2/3 vs 3/3); writes continue if remaining quorum intact | Follower restarts, loads latest checkpoint, replays WAL from checkpoint LSN |
| Network partition (leader isolated) | Leader can’t get quorum acks; followers can’t reach leader | Split-brain risk; old leader stops accepting writes (no quorum); followers elect new leader | Old leader’s term is lower than new leader’s term; reconnecting old leader defers immediately |
| Disk full on leader | OS write error on WAL append | All writes fail immediately | Alert + pause writes; truncate old segments if safe; migrate leader to node with more space |
| Corrupt WAL segment (CRC failure) | CRC check on read during recovery | Recovery cannot proceed past the corruption | Restore from the previous checkpoint + uncorrupted segments; entries after corruption are lost |
| Clock skew between nodes | LSN comparison reveals future timestamps | Incorrect election decisions if using wall clock in LSN | Use logical LSN (monotonic counter), not wall clock; treat clock skew as operational alarm only |
The most dangerous failure mode is an operator mistake during truncation. Manually deleting WAL segments or running VACUUM on the metadata table without verifying that all followers have confirmed past those segments results in followers that cannot catch up - they need a full snapshot and replay from scratch. Never truncate log segments manually; always go through the truncation coordinator.
Comparison of Approaches
| Approach | Durability | Write Latency | Complexity | Failure Tolerance | Best Fit |
|---|---|---|---|---|---|
| Single-node WAL (no replication) | Node-level (disk survives) | 0.5-1ms (1 fsync) | Low | None (node crash = downtime) | Single-node databases, embedded systems |
| Async replication WAL | Node-level + replication lag | 0.5-1ms (no quorum wait) | Medium | Low (crash loses in-flight writes) | Analytics pipelines, metrics collection |
| Synchronous quorum WAL | Quorum-level (survives minority failure) | 1-5ms (quorum wait) | High | Any minority of nodes | OLTP databases, financial ledgers |
| Raft log | Strong (majority quorum + leader election) | 2-10ms (Raft overhead) | Very High | Any minority of nodes | etcd, distributed consensus systems |
| Multi-Paxos | Strong (majority quorum, no single leader) | 5-20ms (two-phase) | Extremely High | Any minority of nodes + leader crash | Google Chubby, CockroachDB |
Synchronous quorum WAL is the right choice for a distributed database or ledger where the write durability contract must be absolute. Raft simplifies the leader election and log consistency protocol at the cost of additional message rounds. For systems where you control the entire stack, implementing synchronous quorum WAL without the full Raft state machine is simpler and more transparent.
Key Takeaways
- Write the log before you change the state: the WAL’s invariant is that the log entry is durable before the state machine advances; violating this makes recovery impossible.
- Quorum acknowledgment is the durability boundary: returning success to a client before a quorum confirms the entry means a crash can lose committed data - this is a correctness bug, not a performance tradeoff.
- Log truncation requires coordination: truncating segments before all followers have confirmed past them leaves followers unable to catch up without a full snapshot.
- Group commit is the throughput multiplier: batching multiple entries into a single
fsynccall amortizes the dominant latency component and is the difference between 1,000 and 100,000 writes per second. - REDO logging with periodic checkpoints is the correct recovery model: replay from checkpoint + REDO entries is deterministic, fast, and doesn’t require UNDO records for committed transactions.
- Synchronous and asynchronous replication are both correct, for different data: don’t force all writes into sync mode - let the caller choose per-write based on the durability contract that data requires.
- The LSN is a logical clock: it orders events across the cluster without requiring synchronized wall clocks; treat it as the source of truth for ordering, not timestamps.
The counter-intuitive lesson here is that the WAL’s power comes from simplicity: it is an append-only, sequential, fsync’d file. Every optimization (batching, async replication, async checkpointing) is an optimization on top of this simple primitive. When things go wrong, you debug by reading the log sequentially. The WAL is both the most critical component in a database system and the most auditable - which is exactly what you want when data integrity is on the line.
Frequently Asked Questions
Q: Why call fsync on the leader before replicating to followers?
A: If the leader appends to its local WAL without fsync and then crashes before replication, the in-memory entry is lost. When a follower with the highest confirmed LSN becomes the new leader, it may not have the entry that the old leader was about to send. By fsync’ing locally first, we guarantee the entry survives on at least one node (the leader’s disk) before replication begins, giving the new leader a complete log to reconcile from.
Q: Why not just use a distributed consensus system like etcd or Zookeeper for this? A: You can, and many systems do. But etcd and Zookeeper are general-purpose consensus stores optimized for small values and low write rates. A WAL needs to handle large payloads (up to 4MB per entry), high throughput (100,000 entries/second), and sequential disk access patterns that don’t fit well into a B-tree store. Building a purpose-specific WAL lets you optimize for sequential writes, batch replication, and fast replay - capabilities general consensus stores sacrifice for generality.
Q: How do you handle a follower that has been partitioned for several hours and missed thousands of log segments? A: The follower’s confirmed LSN is far behind the safe truncation LSN. When it reconnects, the leader checks if the follower’s last confirmed LSN still exists in the WAL. If the segments are gone, the follower cannot catch up incrementally - it needs a full snapshot transfer. The leader sends the latest checkpoint snapshot, and the follower rebuilds its state from there, then continues receiving new entries. This is called “snapshot transfer” in Raft terminology.
Q: What’s the difference between the checkpoint LSN and the completed LSN inside the checkpoint entry?
A: The CompletedLSN is the highest LSN whose effect is captured in the state snapshot. It’s the recovery start point - replay begins from this LSN. The checkpoint entry’s own LSN is simply where in the WAL this checkpoint record was written. These are almost always different because writing the checkpoint record itself advances the LSN counter. Recovery reads the checkpoint entry, extracts CompletedLSN, then replays all entries from CompletedLSN onward.
Q: How do you prevent write-ahead log entries from being corrupted during a crash mid-write?
A: Each entry has a CRC32 checksum covering its payload. During recovery, the log reader validates the CRC of every entry. If the CRC fails, the entry is considered corrupt and recovery stops at that point (treating it as if the crash happened at that entry). Entries before the corrupt one are valid; the corrupt entry and everything after it are discarded. This is why fsync before acknowledging is critical - without it, an acknowledged entry could have a partially-written CRC on disk after a crash.
Interview Questions
Q: Walk me through what happens when the WAL leader crashes mid-replication. Which writes are lost and why?
Expected depth: Candidate should trace the write path: leader fsync’s, sends to followers, collects quorum acks, returns to client. A crash after fsync but before quorum means the entry exists on the leader’s disk but followers may not have it. The new leader (highest LSN follower) may not have this entry. If fewer than quorum-minus-1 followers have it, the entry is not quorum-committed and will be rolled back. Client should retry. If the client got a success acknowledgment before the crash, the system has lost a committed write - this only happens in async mode.
Q: How do you design the checkpoint protocol so that taking a checkpoint doesn’t block ongoing writes?
Expected depth: The state machine snapshot must be consistent (a point-in-time copy) but can’t hold a write lock during the entire snapshot duration (which could be seconds for a large state machine). The solution is copy-on-write: snapshot the in-memory state with a read lock (milliseconds), then serialize the snapshot to disk in the background while writes continue. The checkpoint’s CompletedLSN reflects the state at the moment of the read lock, not when the serialization completes.
Q: A follower’s disk is 10x slower than the leader’s disk. How does this affect the cluster under write load?
Expected depth: In synchronous quorum mode, the leader waits for all required followers to fsync. A slow follower increases P99 write latency dramatically - the quorum wait is gated on the slowest required follower. Solutions: reduce quorum requirements (from all-followers to N/2+1 + leader), designate the slow follower as an async replica excluded from quorum calculations, or fix the disk. The slow follower also falls behind on replication lag, which can block log truncation on the leader.
Q: How would you implement point-in-time recovery to a specific timestamp rather than a specific LSN?
Expected depth: WAL entries include wall-clock timestamps. To recover to time T, replay entries whose timestamp is less than or equal to T and stop. The challenge is that timestamps can be skewed if nodes don’t have synchronized clocks. Use the leader’s timestamp at the time it wrote each entry (not the client’s timestamp). Point-in-time recovery from a backup: restore the latest checkpoint before time T, then replay WAL entries from the checkpoint’s CompletedLSN until you reach the target timestamp.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.