Build an Event Sourcing and CQRS Audit Trail


databases microservices reliability

System Design Deep Dive

Event Sourcing and CQRS Audit Trail

How do you replay 3 years of history when your database only stores current state?

⏱ 14 min read📐 Advanced🏗️ Architecture

Think of a traditional database like a whiteboard. When someone updates a record, they erase what was there and write the new value. The history is gone - you only have today’s truth. Now imagine instead you kept a notebook where every change was written as a new line, and the current state was just the sum of all those lines. That is event sourcing. It is not a clever pattern or an optimization trick - it is a fundamentally different model for persisting application state.

The challenge is scale. A financial trading platform generates millions of state changes per day. An e-commerce system creates order events, payment events, fulfillment events, and return events for every transaction. Over 3 years, that is billions of events. Replaying them to reconstruct any past state cannot be slow - a compliance audit that takes 6 hours to reconstruct last quarter’s state is not useful. And on the query side, the same event stream needs to power dashboards that ask completely different questions: “What is the current balance?” and “What were all the state changes for account X on March 15th, 2024?” are structurally incompatible queries if served from the same model.

The CQRS (Command Query Responsibility Segregation) pattern addresses this by separating writes from reads. Commands mutate state by appending events. Queries read from projections - denormalized views built specifically for each query shape. This sounds simple until you realize projections can fall behind, can fail mid-build, and must be rebuildable from scratch on demand. Add in snapshot optimization to avoid replaying 3 years of history on every read, and you have a system with six interacting layers that each fail in interesting ways.

We need to solve for durable ordered event storage, efficient aggregate reconstruction, projection consistency, snapshot management, and schema evolution simultaneously.

Requirements and Constraints

Functional Requirements

  • Append events to named aggregates (e.g., order:123, account:456) with optimistic concurrency control
  • Reconstruct any aggregate’s state at the current point in time by replaying its events
  • Reconstruct any aggregate’s state at any arbitrary past point in time (point-in-time queries)
  • Build and maintain read-optimized projections from the event stream
  • Support replaying the entire event stream from scratch to rebuild projections
  • Detect and handle conflicting concurrent writes (two processes updating the same aggregate)

Non-Functional Requirements

  • Event append latency: under 10ms at p99
  • Aggregate load latency: under 50ms for aggregates with up to 10,000 events
  • Replay throughput: at least 500,000 events/second for projection rebuild
  • Storage: retain all events for 3 years minimum (approximately 1 billion events at 1KB average = 1TB)
  • Availability: 99.99% for writes; 99.9% for projection queries
  • Event ordering: strict ordering within a single aggregate; best-effort global ordering across aggregates

Constraints

  • Events are immutable once written - no updates or deletes to the event log
  • Schema evolution must be handled in the application layer, not the store
  • Cross-aggregate transactions are out of scope - the system provides per-aggregate consistency only
  • Projections are eventually consistent with the event stream - not immediate

High-Level Architecture

The system splits into three planes: the write plane that accepts commands and appends events, the read plane that serves queries from projections, and the projection engine that transforms the event stream into queryable views.

Event sourcing CQRS system architecture overview

When a command arrives, the command handler loads the current aggregate state, validates the command against that state, and if valid, appends one or more events to the event store. The event store is the single source of truth - an append-only log partitioned by aggregate ID. After appending, the event store publishes events to an event bus (Kafka or similar), which feeds multiple projection workers. Each projection worker maintains one read model - an order history projection, a balance projection, an audit trail projection - and writes to a read store optimized for that query shape. Queries hit the read store directly, never the event store.

The snapshot service periodically computes aggregate state at a known event sequence number and stores it as a checkpoint. On aggregate load, the system reads the latest snapshot and replays only the events after it, bounding replay time regardless of aggregate age.

Key Insight

The event store is the only component that needs to be strongly consistent. Everything else - projections, snapshots, read models - can be eventually consistent, which is what makes this architecture horizontally scalable.

The Event Store

The event store’s job is deceptively simple: append events in order, with no gaps, under concurrent writers.

The hard part is optimistic concurrency control. Two processes can load the same aggregate, both see version 5, both decide to write version 6, and both try to append. Only one can win. The event store must reject the second write with a conflict error, forcing it to reload and retry. This is identical to the version column in optimistic locking - it is just applied at append time rather than update time.

Event append and projection fan-out data flow

The core table looks like this:

-- Event store: append-only log partitioned by aggregate
CREATE TABLE events (
  aggregate_id   UUID        NOT NULL,
  aggregate_type VARCHAR(64) NOT NULL,
  sequence_num   BIGINT      NOT NULL,
  event_type     VARCHAR(128) NOT NULL,
  event_data     JSONB       NOT NULL,
  metadata       JSONB       NOT NULL DEFAULT '{}',
  occurred_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  global_seq     BIGINT GENERATED ALWAYS AS IDENTITY,
  CONSTRAINT pk_events PRIMARY KEY (aggregate_id, sequence_num),
  CONSTRAINT uq_global_seq UNIQUE (global_seq)
);

CREATE INDEX idx_events_aggregate_type ON events (aggregate_type, global_seq);
CREATE INDEX idx_events_occurred_at ON events (occurred_at);

The (aggregate_id, sequence_num) primary key enforces ordering and uniqueness within an aggregate. The global_seq column provides a monotonically increasing global ordering needed by projection workers to track their position in the stream.

The append operation uses a single SQL statement with a conflict check:

-- Append event with optimistic concurrency check
INSERT INTO events (aggregate_id, aggregate_type, sequence_num, event_type, event_data, metadata)
SELECT
  $1, $2,
  COALESCE((SELECT MAX(sequence_num) FROM events WHERE aggregate_id = $1), -1) + 1,
  $3, $4, $5
WHERE NOT EXISTS (
  SELECT 1 FROM events
  WHERE aggregate_id = $1
  AND sequence_num = (SELECT MAX(sequence_num) FROM events WHERE aggregate_id = $1)
  AND sequence_num >= $6
);

In practice, most production event stores use a simpler approach: pass the expected version as a parameter and let a CHECK constraint or trigger reject mismatches.

# Append with explicit version check - raises on conflict
def append_event(db, aggregate_id: str, expected_version: int, event: dict) -> int:
    result = db.execute("""
        INSERT INTO events (aggregate_id, aggregate_type, sequence_num, event_type, event_data)
        SELECT $1, $2, $3 + 1, $4, $5
        WHERE (
          SELECT COALESCE(MAX(sequence_num), -1) FROM events WHERE aggregate_id = $1
        ) = $3
        RETURNING sequence_num
    """, aggregate_id, event["aggregate_type"], expected_version,
         event["type"], json.dumps(event["data"]))

    if result.rowcount == 0:
        raise OptimisticConcurrencyError(
            f"Expected version {expected_version} for {aggregate_id}, got conflict"
        )
    return result.fetchone()["sequence_num"]
Real World

EventStoreDB (formerly EventStore) uses this exact model - streams partitioned by aggregate ID with per-stream expected version checks. Greg Young, who popularized event sourcing, built it specifically because existing databases made append-only semantics awkward to implement reliably.

Aggregate State Reconstruction

Loading an aggregate’s current state means fetching all events for that aggregate ID and applying them in sequence to a zero-value state object. An aggregate is just a state machine - each event transitions it from one state to the next.

# Aggregate reconstruction from event stream
from dataclasses import dataclass, field
from typing import Any
import json

@dataclass
class OrderAggregate:
    order_id: str = ""
    status: str = "pending"
    items: list = field(default_factory=list)
    total_cents: int = 0
    version: int = -1

    def apply(self, event: dict) -> "OrderAggregate":
        handlers = {
            "OrderCreated":   self._on_created,
            "ItemAdded":      self._on_item_added,
            "PaymentReceived": self._on_payment_received,
            "OrderCancelled": self._on_cancelled,
            "OrderFulfilled": self._on_fulfilled,
        }
        handler = handlers.get(event["event_type"])
        if handler:
            handler(event["event_data"])
        self.version = event["sequence_num"]
        return self

    def _on_created(self, data: dict):
        self.order_id = data["order_id"]
        self.status = "created"

    def _on_item_added(self, data: dict):
        self.items.append(data)
        self.total_cents += data["price_cents"] * data["quantity"]

    def _on_payment_received(self, data: dict):
        self.status = "paid"

    def _on_cancelled(self, data: dict):
        self.status = "cancelled"

    def _on_fulfilled(self, data: dict):
        self.status = "fulfilled"


def load_aggregate(db, aggregate_id: str) -> OrderAggregate:
    events = db.execute("""
        SELECT event_type, event_data, sequence_num
        FROM events
        WHERE aggregate_id = $1
        ORDER BY sequence_num ASC
    """, aggregate_id).fetchall()

    state = OrderAggregate()
    for event in events:
        state.apply({
            "event_type": event["event_type"],
            "event_data": json.loads(event["event_data"]),
            "sequence_num": event["sequence_num"],
        })
    return state

The version field on the aggregate becomes the expected_version for the next write. This closes the concurrency loop.

Watch Out

Never put side effects - API calls, emails, database writes - inside an apply method. Apply methods reconstruct state; they must be pure. The side effects happen in the command handler before the event is appended, or in projection workers after it is published.

Snapshot Optimization

An aggregate with 50,000 events takes 50ms to reconstruct on a warm database. That is acceptable. An aggregate with 5 million events - a high-volume account that has been active for 3 years - takes 5 seconds. That is not.

Snapshots solve this by periodically recording the full aggregate state at a known sequence number. On load, instead of starting from event 0, the system starts from the latest snapshot and replays only the events after it. The snapshot acts like a save point in a video game - you can always go back further, but you start from the nearest checkpoint.

Snapshot optimization showing checkpoint and delta replay
-- Snapshot table: one row per aggregate, updated periodically
CREATE TABLE snapshots (
  aggregate_id   UUID        NOT NULL PRIMARY KEY,
  aggregate_type VARCHAR(64) NOT NULL,
  sequence_num   BIGINT      NOT NULL,
  state_data     JSONB       NOT NULL,
  created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
# Load aggregate using snapshot if available
def load_aggregate_with_snapshot(db, aggregate_id: str) -> OrderAggregate:
    snapshot = db.execute("""
        SELECT sequence_num, state_data
        FROM snapshots
        WHERE aggregate_id = $1
    """, aggregate_id).fetchone()

    if snapshot:
        state = OrderAggregate(**json.loads(snapshot["state_data"]))
        from_seq = snapshot["sequence_num"] + 1
    else:
        state = OrderAggregate()
        from_seq = 0

    events = db.execute("""
        SELECT event_type, event_data, sequence_num
        FROM events
        WHERE aggregate_id = $1 AND sequence_num >= $2
        ORDER BY sequence_num ASC
    """, aggregate_id, from_seq).fetchall()

    for event in events:
        state.apply({
            "event_type": event["event_type"],
            "event_data": json.loads(event["event_data"]),
            "sequence_num": event["sequence_num"],
        })

    # Write snapshot if we replayed more than 100 events
    if len(events) > 100:
        save_snapshot(db, aggregate_id, state)

    return state

The snapshot threshold (100 events here) is tunable. Systems with fast apply functions can use a higher threshold; those with slow aggregates use a lower one.

Key Insight

Snapshots are a performance optimization, never a source of truth. The system must produce identical results whether it uses a snapshot or replays from event 0. If a snapshot and a full replay produce different results, the snapshot is wrong - discard it and recompute.

Projection Building

A projection is a read model built by consuming the event stream and transforming events into a shape optimized for queries. The projection for “order history by customer” looks completely different from the projection for “current order status” - both are built from the same events.

Projection workers use a persistent cursor (their position in the global event stream) to track progress. On restart, they resume from where they left off. On a full rebuild, the cursor is reset to 0 and the target table is truncated.

# Projection worker consuming global event stream
import time

def run_order_history_projection(db, read_db):
    cursor_key = "order_history_projection_cursor"
    cursor = get_cursor(read_db, cursor_key) or 0

    while True:
        events = db.execute("""
            SELECT global_seq, aggregate_id, event_type, event_data, occurred_at
            FROM events
            WHERE global_seq > $1
            ORDER BY global_seq ASC
            LIMIT 500
        """, cursor).fetchall()

        if not events:
            time.sleep(0.1)
            continue

        with read_db.transaction():
            for event in events:
                apply_to_order_history(read_db, event)
            cursor = events[-1]["global_seq"]
            set_cursor(read_db, cursor_key, cursor)

def apply_to_order_history(read_db, event):
    data = json.loads(event["event_data"])
    if event["event_type"] == "OrderCreated":
        read_db.execute("""
            INSERT INTO order_history (order_id, customer_id, created_at, status)
            VALUES ($1, $2, $3, 'created')
            ON CONFLICT (order_id) DO NOTHING
        """, data["order_id"], data["customer_id"], event["occurred_at"])
    elif event["event_type"] == "OrderFulfilled":
        read_db.execute("""
            UPDATE order_history SET status = 'fulfilled', fulfilled_at = $2
            WHERE order_id = $1
        """, event["aggregate_id"], event["occurred_at"])
Real World

Axon Framework (Java) handles projection building with an @EventHandler annotation model where each method maps to an event type. The framework manages cursor tracking and replay orchestration, so teams only write the transformation logic. The underlying mechanics are identical to the code above.

Data Model

The data model has three layers: the event store (source of truth), the snapshot store (performance cache), and read stores (one per projection, optimized for queries).

-- Event store schema (write side)
CREATE TABLE events (
  aggregate_id    UUID         NOT NULL,
  aggregate_type  VARCHAR(64)  NOT NULL,
  sequence_num    BIGINT       NOT NULL,
  event_type      VARCHAR(128) NOT NULL,
  event_data      JSONB        NOT NULL,
  metadata        JSONB        NOT NULL DEFAULT '{}',
  occurred_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  global_seq      BIGINT       GENERATED ALWAYS AS IDENTITY,
  CONSTRAINT pk_events PRIMARY KEY (aggregate_id, sequence_num),
  CONSTRAINT uq_global_seq UNIQUE (global_seq),
  CONSTRAINT chk_seq_positive CHECK (sequence_num >= 0)
);

CREATE INDEX idx_events_type_global ON events (aggregate_type, global_seq);
CREATE INDEX idx_events_occurred ON events (occurred_at DESC);

-- Snapshot store
CREATE TABLE snapshots (
  aggregate_id    UUID         NOT NULL PRIMARY KEY,
  aggregate_type  VARCHAR(64)  NOT NULL,
  sequence_num    BIGINT       NOT NULL,
  state_data      JSONB        NOT NULL,
  created_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

-- Projection cursor tracking
CREATE TABLE projection_cursors (
  projection_name VARCHAR(128) NOT NULL PRIMARY KEY,
  last_global_seq BIGINT       NOT NULL DEFAULT 0,
  updated_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

-- Example read model: order summary projection
CREATE TABLE order_summary (
  order_id        UUID         NOT NULL PRIMARY KEY,
  customer_id     UUID         NOT NULL,
  status          VARCHAR(32)  NOT NULL,
  total_cents     BIGINT       NOT NULL DEFAULT 0,
  item_count      INT          NOT NULL DEFAULT 0,
  created_at      TIMESTAMPTZ  NOT NULL,
  last_updated_at TIMESTAMPTZ  NOT NULL
);

CREATE INDEX idx_order_summary_customer ON order_summary (customer_id, created_at DESC);
CREATE INDEX idx_order_summary_status ON order_summary (status, created_at DESC);

The global_seq column on the events table is the backbone of the entire system. It provides a total ordering across all aggregates and is the cursor value projection workers track.

Key Insight

Partitioning the event store by aggregate_id is natural for writes, but projection workers need to scan by global_seq across all aggregates. These access patterns are in direct tension - a table optimized for one is slow for the other. The idx_events_type_global index resolves this for type-scoped projections.

Key Algorithms and Protocols

Schema Versioning via Upcasters

Events are immutable, but the application’s understanding of them evolves. An OrderCreated event from 2023 might not have a currency field that all 2024 events include. Upcasters transform old event formats into the current format at read time.

# Upcaster chain for schema evolution
from typing import Callable

UpcasterFn = Callable[[dict], dict]

UPCASTERS: dict[str, list[tuple[int, UpcasterFn]]] = {
    "OrderCreated": [
        # v1 -> v2: add default currency field
        (2, lambda e: {**e, "currency": e.get("currency", "USD")}),
        # v2 -> v3: normalize amount to cents
        (3, lambda e: {**e, "total_cents": int(e.pop("total_dollars", 0) * 100)}),
    ]
}

def upcast_event(event_type: str, event_data: dict, schema_version: int) -> dict:
    upcasters = UPCASTERS.get(event_type, [])
    result = event_data.copy()
    for target_version, upcast_fn in upcasters:
        if schema_version < target_version:
            result = upcast_fn(result)
    return result

Replay Performance: Parallel Aggregate Batching

Replaying 1 billion events sequentially at 10,000 events/second takes 28 hours. The trick is that events for different aggregates are fully independent - you can replay aggregate A in parallel with aggregate B. Partition the event stream by aggregate ID and run N parallel workers.

# Parallel replay using aggregate ID partitioning
import hashlib
from concurrent.futures import ThreadPoolExecutor

def parallel_replay(db, projection_fn, num_workers: int = 16):
    def replay_partition(partition_id: int):
        cursor = 0
        while True:
            events = db.execute("""
                SELECT global_seq, aggregate_id, event_type, event_data
                FROM events
                WHERE global_seq > $1
                  AND abs(hashtext(aggregate_id::text)) % $2 = $3
                ORDER BY global_seq ASC
                LIMIT 1000
            """, cursor, num_workers, partition_id).fetchall()

            if not events:
                break

            for event in events:
                projection_fn(event)
            cursor = events[-1]["global_seq"]

    with ThreadPoolExecutor(max_workers=num_workers) as pool:
        futures = [pool.submit(replay_partition, i) for i in range(num_workers)]
        for f in futures:
            f.result()

This scales replay throughput linearly with worker count - 16 workers yield roughly 500,000 events/second on a modern database.

Key Insight

Parallel replay only works because projections for a single aggregate must be applied in order (sequence_num ASC), but projections across aggregates are independent. The hash partitioning ensures each aggregate always goes to the same worker, preserving per-aggregate ordering while enabling full parallelism.

Scaling and Performance

The event store’s append path has three scaling limits: write throughput, global_seq contention, and projection lag.

Scaling architecture showing partition strategy and projection lag management

Write throughput scales with read replicas for snapshots and batch inserts for high-frequency aggregates. The single primary database handles writes; reads (for aggregate load) fan out to replicas.

Global_seq contention is the subtler problem. A GENERATED ALWAYS AS IDENTITY column is a sequence, and Postgres sequences contend under high insert rates. At 10,000 inserts/second this is fine; at 100,000/second the sequence becomes a bottleneck. Solutions: batch inserts (insert 100 events, grab 100 sequence numbers), or partition the event store across multiple databases with a time-based or aggregate-type-based router.

Projection lag - the delay between an event being written and appearing in the read model - depends on projection worker throughput. For most systems, lag is under 100ms. For bulk replay, lag is irrelevant since consistency is per-rebuild.

Capacity Estimation:
Given:
  - 10,000 events/second peak write rate
  - 1KB average event payload
  - 3 years retention
  - 8 projections each reading the full stream

Storage (event store):
  10,000 * 1KB * 86400s * 365 * 3 = ~945 TB
  With compression (~4x): ~236 TB

Bandwidth (projection fan-out):
  10,000 events/s * 8 projections = 80,000 reads/s from event stream
  80,000 * 1KB = 80 MB/s sustained read from event store

Index overhead:
  global_seq + aggregate_id indexes: ~20% of raw data = ~47 TB

Total storage with read models: ~300 TB across 3 years
Real World

Shopify’s Order Management system uses an event-sourced core for order state. At their scale (millions of orders/day), they partition the event stream by shop ID rather than order ID, which colocates all events for a merchant’s orders on the same partition and makes per-merchant projection rebuilds cheap without a full stream scan.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Event store primary failureHealth check + replication lag alertWrite path down; reads continue from replicasPromote replica; replay any WAL gaps
Optimistic concurrency conflictINSERT returns 0 rowsSingle command rejected; no data lossReload aggregate, re-validate, retry command
Projection worker crash mid-batchWorker heartbeat timeoutProjection lag grows; read model staleRestart worker; replay from last committed cursor
Snapshot corruptionState mismatch on spot-check replayWrong aggregate state servedDelete snapshot; full replay from event 0
Global_seq gap (sequence skip)Monotonicity check on worker readProjection worker may stall or skip eventsInvestigate: gaps in identity sequences are normal on rollback; workers should handle non-contiguous global_seq values
Schema incompatibility (missing upcaster)Event deserialization error in workerProjection rebuild fails for affected event typesAdd upcaster; rerun affected projection
Watch Out

Never delete events to fix a bug. If a bug caused wrong events to be appended, append correction events instead - a PaymentReversed after an erroneous PaymentReceived. Deleting events breaks replay correctness for any projection that consumed those events before the delete.

Comparison of Approaches

ApproachWrite LatencyQuery FlexibilityComplexityBest Fit
Traditional CRUD (current state only)5msFull SQL flexibilityLowCRUD apps without audit needs
Event sourcing + CQRS (this design)10msUnlimited projectionsHighFinancial systems, audit-heavy domains
Event sourcing + single read model10msOne query shapeMediumSimple audit trails, no complex queries
Change Data Capture (CDC)5msSQL on snapshots + delta logMediumRetrofitting audit onto existing system
Append-only tables with current-state column6msSQL on current + historyMediumModerate audit needs, no replay requirement

For a system where audit trail and time-travel queries are first-class requirements, event sourcing with CQRS is the right choice. The complexity cost is front-loaded - the read model infrastructure is real work - but it pays off when a compliance team asks “what was every state change for customer X in Q3 2023?” at 10pm.

Key Takeaways

  • Event store is immutable: events are never updated or deleted; the append-only constraint is what makes replay deterministic
  • Optimistic concurrency: the expected_version check on append prevents lost updates without database-level locking
  • Snapshots are optional but necessary at scale: without them, aggregates with millions of events become unusable
  • Projections are disposable: any read model can be deleted and rebuilt from scratch by replaying the event stream
  • Schema evolution is handled via upcasters: transform old event formats at read time, never mutate stored events
  • Global ordering is the projection worker’s clock: global_seq provides the cursor; each worker advances it independently
  • CQRS separates command validation from query execution: commands run against aggregate state; queries hit denormalized read models

The counter-intuitive lesson: event sourcing makes your write path simpler. A traditional CRUD system requires complex update logic to manage all the possible state transitions. An event-sourced system just appends - the aggregate’s apply methods handle transitions, and they are pure functions with no database side effects. The complexity moves from the write path into the projection layer, where it is more manageable and independently scalable.

Frequently Asked Questions

Q: How do you handle commands that need to check state across multiple aggregates? A: You don’t - not within a single transaction. Cross-aggregate consistency is handled via sagas or process managers: one aggregate emits an event, a saga listens for it and issues commands to other aggregates. The saga itself is an aggregate that tracks the workflow state. This is fundamentally different from a multi-table transaction in RDBMS and requires designing aggregates to be independent consistency boundaries.

Q: Why not just use a CDC pipeline on an existing database instead of event sourcing? A: CDC gives you a change log, but not a domain event log. The difference matters: a CDC event says “row 123 updated, balance changed from 100 to 150.” A domain event says “PaymentReceived for order 456, amount 50, at 14:32 UTC.” CDC events have no semantic meaning and require reverse-engineering intent. Domain events carry intent explicitly. If your query is “why did this state change happen?”, CDC cannot answer it without additional annotation.

Q: Won’t projections always be slightly stale? A: Yes, and that is acceptable for most use cases. The write confirmation tells the user their command succeeded; the read model catches up within milliseconds. For use cases that require reading your own write immediately, some systems return the aggregate state directly from the command response and cache it client-side for a short window rather than querying the projection.

Q: How do you handle projection rebuilds without downtime? A: Blue-green projections. Build the new projection into a shadow table while the current one serves traffic. Once the rebuild reaches the live cursor position (lag under 1 second), atomically swap the table name. The brief swap window uses the old model; no queries fail. This is identical to blue-green deployment but for read models.

Q: Why use Postgres for the event store rather than Kafka? A: Kafka is excellent for the event bus (projection fan-out) but poor as the event store (source of truth). Kafka’s retention is time-based and size-based; events expire. Postgres is durable, queryable with SQL, and has per-aggregate ordering via the primary key. Most production systems use Postgres or a purpose-built event store (EventStoreDB) as the store, and Kafka for fan-out.

Q: How do you prevent the global_seq from becoming a bottleneck? A: Three techniques: batch inserts to amortize sequence generation overhead, partitioning the event store by aggregate type so each type has its own sequence, and using a timestamp-plus-shard monotonic key instead of a database sequence. The last option sacrifices strict global ordering for scalability - acceptable if projections tolerate occasional reordering within a small time window.

Interview Questions

Q: Walk me through how you’d handle an aggregate with 10 million events and a need for sub-100ms load time.

Expected depth: Discuss snapshot strategy - periodic serialization of aggregate state at known sequence numbers. Cover snapshot invalidation (snapshots are never wrong, they just become stale). Explain the threshold for triggering a new snapshot (N events since last snapshot, or M seconds elapsed). Address the bootstrap problem: first load always does full replay; snapshot is written after the first load to seed future loads.

Q: Two workers both issue commands against the same aggregate at the same time. Walk me through what happens.

Expected depth: Both workers load aggregate at version 5. Both generate events to append as version 6. Worker A’s INSERT succeeds. Worker B’s INSERT checks (aggregate_id = X AND sequence_num = 5) - no longer true, so 0 rows affected. Worker B gets OptimisticConcurrencyError, reloads aggregate (now at version 6), re-evaluates its command against the new state (may still be valid or may now be rejected), and retries. Cover the case where the command is idempotent vs non-idempotent after retry.

Q: A compliance team asks you to delete all events for a specific customer (GDPR right to erasure). How do you handle this?

Expected depth: You cannot delete events - that breaks replay. Solutions: encrypt event payloads with a per-customer key, then delete the key (crypto-shredding); or replace PII fields with tombstone markers in a separate index while keeping the event structure. Cover the projection rebuild problem: projections built before the erasure may still contain PII - they need to be rebuilt after erasure.

Q: Your projection is 30 minutes behind the event stream. How do you investigate and fix it?

Expected depth: Check projection cursor position vs current global_seq - gap = lag. Check worker throughput: events/second processed. Check for a specific event type causing slow apply (a projection that does a web API call in the handler). Check database write contention on the read model table. Fix path: identify the slow event type, optimize the handler, or scale to multiple projection workers with hash partitioning on aggregate ID.

Q: How would you implement point-in-time queries - “what was the state of order X at 2pm on March 3rd”?

Expected depth: Two approaches: filter events by occurred_at <= target_time and replay only those; or maintain a time-indexed projection that stores state snapshots at regular intervals. The first approach is correct for any aggregate but scales poorly for large aggregates. The second is fast but requires pre-computing the time points. Cover how daylight saving time and clock skew affect occurred_at accuracy - discuss using monotonic global_seq as a proxy when strict time accuracy is needed.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article