Build an Event Sourcing and CQRS Audit Trail
databases microservices reliability
System Design Deep Dive
Event Sourcing and CQRS Audit Trail
How do you replay 3 years of history when your database only stores current state?
Think of a traditional database like a whiteboard. When someone updates a record, they erase what was there and write the new value. The history is gone - you only have today’s truth. Now imagine instead you kept a notebook where every change was written as a new line, and the current state was just the sum of all those lines. That is event sourcing. It is not a clever pattern or an optimization trick - it is a fundamentally different model for persisting application state.
The challenge is scale. A financial trading platform generates millions of state changes per day. An e-commerce system creates order events, payment events, fulfillment events, and return events for every transaction. Over 3 years, that is billions of events. Replaying them to reconstruct any past state cannot be slow - a compliance audit that takes 6 hours to reconstruct last quarter’s state is not useful. And on the query side, the same event stream needs to power dashboards that ask completely different questions: “What is the current balance?” and “What were all the state changes for account X on March 15th, 2024?” are structurally incompatible queries if served from the same model.
The CQRS (Command Query Responsibility Segregation) pattern addresses this by separating writes from reads. Commands mutate state by appending events. Queries read from projections - denormalized views built specifically for each query shape. This sounds simple until you realize projections can fall behind, can fail mid-build, and must be rebuildable from scratch on demand. Add in snapshot optimization to avoid replaying 3 years of history on every read, and you have a system with six interacting layers that each fail in interesting ways.
We need to solve for durable ordered event storage, efficient aggregate reconstruction, projection consistency, snapshot management, and schema evolution simultaneously.
Requirements and Constraints
Functional Requirements
- Append events to named aggregates (e.g.,
order:123,account:456) with optimistic concurrency control - Reconstruct any aggregate’s state at the current point in time by replaying its events
- Reconstruct any aggregate’s state at any arbitrary past point in time (point-in-time queries)
- Build and maintain read-optimized projections from the event stream
- Support replaying the entire event stream from scratch to rebuild projections
- Detect and handle conflicting concurrent writes (two processes updating the same aggregate)
Non-Functional Requirements
- Event append latency: under 10ms at p99
- Aggregate load latency: under 50ms for aggregates with up to 10,000 events
- Replay throughput: at least 500,000 events/second for projection rebuild
- Storage: retain all events for 3 years minimum (approximately 1 billion events at 1KB average = 1TB)
- Availability: 99.99% for writes; 99.9% for projection queries
- Event ordering: strict ordering within a single aggregate; best-effort global ordering across aggregates
Constraints
- Events are immutable once written - no updates or deletes to the event log
- Schema evolution must be handled in the application layer, not the store
- Cross-aggregate transactions are out of scope - the system provides per-aggregate consistency only
- Projections are eventually consistent with the event stream - not immediate
High-Level Architecture
The system splits into three planes: the write plane that accepts commands and appends events, the read plane that serves queries from projections, and the projection engine that transforms the event stream into queryable views.
When a command arrives, the command handler loads the current aggregate state, validates the command against that state, and if valid, appends one or more events to the event store. The event store is the single source of truth - an append-only log partitioned by aggregate ID. After appending, the event store publishes events to an event bus (Kafka or similar), which feeds multiple projection workers. Each projection worker maintains one read model - an order history projection, a balance projection, an audit trail projection - and writes to a read store optimized for that query shape. Queries hit the read store directly, never the event store.
The snapshot service periodically computes aggregate state at a known event sequence number and stores it as a checkpoint. On aggregate load, the system reads the latest snapshot and replays only the events after it, bounding replay time regardless of aggregate age.
The event store is the only component that needs to be strongly consistent. Everything else - projections, snapshots, read models - can be eventually consistent, which is what makes this architecture horizontally scalable.
The Event Store
The event store’s job is deceptively simple: append events in order, with no gaps, under concurrent writers.
The hard part is optimistic concurrency control. Two processes can load the same aggregate, both see version 5, both decide to write version 6, and both try to append. Only one can win. The event store must reject the second write with a conflict error, forcing it to reload and retry. This is identical to the version column in optimistic locking - it is just applied at append time rather than update time.
The core table looks like this:
-- Event store: append-only log partitioned by aggregate
CREATE TABLE events (
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(64) NOT NULL,
sequence_num BIGINT NOT NULL,
event_type VARCHAR(128) NOT NULL,
event_data JSONB NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
global_seq BIGINT GENERATED ALWAYS AS IDENTITY,
CONSTRAINT pk_events PRIMARY KEY (aggregate_id, sequence_num),
CONSTRAINT uq_global_seq UNIQUE (global_seq)
);
CREATE INDEX idx_events_aggregate_type ON events (aggregate_type, global_seq);
CREATE INDEX idx_events_occurred_at ON events (occurred_at);
The (aggregate_id, sequence_num) primary key enforces ordering and uniqueness within an aggregate. The global_seq column provides a monotonically increasing global ordering needed by projection workers to track their position in the stream.
The append operation uses a single SQL statement with a conflict check:
-- Append event with optimistic concurrency check
INSERT INTO events (aggregate_id, aggregate_type, sequence_num, event_type, event_data, metadata)
SELECT
$1, $2,
COALESCE((SELECT MAX(sequence_num) FROM events WHERE aggregate_id = $1), -1) + 1,
$3, $4, $5
WHERE NOT EXISTS (
SELECT 1 FROM events
WHERE aggregate_id = $1
AND sequence_num = (SELECT MAX(sequence_num) FROM events WHERE aggregate_id = $1)
AND sequence_num >= $6
);
In practice, most production event stores use a simpler approach: pass the expected version as a parameter and let a CHECK constraint or trigger reject mismatches.
# Append with explicit version check - raises on conflict
def append_event(db, aggregate_id: str, expected_version: int, event: dict) -> int:
result = db.execute("""
INSERT INTO events (aggregate_id, aggregate_type, sequence_num, event_type, event_data)
SELECT $1, $2, $3 + 1, $4, $5
WHERE (
SELECT COALESCE(MAX(sequence_num), -1) FROM events WHERE aggregate_id = $1
) = $3
RETURNING sequence_num
""", aggregate_id, event["aggregate_type"], expected_version,
event["type"], json.dumps(event["data"]))
if result.rowcount == 0:
raise OptimisticConcurrencyError(
f"Expected version {expected_version} for {aggregate_id}, got conflict"
)
return result.fetchone()["sequence_num"]
EventStoreDB (formerly EventStore) uses this exact model - streams partitioned by aggregate ID with per-stream expected version checks. Greg Young, who popularized event sourcing, built it specifically because existing databases made append-only semantics awkward to implement reliably.
Aggregate State Reconstruction
Loading an aggregate’s current state means fetching all events for that aggregate ID and applying them in sequence to a zero-value state object. An aggregate is just a state machine - each event transitions it from one state to the next.
# Aggregate reconstruction from event stream
from dataclasses import dataclass, field
from typing import Any
import json
@dataclass
class OrderAggregate:
order_id: str = ""
status: str = "pending"
items: list = field(default_factory=list)
total_cents: int = 0
version: int = -1
def apply(self, event: dict) -> "OrderAggregate":
handlers = {
"OrderCreated": self._on_created,
"ItemAdded": self._on_item_added,
"PaymentReceived": self._on_payment_received,
"OrderCancelled": self._on_cancelled,
"OrderFulfilled": self._on_fulfilled,
}
handler = handlers.get(event["event_type"])
if handler:
handler(event["event_data"])
self.version = event["sequence_num"]
return self
def _on_created(self, data: dict):
self.order_id = data["order_id"]
self.status = "created"
def _on_item_added(self, data: dict):
self.items.append(data)
self.total_cents += data["price_cents"] * data["quantity"]
def _on_payment_received(self, data: dict):
self.status = "paid"
def _on_cancelled(self, data: dict):
self.status = "cancelled"
def _on_fulfilled(self, data: dict):
self.status = "fulfilled"
def load_aggregate(db, aggregate_id: str) -> OrderAggregate:
events = db.execute("""
SELECT event_type, event_data, sequence_num
FROM events
WHERE aggregate_id = $1
ORDER BY sequence_num ASC
""", aggregate_id).fetchall()
state = OrderAggregate()
for event in events:
state.apply({
"event_type": event["event_type"],
"event_data": json.loads(event["event_data"]),
"sequence_num": event["sequence_num"],
})
return state
The version field on the aggregate becomes the expected_version for the next write. This closes the concurrency loop.
Never put side effects - API calls, emails, database writes - inside an apply method. Apply methods reconstruct state; they must be pure. The side effects happen in the command handler before the event is appended, or in projection workers after it is published.
Snapshot Optimization
An aggregate with 50,000 events takes 50ms to reconstruct on a warm database. That is acceptable. An aggregate with 5 million events - a high-volume account that has been active for 3 years - takes 5 seconds. That is not.
Snapshots solve this by periodically recording the full aggregate state at a known sequence number. On load, instead of starting from event 0, the system starts from the latest snapshot and replays only the events after it. The snapshot acts like a save point in a video game - you can always go back further, but you start from the nearest checkpoint.
-- Snapshot table: one row per aggregate, updated periodically
CREATE TABLE snapshots (
aggregate_id UUID NOT NULL PRIMARY KEY,
aggregate_type VARCHAR(64) NOT NULL,
sequence_num BIGINT NOT NULL,
state_data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
# Load aggregate using snapshot if available
def load_aggregate_with_snapshot(db, aggregate_id: str) -> OrderAggregate:
snapshot = db.execute("""
SELECT sequence_num, state_data
FROM snapshots
WHERE aggregate_id = $1
""", aggregate_id).fetchone()
if snapshot:
state = OrderAggregate(**json.loads(snapshot["state_data"]))
from_seq = snapshot["sequence_num"] + 1
else:
state = OrderAggregate()
from_seq = 0
events = db.execute("""
SELECT event_type, event_data, sequence_num
FROM events
WHERE aggregate_id = $1 AND sequence_num >= $2
ORDER BY sequence_num ASC
""", aggregate_id, from_seq).fetchall()
for event in events:
state.apply({
"event_type": event["event_type"],
"event_data": json.loads(event["event_data"]),
"sequence_num": event["sequence_num"],
})
# Write snapshot if we replayed more than 100 events
if len(events) > 100:
save_snapshot(db, aggregate_id, state)
return state
The snapshot threshold (100 events here) is tunable. Systems with fast apply functions can use a higher threshold; those with slow aggregates use a lower one.
Snapshots are a performance optimization, never a source of truth. The system must produce identical results whether it uses a snapshot or replays from event 0. If a snapshot and a full replay produce different results, the snapshot is wrong - discard it and recompute.
Projection Building
A projection is a read model built by consuming the event stream and transforming events into a shape optimized for queries. The projection for “order history by customer” looks completely different from the projection for “current order status” - both are built from the same events.
Projection workers use a persistent cursor (their position in the global event stream) to track progress. On restart, they resume from where they left off. On a full rebuild, the cursor is reset to 0 and the target table is truncated.
# Projection worker consuming global event stream
import time
def run_order_history_projection(db, read_db):
cursor_key = "order_history_projection_cursor"
cursor = get_cursor(read_db, cursor_key) or 0
while True:
events = db.execute("""
SELECT global_seq, aggregate_id, event_type, event_data, occurred_at
FROM events
WHERE global_seq > $1
ORDER BY global_seq ASC
LIMIT 500
""", cursor).fetchall()
if not events:
time.sleep(0.1)
continue
with read_db.transaction():
for event in events:
apply_to_order_history(read_db, event)
cursor = events[-1]["global_seq"]
set_cursor(read_db, cursor_key, cursor)
def apply_to_order_history(read_db, event):
data = json.loads(event["event_data"])
if event["event_type"] == "OrderCreated":
read_db.execute("""
INSERT INTO order_history (order_id, customer_id, created_at, status)
VALUES ($1, $2, $3, 'created')
ON CONFLICT (order_id) DO NOTHING
""", data["order_id"], data["customer_id"], event["occurred_at"])
elif event["event_type"] == "OrderFulfilled":
read_db.execute("""
UPDATE order_history SET status = 'fulfilled', fulfilled_at = $2
WHERE order_id = $1
""", event["aggregate_id"], event["occurred_at"])
Axon Framework (Java) handles projection building with an @EventHandler annotation model where each method maps to an event type. The framework manages cursor tracking and replay orchestration, so teams only write the transformation logic. The underlying mechanics are identical to the code above.
Data Model
The data model has three layers: the event store (source of truth), the snapshot store (performance cache), and read stores (one per projection, optimized for queries).
-- Event store schema (write side)
CREATE TABLE events (
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(64) NOT NULL,
sequence_num BIGINT NOT NULL,
event_type VARCHAR(128) NOT NULL,
event_data JSONB NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
global_seq BIGINT GENERATED ALWAYS AS IDENTITY,
CONSTRAINT pk_events PRIMARY KEY (aggregate_id, sequence_num),
CONSTRAINT uq_global_seq UNIQUE (global_seq),
CONSTRAINT chk_seq_positive CHECK (sequence_num >= 0)
);
CREATE INDEX idx_events_type_global ON events (aggregate_type, global_seq);
CREATE INDEX idx_events_occurred ON events (occurred_at DESC);
-- Snapshot store
CREATE TABLE snapshots (
aggregate_id UUID NOT NULL PRIMARY KEY,
aggregate_type VARCHAR(64) NOT NULL,
sequence_num BIGINT NOT NULL,
state_data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Projection cursor tracking
CREATE TABLE projection_cursors (
projection_name VARCHAR(128) NOT NULL PRIMARY KEY,
last_global_seq BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Example read model: order summary projection
CREATE TABLE order_summary (
order_id UUID NOT NULL PRIMARY KEY,
customer_id UUID NOT NULL,
status VARCHAR(32) NOT NULL,
total_cents BIGINT NOT NULL DEFAULT 0,
item_count INT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL,
last_updated_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_order_summary_customer ON order_summary (customer_id, created_at DESC);
CREATE INDEX idx_order_summary_status ON order_summary (status, created_at DESC);
The global_seq column on the events table is the backbone of the entire system. It provides a total ordering across all aggregates and is the cursor value projection workers track.
Partitioning the event store by aggregate_id is natural for writes, but projection workers need to scan by global_seq across all aggregates. These access patterns are in direct tension - a table optimized for one is slow for the other. The idx_events_type_global index resolves this for type-scoped projections.
Key Algorithms and Protocols
Schema Versioning via Upcasters
Events are immutable, but the application’s understanding of them evolves. An OrderCreated event from 2023 might not have a currency field that all 2024 events include. Upcasters transform old event formats into the current format at read time.
# Upcaster chain for schema evolution
from typing import Callable
UpcasterFn = Callable[[dict], dict]
UPCASTERS: dict[str, list[tuple[int, UpcasterFn]]] = {
"OrderCreated": [
# v1 -> v2: add default currency field
(2, lambda e: {**e, "currency": e.get("currency", "USD")}),
# v2 -> v3: normalize amount to cents
(3, lambda e: {**e, "total_cents": int(e.pop("total_dollars", 0) * 100)}),
]
}
def upcast_event(event_type: str, event_data: dict, schema_version: int) -> dict:
upcasters = UPCASTERS.get(event_type, [])
result = event_data.copy()
for target_version, upcast_fn in upcasters:
if schema_version < target_version:
result = upcast_fn(result)
return result
Replay Performance: Parallel Aggregate Batching
Replaying 1 billion events sequentially at 10,000 events/second takes 28 hours. The trick is that events for different aggregates are fully independent - you can replay aggregate A in parallel with aggregate B. Partition the event stream by aggregate ID and run N parallel workers.
# Parallel replay using aggregate ID partitioning
import hashlib
from concurrent.futures import ThreadPoolExecutor
def parallel_replay(db, projection_fn, num_workers: int = 16):
def replay_partition(partition_id: int):
cursor = 0
while True:
events = db.execute("""
SELECT global_seq, aggregate_id, event_type, event_data
FROM events
WHERE global_seq > $1
AND abs(hashtext(aggregate_id::text)) % $2 = $3
ORDER BY global_seq ASC
LIMIT 1000
""", cursor, num_workers, partition_id).fetchall()
if not events:
break
for event in events:
projection_fn(event)
cursor = events[-1]["global_seq"]
with ThreadPoolExecutor(max_workers=num_workers) as pool:
futures = [pool.submit(replay_partition, i) for i in range(num_workers)]
for f in futures:
f.result()
This scales replay throughput linearly with worker count - 16 workers yield roughly 500,000 events/second on a modern database.
Parallel replay only works because projections for a single aggregate must be applied in order (sequence_num ASC), but projections across aggregates are independent. The hash partitioning ensures each aggregate always goes to the same worker, preserving per-aggregate ordering while enabling full parallelism.
Scaling and Performance
The event store’s append path has three scaling limits: write throughput, global_seq contention, and projection lag.
Write throughput scales with read replicas for snapshots and batch inserts for high-frequency aggregates. The single primary database handles writes; reads (for aggregate load) fan out to replicas.
Global_seq contention is the subtler problem. A GENERATED ALWAYS AS IDENTITY column is a sequence, and Postgres sequences contend under high insert rates. At 10,000 inserts/second this is fine; at 100,000/second the sequence becomes a bottleneck. Solutions: batch inserts (insert 100 events, grab 100 sequence numbers), or partition the event store across multiple databases with a time-based or aggregate-type-based router.
Projection lag - the delay between an event being written and appearing in the read model - depends on projection worker throughput. For most systems, lag is under 100ms. For bulk replay, lag is irrelevant since consistency is per-rebuild.
Capacity Estimation:
Given:
- 10,000 events/second peak write rate
- 1KB average event payload
- 3 years retention
- 8 projections each reading the full stream
Storage (event store):
10,000 * 1KB * 86400s * 365 * 3 = ~945 TB
With compression (~4x): ~236 TB
Bandwidth (projection fan-out):
10,000 events/s * 8 projections = 80,000 reads/s from event stream
80,000 * 1KB = 80 MB/s sustained read from event store
Index overhead:
global_seq + aggregate_id indexes: ~20% of raw data = ~47 TB
Total storage with read models: ~300 TB across 3 years
Shopify’s Order Management system uses an event-sourced core for order state. At their scale (millions of orders/day), they partition the event stream by shop ID rather than order ID, which colocates all events for a merchant’s orders on the same partition and makes per-merchant projection rebuilds cheap without a full stream scan.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Event store primary failure | Health check + replication lag alert | Write path down; reads continue from replicas | Promote replica; replay any WAL gaps |
| Optimistic concurrency conflict | INSERT returns 0 rows | Single command rejected; no data loss | Reload aggregate, re-validate, retry command |
| Projection worker crash mid-batch | Worker heartbeat timeout | Projection lag grows; read model stale | Restart worker; replay from last committed cursor |
| Snapshot corruption | State mismatch on spot-check replay | Wrong aggregate state served | Delete snapshot; full replay from event 0 |
| Global_seq gap (sequence skip) | Monotonicity check on worker read | Projection worker may stall or skip events | Investigate: gaps in identity sequences are normal on rollback; workers should handle non-contiguous global_seq values |
| Schema incompatibility (missing upcaster) | Event deserialization error in worker | Projection rebuild fails for affected event types | Add upcaster; rerun affected projection |
Never delete events to fix a bug. If a bug caused wrong events to be appended, append correction events instead - a PaymentReversed after an erroneous PaymentReceived. Deleting events breaks replay correctness for any projection that consumed those events before the delete.
Comparison of Approaches
| Approach | Write Latency | Query Flexibility | Complexity | Best Fit |
|---|---|---|---|---|
| Traditional CRUD (current state only) | 5ms | Full SQL flexibility | Low | CRUD apps without audit needs |
| Event sourcing + CQRS (this design) | 10ms | Unlimited projections | High | Financial systems, audit-heavy domains |
| Event sourcing + single read model | 10ms | One query shape | Medium | Simple audit trails, no complex queries |
| Change Data Capture (CDC) | 5ms | SQL on snapshots + delta log | Medium | Retrofitting audit onto existing system |
| Append-only tables with current-state column | 6ms | SQL on current + history | Medium | Moderate audit needs, no replay requirement |
For a system where audit trail and time-travel queries are first-class requirements, event sourcing with CQRS is the right choice. The complexity cost is front-loaded - the read model infrastructure is real work - but it pays off when a compliance team asks “what was every state change for customer X in Q3 2023?” at 10pm.
Key Takeaways
- Event store is immutable: events are never updated or deleted; the append-only constraint is what makes replay deterministic
- Optimistic concurrency: the
expected_versioncheck on append prevents lost updates without database-level locking - Snapshots are optional but necessary at scale: without them, aggregates with millions of events become unusable
- Projections are disposable: any read model can be deleted and rebuilt from scratch by replaying the event stream
- Schema evolution is handled via upcasters: transform old event formats at read time, never mutate stored events
- Global ordering is the projection worker’s clock:
global_seqprovides the cursor; each worker advances it independently - CQRS separates command validation from query execution: commands run against aggregate state; queries hit denormalized read models
The counter-intuitive lesson: event sourcing makes your write path simpler. A traditional CRUD system requires complex update logic to manage all the possible state transitions. An event-sourced system just appends - the aggregate’s apply methods handle transitions, and they are pure functions with no database side effects. The complexity moves from the write path into the projection layer, where it is more manageable and independently scalable.
Frequently Asked Questions
Q: How do you handle commands that need to check state across multiple aggregates? A: You don’t - not within a single transaction. Cross-aggregate consistency is handled via sagas or process managers: one aggregate emits an event, a saga listens for it and issues commands to other aggregates. The saga itself is an aggregate that tracks the workflow state. This is fundamentally different from a multi-table transaction in RDBMS and requires designing aggregates to be independent consistency boundaries.
Q: Why not just use a CDC pipeline on an existing database instead of event sourcing? A: CDC gives you a change log, but not a domain event log. The difference matters: a CDC event says “row 123 updated, balance changed from 100 to 150.” A domain event says “PaymentReceived for order 456, amount 50, at 14:32 UTC.” CDC events have no semantic meaning and require reverse-engineering intent. Domain events carry intent explicitly. If your query is “why did this state change happen?”, CDC cannot answer it without additional annotation.
Q: Won’t projections always be slightly stale? A: Yes, and that is acceptable for most use cases. The write confirmation tells the user their command succeeded; the read model catches up within milliseconds. For use cases that require reading your own write immediately, some systems return the aggregate state directly from the command response and cache it client-side for a short window rather than querying the projection.
Q: How do you handle projection rebuilds without downtime? A: Blue-green projections. Build the new projection into a shadow table while the current one serves traffic. Once the rebuild reaches the live cursor position (lag under 1 second), atomically swap the table name. The brief swap window uses the old model; no queries fail. This is identical to blue-green deployment but for read models.
Q: Why use Postgres for the event store rather than Kafka? A: Kafka is excellent for the event bus (projection fan-out) but poor as the event store (source of truth). Kafka’s retention is time-based and size-based; events expire. Postgres is durable, queryable with SQL, and has per-aggregate ordering via the primary key. Most production systems use Postgres or a purpose-built event store (EventStoreDB) as the store, and Kafka for fan-out.
Q: How do you prevent the global_seq from becoming a bottleneck? A: Three techniques: batch inserts to amortize sequence generation overhead, partitioning the event store by aggregate type so each type has its own sequence, and using a timestamp-plus-shard monotonic key instead of a database sequence. The last option sacrifices strict global ordering for scalability - acceptable if projections tolerate occasional reordering within a small time window.
Interview Questions
Q: Walk me through how you’d handle an aggregate with 10 million events and a need for sub-100ms load time.
Expected depth: Discuss snapshot strategy - periodic serialization of aggregate state at known sequence numbers. Cover snapshot invalidation (snapshots are never wrong, they just become stale). Explain the threshold for triggering a new snapshot (N events since last snapshot, or M seconds elapsed). Address the bootstrap problem: first load always does full replay; snapshot is written after the first load to seed future loads.
Q: Two workers both issue commands against the same aggregate at the same time. Walk me through what happens.
Expected depth: Both workers load aggregate at version 5. Both generate events to append as version 6. Worker A’s INSERT succeeds. Worker B’s INSERT checks (aggregate_id = X AND sequence_num = 5) - no longer true, so 0 rows affected. Worker B gets OptimisticConcurrencyError, reloads aggregate (now at version 6), re-evaluates its command against the new state (may still be valid or may now be rejected), and retries. Cover the case where the command is idempotent vs non-idempotent after retry.
Q: A compliance team asks you to delete all events for a specific customer (GDPR right to erasure). How do you handle this?
Expected depth: You cannot delete events - that breaks replay. Solutions: encrypt event payloads with a per-customer key, then delete the key (crypto-shredding); or replace PII fields with tombstone markers in a separate index while keeping the event structure. Cover the projection rebuild problem: projections built before the erasure may still contain PII - they need to be rebuilt after erasure.
Q: Your projection is 30 minutes behind the event stream. How do you investigate and fix it?
Expected depth: Check projection cursor position vs current global_seq - gap = lag. Check worker throughput: events/second processed. Check for a specific event type causing slow apply (a projection that does a web API call in the handler). Check database write contention on the read model table. Fix path: identify the slow event type, optimize the handler, or scale to multiple projection workers with hash partitioning on aggregate ID.
Q: How would you implement point-in-time queries - “what was the state of order X at 2pm on March 3rd”?
Expected depth: Two approaches: filter events by occurred_at <= target_time and replay only those; or maintain a time-indexed projection that stores state snapshots at regular intervals. The first approach is correct for any aggregate but scales poorly for large aggregates. The second is fast but requires pre-computing the time points. Cover how daylight saving time and clock skew affect occurred_at accuracy - discuss using monotonic global_seq as a proxy when strict time accuracy is needed.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.