Build a Change Data Capture Pipeline
databases data-engineering distributed-systems
System Design Deep Dive
Change Data Capture Pipeline
Turning a database transaction log into a real-time event stream - without touching application code.
Imagine your database as a shared notebook that hundreds of services need to read simultaneously. The naive solution is to ask every service to poll the notebook every few seconds - but with 50 million writes per day, that polling tax compounds into billions of redundant queries, and you still miss events that happened between polls. The better solution is to stop reading the notebook and start reading the author’s hand as they write: tap into the transaction log directly.
Change Data Capture (CDC) is exactly that. Rather than having application code explicitly publish events alongside every database write - the fragile dual-write pattern that creates consistency windows - CDC reads the database’s own internal record of every committed change. Every row insertion, update, and deletion is already logged atomically with the transaction. CDC simply subscribes to that log and converts it into a stream of structured events.
The challenge is not conceptually hard. The engineering is. Transaction logs were designed for crash recovery, not for external consumption. They use database-internal formats, rotate on a schedule, and contain both committed and rolled-back transactions interleaved. A production-grade CDC pipeline must parse these logs reliably, filter only committed data, preserve transactional boundaries across multi-row changes, handle schema changes mid-stream, survive connector crashes without losing or duplicating events, and deliver to downstream consumers that may be temporarily offline. At the scale of a major e-commerce platform - say 200k transactions per minute across 40 tables - each of these problems becomes a hard distributed systems challenge.
This post walks through designing exactly such a pipeline. We will cover the binlog and WAL internals that make it possible, how Debezium sits on top of those logs as a managed connector, the ordering and exactly-once semantics that determine correctness, and how the whole system scales to multiple source databases and thousands of downstream consumers.
Requirements and Constraints
Functional requirements:
- Capture every committed INSERT, UPDATE, and DELETE from MySQL, Postgres, and MongoDB source databases
- Deliver events to downstream consumers in the order they occurred within each table (per-entity ordering)
- Preserve transactional boundaries: all rows changed in one transaction appear as a coherent group
- Support at least 15 downstream consumer types (search indexes, data warehouses, cache invalidators, audit logs)
- Handle schema changes (column additions, renames) without losing events or requiring consumer downtime
- Provide at-least-once delivery with idempotency support so consumers can safely deduplicate
Non-functional requirements:
- End-to-end latency under 500ms from database commit to consumer delivery (p99)
- Zero event loss even if any single component (connector, Kafka broker, consumer) crashes and restarts
- Support backpressure: slow consumers must not cause the pipeline to drop events or block the source database
- Connector throughput sufficient for 50k events/second per source database
- Observable: lag, throughput, and schema drift are all monitorable via standard metrics
Scale assumptions:
- 5 source databases (3 MySQL shards, 1 Postgres cluster, 1 MongoDB replica set)
- Peak write rate: 80k transactions/minute across all sources
- Average event size: 2KB (before + after row image)
- 15 downstream consumers across 4 consumer groups
- Data retention in Kafka: 7 days (enables consumer replay)
- Schema changes: approximately 2 per week per database
Out of scope:
- Global ordering across different source databases (we guarantee per-partition ordering only)
- Exactly-once delivery at the consumer side (we provide at-least-once; consumers handle idempotency)
- Real-time DDL execution on downstream systems
High-Level Architecture
The pipeline splits into three layers. The Source Layer contains the origin databases whose transaction logs we tap. The CDC Layer contains Debezium connectors running as Kafka Connect source connectors, which read those logs and publish structured events to Kafka. The Sink Layer contains downstream consumers - Elasticsearch for search, a data warehouse for analytics, cache invalidators for read performance - each reading from Kafka at their own pace.
Kafka sits at the center as the durable, ordered buffer between producers (Debezium connectors) and consumers. This decoupling is what solves the offline consumer problem: if the Elasticsearch indexer is redeploying for 10 minutes, events accumulate in Kafka and are replayed in order when it comes back. The source database is never asked to hold events on behalf of slow consumers.
The topic naming convention mirrors the source: {server-name}.{database}.{table}. A change to customers.accounts on the mydb MySQL instance lands in the mydb.customers.accounts topic. Consumers subscribe to one or more topics. Each topic has multiple partitions, and events are routed to partitions by primary key hash, guaranteeing that all events for a given row go to the same partition and thus arrive at consumers in order.
The Binlog Reader
The foundation of the entire pipeline is reading the database’s transaction log. Each database engine exposes this differently.
MySQL binary log (binlog) is a sequence of binary event files (e.g., mysql-bin.000042) containing a record of every change that modifies data or schema. There are three binlog formats:
STATEMENT: logs the SQL statement itself. Cheap to store but non-deterministic -NOW()orUUID()calls produce different results on replay.ROW: logs the before and after image of every affected row. Larger but completely deterministic and required for CDC.MIXED: uses STATEMENT for safe queries, ROW for non-deterministic ones. CDC tools treat this as unreliable.
We require binlog_format = ROW and binlog_row_image = FULL (captures both before and after images even for partial updates). Debezium registers itself as a MySQL replica, receiving the same binary log stream that a read replica would receive, using the replication protocol.
Postgres Write-Ahead Log (WAL) is fundamentally different. Postgres writes every change to the WAL before applying it to the heap files - this is how it achieves durability. For external CDC, Postgres exposes logical replication slots, which decode the physical WAL entries into logical row-level changes using a decoder plugin. The pgoutput plugin (built into Postgres 10+) and wal2json are the most common choices.
-- Enable logical replication in postgresql.conf
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');
-- Grant replication privilege to the CDC user
ALTER ROLE debezium_user REPLICATION LOGIN;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;
A replication slot is a cursor into the WAL that Postgres holds open on behalf of a consumer. Crucially, Postgres will not discard WAL segments until all slots have consumed them. This is both a safety feature (events cannot be lost) and a danger: an inactive slot causes WAL to accumulate indefinitely, eventually filling disk.
MongoDB oplog is a special capped collection (local.oplog.rs) that records every write operation on a replica set. Debezium tails this collection using a tailable cursor, reading documents that contain fields like op (operation type: i, u, d), ns (namespace: db.collection), o (the document or change delta), and ts (BSON timestamp). MongoDB 4.0+ also supports change streams, a higher-level API built on the oplog that abstracts away the cursor management.
Log Sequence Numbers (LSN) are the universal position marker across all these log formats. In Postgres, an LSN is a 64-bit integer representing a byte offset in the WAL stream (e.g., 0/157842F0). In MySQL, the equivalent is a combination of binlog filename and byte position, or a GTID (Global Transaction Identifier) in GTID mode. In MongoDB it is the Timestamp type from the oplog. The CDC connector persists its current LSN as a checkpoint, enabling it to resume exactly where it left off after a crash.
Key Insight
The replication slot in Postgres is both the CDC connector’s best friend and its most dangerous dependency. It guarantees no event is lost - but an idle slot that falls behind will cause Postgres WAL to accumulate until disk fills. Always monitor pg_replication_slots and set max_slot_wal_keep_size as a safety valve.
The Debezium Connector Layer
Debezium is an open-source CDC platform that runs as a set of Kafka Connect source connectors. Kafka Connect is a distributed framework for scalably moving data into and out of Kafka, with built-in offset tracking, worker management, and REST-based configuration. Debezium implements the connector interface for MySQL, Postgres, MongoDB, SQL Server, Oracle, and others.
Each Debezium connector instance is a single-threaded process that reads from one source database. It operates in two distinct phases:
Snapshot mode is the initial one-time phase when the connector first starts against a database it has not seen before. It executes a consistent read of the entire configured table set - essentially SELECT * FROM table with a snapshot isolation level - and publishes every existing row as a synthetic READ event. For a large database with 500M rows, this snapshot can take hours. During this time, any new writes are buffered in the binlog/WAL and will be processed in streaming mode afterward.
Streaming mode is the ongoing phase. Once the snapshot completes, the connector records the LSN at which it transitioned and begins tailing the transaction log from that point. It converts raw log entries into structured JSON or Avro events and publishes them to Kafka topics. From this point on, the connector stays near-real-time, typically within 100ms of the source database.
Here is a production Debezium MySQL connector configuration:
name: mysql-customers-connector
config:
connector.class: io.debezium.connector.mysql.MySqlConnector
tasks.max: "1"
database.hostname: mysql-primary.internal
database.port: "3306"
database.user: debezium
database.password: "${file:/opt/kafka/config/debezium.properties:mysql.password}"
database.server.id: "184054"
database.server.name: "mydb"
database.include.list: customers,orders
table.include.list: customers.accounts,customers.profiles,orders.line_items
database.history.kafka.bootstrap.servers: kafka:9092
database.history.kafka.topic: schema-changes.mydb
include.schema.changes: "true"
snapshot.mode: initial
snapshot.isolation.mode: repeatable_read
transforms: unwrap
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.add.fields: op,table,lsn,source.ts_ms
The database.history.kafka.topic is critical: Debezium stores the full DDL history of the schema in a separate internal Kafka topic. When the connector restarts, it replays this schema history to reconstruct the schema at the point it stopped, so it can correctly decode binary log entries that reference column positions by index rather than name.
Offset storage works through Kafka Connect’s built-in offset storage, which persists the connector’s current LSN/position to the kafka-connect-offsets internal topic. On restart, the connector reads its last committed offset and resumes from there. This means the connector itself is stateless from a durability perspective - all state lives in Kafka.
Real World
Debezium, now a Red Hat project and the dominant open-source CDC tool, powers CDC pipelines at thousands of companies. LinkedIn’s Databus (2012) was one of the earliest large-scale CDC systems, using a similar log-tailing approach for Oracle and MySQL. Airbnb’s SpinalTap (open-sourced 2019) extends Debezium’s approach to handle Airbnb’s specific sharding and schema management requirements at global scale. All three share the same core insight: trust the transaction log, not the application layer.
Event Ordering and Transactional Boundaries
Ordering in a distributed CDC pipeline is nuanced. We cannot guarantee global ordering across all topics simultaneously - that would require a single sequencer, which is a bottleneck. Instead, we guarantee two properties:
Per-partition ordering: Within a single Kafka partition, events are strictly ordered by their LSN from the source. This means all updates to a given primary key (which map to the same partition via key-hash routing) are processed in the exact order they occurred in the database.
Transactional boundary preservation: All row changes from a single database transaction are published atomically to Kafka, tagged with the same transaction ID. Downstream consumers can reconstruct the original transaction if they need to apply changes as a unit.
Here is what a Debezium CDC event looks like in full:
{
"schema": {},
"payload": {
"before": null,
"after": {
"id": 42,
"email": "alice@example.com",
"status": "active",
"updated_at": 1718150400000000
},
"source": {
"version": "2.5.0.Final",
"connector": "mysql",
"name": "mydb",
"ts_ms": 1718150400123,
"db": "customers",
"table": "accounts",
"gtid": "d4a72258-c8f7-11ee-a506-0242ac120002:1-1024",
"file": "mysql-bin.000042",
"pos": 157842,
"row": 0,
"server_id": 184054
},
"op": "u",
"ts_ms": 1718150400456,
"transaction": {
"id": "file=mysql-bin.000042,pos=157800",
"total_order": 3,
"data_collection_order": 1
}
}
}
The transaction block is emitted by Debezium’s transaction metadata feature (enabled with provide.transaction.metadata=true). The total_order field indicates which event within the transaction this is (3rd event), and data_collection_order indicates which event for this specific table (1st event in this table within the transaction). A consumer building a transactional view can buffer events until it sees total_order == total_count, then apply them together.
The op field encodes the operation: c for create (INSERT), u for update (UPDATE), d for delete (DELETE), and r for read (snapshot rows). The before field is null for inserts; for updates and deletes it contains the pre-change row state (requires binlog_row_image=FULL in MySQL).
Key Insight
Kafka partitions are the unit of ordering. Two events in different partitions have no guaranteed order relative to each other, even if they came from the same database transaction. If your consumer joins data across multiple tables, and those tables map to different partitions, you need application-level transaction tracking using the transaction.id field - not Kafka offset ordering.
Schema Evolution
Schema changes are the silent killer of CDC pipelines. When a developer runs ALTER TABLE accounts ADD COLUMN phone VARCHAR(20), the binlog entries before the migration use one column layout and entries after use another. A naive decoder applying the wrong schema produces corrupt events.
Debezium’s approach is to embed schema information in every event (the schema block in the event envelope) and maintain a full DDL history in the schema-changes Kafka topic. When it encounters a DDL event in the binlog, it pauses event emission, updates its internal schema representation, writes the DDL change to the history topic, then resumes. Downstream consumers that cache schemas should watch for schema changes.
For production systems with many consumers, we use Apache Avro with a Schema Registry (Confluent Schema Registry or AWS Glue Schema Registry). Instead of embedding the full schema in every message (expensive), each message carries a 5-byte header: 1 magic byte (0x00) followed by a 4-byte schema ID. The consumer looks up the schema by ID from the registry (cached after first fetch).
{
"type": "record",
"name": "accounts_value",
"namespace": "mydb.customers",
"fields": [
{"name": "id", "type": "long"},
{"name": "email", "type": "string"},
{"name": "status", "type": "string"},
{"name": "updated_at", "type": {"type": "long", "logicalType": "timestamp-micros"}},
{"name": "phone", "type": ["null", "string"], "default": null}
]
}
The phone field uses a union type ["null", "string"] with a null default - this is the standard Avro pattern for backward compatibility. Consumers running the old schema (without phone) can still deserialize messages written with the new schema, because they simply ignore unknown fields. This is backward compatibility: new writers, old readers work.
Forward compatibility - old writers, new readers - is harder and requires that the new schema can read messages written by the old schema. This means new required fields must have defaults, and fields can only be removed if they had defaults.
The Schema Registry enforces a compatibility mode per subject (topic). We configure BACKWARD compatibility for all CDC topics, requiring that every new schema version can be read by consumers using the previous version.
Warning
Renaming a column is a breaking schema change. To consumers using the old schema, the renamed column appears as a new field (added) and the old field appears deleted. This breaks backward compatibility. The safe approach is: add the new column name as an alias, keep both column names for one release cycle, then drop the old name after all consumers have migrated to the new name.
Data Model
We need persistent state for connector checkpoints, a dead-letter queue for failed events, and schema history. Here are the key tables and naming conventions.
Connector offset tracking (this is managed by Kafka Connect internally, but shown here for understanding):
-- Internal Kafka Connect offset storage (in Kafka topic, not a DB table)
-- Key: {"connector": "mysql-customers-connector", "server": "mydb"}
-- Value: {"file": "mysql-bin.000042", "pos": 157842, "gtid": "d4a72258:1-1024"}
-- To inspect current connector positions:
SELECT
connector_name,
json_extract(offset_value, '$.file') AS binlog_file,
json_extract(offset_value, '$.pos') AS binlog_pos,
json_extract(offset_value, '$.gtid') AS gtid
FROM cdc_connector_offsets;
Dead letter queue for events that fail to be processed by a consumer (e.g., schema mismatch, downstream system unavailable):
CREATE TABLE cdc_dead_letter_queue (
id BIGSERIAL PRIMARY KEY,
topic VARCHAR(255) NOT NULL,
partition_num INTEGER NOT NULL,
offset_value BIGINT NOT NULL,
consumer_group VARCHAR(255) NOT NULL,
error_class VARCHAR(500),
error_message TEXT,
raw_key BYTEA,
raw_value BYTEA,
failed_at TIMESTAMPTZ DEFAULT NOW(),
retry_count INTEGER DEFAULT 0,
last_retry_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ,
UNIQUE(topic, partition_num, offset_value, consumer_group)
);
CREATE INDEX idx_dlq_unresolved ON cdc_dead_letter_queue(consumer_group, resolved_at)
WHERE resolved_at IS NULL;
Kafka topic naming convention:
{server-name}.{database}.{table}
Examples:
mydb.customers.accounts -- MySQL customers.accounts table
mydb.customers.profiles -- MySQL customers.profiles table
mydb.orders.line_items -- MySQL orders.line_items table
pgdb.inventory.products -- Postgres inventory.products table
mongodb.catalog.items -- MongoDB catalog.items collection
Internal topics (Debezium managed):
schema-changes.mydb -- DDL history for mydb connector
debezium.offsets -- Connector position checkpoints
debezium.configs -- Connector configurations
debezium.status -- Connector task status
Consumer lag tracking view (built on top of Kafka consumer group metadata):
CREATE VIEW cdc_consumer_lag AS
SELECT
consumer_group,
topic,
partition_num,
current_offset,
log_end_offset,
(log_end_offset - current_offset) AS lag,
CASE
WHEN (log_end_offset - current_offset) > 100000 THEN 'CRITICAL'
WHEN (log_end_offset - current_offset) > 10000 THEN 'WARNING'
ELSE 'OK'
END AS lag_status
FROM cdc_consumer_offsets
ORDER BY lag DESC;
Key Algorithms and Protocols
Exactly-Once Delivery
True exactly-once across Kafka and a downstream database is theoretically possible with distributed transactions, but practically avoided because it requires both systems to participate in a two-phase commit. The overhead is severe: throughput drops 5-10x and latency spikes.
Instead we use the pragmatic approach: at-least-once delivery with idempotent consumers. Debezium guarantees at-least-once (every committed event will eventually be delivered; restarts may re-deliver the last few). Consumers are designed to handle duplicates safely by using upsert semantics keyed on the event’s primary key and using the source.ts_ms or LSN as a version guard.
from confluent_kafka import Consumer, KafkaError, KafkaException
import json
consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'search-indexer',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # Manual commit for at-least-once
'max.poll.interval.ms': 300000,
'session.timeout.ms': 30000,
'partition.assignment.strategy': 'cooperative-sticky',
})
consumer.subscribe(['mydb.customers.accounts'])
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
continue
raise KafkaException(msg.error())
event = json.loads(msg.value())
payload = event['payload']
# Idempotent upsert - safe to retry because we key on id
# and only update if the incoming lsn is newer than stored
upsert_to_elasticsearch(
index='accounts',
doc_id=payload['after']['id'],
doc=payload['after'],
version=payload['source']['pos'], # LSN as version
version_type='external_gte', # Only update if version >= stored
)
# Only commit offset after successful processing
# If this crashes before committing, the event is re-delivered - that's fine
consumer.commit(asynchronous=False)
The version_type='external_gte' in Elasticsearch is the idempotency guard: if we re-deliver an event with the same or lower LSN (e.g., after a connector restart), Elasticsearch simply ignores the update because it already has an equal or newer version.
Consumer Offset Tracking
Kafka tracks consumer positions using the consumer group protocol. Each consumer in a group is assigned a subset of partitions. When a consumer calls commit(), the offset is written to the __consumer_offsets internal topic. On restart or rebalance, the consumer group coordinator assigns partitions and each consumer resumes from its last committed offset.
The two commit strategies and their tradeoffs:
- Auto-commit (
enable.auto.commit=True): Kafka commits offsets periodically (default every 5 seconds). Simple but risks skipping events if the consumer crashes between commit and processing. Effectively at-most-once. - Manual commit (
enable.auto.commit=False): The consumer explicitly commits after successful processing. Ensures at-least-once. Our choice.
Consumer group lag is the key operational metric. We monitor it with:
kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--describe \
--group search-indexer
# Output includes: PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG
Backpressure Handling
When a consumer falls behind, Kafka partition lag accumulates. The consumer is the right place to apply backpressure - it simply slows its poll() rate. Kafka does not push to consumers; consumers pull. This is the fundamental design that prevents slow consumers from affecting producers.
However, there is a subtlety on the producer (Debezium) side. Debezium itself can produce events faster than Kafka can accept them if Kafka is under heavy load. Debezium’s Kafka producer uses standard Kafka producer backpressure: the producer has a bounded buffer.memory (default 32MB) and if the buffer fills because Kafka brokers are not acknowledging, producer.send() blocks until space is available or the max.block.ms timeout triggers. We tune this:
# Debezium producer tuning for throughput
producer.batch.size=65536
producer.linger.ms=5
producer.compression.type=lz4
producer.buffer.memory=67108864
producer.max.block.ms=60000
producer.acks=all
producer.retries=2147483647
producer.max.in.flight.requests.per.connection=5
producer.enable.idempotence=true
The enable.idempotence=true on the producer ensures that if a batch is retried (due to network failure), Kafka deduplicates it. This is Kafka’s built-in producer idempotence, independent of consumer-side idempotence.
Scaling and Performance
One connector per source database. Debezium connectors are single-threaded readers of a single log stream. You cannot parallelize reading from a single MySQL binlog because the log is an ordered sequence. Scale by having one connector per source database (or per shard in a sharded setup).
Partition count drives consumer parallelism. Each Kafka partition can be consumed by at most one consumer per consumer group. If we have 12 partitions for mydb.customers.accounts, we can have up to 12 search-indexer consumer instances processing that topic in parallel. We set partition count at topic creation time based on the expected peak throughput per consumer. A safe starting point is partition_count = ceil(peak_events_per_sec / consumer_throughput_per_sec).
Partition routing by primary key. Debezium uses the Kafka Connect DefaultPartitioner which hashes the message key (the row’s primary key) to assign partitions. This guarantees that all events for account_id=42 always go to the same partition, preserving per-entity ordering for consumers.
Consumer group scaling. Multiple consumer groups can independently consume the same Kafka topics. Adding a new downstream consumer type (e.g., a new analytics system) means creating a new consumer group with its own offset tracking. It does not affect existing consumers and can start from any offset (beginning, end, or a specific timestamp).
Typical throughput ceiling. A single Debezium connector can sustain approximately 30k-50k events/second in streaming mode, limited by the Kafka producer batch throughput and the source database’s replication bandwidth. At 80k transactions/minute (1,333 TPS) with average 3 rows per transaction (~4k events/second), we are well within the ceiling for a single connector. For 10x the load, we would need sharded source databases with one connector each.
Real World
LinkedIn’s Databus - a precursor to modern CDC tools - was open-sourced in 2012 and handled billions of events per day across Oracle and MySQL databases. Airbnb’s SpinalTap (open-sourced 2019) extended similar concepts to handle Airbnb’s multi-region MySQL shards, adding features like per-shard connector management, schema versioning backed by a central registry, and automated failover to replica binlog sources. Both systems validate the connector-per-shard scaling model in practice.
Failure Modes and Recovery
Connector crash mid-transaction. If the Debezium connector process crashes while reading a large multi-row transaction, it will restart from the last committed Kafka Connect offset. The partially-published events from the in-progress transaction may be re-delivered. Consumers see a repeat of some events with the same transaction.id. Idempotent consumers handle this correctly. The before/after fields ensure idempotent upsert can detect and discard duplicates.
Kafka broker failure. With replication.factor=3 and min.insync.replicas=2, a single broker failure is transparent. The partition leader fails over to a replica within seconds. Debezium’s producer with acks=all and retries=INT_MAX will retry until the new leader acknowledges. No events are lost; there is a brief producer stall during leader election (typically under 30 seconds).
Schema registry unavailability. Consumers cache schemas locally. A brief registry outage (under the cache TTL, typically 5 minutes) is transparent. For longer outages, consumers stall on cache misses. We use a circuit breaker pattern: on registry error, fall back to the cached schema for known schema IDs, emit a warning metric, and continue. For unknown schema IDs (new schema versions deployed during outage), the consumer parks those messages in the dead-letter queue.
Replication slot lag (Postgres). Monitor pg_replication_slots for the confirmed_flushed_lsn of the Debezium slot versus the current pg_current_wal_lsn(). The difference is the amount of WAL that Postgres is holding for the connector. Alert if this exceeds 10GB. If the connector falls too far behind (e.g., connector is paused for days), WAL can fill disk. Set max_slot_wal_keep_size = 50GB in postgresql.conf as a safety valve - Postgres will drop the slot rather than fill disk, forcing a re-snapshot from Debezium.
-- Monitor Postgres replication slot health
SELECT
slot_name,
active,
confirmed_flushed_lsn,
pg_current_wal_lsn() AS current_wal_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flushed_lsn) AS wal_lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical';
Consumer group rebalance storms. When consumers join or leave a group, Kafka triggers a rebalance, pausing all consumption temporarily. In a high-throughput group with many partitions, eager rebalancing (the default pre-Kafka 2.4) stops all consumers for the duration. We use partition.assignment.strategy=cooperative-sticky which enables incremental rebalancing: only the partitions that need to move are reassigned, and consumers continue processing their stable partitions during the rebalance. Rebalance time drops from O(total partitions) to O(moved partitions).
Comparison of Approaches
| Approach | Latency | Complexity | Ordering | Schema Coupling | Reliability |
|---|---|---|---|---|---|
| Query polling | High (seconds) | Low | None | Tight | At-most-once |
| Debezium + Kafka | Low (milliseconds) | Medium | Per-partition | Loose (Schema Registry) | At-least-once |
| Transactional Outbox | Low (milliseconds) | Medium-High | Per-entity | Medium | At-least-once |
| Dual Write | Very low | Low | None | Tight | Inconsistent |
Query polling is the starting point most teams use: run a cron job that selects rows with updated_at > last_run. It fails at scale because it requires an updated_at column (not all tables have it), misses deletions, and produces high polling load on the source.
Debezium + Kafka is the gold standard for teams that can afford the operational overhead of Kafka and a Schema Registry. It requires no application code changes, captures deletions, and scales to arbitrary throughput.
Transactional Outbox is the alternative for teams that cannot tap the binlog directly (e.g., managed database services with restricted replication access). The application writes events to an outbox table in the same transaction as the data change, then a separate relay process reads from the outbox and publishes to Kafka. It requires application code changes but avoids the dual-write consistency problem.
Dual Write - having the application write to both the database and Kafka - is tempting but wrong. If the application crashes after the database write but before the Kafka write, the event is permanently lost. Even with retry logic, you have a consistency window. Never use this for reliable event streaming.
Key Takeaways
- CDC is log-based, not query-based. Reading the
binlog,WAL, oroplogcaptures every committed change including deletes, without modifying application code or adding load to the source database through polling. Debeziumhandles the hard parts: snapshot mode for initial population, streaming mode for ongoing capture, schema history tracking, and offset management via Kafka Connect.- Kafka is the durable buffer that decouples producers from consumers. A slow or offline consumer causes lag in Kafka partitions, not pressure on the source database.
- Per-partition ordering (by primary key hash) is the correct ordering guarantee for CDC. Global ordering across all tables is neither achievable nor necessary for most use cases.
- Schema evolution requires planning. Use Avro with a Schema Registry and configure backward compatibility. Column renames are breaking changes that require a multi-step migration.
- At-least-once delivery with idempotent consumers is the practical exactly-once solution. True distributed exactly-once (2PC across Kafka and the downstream system) costs too much throughput to be worth it in most cases.
- Monitor
WALlag on Postgres replication slots, consumer group lag in Kafka, and dead-letter queue depth. These three metrics tell you the health of the entire pipeline.
Frequently Asked Questions
What is the difference between CDC and event sourcing?
Event sourcing is an application architecture pattern where the application explicitly stores state as a sequence of immutable events - the application code produces events as the primary data model, and the current state is derived by replaying them. CDC is an infrastructure pattern where the database’s internal transaction log is treated as an event stream after the fact. In event sourcing, events are first-class; in CDC, events are derived from database writes. They solve different problems but can be combined: an event-sourced system might use CDC to propagate its event store to downstream systems.
How does Debezium handle TOASTed values in Postgres?
Postgres uses TOAST (The Oversized Attribute Storage Technique) to store large column values (over ~2KB) out-of-line. When a row is updated but a large column is unchanged, the WAL entry for that row does not include the TOASTed column value - it uses a placeholder. Debezium cannot retrieve the full value from the WAL alone. The workaround is to set REPLICA IDENTITY FULL on the table (ALTER TABLE t REPLICA IDENTITY FULL), which forces Postgres to include all column values in WAL entries regardless of TOAST status. This increases WAL volume but ensures complete before/after images in CDC events.
How does CDC work with horizontally sharded databases?
Each database shard is an independent source with its own binlog or WAL. We run one Debezium connector per shard, each publishing to the same set of Kafka topics (but different partitions, based on the shard’s key space). Consumers read from all partitions and see a merged stream. The ordering guarantee is per-partition within a shard; cross-shard ordering is not guaranteed, which is acceptable because cross-shard transactions are typically avoided in sharded architectures anyway.
How do we handle the initial snapshot for a large database (100GB+)?
Large snapshots are the biggest operational challenge. Options: (1) Use snapshot.mode=schema_only initially, accepting that historical data is not captured - then backfill via a separate batch job. (2) Use snapshot.mode=initial with snapshot.fetch.size tuned to avoid overwhelming the DB. (3) For Postgres, use snapshot.mode=exported which takes the snapshot at the WAL LSN at connector start, avoiding a long-running open transaction. (4) Restore a database dump into a separate instance and snapshot from that, then switch to streaming from the primary. Option 1 or 4 is usually best for 100GB+ databases.
How do we monitor the health of a CDC pipeline end-to-end?
Three key metrics: (1) Debezium connector lag - the difference between the current LSN and the connector’s last-read LSN, exposed via JMX as debezium.mysql.connector.metrics:name=streaming_max_queue_size_in_bytes. Alert if this grows persistently. (2) Kafka consumer group lag - use kafka-consumer-groups.sh --describe or Kafka JMX metrics. Alert if any consumer group lag exceeds the threshold for its SLA (e.g., >10k events for a real-time consumer). (3) Dead letter queue depth - track the count of unresolved rows in the DLQ table. Any entry is an event that failed delivery and needs investigation. Together these three metrics give you a complete view from source to consumer.
Interview Questions
“How would you handle exactly-once delivery in a CDC pipeline?”
The interviewer is looking for: understanding that true exactly-once requires 2PC (expensive), the at-least-once + idempotent consumer pattern as the practical alternative, specific mechanisms like Elasticsearch’s external_gte version type or database upsert-on-primary-key, and awareness of Kafka’s producer idempotence (deduplication at the broker level). Bonus: mention Kafka Streams exactly-once semantics for stream-to-stream processing within Kafka itself.
“A Postgres replication slot is lagging by 50GB of WAL. What happened and what do you do?”
The interviewer is looking for: understanding that the slot holds WAL until consumed, diagnosing root cause (connector crash? slow consumer? schema change causing connector pause?), short-term action (restart connector and verify it is consuming), medium-term mitigation (max_slot_wal_keep_size to prevent disk-fill, alerting on lag), and long-term solution (improve connector reliability, investigate why it fell behind).
“How would you handle a CDC pipeline for a table that has no primary key?”
The interviewer is looking for: awareness that Kafka partition routing and idempotent upserts both depend on a stable row identifier. Solutions: (1) add a primary key to the table (ALTER TABLE). (2) Use REPLICA IDENTITY FULL in Postgres so the full before-image is available for deduplication. (3) Use ROW_FORMAT and binlog_row_image=FULL in MySQL. (4) Accept that without a primary key, deduplication must be done on a hash of all columns - possible but expensive. The interviewer wants to see that you recognize the dependency on identity, not just that you know the workaround.
“How would you design CDC for a multi-region active-active setup where the same row can be written in two regions simultaneously?”
The interviewer is looking for: understanding of conflict resolution (last-write-wins by timestamp, CRDT-based approaches), how two CDC streams from two regions merge in Kafka (they interleave by Kafka’s arrival order, not database commit order), the need for a global ordering mechanism like a distributed clock (Hybrid Logical Clocks), and the trade-off between consistency and availability. This is intentionally hard; the right answer includes acknowledging that multi-region active-active CDC is an unsolved problem in general and requires application-specific conflict resolution rules.
“A consumer group’s lag has been growing for 2 hours. Walk me through your diagnosis.”
The interviewer is looking for a systematic approach: (1) Check if the consumer processes are running (Kubernetes pod status, process health). (2) If running, check consumer processing rate vs. production rate - is the consumer CPU/IO bound? (3) Check for consumer group rebalances - repeated rebalances cause processing gaps. (4) Check dead-letter queue depth - if many events are failing, processing is slow because each failure has retry overhead. (5) Check if a new message format arrived that the consumer cannot parse (schema compatibility alert). (6) Check downstream system health (Elasticsearch might be slow, causing the consumer to block on each upsert). Each of these has a different fix.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.