Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding
How CDC tools use MySQL Binlog and PostgreSQL WAL with logical decoding for real-time data streaming
Photo by Matoo.Studio on Unsplash
CDC (Change Data Capture) is a term that has been gaining significant attention over the past few years. You might already be familiar with it (if not, don’t worry — there’s a quick introduction below). One question that puzzled me, though, was how tools like the Debezium CDC connectors can read changes from MySQL and PostgreSQL databases. Sure, transactional logs might be the answer — but why? Is the process of pulling logs from MySQL and PostgreSQL the same? If not, what are the key differences?
These are the questions I’ll explore and attempt to answer in this article. For those new to CDC, I’ll also provide a brief overview to get you up to speed. Enjoy the read, and don’t forget to follow!
Change Data Capture (CDC) System — Example Diagram (Created using Lucidchart)
What is Change Data Capture?
Change Data Capture (CDC) is a powerful and efficient tool for transmitting data changes from relational databases such as MySQL and PostgreSQL. By recording changes as they occur, CDC enables real-time data replication and transfer, minimizing the impact on source systems and ensuring timely consistency across downstream data stores and processing systems that depend on this data.
Instead of relying on infrequent, large batch jobs that may run only once a day or every few hours, CDC allows incremental data updates to be loaded in micro batches—such as every minute—providing a faster and more responsive approach to data synchronization.
There are a couple of ways that we can track the changes in a database:
Query-based CDC: This method involves using SQL queries to retrieve new or updated data from the database. Typically, it relies on a timestamp column to identify changes. For example:SELECT * FROM table_A WHERE ts_col > previous_ts; –This query fetches rows where the timestamp column (ts_col) is greater than the previously recorded timestamp.Log-based CDC: This method utilizes the database’s transaction log to capture every change made. As we’ll explore further, the specific implementation of transaction logs varies between databases; however, the core principle remains consistent: all changes to the database are recorded in a transaction log (commonly known as a redo log, binlog, WAL, etc.). This log serves as a detailed and reliable record of modifications, making it a key component of Change Data Capture.
In this article, we will focus on the transaction logs of MySQL and PostgreSQL databases, which serve as the backbone for CDC tools like Debezium CDC Connectors and Flink CDC.
Mysql — Binlog
MySQL uses a binary log to record changes to the database. Every operation in a transaction — whether it’s a data INSERT, UPDATE, or DELETE — is logged in sequence (Log Sequence Number — LSN). The binlog contains events that describe database changes and can operate in three formats:
Row-based: RBR logs the actual data changes at the row level. Instead of writing the SQL statements, it records each modified row’s old and new values. For example: If a row in the users table is updated, the binlog will contain both the old and new values:Old Value: (id: 1, name: ‘Peter’, email: ‘peter@gmail.com’)
New Value: (id: 1, name: ‘Peter’, email: ‘peter@hotmail.com’)
/*By default, mysqlbinlog displays row events encoded as
base-64 strings using BINLOG statements */Statement-based: MySQL logs the actual SQL statements executed to make changes. A simple INSERT statement might be logged as:INSERT INTO users (id, name, email) VALUES (1, ‘Peter’, ‘peter@gmail.com’);Mixed: Combines row-based and statement-based logging. It uses statement-based replication for simple, deterministic queries and row-based replication.
PostgreSQL — WAL & Logical Decoding
Unlike MySQL, which uses binary logging for replication and recovery, PostgreSQL relies on a Write-Ahead Log (WAL). MySQL replication is based on logical replication, where SQL statements are recorded in the binlog, whereas PostgreSQL uses a physical streaming replication model.
The key difference lies in how changes are captured and replicated:
MySQL (Logical Replication): Records SQL statements (e.g., INSERT, UPDATE, DELETE) in the binlog. These changes are then replicated to the replica databases at the SQL statement level. Logical replication is more flexible and captures the exact SQL commands executed on the master.PostgreSQL (Physical Replication): Uses Write-Ahead Logs (WAL), which record low-level changes to the database at a disk block level. In physical replication, changes are transmitted as raw byte-level data, specifying exactly what blocks of disk pages have been modified. For example, it could record something like: “At offset 14 of disk page 18 in relation 12311, wrote tuple with hex value 0x2342beef1222…”. This form of replication is more efficient in terms of storage but less flexible.
To address the need for more flexible replication and change capture, PostgreSQL introduced logical decoding in version 9.4. Logical decoding extracts a detailed stream of database changes (inserts, updates, and deletes) from a database in a more flexible and manageable way compared to physical replication. Under the covers, a logical replication captures changes in the Postgres Write-Ahead Log (WAL) and streams them in a human-readable format to the client.
Similarly to what we saw in MySQL, take the below INSERT statement as an example:
— Insert a new record
INSERT INTO users (id, name, email) VALUES (1, ‘Peter’, ‘peter@gmail.com’);
Once the changes are made, pg_recvlogical (a tool for controlling PostgreSQL logical decoding streams) should output the following changes:
BEGIN
table customer: INSERT: id[integer]:1,name[text]:Peter,email[text]:peter@gmail.com
It is through PostgreSQL’s logical decoding capability that CDC tools can stream real-time data changes from PostgreSQL to downstream systems, such as streaming applications, message queues, data lakes, and other external data platforms.
Conclusion
By understanding how transaction logs work in MySQL and PostgreSQL, we gain valuable insights into how CDC tools leverage these logs to perform incremental replication to downstream systems such as streaming applications, data lakes, and analytics platforms. We explored the differences between MySQL’s Binlog and PostgreSQL’s WAL, highlighting how PostgreSQL’s introduction of logical decoding enabled seamless integration with CDC tools.
This is the first post in our Change Data Capture and Streaming Applications series. Stay tuned for more insights, and don’t forget to follow, share, and leave a like!
References
MySQL Binlog Files: A Deep DiveLogical Decoding Output Plug-in Installation for PostgreSQLMySQL :: MySQL 8.4 Reference Manual :: 7.4.4.1 Binary Logging Formats
Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.