Supporting Very Large CDC Syncs with WASS (WAL Acquisition Synchronization System) (original) (raw)

Change Data Capture (CDC) is a pivotal concept in database management, enabling real-time tracking and replication of data changes. CDC is leveraged by data practitioners to track all changes applied to the database in near real-time, including inserts, deletes, and updates. Airbyte offers incremental replication via CDC and is one of the most popular replication methods to support incremental syncing of large databases.

As part of our efforts towards Airbyte 1.0, we are proud to announce improvements to our initial CDC snapshotting capabilities that will ensure Airbyte can sync large databases successfully without losing our position in the transaction log. This can cause adverse side effects such as needing to resync existing data and excessive transaction log buildup. In this blog post, we will explore the improvements that were introduced for database sources and the methodology behind them.

How Airbyte used to perform initial CDC syncs

Before going into the details of the enhancements, let’s take a moment to understand the components of a CDC sync. CDC typically reads off of a database’s transaction log (WAL in postgres, binlog in MySQL, oplog in MongoDB, Transaction Logs in SQL Server). These changelogs have a retention period associated with them (e.g. last day, last week) and do not contain the full history of changes in a database. Consequently, during an initial sync Airbyte is required to perform a snapshot of the full underlying database table before it can start streaming incremental changes off of the transaction log.

Previously, Airbyte used to perform the initial snapshot for all configured tables prior to reading off of the WAL. For some large syncs, the time to perform this snapshot could be longer than the transaction log retention time. As a result, any changes made to the source database during this time would be lost as the transaction log would be rotated out. Airbyte would prompt the user to reset their data and resync as there was a possibility of data loss - missing out on changes to the database while the initial snapshot was running. At best, this involves another sync (involving several more days of downtime). At worst, this involved not being able to support a sync since the initial load could never be completed.

The solution to this was manual. Our solutions engineer John Wasserman (aka Wass) used to add tables one by one to make sure we were making progress reading the transaction log.

How Airbyte supports large incremental CDC syncs

Our improvements to the connector simulate this manual progress. In homage to him, we’ve dubbed this algorithm the WASS (WAL Acquisition Synchronization System).

Airbyte’s initial snapshotting mechanism periodically switches between performing an initial snapshot and reading the transation log. As a result, we can ensure that there is no long-term buildup in the transaction log. This combined with our databases checkpointing capabilities ensures that Airbyte can now sync databases of any size. Concretely, this involves the following steps:

  1. Start the initial snapshot for configured tables. This involves issuing a series of SQL queries to the underlying database to sync data in small chunks in a way that is efficient, resumable, and unobtrusive.
  2. After a pre-configured amount of time, switch to read the transaction log for completed and started syncs if the initial snapshot is still running.
  3. Trigger a new attempt and resume. Steps 1 & 2 are repeated until the initial snapshot is complete.

This method ensures that the transaction log cannot build up for more than a fixed period of time. If you are curious about the implementation, here is the Pull Request that implemented the WASS algorithm for the MySQL source.

Ensuring data correctness

While implementing any changes to our core algorithms, we wanted to make sure that there were no data loss scenarios added. Airbyte’s at-least once delivery allows for some duplicate records to be emitted but guarantees that there will be no data loss.

We run CDC for streams that have started or completed snapshotting. The case of streams for which an initial snapshot is in progress but not completed is particularly interesting. If an initial snapshot is in progress we have 2 different scenarios:

  1. A change on the transaction log corresponds to a record that was already processed by the initial snapshot. This is a change that could not yet be reflected in the initial snapshot. This change must be emitted to ensure that there is no data loss.
  2. A record on the transaction log corresponds to a record not yet processed by the initial snapshot. That record will be emitted twice - once from the transaction log and another time from when the initial snapshot is resumed.

Future enhancements

With these enhancements, Airbyte now succeeds large initial snapshots for CDC syncs. To further enhance the snapshotting phase, Airbyte plans to:

  1. Perform the snapshotting phase in parallel for all tables.
  2. Only process records in the CDC catchup phase in a stream that correspond to records already processed by the initial snapshot.

These enhancements are available on

Update today and start syncing huge tables!

What other great readings

This article is part of a series of articles we wrote around all the reliability features we released for Airbyte to reach the 1.0 status. Here are some other articles about those reliability features:

Don't hesitate to check on what was included in Airbyte 1.0's release!

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->