Introducing Refreshes: Reimport Historical Data with Zero Downtime (original) (raw)

While data practitioners typically prefer incremental data movement, there are times when re-importing historical data becomes necessary. Whenever possible, Airbyte tracks progress using cursors. Re-importing involves “rewinding” the cursor to read data from the beginning, an operation previously known as a Reset.

There are many scenarios where a Reset was useful. Perhaps the source has expanded the availability of historical data, or it contained incorrect data that needs to be overwritten. Sometimes, sources return inconsistent data, such as when an API without strict cursor-based pagination is read while the underlying data is being updated. Databases configured with loose transaction levels can exhibit similar issues. Additionally, if something goes wrong with your sync, you might need to start from scratch. A reliable data movement tool must seamlessly handle these situations.

While effective, Resets had their limitations, particularly data downtime. As part of Airbyte 1.0, we are excited to introduce significant improvements to the Reset functionality. Notably, three new operations that enable re-imports of historical data without data downtime: Clear, Refresh and Remove, and Refresh and Retain. This article explores the new Refresh functionality, its technical details, and some new data quality features.

Reset Limitations

A ”‘Reset’’ was how Airbyte supported re-imports of historical data. Resets kicked of a job that:

  1. Cleared a connection’s cursor effectively ‘rewinding’.
  2. Prepared for the new data by removing the data in the destination.
  3. Kicked off a new sync job moving the data using the reset cursor. The “Reset" is complete after this job succeeds.

Although Resets served the purpose of re-importing source data, its implementation contained several limitations.

Chiefly, the two different jobs - first to clear existing data, second to re-import data - meant data downtime for users. Users kicking off reset jobs experienced empty destinations until the new Sync job succeeded. Data is quickly becoming a critical business function, and any downtime, no matter how small, is unacceptable!

Secondly, clearing all data can inadvertently lead to missing data when combined with sources with limited look back windows. For example, both Zendesk and Stripe APIs only have 30-day look back windows. Users running resets with these sources will suddenly find data older than 30 days gone! While this is outside Airbyte’s control (Airbyte can only view what the API returns), we know, and want, to serve our users better.

Lastly, combining clearing the Destination’s data with re-importing data meant Airbyte did not deal well with inconsistent Sources. An inconsistent source is a source that might not return the same set of data. An API source implementing Offset Pagination commonly runs into this issue and is the go-to example. The simplest Offset Pagination implementation is a LIMIT .. OFFSET query with OFFSET exposed as a cursor. The lack of ORDER BY means unpredictable results and it is easy for concurrent data modifications to change the order of data rows, leading to missed rows or the same row on different pages. The unreliable cursor means users resetting their Airbyte connection sometimes found themselves confusingly looking at a different data set.

Reliability is crucial for any data movement tool, so we knew we had to improve on all these when we started redesigning Resets a few months ago.

Improving Reset with Refreshes

With that in mind, we are happy to announce Refreshes as an improvement over Resets.

With Refreshes, Airbyte overwrites and moves data in a single job. Further, data is now only cleared after the new set of data is successfully read. Users no longer see missing data during a Refresh job, or on a Refresh failure, a big data reliability win!

We’ve also split the various use cases Reset previously served into different Refresh options for clarity and flexibility. All the following operations are available per-stream on the Connection page, or on the entire connection in the Settings page.

Reset is renamed as Clear

An important use case with Reset is clearing the final Destination. Data hygiene, correctly tracking and removing data, is crucial in our privacy-first world. While completely erasing data is a rare occurrence, its occurrence often requires immediate action.

We are keeping this functionality and renaming Reset to Clear to highlight the destructive action Airbyte performs. Unlike Reset, Clear no longer triggers a follow up sync job i.e. only the data is cleared.

A confirmation modal safeguards against accidental data deletion.

Introducing Refresh and Remove

The existing Reset use case of clearing the destination data and reimporting all data has been renamed to Refresh and Remove.

This is the simplest type of Refresh Sync and only displays data in the destination’s final tables that was newly synced since the refresh was initiated. We recommend using this only if reimporting a connection with a consistent data source supporting full historical reads e.g. databases or general file systems.

Introducing Refresh and Retain

As not all sources keep history forever, care needs to be taken when reimporting a connection with a buggy or lossy source. Refresh and Remove is insufficient for these cases if keeping a full history is required.

Airbyte is introducing a new mode to tackle this issue with the Refresh functionality: Refresh and Retain.

With Refresh and Retain, Airbyte keeps any previously synced records with newly captured records. If deduplication is enabled, Airbyte is also able to merge previous records with any newly captured records from the last Full Refresh job. This is especially useful to ensure data retention on sources with short historical windows, most commonly seen in API sources.

Note Refresh and Retain, by its nature, will lead to increased storage costs in your destination due to the potential duplication of old and new data copies. This may also lead to increased compute costs and slower syncs, depending on your destination.

The Resumability Challenge with Refreshes

One of the main challenges in designing Refreshes is ensuring Resumability. A crucial aspect of achieving a seamless experience is effectively managing inevitable transient errors. Airbyte addresses this through its Jobs and Attempts framework, where each job consists of various attempts, with each attempt resuming from where the previous one left off.

To achieve zero data downtime, Airbyte must swap the final tables at precisely the right moment across multiple attempts. Since a connection can involve multiple streams, this swapping process must also be stream-aware. The complexity increases further due to the nuances of different Destinations. For instance, warehouse and database destinations typically upload incoming data to a temporary table. Airbyte must determine whether the current attempt’s data should be written to an existing temporary table or a new one. The possible scenarios are mind-boggling and exhausting! We know because tried writing them all!

As the saying goes “All problems in computer science can be solved by another level of indirection” - we realised we were too deep in the weeds and needed a crucial middle abstraction: Data Generations.

Data Generations

A Generation represents the number of times each record has been reimported. Each time a Refresh (of any type) occurs, the generation number increments. A record from a connection without any refreshes is tagged with generation 0. After a Refresh job, the same record would be assigned generation 1. This is also known as the current generation id. Data synced before the introduction of this feature will have generation of null.

Additionally, there is now a minimum generation ID. Both the current generation ID and the minimum generation ID are tracked per stream and are available for each job. Airbyte uses these IDs to determine which data to remove or retain.

In Retain, the minimum generation ID is set to 0, prompting Airbyte to retain all generations. In Remove, the minimum generation ID is set to the current generation ID minus 1, resulting in only the current generation being retained. Although Airbyte currently supports only these two modes, this approach paves the way for retaining n generations for cost efficiency.

Data Generations also provides advanced data observability features that we’ll discuss later.

Signalling Stream Completion

The final piece of the puzzle is timing the swap of the final tables perfectly. The solution became evident once we incorporated Data Generations: use the existing per-stream Stream Status Complete message as a trigger. This approach seamlessly integrates into existing abstractions and is straightforward. Win-win!

Debugging Data Quality Issues

Each record’s generation is also present in the _airbyte_generation_id column in the final table. This additional metadata allows advanced data observability features.

Understanding and Recovering from a Flaky Source

Consider the following example: you are extracting data into your data warehouse and notice that data for March 2024 is missing, while using an append sync mode.

Since you know the source is sometimes unreliable, you decide to perform a Refresh and Retain History Sync. This time, the source returns the correct data set, which Airbyte merges, giving you the missing row.

It is now possible to compute the different total values of sum(total_sales) for each generation, and if they differ, to look for records which didn’t exist before and after the refresh in different generations. If you aren’t using an append sync mode, the data for both the previous and current generations will be retained in your destination’s raw tables but not displayed in the final tables. A similar analysis can be performed to identify records which exist in the current generation but not the previous one.

Discovering Deletes from a Source that Hides Them

Another example of using generation IDs, extracting data into your data warehouse from a source that doesn’t track deletes. You are using an append+dedupe sync mode, and you want to detect when a record has been deleted from the source:

Time passes, and you opt to perform a Refresh and Retain to see if any users were removed from the source. Your final table now looks like this:

Notice that user #2’s latest entry does not belong to the current (i.e., max(_airbyte_generation_id)) generation. This indicates that the source no longer includes a record for this primary key, meaning it has been deleted. In your downstream tables or analysis, you can choose to exclude this record.

Closing Thoughts

With the introduction of Refreshes, Airbyte significantly enhances its ability to handle re-imports of historical data without data downtime. By consolidating the operations of data clearing and syncing into a single job, Refreshes ensure seamless data overwrites and eliminate the periods of missing data that previously occurred during Resets.

The new modes (Clear, Refresh and Remove, and Refresh and Retain) provide greater flexibility and clarity in managing data re-imports. The integration of Data Generations introduces advanced data observability features, enabling users to track and manage data changes more effectively. These improvements not only increase the reliability and robustness of Airbyte as a data movement tool but also align it with the growing need for uninterrupted and consistent data availability in modern business environments.

What other great readings

This article is part of a series of articles we wrote around all the reliability features we released for Airbyte to reach the 1.0 status. Here are some other articles about those reliability features:

Don't hesitate to check on what was included in Airbyte 1.0's release!