How Airbyte 1.0 Monitors Sync Progress and Solves OOM Failures (original) (raw)
At Airbyte, we're committed to offering you reliability and transparency into your syncs. As a part of our Airbyte 1.0 launch, we dove into a common user problem that impacted our perceived reliability: stuck syncs.
From a survey across our Support, Sales, Product and Community teams, we found there were four reasons why users would perceive syncs to be stuck.
- The sync failed because we stopped seeing responses from the source or destination
- The sync isn’t making progress fast enough
- The sync failed because Airbyte required more memory than was allocated
- The sync failed because of Temporal payload limits, which users cannot resolve themselves.
These four reasons in combination spurred an initiative to resolve these issues to improve our sync success rates and provide more transparency to users during a sync.
1. The sync failed because we stopped seeing responses from the source or destination
This was the most prevalent issue for our Engineering on-call team at the onset of our investigation. We had previously implemented a heartbeat mechanism that would automatically detect if a source had not sent any records to Airbyte in an hour, and fail the sync after 60 minutes of inactivity. This was implemented as a standard ceiling across all sources to ensure syncs did not hang for long periods of time without any sync progress.
While extracting data from sources, there are times when a source API may throttle or limit the number of requests it can handle within a certain time frame. Sources will immediately stop emitting records if a rate limit is exceeded, and often those rate limits will last an hour. We immediately saw that a smarter catch-all ceiling would be 90 minutes. There were a handful of sources that had much larger rate limit windows or required special handling, like LinkedIn (24 hour window) or Facebook (kick off an async job). To solve this problem, we then implemented CDK features and new functionality to our Connector Builder to handle custom rate limit windows.
A source’s rate limit window can now be defined specifically for that source. When Airbyte is actively being rate limited by the source, our platform not only identifies the issue but also provides an estimation indicating when the API is expected to be available again. This information is then displayed in the Airbyte UI, giving you real-time insights into the status of your connections. If the source indicates when the rate limit will end, you'll also see a countdown, empowering you to manage expectations and plan accordingly.
Airbyte will not continue attempting to sync until the rate limit is lifted. The Active Streams section will indicate which specific streams have been rate limited. The sync will automatically start attempting to sync again once the rate limit has lifted.
This new level of transparency is designed to give you a greater understanding of your sync processes. No more guessing or waiting in uncertainty.
2. The sync isn’t making progress fast enough
We heard feedback from users that it was hard to know what progress was occurring with the sync. This was particularly difficult when syncing larger data sets or historical data. To workaround this issue, we knew users were going directly into the sync logs, which were verbose and cumbersome.
To improve the user experience of tracking live syncs, we added 10-second polling to pull sync progress. In the UI, each stream shows how much data has been extracted and loaded, and how long the stream has been actively syncing for. We’ll also show how long it’s been since data was last loaded.
3. The sync failed because Airbyte required more memory than was allocated
Syncs would fail with an “Out of Memory” (OOM) error across both APIs and Databases. On Airbyte Cloud, we set a non-configurable ceiling of resources a connector can use. If a connector requests too much data at once, the sync would fail and not recover.
To combat these errors, our Connectors teams worked to improve our CDK handling of large requests to ensure that connectors would only pass a defined set of data that fit within that ceiling.
4. The sync failed because of Temporal payload issues, which users cannot resolve themselves.
We saw sync errors within our Cloud customer base around the Temporal limit of 2MB, which was a result of directly passing large payloads instead of passing a pointer to a large payload. When passed to our system’s serialization, it surpassed the Temporal limits and would cause a sync error. This had a larger impact on users with large schemas and bigger data sets.
To resolve the issue, we modified our syncs to store intermediate payloads in storage and then passed pointers to that storage.
What other great readings
This article is part of a series of articles we wrote around all the reliability features we released for Airbyte to reach the 1.0 status. Here are some other articles about those reliability features:
- Handling large records
- Checkpointing to ensure uninterrupted data syncs
- Load balancing Airbyte workloads across multiple Kubernetes clusters
- Resumable full refresh, as a way to build resilient systems for syncing data
- Refreshes to reimport historical data with zero downtime
- Notifications and Webhooks to enable set & forget
- Supporting very large CDC syncs
- Automatic detection of dropped records
Don't hesitate to check on what was included in Airbyte 1.0's release!
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->