Maintaining Hundreds of API Connectors with the Low-Code CDK and Connector Builder (original) (raw)
A successful data movement platform must access data from a wide array of sources, which requires supporting a vast collection of connectors. In a previous blog post, we discussed the challenges of maintaining connectors in-house and the need for an open-source framework.
We’ve gained significant insights on connector maintenance since writing that post in 2021. While the core sentiment remains valid, our experience building hundreds of connectors has taught us that most API source connectors follow a standardized structure. This realization allowed us to simplify connector development even further, moving beyond our initial Python CDK (Connector Development Kit). In this article, I’ll explain how we leverage the low-code CDK and the Connector Builder to build and maintain hundreds of API connectors efficiently.
Introducing the low-code CDK: Writing connectors in YAML
Despite their diverse implementations, API connectors typically address a common set of challenges:
- Submitting a request to one or multiple endpoints
- Authenticating using a well-defined mechanism such as OAuth or an API key
- Paginating through multiple pages of records
- Gracefully handling rate limits and errors
- Describing the schema of the data returned by the API
- Decoding the data returned by the API
- Supporting incremental data exports by keeping track of what data was already synced
To simplify connector development, we broke down these challenges into modular building blocks, each with a clear interface. This approach allows us to build an extensible system capable of supporting any API connector. As long as these blocks adhere to their interfaces, new features can be integrated without disrupting existing functionality.
Most APIs follow a few common patterns to solve these challenges. For instance, pagination is typically handled using one of three methods: limit-offset, page count, or cursor-based. By providing implementations of each of these mechanisms within the low-code CDK, developers can easily incorporate them into connectors without having to reinvent the wheel.
The low-code CDK therefore allows developers to create API connectors using a combination of configurable, predefined building blocks. These blocks can be mixed and matched to meet the specific requirements of any given API, streamlining the development process and reducing the need for extensive custom coding.
Example connector using the low-code CDK
Using this mental model, an Airbyte stream can be thought of as configuring:
- How to submit a request to an API
- How to extract the records from the API response
- How to paginate
Here is an example connector definition describing how to read the list of Pokemons from the pokeAPI.
streams:
Pokemons:
type: DeclarativeStream
name: Pokemons
retriever:
type: SimpleRetriever
requester: # Describes how to submit a request
type: HttpRequester
url_base: https://pokeapi.co/api/v2/
path: /pokemon
http_method: GET
record_selector: # Describes how to select records from the response
type: RecordSelector
extractor:
type: DpathExtractor
field_path:
- results
Paginator: # Describes how to obtain the URL of the next page of records
type: DefaultPaginator
page_token_option:
type: RequestOption
inject_into: request_parameter
field_name: offset
page_size_option:
type: RequestOption
inject_into: request_parameter
field_name: limit
pagination_strategy:
type: OffsetIncrement
page_size: 5
This is still quite a large YAML file. Let’s deconstruct it a bit.
The following block describes how to submit requests to fetch the data:
requester:
type: HttpRequester
url_base: https://pokeapi.co/api/v2/
path: /pokemon
http_method: GET
You can read this as: submit a GET request to the https://pokeapi.co/api/v2/ endpoint, using the path “/pokemon”, which will be resolved to this analogous curl command: curl https://pokeapi.co/api/v2/pokemon
Then this block describes how to extract the records from the response:
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path:
- results
This extractor assumes the response can be decoded as a json object, and is simply defined as the path to the records in the decoded response. Here is a sample response for reference:
{
"count": 1302,
"next": "https://pokeapi.co/api/v2/pokemon?offset=20&limit=20",
"previous": null,
"results": [
{
"name": "bulbasaur",
"url": "https://pokeapi.co/api/v2/pokemon/1/"
},
{
"name": "ivysaur",
"url": "https://pokeapi.co/api/v2/pokemon/2/"
},
…
]
}
Finally, this block describes how to paginate to fetch all records:
paginator:
type: DefaultPaginator
page_token_option:
type: RequestOption
inject_into: request_parameter
field_name: offset
page_size_option:
type: RequestOption
inject_into: request_parameter
field_name: limit
pagination_strategy:
type: OffsetIncrement
page_size: 5
This describes a limit-offset paginator, which adds two request parameters to the outgoing requests sent to the API:
- limit: the number of records to include in the response
- offset: the offset to start from
Using this pagination strategy, the connector will automatically submit requests with an increasing offset until all records are read.
Of course, this is only a subset of the configurable features available in the low-code CDK. I recommend going through the no-code and low-code documentation and tutorial for more details on the full feature set.
Extending the framework
One common drawback of no-code frameworks is their inflexibility, often requiring developers to navigate complex workarounds for unsupported use cases. The low-code framework mitigates this issue by allowing developers to fully customize each connector component using Python code.
This customization is achieved by implementing your own custom class. The only requirement is that the class must adhere to the interface of the building block it is replacing. For example, to define a cursor object with custom state management, a developer would simply create a new class:
class CustomCursor(Cursor):
def stream_slices(self) -> Iterable[StreamSlice]:
…
def set_initial_state(self, stream_state: StreamState) -> None:
…
def close_slice(self, stream_slice: StreamSlice, most_recent_record: Optional[Record]) -> None:
…
def should_be_synced(self, record: Record) -> bool:
…
def is_greater_than_or_equal(self, first: Record, second: Record) -> bool:
…
def get_stream_state(self) -> StreamState:
…
def get_request_params(
self,
*,
stream_state: Optional[StreamState] = None,
stream_slice: Optional[StreamSlice] = None,
next_page_token: Optional[Mapping[str, Any]] = None,
) -> MutableMapping[str, Any]:
…
def get_request_headers(self, *args, **kwargs) -> Mapping[str, Any]:
…
def get_request_body_data(self, *args, **kwargs) -> Optional[Union[Mapping, str]]:
…
def get_request_body_json(self, *args, **kwargs) -> Optional[Mapping]:
…
The custom code can then be referenced from the connector manifest:
incremental_sync:
type: CustomIncrementalSync
class_name: source_my_source.components.CustomCursor
Introducing the Connector Builder UI: The IDE for connector development
Connectors that can be described by a YAML configuration file can be developed using the connector builder.
The Connector Builder simplifies the development process by allowing developers to focus exclusively on the specifics of the API, without getting bogged down by the intricacies of the connector definition syntax.
The Connector Builder became much more than just an interface for generating YAML files. While the left-hand side of the streams allows developers to configure a new connector without writing code, the right-hand side allows developers to easily see the outcome of a modification. This tightens the feedback loop, and makes the Connector Builder the best IDE for connector development. Connectors built this way can directly be published to the Airbyte platform, making the process of iterating on a connector smoother and more efficient.
We encourage you to try it out on your Airbyte instance - developing new connectors for API sources has never been easier. If you’re interested in learning more about how to use the Builder, feel free to follow this tutorial.
How we migrated 38 API connectors to the low-code framework
Enabling connector developers to build new connectors using the low-code framework was only part of the solution. To truly ease our maintenance burden, we needed to migrate our existing connectors to the low-code framework as well.
This migration was both urgent and challenging. Despite being costly, risky, and offering little immediate benefit to users, it was necessary. Each connector built on older interfaces added to our maintenance debt, making them difficult to maintain and prone to bugs. The migration process allowed us to revisit these connectors, reassessing their implementations and addressing their limitations.
Migration process
To mitigate risks, we adopted an incremental approach. From January to July 2024, we focused each week on migrating a subset of connectors, continuously refining our process. Below are some of the most impactful improvements we made during the migration cycle.
Testing the migrated connectors
Initially, our existing test suite was inadequate for the task, leading to regressions during the early migrations. We took this issue seriously and invested in a comprehensive regression testing framework. This framework allowed us to verify the output of connectors using arbitrary configurations. We coupled the migration with extensive regression testing, using both sandboxes we seeded ourselves, and datasets provided by design partners. This approach enabled us to apply large-scale changes to connectors with minimal risks.
Avoiding breaking changes
While standardizing connectors is necessary for long-term maintainability, it can be disruptive if it requires breaking changes and user interventions. To minimize disruption, we provide developers with tools to define configuration and state migration functions. These functions can automatically be applied to sync inputs, aligning formats with the new standardized expectations without introducing breaking changes to the connector.
We identified several common patterns in legacy connector implementations. To streamline the migrations, we implemented the necessary transformations directly in the CDK, reducing the need for custom implementations in each connector. For connectors with more significant deviations, developers can implement their own custom transformations, ensuring flexibility while maintaining standardization.
Using the Connector Builder’s latest features
During the migration process, we encountered challenges that required custom components, making migrations more difficult. To overcome this, we enhanced the Connector Builder to allow connectors with custom components to be executed locally. This feature enables developers to define individual components in YAML within the Connector Builder, while also mounting the connector module during local execution.
This approach allowed us to leverage CDK features directly from the Connector Builder, even before they were fully integrated into the UI. This empowered advanced users to benefit from the feedback loop and streamlined testing functionalities of the Connector Builder even for more complex connectors.
Benefits of using the low-code CDK
The introduction of the low-code CDK has significantly impacted our ability to maintain connectors, offering several key benefits:
Removing code duplication
By creating new abstractions that are closer to the domain - such as Cursors instead of Stream Slices - we’ve embedded common logic directly into the CDK. This logic is now battle-tested and reusable across multiple connectors. As a result, the process of implementing a new connector is greatly simplified. Often, connectors don’t require any custom logic, but even when they do, developers only need to focus on customizing the specific components while reusing the existing, proven logic.
Implementing features once
The strong contract between connectors and the CDK allows us to implement features like Resumable Full Refresh just once, rather than repeatedly for each connector. This approach provides tremendous leverage: we spend less time rewriting the same logic, there’s less code to maintain, and any bugs that arise need to be fixed only once. This significantly reduces both development and maintenance overhead.
Exposing metadata with a YAML configuration file
Describing connectors as simple YAML configuration files enables us to extract valuable metadata about the connectors. For example, there’s no need to inspect the code to determine of a connector uses a limit-offset pagination strategy; it’s explicitly stated in the connector’s manifest. This metadata offers insights into the most-used features, levels of customization, and more, helping us better understand and manage our connector ecosystem.
Conclusion
The transition to a low-code framework, supported by the Connector Builder, has revolutionized how we develop, maintain, and scale our catalog of API connectors. By abstracting common logic and leveraging reusable components, we’ve drastically reduced the time and effort required to build and maintain connectors.
This evolution in our approach has not only improved the quality and reliability of our connectors as part of our work towards releasing Airbyte 1.0, but has also empowered our community to contribute more effectively. We’re now doubling down on streamlining community contributions, enabling everyone to contribute to Airbyte’s public catalog of connectors from the Connector Builder.
As we continue to refine and expand these tools, we’re excited to see what innovations and integrations the community will build next. Whether you’re a developer, a data engineer, or someone looking to integrate new APIs, the low-code CDK and the Connector Builder offer the tools you need to succeed.
What other great readings
This article is part of a series of articles we wrote around all the reliability features we released for Airbyte to reach the 1.0 status. Here are some other articles about those reliability features:
- Handling large records
- Checkpointing to ensure uninterrupted data syncs
- Load balancing Airbyte workloads across multiple Kubernetes clusters
- Resumable full refresh, as a way to build resilient systems for syncing data
- Refreshes to reimport historical data with zero downtime
- Notifications and Webhooks to enable set & forget
- Supporting very large CDC syncs
- Monitoring sync progress and solving OOM failures
- Automatic detection of dropped records
Don't hesitate to check on what was included in Airbyte 1.0's release!