Extend DBAPI with Apache Arrow support (original) (raw)

Hi everyone,

Apache Arrow has become the de-facto standard in-memory format for analytical data. It allows moving data across language and process boundaries without actually copying any data in memory. There are many successful projects built on top of Arrow, such as Polars and DuckDB, and it has become the default array backend for Pandas since version 2.

In modern python data pipelines, the data never has to interact with python directly. The data is usually loaded directly into Arrow, and then python is used only to pass these Arrow arrays to various compiled computation engines, and then dump them directly into some compatible format like Apache Parquet. When data pipelines are written in this style, there is almost zero performance penalty from using python.

The story is different for relational databases however. Python defines the standard DBAPI v2 and most database drivers conform to it. This API is row-based and works with python values.
The stereotypical python data pipeline that works with relational databases works like this

Fetch data from database (in a compiled database driver module)
Convert the query result into python lists of python tuples of python values
Convert the lists into Arrow
Do computation in Arrow
Convert Arrow back into python lists of python tuples of python values
Pass it to database driver that will insert it into the database

It turns out that steps 3, 5 are completely unnecessary and are actually very costly.
It becomes an obvious bottleneck in cloud computing settings, where the database interactions are fast.

Moreover, detailed type information is lost when converting to python types, so the step 3 usually requires some fragile type-inference.

There has been numerous projects that aim to solve this problem

ADBC
turbodbc
arrow-odbc
connector-x
oracledb (recently added support to create Arrow arrays directly in the driver)

These implementations usually report manyfold speedups over DBAPI.
But they always come up with a bespoke API for doing so, and some of them don’t even provide the other semantics of DBAPI (connection/cursor/etc.).

I propose that the DBAPI specification be extended (backward-compatible) with optional methods to allow drivers to work with Arrow data instead, something along the lines of

cursor.fetch{one,many,all}_arrow will return results as tuple of Arrow arrays
cursor.execute should accept parameters as Arrow arrays
cursor.description_arrow will return tuple of Array dtypes that best matches the result set

It will not be needed that the driver libraries depend on pyarrow. Arrow defines a simple “Arrow PyCapsule Interface” that any library can implement on its own. I believe that this is the way oracledb did it.

Implementing this would make way e.g. for the best database toolkit in any language SQLAlchemy to provide efficient analytics data access on top of these drivers.

From what I’ve seen in oracledb and turbodbc, I believe that conforming to some new DBAPI like this could be a minor change entailing mostly renaming some methods.

What do you think about it?

pitrou (Antoine Pitrou) June 4, 2025, 6:57am 2

@jonashaag had already drafted something like this in pep-249-arrow/pep-249-arrow.rst at master · jonashaag/pep-249-arrow · GitHub (see discussion at RFC: Standardizing Python database connector PyArrow interfaces · Issue #2 · jonashaag/pep-249-arrow · GitHub ).

I’m obviously biased [1], and I think it would be a good idea. I also suspect that some DBAPI driver implementations would not be comfortable handling Arrow data as they might not have good knowledge of the Arrow format, so this would certainly be an optional extension.

Also, since you mention SQLAlchemy, you might want to bring its maintainers into the discussion.

I’m a maintainer of Apache Arrow ↩︎

Can someone link that pyarrow is the default array backend for pandas? I cannot find this.

I don’t want to change the topic but to my understanding numpy is staying as the default array backing storage, except for strings, which is pyarrow, starting in version 3.

vladidobro (Vladislav Wohlrath) June 5, 2025, 11:27pm 4

I apologize, I was mistaken, Arrow is not the default backend of pandas.

Otherwise, I think the discussion can be closed if there is this proposal already. Thanks for linking

pitrou (Antoine Pitrou) June 6, 2025, 8:02am 5

Well, the discussion in RFC: Standardizing Python database connector PyArrow interfaces · Issue #2 · jonashaag/pep-249-arrow · GitHub hasn’t seen any update for 2 years, perhaps the idea will get more visibility here than on a little-known GitHub repo.

malemburg (Marc-André Lemburg) June 6, 2025, 8:47am 6

FYI: The Python DB-API is being developed and discussed on the Python DB-SIG mailing list, not on this Discourse instance. Please take discussions there, if you want something to be added as e.g. a standard extension of the DB-API.

As editor of the DB-API v2 (PEP 249), I can say that adding support for columnar access to database data is certainly something we are interested in, but making this specific to Apache Arrow will most certainly not be what we want.

What we typically do is define APIs in a way which provides freedom to the implementors, while making sure that functionality stays consistent across implementations.

E.g. for columnar data, it may be better to use an array approach. Arrow could then be used to provide such an array data type, but standard Python arrays, numpy arrays, etc. would work just as well.

h-vetinari (H. Vetinari) June 6, 2025, 9:47am 7

That’s not the case, because the ones you list would need to fully copy out the data, whereas arrow is designed such that the same bit of memory can be read by multiple users, even across multiple languages. Not sharing memory has big implications in terms of performance and memory footprint.

I’m not affiliated with arrow, but it really is a generation beyond classical DB technologies, which is also why (in my perception) it is eating the world.

Of course, there is value in genericity (and admittedly it would be a very substantial dependency if CPython wanted to depend on arrow), but there is nothing about this development that requires implementor freedom. Arrow is being adopted broadly precisely because it is polyglot (as well as fully open-source with a permissive license to boot), so people wisely do not want to rewrite its functionality from scratch, which has turned it into a de facto standard.

malemburg (Marc-André Lemburg) June 6, 2025, 10:11am 8

I know, but when designing APIs in the style of the DB-API, it’s important to let implementers chose what they think is best. That’s why the DB-API has seen such wide adoption.

E.g. there may well be criteria which rule out use of external libraries such as pyarrow and focus only on stdlib data types.

pitrou (Antoine Pitrou) June 6, 2025, 10:25am 9

The discussion will probably have to dissociate the Apache Arrow memory format, and PyArrow which is the main Python implementation but is not required for handling Arrow data. The Arrow PyCapsule interface ensures that other libraries can produce and consume data in the Arrow format without any dependency on PyArrow.

For example, Polars is another widely-known Python library that uses the Arrow format natively.

malemburg (Marc-André Lemburg) June 6, 2025, 11:31am 10

All true and yes, I’m aware of Polars and DuckDB, both of which use Arrow as native columnar storage format, and ADBC and Arrow Flight SQL

The key point, though, is not to mandate a specific array implementation in the DB API, just that it needs to be an sequence.

h-vetinari (H. Vetinari) June 6, 2025, 12:39pm 11

Having arrow-backed memory and not just a sequence is the the key point of this thread (it appears you don’t agree that it’s desirable for the DB API; but that is just what’s being put up for discussion).

malemburg (Marc-André Lemburg) June 6, 2025, 1:24pm 12

We seem to be talking passed each other

Let’s take a step back…

The DB API builds on the concept of row wise access to result sets and takes rows as input to the execute methods. This is also the reason why the DB API v2 focuses on using cursors to run queries and work with result sets, the cursor being a pointer into the result set (a list of rows) created by the database.

Now, OLAP databases these days typically work on columnar data, so if the DB API wants to provide a good interface for this, it would need to provide ways of accessing result sets as a list of columns. The cursor concept is not an ideal fit for this, so we’ll probably have to look for a new concept. Getting this right is the key challenge.

Now, let’s assume we can come up with a new concept. We’d then use the same approach for defining these new APIs as we’ve done for the row based approach: focus on how data needs to be accessed and not on the specific data types.

Database interfaces which use Arrow for such columnar data would be DB API compatible, but interfaces using Python lists or arrays would be just as well.

If you are saying: the DB API should only allow Arrow implementations, then I have to disagree, since this would not be in line with how the DB API is designed, nor would it be future proof (Arrow is a great technology, but who knows what people will come up with in 10 years from now).

h-vetinari (H. Vetinari) June 6, 2025, 1:37pm 13

Thanks for the elaboration; I get where you’re coming from.

Not by policy. Just that 1) right now arrow is the only de facto standard in this space and 2) the abstraction would need to be designed in such a way that a backing implementation can implement data-shared-in-memory (rather than copying it into the runtime).

But I’m not the best to advocate for this either way (I just see big potential in arrow :)), so I’ll leave this to others now.

pitrou (Antoine Pitrou) June 6, 2025, 4:03pm 14

I think ultimately it’s a matter of what granularity of typing you want the consumer of the DB-API to be able to assume. If you merely say “the API returns a sequence object”, you are indeed more future-proof wrt. potential evolutions in the landscape of data formats, but you also require performance-minded users to have different code paths (“if it’s Arrow-compatible, then do this, otherwise if it implements the Array API, then do that, otherwise if it implements the DLPack spec, do that, otherwise fallback on this slow compatibility loop that iterates over values as individual Python objects”).

malemburg (Marc-André Lemburg) June 6, 2025, 5:55pm 15

Sure, but that’s the job of higher level abstraction layers such as SQLAlchemy to deal with, if you want to support multiple different database backends.

The strategy of the DB API is to provide just enough specification to allow easily working with the compatible interfaces, but no more. This enables optimizations, but does not require them, which in return creates the basis for a wealth of options to choose from.