DISC: Consider not requiring PyArrow in 3.0 · Issue #57073 · pandas-dev/pandas (original) (raw)

TL;DR: Don't make PyArrow required - instead, set minimum NumPy version to 2.0 and use NumPy's StringDType.

Background

In PDEP-10, it was proposed that PyArrow become a required dependency. Several reasons were given, but the most significant reason was to adopt a proper string data type, as opposed to object.
This was voted on and agreed upon, but there have been some important developments since then, so I think it's warranted to reconsider.

StringDType in NumPy

There's a proposal in NumPy to add a StringDType to NumPy itself. This was brought up in the PDEP-10 discussion, but at the time was not considered significant enough to delay the PyArrow requirement because:

NumPy itself might not accept its StringDType proposal.
NumPy's StringDType might not come with the algorithms pandas needs.
pyarrow's strings might still be significantly faster.
because pandas typically supports older NumPy versions (in addition to the latest release), it would be 2+ years until pandas could use NumPy's strings.

Let's tackle these in turn:

I caught up with Nathan Goldbaum (author of the StringDType proposal) today, and he's said that NEP55 will be accepted (although technically still in draft status, it has several supporters and no objectors and so realistically is going to change to "accepted" very soon).
The second concern was the algorithms. Here's an excerpt of the NEP I'd like to draw attention to:

In addition, we will add implementations for the comparison operators as well as an add loop that accepts two string
arrays, multiply loops that accept string and integer arrays, an isnan loop, and implementations for the str_len, isalpha,
isdecimal, isdigit, isnumeric, isspace, find, rfind, count, strip, lstrip, rstrip, and replace string ufuncs [universal functions] that will be newly
available in NumPy 2.0.
So, NEP55 not only provides a NumPy StringDType, but also efficient string algorithms.
There's a pandas fork implementing this in pandas, which Nathan has been keeping up-to-date. Once the NumPy StringDType is merged into NumPy main (likely next week) it'll be much easier for pandas devs to test it out. Note: some parts of the fork don't yet use the ufuncs, but they will do soon, it's just a matter of updating things.
For any ufunc that's missing, Nathan's said that now that the string ufuncs framework exists in NumPy, it's relatively straightforward to add new ones (e.g. for .str.partition). There is real funding behind this work, so it's likely to keep moving quite fast.
Nathan's said he doesn't have timings to hand for this comparison, and is about to go on holiday 🌴 He'll be able to provide timings in 1-2 weeks' time though.
Personally, I'd be fine with requiring NumPy 2.0 as the minimum NumPy version for pandas, if it means efficient string handling by default without the need for PyArrow. Also, Nathan Goldbaum's fork already implements this for pandas. So, no need to wait 2 years, it should just be a matter of months.

Feedback

The feedback issue makes for an interesting read: #54466.
Complaints seem to come mostly (as far as I can tell) from other package maintainers who are considering moving away from pandas (e.g. fairlearn).

This one surprised me, I don't think anyone had considered this one before? One could argue that it's VirusTotal's issue, but still, just wanted to bring visibility to it.

Tradeoffs

In the PDEP-10 PR it was mentioned that PyArrow could help reduce some maintenance work (which, despite some funding, still seems to be mostly volunteer-driven). Has this been investigated further? Is it still likely to be the case?

Furthermore, not requiring PyArrow would mean not being able to infer list and struct dtypes by default (at least, not without significant further work).

"No is temporary, yes is forever"

I'm not saying "never require PyArrow". I'm just saying, at this point in time, I don't think the requirement is justified. Of the proposed benefits, the most salient one is strings, and now there's a realistic alternative which doesn't require taking on an extra massive dependency.

I acknowledge that lately I've been more focused on other projects, and so don't want to come across as "I'm telling pandas what to do because I know best!" (I certainly don't).

Circumstances have changed since the PDEP-10 PR and vote, and personally I regret voting the way I did. Does anyone else feel the same?