PDEP-10: Add pyarrow as a required dependency by mroeschke · Pull Request #52711 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes Make pyarrow a required dependency #52509 (Replace xxxx with the GitHub issue number)

I've tried to summarize the motivations and drawbacks from #52509 in this PDEP. Please let me know if there are any reasons on either side that I am missing.

Feel free to continue the discussion in #52509, but as a reminder the voting will take place here.

cc @pandas-dev/pandas-core

mroeschke changed the title~~PDEP: Add pyarrow as a required dependency~~ PDEP-10: Add pyarrow as a required dependency

Apr 17, 2023

by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be too optimistic, but having pyarrow as a required dependency has the potential to make the c/cython-code for read_csv and read_json obsolete (if they are on par and similarly fast).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a compile time dependency which we are not contemplating at the current time; possibly could propose in the future

Comment on lines 16 to 17

- The minimum version of PyArrow will be bumped every major pandas release to the highest
PyArrow version that has been released for at least 2 years.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the major.minor.patch terminology, major could be 2-3 years (ignoring for now the proposal by some to make this more frequent) and minor is 6-9 months.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Near the tail end of the major release cycle, the minimum supported version of pyarrow could be 5 years old?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Correct

Will there also be a version cap? NumPy is extremely conservative with breaking changes. I can't recall a case where a cap would have helped avoid issues with NumPy changes, especially deprecations and removals. Is pyarrow similarly stable? Does pyarrow have an implicit or explicit deprecation policy? If not, would there need to be a cap on each release too?

Recently in a number of projects I've been downstream of SciPy which has been doing a lot of long-needed but painful clean up. This has resulted in cases where not too old releases of downstream projects are breaking against the most recently released SciPy.

Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited
by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this comment by Will has been addressed (unless I missed it?)

to make it easier to find: the link is here, and says:

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

This PDEP proposes that:

- PyArrow becomes a runtime dependency starting pandas 2.1
- The minimum version of PyArrow supported starting pandas 2.1 is version 7.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this version be consistent across the entire pandas API?

e.g. If I wanted to bump the pyarrow version for just the CSV parser to something higher, would I be able to do it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum version would be consistent across the library, but IMO that shouldn't stop development of features that exist in newer versions of pyarrow (we already do this with version checking or try/except)

In one of the linked issues there is the comment about pandas being backend agnostic. Is it possible that this PDEP can broach that and consider how making pyarrow a dependency defines that objective as being unacheivable or how it would fit in to such a concept. This is not clear to me, or whether that is indeed a goal.

Primarily I have used numpy for linear algebra and pandas as an extension for better indexing, data wrangling and output. the added bloat when incorporating into web apps with limited build space is concerning.

This isn't a really useful discussion to have right now (and IMO shouldn't be in scope here). Even if we get to a point where NumPy is optional (this is years away), NumPy is still a required dependency of PyArrow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments, but I'd really like to understand the burden on development if we left things as they are in terms of it being an optional dependency. How much simpler does the code base become if PyArrow is required?

I'm also concerned about the lack of support on Alpine Linux for PyArrow, and maybe we should wait for that before accepting this PDEP.

Is it a good comparison to say that trying to develop arrow-backed arrays without pyarrow as a dependency is like trying to develop NumPy-backed arrays/blocks without NumPy as a dependency?

Co-authored-by: Irv Lustig irv@princeton.com

date and times are 2 additional dtypes that we can infer properly. Added them to the list as well.

numpy.void is not a drop-in replacement for PyArrow structs, even if we would implement it.

I understand, but doing it this way, we don't know how many people are neutral or happy with it. Even if there's a large volume of criticism, there could be an even larger volume that are neutral or happy with better perf that are probably not going to go out of their way to make their voices heard. For example, if a million people voice criticism, it would seem like a lot, but maybe not if there were 5 million people happy with it or neutral towards it (there's no one size fits all here), but we have no idea what this number would be. And that's the point I am trying to make, this issue on the warning wouldn't gather any data on the flip side of the issue (which I think is equally important), which imo would skew the results, or make them seem worse than they really might be.

IMHO, it's better that we get some feedback, rather than none. The wording as proposed doesn't commit us to saying we will not require pyarrow if we get negative feedback - it just says that we will get the feedback, which gives us the possibility of delaying the requirement based on that feedback.

improve structure, list user benefits more clearly, add faq

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment about the word "horrendous"

Reworded the horrendous thing. Let's start the voting now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's do this. After recent discussions and clarifications, I'm sold: mainly based on superseding object dtype in string, list, and struct dtypes, which would be a real and immediate benefit to users

To emphasise: if accepted, this would not change the default for dtypes which are currently non-object numpy dtypes (that would be a separate discussion). It would only change the default for dtypes which are currently object (which I don't think I'm alone in considering an embarrassment)

Reworded the horrendous thing. Let's start the voting now

Agreed. I'd like to propose that we use the newly proposed voting procedure for PDEP-1 on this. I've created an issue at #54106 for the core team to place their votes.

This was accepted in our vote. So updated status and will merge

Labels

Arrow

pyarrow functionality

Dependencies

Required and optional dependencies

PDEP

pandas enhancement proposal