Revert "DEPR: make_block (#56422)" by jorisvandenbossche · Pull Request #56481 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation27 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
I added an expanded comment on this issue (#56422 (comment)). And to clear, this is not a critical blocker for the RC (just for the final release). Nothing will break for people testing the RC if this is not in, but if they are running tests where they also use pyarrow, they will see a flood of additional warnings (that we can easily avoid by merging this), that will distract of actual relevant deprecation warnings we want them to check for the RC.
cc @pandas-dev/pandas-core for more thoughts on this as this is the last remaining blocker for the release.
I added an expanded comment on this issue (#56422 (comment)). And to clear, this is not a critical blocker for the RC (just for the final release). Nothing will break for people testing the RC if this is not in, but if they are running tests where they also use pyarrow, they will see a flood of additional warnings (that we can easily avoid by merging this), that will distract of actual relevant deprecation warnings we want them to check for the RC.
Is there a way to not issue the warning if coming from pyarrow
, but issue the warning if another library uses make_block()
?
I understand that @jbrockmendel would like to "force" pyarrow
to not use make_block()
, but also am sympathetic to what @jorisvandenbossche says with respect to the performance implications for pyarrow
and that there is not a reasonable alternative that doesn't affect pyarrow
performance. So I don't think we can deprecate without that alternative.
and that there is not a reasonable alternative that doesn't affect pyarrow performance
That is incorrect, several reasonable alternatives have been offered in #56422.
and that there is not a reasonable alternative that doesn't affect pyarrow performance
That is incorrect, several reasonable alternatives have been offered in #56422.
My read of that discussion is that there is debate about whether the 6X performance hit is "reasonable".
"6x" is less accurate than "100us"
Deprecating doesn't impose the performance penalty - it only imposes the message on users. Once can always improve performance of the blessed paths if this is a real bottle neck, but I do suspect that 35us vs 135 us is going to be noise in almost all cases.
It does seem that the incentive for pyarrow to change their usage is nil until this makes it into a release.
and that there is not a reasonable alternative that doesn't affect pyarrow performance
That is incorrect, several reasonable alternatives have been offered in #56422.
@jbrockmendel I would recommend that you re-read that discussion then. Yes, there are alternatives, and I also acknowledged that, but only ones (that are zero-copy) that can be used in the future, not right now (note that pyarrow cannot enable optional features like CoW, that's something for the user to do)
To be explicit: the alternative that is 400µs slower (I agree that a slowdown like this will mostly be noise, although have to check the impact of more dtypes/columns) is not possible right now for pyarrow to implement (with public APIs).
I think the current deprecation is OK to leave in for 2.2, but this deprecation should not be enforced (or upgraded to a FutureWarning
) without an agreed upon alternative (might be 4.0 timeline at the earliest). There may be other libraries besides pyarrow using this API that either can use an existing alternative or can help comment on what a suitable alternative looks like that satisfies pyarrow + any other library.
Given that there is a lot of ongoing discussion about this (what's the exact alternative, what should be the timeline, etc), can we simply revert this for now so we can have this discussion before committing to certain changes?
This warning will appear in CI for every project that uses pandas and pyarrow (e.g. we already see that in geopandas). So this is annoying a lot of other projects, just because we are committing the deprecation a bit early.
Reverting it now means delaying enforcing it from 2025 to 2026. That is bad.
@jorisvandenbossche wrote:
This warning will appear in CI for every project that uses pandas and pyarrow (e.g. we already see that in geopandas). So this is annoying a lot of other projects, just because we are committing the deprecation a bit early.
@jbrockmendel wrote:
Reverting it now means delaying enforcing it from 2024 to 2025. That is bad.
To me, the timing seems to be the critical difference in opinion. I think there is general agreement that make_block()
should be removed at some point, which would require a deprecation warning, followed by removal. The question is when this should be done.
@jorisvandenbossche What timing do you propose for this deprecation and removal?
would using a PendingDeprecationWarning address some of the CI issues?
According to our 2-year support window of pyarrow, the earliest moment we can enforce this deprecation would be spring 2026. Even if we are laxer (as we currently are) and only support for 1 year, the earliest moment would be in spring 2025 (because the earliest version of pyarrow that can support this will be in spring 2024).
So the timeline of enforcing this deprecation is >1 year anyway. Given that, personally I wouldn't worry about delaying the initial deprecation a few months.
If we only change Deprecation->FutureWarning in 4.0, what does it matter that the DeprecationWarning is added in 2.2 or 3.0 or 3.1?
Reverting it now means delaying enforcing it from 2024 to 2025. That is bad.
And can you explain why this is "bad"? What we are deprecating here is a small helper function (only for external usage, not used internally) that has very little maintenance overhead (assumption based on the fact it was hardly touched the last three years). Keeping it a year longer doesn't seem much of an issue.
would using a PendingDeprecationWarning address some of the CI issues?
Pytest shows both DeprecationWarning and PendingDeprecationWarning by default, so that would be no difference AFAIK
I think there is general agreement that
make_block()
should be removed at some point,
I have no problem with deprecating make_block
, if there is a clear alternative for the valid use cases we are aware of. But that's the part where the details are still being discussed (the current main alternative is concat
+reindex
, but one could also imagine a new API to construct a DataFrame from arrays), something that ideally happens before actually doing the deprecation.
For the uninformed what huge benefit is there to a downstream library using make_block versus say constructing a dataframe with a dict of arrays?
Zero-copy construction of 2D blocks. We currently don't have a public API that accepts 2D arrays (except for the case of a single dtype for your full DataFrame).
Reverting it now means delaying enforcing it from 2024 to 2025. That is bad.
And can you explain why this is "bad"?
That was a typo on my part. Should read "from 2025 to 2026". It is bad because 2026 is a long way away and enforcing the deprecation blocks weaning pyarrow off of constructing Managers which blocks any potential changes in Managers (as demonstrated by ArrayManager never being workable in the pyarrow-touching IO tests)
For the uninformed what huge benefit is there to a downstream library using make_block versus say constructing a dataframe with a dict of arrays?
Zero-copy construction of 2D blocks. We currently don't have a public API that accepts 2D arrays (except for the case of a single dtype for your full DataFrame).
Kind of. In many (all?) cases pyarrow already does a copy when converting its 1D arrays to 2D arrays. The reasoning that was given for this years ago was that this prevented post-construction performance hits bc pandas did silent consolidation. But silent consolidation was removed in 2.0, and there has been a lot of very frustrating goalpost-shifting.
I have no problem with deprecating make_block, if there is a clear alternative for the valid use cases we are aware of. But that's the part where the details are still being discussed (the current main alternative is concat+reindex, but one could also imagine a new API to construct a DataFrame from arrays), something that ideally happens before actually doing the deprecation.
So let's revert the deprecation for 2.2 and then reintroduce for 3.0 so that we can enforce this either in 4.0 (less likely) or 5.0 (more likely because of the support window) at the latest. We have to run some tests to see if concat + reindex is sufficient or if we need a more efficient path that takes a list of arrays directly, but that should be relatively easy I guess?
My concern with that plan is that Joris will infinitely re-litigate "clear alternative for the valid use cases" a few months from now. As bashtage said, pyarrow has no incentive to actually change until this is in a release. pyarrow is not being a good citizen by refusing to use public APIs.
My concern with that plan is that Joris will infinitely re-litigate "clear alternative for the valid use cases" a few months from now. As bashtage said, pyarrow has no incentive to actually change until this is in a release. pyarrow is not being a good citizen by refusing to use public APIs.
FWIW, it seems that the people coding that part of pyarrow are @jorisvandenbossche and @wesm so they are representing both communities on this topic.....
Signing off for the day. I'll consent to reverting for 2.2 if there is firm commitment to deprecating in 3.0 and enforcing in 4.0. pyarrow shouldn't get to abuse our internals indefinitely over a few hundred microseconds.
merging as we agreed on reverting the deprecation for 2.2 and deprecating again for 3.0
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request
WillAyd deleted the revert-56422-depr-make_block branch
Looks like I made CI red with this (should've rebased before merging).
Will take a look at the failures.
lithomas1 pushed a commit that referenced this pull request
…") (#56814)
Backport PR #56481: Revert "DEPR: make_block (#56422)"
Co-authored-by: Joris Van den Bossche jorisvandenbossche@gmail.com
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request
Labels
Blocking issue or pull request for an upcoming release
pandas objects compatability with Numpy or Python functions
Related to non-user accessible pandas implementation