ENH: Improve ref-tracking for group keys by phofl · Pull Request #51442 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation15 Commits7 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Member
phofl commented
•
edited by jorisvandenbossche
Loading
- closes #xxxx (Replace xxxx with the GitHub issue number)
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
Follow-up on #49450
The equals check causes failures in Dask. I think we are better off moving to reference checking anyway. If you are ok with the general logic here, I'd refactor this check down to the Manager level.
phofl mentioned this pull request
This would make df.groupy(df['a'])
behave differently from df.groupby(df['a'].copy())
, is that right? In general I'd like to move away from "in-axis" vs "out-axis" groupers behaving differently (ref: #49519), so I agree this is the right direction, but it might cause some odd edge cases to behave differently until we resolve those issues.
Still - I'm +0 here; checking equality gives some odd behaviors as well.
Yes those 2 behave different, but this pr establishes the same behavior as in the non-cow mode for this case. So that's actually a plus imo
except (AttributeError, KeyError, IndexError, InvalidIndexError): |
---|
return False |
if isinstance(gpr, Series) and isinstance(obj_gpr_column, Series): |
ref = weakref.ref(obj_gpr_column._mgr.blocks[0]) |
return ref in gpr._mgr.blocks[0].refs.referenced_blocks |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you create multiple times a weakref to the same object, this is always the same (i.e. identical) weakref object?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is a good solution for now (essentially the same as what was suggested in #50730). And it indeed aligns with the behaviour we have for non-CoW.
Longer term, I agree with @rhshadrach that the current semantics are not really great. We should probably at some point move to not doing such check at all, at which point this discussion about identity vs equality also just doesn't matter anymore?
Yep this sounds sensible, but I think this needs a deprecation cycle?
I’ll move the check to the BlockManager then
Implemented it on the manager level
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two small comments to clean-up
@@ -947,9 +947,19 @@ def is_in_obj(gpr) -> bool: |
---|
# For the CoW case, we need an equality check as the identity check |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this comment?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -947,9 +947,19 @@ def is_in_obj(gpr) -> bool: |
---|
# For the CoW case, we need an equality check as the identity check |
# no longer works (each Series from column access is a new object) |
try: |
return gpr.equals(obj[gpr.name]) |
obj_gpr_column = obj[gpr.name] |
except (AttributeError, KeyError, IndexError, InvalidIndexError): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the AttributeError now, I think (.name
is already checked above)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, thx
…cking
Conflicts:
pandas/core/internals/managers.py
phofl deleted the cow_groupby_ref_tracking branch