ENH: Improve ref-tracking for group keys by phofl · Pull Request #51442 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation15 Commits7 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

phofl

Member

@phofl phofl commented

Feb 16, 2023

edited by jorisvandenbossche

Loading

cc @jorisvandenbossche

Follow-up on #49450

The equals check causes failures in Dask. I think we are better off moving to reference checking anyway. If you are ok with the general logic here, I'd refactor this check down to the Manager level.

@phofl

@phofl

@phofl phofl mentioned this pull request

Feb 16, 2023

rhshadrach

@rhshadrach

This would make df.groupy(df['a']) behave differently from df.groupby(df['a'].copy()), is that right? In general I'd like to move away from "in-axis" vs "out-axis" groupers behaving differently (ref: #49519), so I agree this is the right direction, but it might cause some odd edge cases to behave differently until we resolve those issues.

Still - I'm +0 here; checking equality gives some odd behaviors as well.

@phofl

Yes those 2 behave different, but this pr establishes the same behavior as in the non-cow mode for this case. So that's actually a plus imo

@phofl

jorisvandenbossche

except (AttributeError, KeyError, IndexError, InvalidIndexError):
return False
if isinstance(gpr, Series) and isinstance(obj_gpr_column, Series):
ref = weakref.ref(obj_gpr_column._mgr.blocks[0])
return ref in gpr._mgr.blocks[0].refs.referenced_blocks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you create multiple times a weakref to the same object, this is always the same (i.e. identical) weakref object?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche

Yes, I think this is a good solution for now (essentially the same as what was suggested in #50730). And it indeed aligns with the behaviour we have for non-CoW.

Longer term, I agree with @rhshadrach that the current semantics are not really great. We should probably at some point move to not doing such check at all, at which point this discussion about identity vs equality also just doesn't matter anymore?

@phofl

Yep this sounds sensible, but I think this needs a deprecation cycle?

I’ll move the check to the BlockManager then

@phofl

@phofl

Implemented it on the manager level

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small comments to clean-up

@@ -947,9 +947,19 @@ def is_in_obj(gpr) -> bool:
# For the CoW case, we need an equality check as the identity check

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this comment?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -947,9 +947,19 @@ def is_in_obj(gpr) -> bool:
# For the CoW case, we need an equality check as the identity check
# no longer works (each Series from column access is a new object)
try:
return gpr.equals(obj[gpr.name])
obj_gpr_column = obj[gpr.name]
except (AttributeError, KeyError, IndexError, InvalidIndexError):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the AttributeError now, I think (.name is already checked above)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, thx

@phofl

@phofl

@phofl

…cking

Conflicts:

pandas/core/internals/managers.py

@jorisvandenbossche

@phofl phofl deleted the cow_groupby_ref_tracking branch

February 22, 2023 17:12