PERF (CoW): optimize equality check in groupby is_in_obj by jorisvandenbossche · Pull Request #50730 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation11 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jorisvandenbossche

See discussion in #49450

TODO:

@jorisvandenbossche

@jorisvandenbossche

cc @jbrockmendel are you aware of something similar that already exists in the code base?

@jbrockmendel

are you aware of something similar that already exists in the code base?

nope, will be nice to have

jbrockmendel

if isinstance(arr1, np.ndarray):
# slice first element -> if first element shares memory and they are
# equal length -> share exactly memory for the full length (if not
# checking first element, two arrays might overlap partially)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are they necessarily contiguous?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question (and we should cover that in tests). No, not necessarily I suppose. But the question is if that matters?

In theory they could have a first same memory point, but if not contiguous and if they have different strides, they might still have different values.
But can you get that in practice with a Series from a DataFrame?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can of course easily check that the strides are equal. But is that enough guarantee?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get there, probably also in the groupby case:

df = DataFrame({"a": list(range(100)), "b": 1})

ser = df.loc[slice(0, 50), "a"]
ser2 = df.loc[slice(0, 100, 2), "a"]
np.shares_memory(ser._values, ser2._values)

Edit: checking the strides fixes this of course

@jbrockmendel

Can of course easily check that the strides are equal. But is that enough guarantee?

IIRC I discussed something similar with @seberg a while back. I don't remember what he said, other than suggesting an approach that didn't involve checking memory.

@seberg

I remember a brief discussion at some point, but not really what I suggested or the details...

If all you want is to check identity, then you can check the starting pointer and the stride(s). np.shares_memory(arr1[:1], arr2[:1]) is hard to confuse (possible in NumPy, but you have to construct it with will).

Maybe I noted that overlap checking is difficult (not sure how hard it is exactly in 1-D), but since you are interested in identity checking that doesn't matter.
I guess I always imagined that CoW implementations would have a view.base and weakref's back to keep track of what is shared (so the parent knows all its views), but I don't actually know anything about CoW implementations, so :)...

@jorisvandenbossche

I guess I always imagined that CoW implementations would have a view.base and weakref's back to keep track of what is shared (so the parent knows all its views),

That's exactly what we do in our CoW implementation in the internals, but, here it's some special case in a groupby code path where we previously relied on actual (python object level) identity, which no longer works with CoW enabled (getting a column twice no longer gives python-identical objects). So I was trying to find some way to see if two objects had exactly the same memory as alternative check (although not exactly equivalent, as an actual view would share the same memory but not be identical).

But now you say this, I actually didn't think of that we could use this base/refs information ... Thanks for the insight! :)

@github-actions

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@jorisvandenbossche

Closing in favor of #51442, which essentially does the last suggestion here (using the ref information to know if they are views of each other)