PERF: Add contains to CategoricalIndex by topper-123 · Pull Request #21369 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation19 Commits2 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

progress towards PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
xref PERF: __contains__ method for Categorical #21022
tests added / passed
benchmark added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Currently, membership checks in CategoricalIndex is very slow as explained in #21022. This PR fixes the issue for CategoricalIndex, while #21022 contains the fix for Categorical. The difference between the two cases is the use of _engine for CategoricalIndex, which makes this even faster than the Catagorical solution in #21022.

Tests exist already and can be found in tests/indexes/test_category.py::TestCategoricalIndex::test_contains.

ASV:

      before           after         ratio
     [0c65c57a]       [986779ab]
-      2.49±0.2ms       3.26±0.2μs     0.00  categoricals.Contains.time_contains

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@@ -324,20 +324,19 @@ def _reverse_indexer(self):
@Appender(_index_shared_docs['__contains__'] % _index_doc_kwargs)
def __contains__(self, key):
hash(key)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a really silly question, but what does this line do?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just ensures that mutables are not passes into the function....

I think it should probably be removed, but that line is in various other places as well, so maybe a seperate PR, that goes through all similar cases? Or I can just remove it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you removed it in another part of the diff, hence why I'm asking. That being said, I like your suggestion. Let's investigate for another time then, in which case I would put back the other one that you deleted.

if self.categories._defer_to_indexing:
return key in self.categories

return key in self.values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me: why do we NOT need to check membership in self.values anymore?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For indices, their indexing engine (i.e. ._engine) has a __contains__ method which does the same thing but is faster (does caching etc. probably, haven't looked into the details of the code).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for clarifying!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return False
if is_scalar(loc):
return loc in self._engine
else: # if self.categories is IntervalIndex, loc is an array

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line between things, e.g.

if isna(...):
     ....

try:
   ....
except:
  ....

if is_scalar(...):
   ...

# no else needed here
return ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you put comments before each case (not everythin needs a comment), but i find this hard to grok in its current form.

I've updated the PR.

I've set it to be part of 0.23.2, if that's alright.

This is changing the implementation quite substantially, so let's move this to 0.24.0.txt?

This was referenced

Jun 15, 2018

Any comments on my comment above about keeping this for 0.24.0 ?

Quility-wise this is ok to go into 23.2 IMO, the PRs are really not that complex, IMO, it's much faster and it doesn't change any APIs.

Also, my main motivation for writing this was speeding up slicing dataframes with a CategoricalIndex (see #20395), which previously was very slow (still is slow, but better than before, and now faster than fancy indexing, at least). I think a lot of people will appreciate this speedup.

we tagged for 0.23.2 (and note is there)
it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24
either way is fine

The main reason that I am raising this is that __contains__ checking has quite some implications (which is of course also the reason you speeded it up!), and I think it is rather easy to miss a small exotic corner case where the new implementation might differ.
To be clear, @topper-123 , I am not questioning the quality of this PR! I just know from experience, also for us, that it is easy to miss an unintended API change (which we might not even decide to fix if it is debateble behaviour, but then that is still better to be left as 0.24.0). Since this is only about performance improvement (and not a regression), I would play on safe and give it more time in 0.24.0.

it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24

Yep, it would be both in 0.23.2 or both in 0.24.0 (the other I actually tagged first as 0.24.0, but that was changed before merging)

Ok, your call, I won't object.

OK, left it for 0.24 (moved the whatsnew already to v0.24.0.txt)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request

Labels

Categorical Data Type

Related to indexing on series/frames, not to indexes themselves

Memory or execution speed performance