PERF: Add contains to CategoricalIndex by topper-123 · Pull Request #21369 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation19 Commits2 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- progress towards PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
- xref PERF: __contains__ method for Categorical #21022
- tests added / passed
- benchmark added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
Currently, membership checks in CategoricalIndex
is very slow as explained in #21022. This PR fixes the issue for CategoricalIndex
, while #21022 contains the fix for Categorical
. The difference between the two cases is the use of _engine
for CategoricalIndex
, which makes this even faster than the Catagorical
solution in #21022.
Tests exist already and can be found in tests/indexes/test_category.py::TestCategoricalIndex::test_contains
.
ASV:
before after ratio
[0c65c57a] [986779ab]
- 2.49±0.2ms 3.26±0.2μs 0.00 categoricals.Contains.time_contains
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
@@ -324,20 +324,19 @@ def _reverse_indexer(self): |
---|
@Appender(_index_shared_docs['__contains__'] % _index_doc_kwargs) |
def __contains__(self, key): |
hash(key) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a really silly question, but what does this line do?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just ensures that mutables are not passes into the function....
I think it should probably be removed, but that line is in various other places as well, so maybe a seperate PR, that goes through all similar cases? Or I can just remove it.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you removed it in another part of the diff, hence why I'm asking. That being said, I like your suggestion. Let's investigate for another time then, in which case I would put back the other one that you deleted.
if self.categories._defer_to_indexing: |
---|
return key in self.categories |
return key in self.values |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me: why do we NOT need to check membership in self.values
anymore?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For indices, their indexing engine (i.e. ._engine
) has a __contains__
method which does the same thing but is faster (does caching etc. probably, haven't looked into the details of the code).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for clarifying!
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return False |
---|
if is_scalar(loc): |
return loc in self._engine |
else: # if self.categories is IntervalIndex, loc is an array |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line between things, e.g.
if isna(...):
....
try:
....
except:
....
if is_scalar(...):
...
# no else needed here
return ...
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can you put comments before each case (not everythin needs a comment), but i find this hard to grok in its current form.
I've updated the PR.
I've set it to be part of 0.23.2, if that's alright.
This is changing the implementation quite substantially, so let's move this to 0.24.0.txt?
This was referenced
Jun 15, 2018
Any comments on my comment above about keeping this for 0.24.0 ?
Quility-wise this is ok to go into 23.2 IMO, the PRs are really not that complex, IMO, it's much faster and it doesn't change any APIs.
Also, my main motivation for writing this was speeding up slicing dataframes with a CategoricalIndex (see #20395), which previously was very slow (still is slow, but better than before, and now faster than fancy indexing, at least). I think a lot of people will appreciate this speedup.
we tagged for 0.23.2 (and note is there)
it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24
either way is fine
The main reason that I am raising this is that __contains__
checking has quite some implications (which is of course also the reason you speeded it up!), and I think it is rather easy to miss a small exotic corner case where the new implementation might differ.
To be clear, @topper-123 , I am not questioning the quality of this PR! I just know from experience, also for us, that it is easy to miss an unintended API change (which we might not even decide to fix if it is debateble behaviour, but then that is still better to be left as 0.24.0). Since this is only about performance improvement (and not a regression), I would play on safe and give it more time in 0.24.0.
it would be slightly tricky to change as there is another related change - both in 0.23.2 or 0.24
Yep, it would be both in 0.23.2 or both in 0.24.0 (the other I actually tagged first as 0.24.0, but that was changed before merging)
Ok, your call, I won't object.
OK, left it for 0.24 (moved the whatsnew already to v0.24.0.txt)
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request
Labels
Categorical Data Type
Related to indexing on series/frames, not to indexes themselves
Memory or execution speed performance