PERF: Faster CategoricalIndex from categorical by TomAugspurger · Pull Request #17513 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation9 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Master:
In [1]: import pandas as pd; import numpy as np
In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category')
In [3]: %timeit pd.CategoricalIndex(s)
69.4 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
HEAD
In [1]: import pandas as pd; import numpy as np
In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category')
In [3]: %timeit pd.CategoricalIndex(s) 9.06 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Is that a 7000x speedup, or did I mess up the math?
@@ -130,6 +130,9 @@ def _create_categorical(self, data, categories=None, ordered=None): |
---|
------- |
Categorical |
""" |
if isinstance(data, ABCSeries) and is_categorical_dtype(data): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isinstance(data, ABCSeries)
is maybe a bit too restrictive. This rules out, e.g. CategoricalIndex. But at moment the CategoricalIndex constructor takes a different (faster) path, so it wouldn't benefit from this change anyway.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does it still pass if you add ABCCategoricalIndex
here as well? it is more consistent
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about isinstance(data, (ABCSeries, type(self)))
. That'll catch Series(dtype='category')
, and CategoricalIndex
, but not Categorical
.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about (is_categorical_dtype and not ABCCategorical)?
maybe add a helper method
_ensure_categorical
?
which passes thru Categorical
does .values on CI and Series
and raises else
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a helper method
I'd prefer to wait till we need that. I haven't come across any other cases where I need to treat CategoricalIndex and Series(dtype=category) differently than Categorical.
I am able to replicate this result (159 ms -> 9.88 µs).
alanbato pushed a commit to alanbato/pandas that referenced this pull request
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request