PERF: Faster CategoricalIndex from categorical by TomAugspurger · Pull Request #17513 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation9 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

TomAugspurger

Master:

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category')

In [3]: %timeit pd.CategoricalIndex(s)
69.4 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

HEAD

In [1]: import pandas as pd; import numpy as np

In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category')

In [3]: %timeit pd.CategoricalIndex(s) 9.06 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is that a 7000x speedup, or did I mess up the math?

@codecov

TomAugspurger

@@ -130,6 +130,9 @@ def _create_categorical(self, data, categories=None, ordered=None):
-------
Categorical
"""
if isinstance(data, ABCSeries) and is_categorical_dtype(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isinstance(data, ABCSeries) is maybe a bit too restrictive. This rules out, e.g. CategoricalIndex. But at moment the CategoricalIndex constructor takes a different (faster) path, so it wouldn't benefit from this change anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does it still pass if you add ABCCategoricalIndex here as well? it is more consistent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about isinstance(data, (ABCSeries, type(self))). That'll catch Series(dtype='category'), and CategoricalIndex, but not Categorical.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about (is_categorical_dtype and not ABCCategorical)?

maybe add a helper method

_ensure_categorical ?

which passes thru Categorical
does .values on CI and Series
and raises else

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a helper method

I'd prefer to wait till we need that. I haven't come across any other cases where I need to treat CategoricalIndex and Series(dtype=category) differently than Categorical.

@topper-123

I am able to replicate this result (159 ms -> 9.88 µs).

@TomAugspurger

jorisvandenbossche

@jreback

@TomAugspurger

alanbato pushed a commit to alanbato/pandas that referenced this pull request

Nov 10, 2017

@TomAugspurger @alanbato

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request

Nov 28, 2017

@TomAugspurger @No-Stream