PERF: df.loc is 100x slower for CategoricalIndex than for normal Index · Issue #20395 · pandas-dev/pandas (original) (raw)

ORIGINAL: 13.8 ms
EDIT: After #21369 was merged the result of %timeit df2.loc['b'] has improved to 3.8 ms.
EDIT: After #21618 was merged the result of %timeit df2.loc['b'] has improved to 3.3 ms.
EDIT: After #21659 was merged the result of %timeit df2.loc['b'] has improved to 1.6 ms.
EDIT: After #23235 was merged the result of %timeit df2.loc['b'] has improved to 159 µs. Issue resolved.

Code Sample

n = 100_000 df1 = pd.DataFrame(dict(A=range(n*3)), index=list('a'*n + 'b'*n + 'c'*n)) df1.index.is_monotonic_increasing True df2 = df1.copy() df2.index = pd.CategoricalIndex(df2.index) df2.index.is_monotonic_increasing True %timeit df1.loc['b'] 125 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit df2.loc['b'] 13.8 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

Selecting on a CategoricalIndex is 100x slower than selecting on a normal Index.

I've tested this on master ( a few days old) and on v0.22, with same result for both versions. The speed is even worse than the speed for a full columns scan:

df3 = df2.reset_index() %timeit df3[df3['index'] == 'b'] 6.58 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A guess is that the binary search is bypassed and a full index scan is being done + some extra stuff so it's even slower than a normal full columns scan.

Expected Output

The output is as expected, but the speed is very slow for CategoricalIndex.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: a7a7f8c
python: 3.6.3.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0.dev0+870.ga7a7f8c
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.5
Cython: 0.26.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None