PERF: Regression with indexing with ExtensionEngine · Issue #45652 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd from itertools import permutations, chain import string

string_values = chain( string.ascii_uppercase, permutations(string.ascii_uppercase, 2), permutations(string.ascii_uppercase, 3), )

string_index = pd.Index(map(''.join, string_values)).astype('string')

df = pd.DataFrame({'ints': range(len(string_index))}, index=string_index)

subset_index = string_index[string_index.str.startswith('A')]

%time slow_result = df.loc[subset_index]

CPU times: user 1.93 s, sys: 0 ns, total: 1.93 s

Wall time: 1.93 s

slow_result = df.loc[subset_index]

%time fast_result = df.loc[subset_index.values]

CPU times: user 2.85 ms, sys: 0 ns, total: 2.85 ms

Wall time: 2.82 ms

fast_result = df.loc[subset_index.values]

results are the same.

pd.testing.assert_frame_equal(slow_result, fast_result)

Old object indexes don't have this issue.

object_index_df = df.copy() object_index_df.index = object_index_df.index.astype(object)

%time obj_result = object_index_df.loc[subset_index]

CPU times: user 945 µs, sys: 19 µs, total: 964 µs

Wall time: 939 µs

obj_result = object_index_df.loc[subset_index]

This only happens when indexing using dtype='string' on both the index and the indexer. Note here that df.index and string_index are both dtype='string. What is odd is that just accessing .values or converting to a list will make indexing fast again. Old object indexes don't have the same issue.

Installed Versions

INSTALLED VERSIONS

commit : c5ff649
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.16.0-arch1-1
Version : #1 SMP PREEMPT Mon, 10 Jan 2022 20:11:47 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+193.gc5ff649b11
numpy : 1.23.0.dev0+512.g6077afd65
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : 0.29.26
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.55.0dev0+1077.g0994f97c3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0.dev587+g458271315
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.0b1
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response