PERF: Index._shallow_copy doesn't copy ._engine · Issue #28584 · pandas-dev/pandas (original) (raw)
idx = pd.Index(np.arange(100_000)) %timeit idx.get_loc(99_999) 774 ns ± 26 ns per loop %timeit idx._shallow_copy().get_loc(99_999) 3.57 ms ± 56.8 µs per loop
The same performance issue can be seen on other index types, e.g. CategoricalIndex and MultiIndex.
Problem description
The reason for the above diferences is that _shallow_copy
does not copy over the ._engine
attribute to the new index and the _engine
is expensive to recreate.
Indexes are immutable, and likewise - to my understanding - are the ._engine
attribute of indexes. The ._engine
is quite expensive to create and if it has been created on the original index, I think it should be possible to reuse it on the new index, saving ther time needed to create a new and identical ._engine
.
_shallow_copy
is used in a few places internally in pandas, so there seems to be potential for some speedups for several pandas merhods by copying the _engine over to newly-copied indexes.
Possibly I'm missing some finer details here, e.g. don't know what the ._engine.clear_mappings
is for and it seems from its name to be destructive, but overall it seems to be possible to make a change to copy this over.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 79663fb
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.0.dev0+1363.g79663fb66
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : 4.28.2
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.6.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None