PERF: remove_unused_levels is very slow · Issue #16556 · pandas-dev/pandas (original) (raw)

Code Sample

import numpy as np import pandas as pd series = pd.DataFrame(dict( A=np.random.randint(0, 10000, 100000), B=np.random.randint(0, 10000, 100000), V=np.random.rand(100000))).groupby(['A', 'B']).V.sum() filtered_series = series[series < 0.1] %time x = filtered_series.index.remove_unused_levels() %time y = filtered_series.reset_index().set_index(['A', 'B']).index

Problem description

On my laptop, x takes 20 to 40 times as long as y, despite y doing the extra work of sorting the second level and reindexing the series in the process. The outputs, except for the sorting of the second level, are identical. Why is remove_unused_levels so slow?

Expected Output

remove_unused_levels should be at least as fast on large indexes as the reset_index/set_index hack.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.2-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None