Improve performance of equality comparison between a simple Index and a MultiIndex by tlaytongoogle · Pull Request #29134 · pandas-dev/pandas (original) (raw)
Previously, Index.equals(MultiIndex) and MultiIndex.equals(Index) both involved converting the MultiIndex into an ndarray. Since this conversion requires resolving MultiIndex's by-reference structure (i.e. its codes + levels) into ndarray's by-value structure, it can be substantially computationally expensive.
However, it is only possible for a simple Index to equal a MultiIndex in two cases:
- the MultiIndex has only
1level - the MultiIndex has
dlevels, and the Index is an object index of size-dsequences (e.g.d-tuples)
Thus, if the Index is not object-typed, and its nlevels differs from that of the MultiIndex, then the two are determined to be unequal without ndarray conversion.
MWE:
import pandas as pd
long_cheap_index = pd.RangeIndex(1000000) short_expensive_index = pd.IntervalIndex( [pd.Interval(pd.Timestamp(2018, 10, 1), pd.Timestamp(2018, 10, 2))])
large_expensive_multiindex = pd.MultiIndex.from_product( [long_cheap_index, short_expensive_index]) trivial_simple_index = pd.Int64Index([])
These operations no longer convert large_expensive_multiindex into an ndarray
Previously, each took ~10 s; now, ~0.02 ms
large_expensive_multiindex.equals(trivial_simple_index) trivial_simple_index.equals(large_expensive_multiindex)
- tests added / passed
- passes
black pandas - passes
git diff upstream/master -u -- "*.py" | flake8 --diff