Improve performance of equality comparison between a simple Index and a MultiIndex by tlaytongoogle · Pull Request #29134 · pandas-dev/pandas (original) (raw)

Previously, Index.equals(MultiIndex) and MultiIndex.equals(Index) both involved converting the MultiIndex into an ndarray. Since this conversion requires resolving MultiIndex's by-reference structure (i.e. its codes + levels) into ndarray's by-value structure, it can be substantially computationally expensive.

However, it is only possible for a simple Index to equal a MultiIndex in two cases:

the MultiIndex has only 1 level
the MultiIndex has d levels, and the Index is an object index of size-d sequences (e.g. d-tuples)

Thus, if the Index is not object-typed, and its nlevels differs from that of the MultiIndex, then the two are determined to be unequal without ndarray conversion.

MWE:

import pandas as pd

long_cheap_index = pd.RangeIndex(1000000) short_expensive_index = pd.IntervalIndex( [pd.Interval(pd.Timestamp(2018, 10, 1), pd.Timestamp(2018, 10, 2))])

large_expensive_multiindex = pd.MultiIndex.from_product( [long_cheap_index, short_expensive_index]) trivial_simple_index = pd.Int64Index([])

These operations no longer convert large_expensive_multiindex into an ndarray

Previously, each took ~10 s; now, ~0.02 ms

large_expensive_multiindex.equals(trivial_simple_index) trivial_simple_index.equals(large_expensive_multiindex)

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff