PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) · Issue #53574 · pandas-dev/pandas (original) (raw)

While discussing #53515, I noticed that specifically the concatenation of the index is showing up in an example implementation of stack(), in this case concatting MultiIndexes. That made me wonder why this operation is relatively slow, and if there would be ways to speed this up. What follows here is a short report of this exploration.

Test case: combining two MultiIndex objects:

idx1 = pd.MultiIndex.from_product([range(1000), range(100), ['a']]) idx2 = pd.MultiIndex.from_product([range(1000), range(100), ['b']])

result = idx1.append(idx2)

This operation essentially "densifies" the levels of the MultiIndex (get_level_values), appends them, and then re-encodes those dense values into codes/levels with MultiIndex.from_arrays (code).
Below is a small toy implementation of concatting two MultiIndex objects (with same number of levels) with never fully densifying the level values, but only "recoding" the existing codes to new labels:

from pandas.core.arrays.categorical import recode_for_categories

def concat_multi_index(idx1, idx2): codes = [] levels = [] for level in range(idx1.nlevels): level_values = idx1.levels[level].union(idx2.levels[level]) codes1 = recode_for_categories(idx1.codes[level], idx1.levels[level], level_values, copy=False) codes2 = recode_for_categories(idx2.codes[level], idx2.levels[level], level_values, copy=False) codes.append(np.concatenate([codes1, codes2])) levels.append(level_values) return pd.MultiIndex(codes=codes, levels=levels, names=idx1.names, verify_integrity=False)

This gives the same result, and a significant (20x) speed-up for this case:

In [51]: res1 = idx1.append(idx2)

In [52]: res2 = concat_multi_index(idx1, idx2)

In [53]: res1.equals(res2)
Out[53]: True

In [54]: %timeit idx1.append(idx2)
10.7 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [55]: %timeit concat_multi_index(idx1, idx2)
466 µs ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(run with current main branch)

A bunch of caveats:

I currently don't have time to further dig in and make this into a PR myself, so if someone else is interested, feel free to take it up!