PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) · Issue #53574 · pandas-dev/pandas (original) (raw)

While discussing #53515, I noticed that specifically the concatenation of the index is showing up in an example implementation of stack(), in this case concatting MultiIndexes. That made me wonder why this operation is relatively slow, and if there would be ways to speed this up. What follows here is a short report of this exploration.

Test case: combining two MultiIndex objects:

idx1 = pd.MultiIndex.from_product([range(1000), range(100), ['a']]) idx2 = pd.MultiIndex.from_product([range(1000), range(100), ['b']])

result = idx1.append(idx2)

This operation essentially "densifies" the levels of the MultiIndex (get_level_values), appends them, and then re-encodes those dense values into codes/levels with MultiIndex.from_arrays (code).
Below is a small toy implementation of concatting two MultiIndex objects (with same number of levels) with never fully densifying the level values, but only "recoding" the existing codes to new labels:

from pandas.core.arrays.categorical import recode_for_categories

def concat_multi_index(idx1, idx2): codes = [] levels = [] for level in range(idx1.nlevels): level_values = idx1.levels[level].union(idx2.levels[level]) codes1 = recode_for_categories(idx1.codes[level], idx1.levels[level], level_values, copy=False) codes2 = recode_for_categories(idx2.codes[level], idx2.levels[level], level_values, copy=False) codes.append(np.concatenate([codes1, codes2])) levels.append(level_values) return pd.MultiIndex(codes=codes, levels=levels, names=idx1.names, verify_integrity=False)

This gives the same result, and a significant (20x) speed-up for this case:

In [51]: res1 = idx1.append(idx2)

In [52]: res2 = concat_multi_index(idx1, idx2)

In [53]: res1.equals(res2)
Out[53]: True

In [54]: %timeit idx1.append(idx2)
10.7 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [55]: %timeit concat_multi_index(idx1, idx2)
466 µs ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(run with current main branch)

A bunch of caveats:

This is not a generic implementation (currently assumes you have matching levels, only for two inputs (while append can take a list), etc), and thus would need some more (which probably also adds some extra overhead) to generalize it. I also didn't test it, so there might be various corner cases that we have to handle that I am overlooking here (a good next step would be to run our test suite with this ..).
I only benchmarked it for this very specific case. It would be good to test it with some other MultiIndexes with different characteristics (number of unique level values, how much overlap in level values between the different indices, etc), to verify if it can be faster more in general, or if it was specific to the data I was testing with.

I currently don't have time to further dig in and make this into a PR myself, so if someone else is interested, feel free to take it up!