PERF: optimize MultiIndex.from_product by immerrr · Pull Request #7627 · pandas-dev/pandas (original) (raw)

This PR speeds up MultiIndex.from_product employing the fact that operating on categorical codes is faster than on the values themselves.

This yields about 2x improvement in the benchmark

In [1]: import pandas.util.testing as tm

In [2]: data = [tm.makeStringIndex(10000), tm.makeFloatIndex(20)]

In [3]: %timeit pd.MultiIndex.from_product(data) 100 loops, best of 3: 10.6 ms per loop

In [4]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data)) 10 loops, best of 3: 23.4 ms per loop

It's only marginally slower in small size cases:

In [1]: data = [np.arange(20).astype(object), np.arange(20)]

In [2]: %timeit pd.MultiIndex.from_product(data) 1000 loops, best of 3: 317 µs per loop

In [3]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data)) 1000 loops, best of 3: 308 µs per loop

In [4]: data_int = [np.arange(20), np.arange(20)]

In [5]: %timeit pd.MultiIndex.from_product(data_int) 1000 loops, best of 3: 285 µs per loop

In [6]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data_int)) 1000 loops, best of 3: 269 µs per loop

And this case came as a surprise because the cartesian product is blazingly fast both in old and new versions, but profiling showed that factorization is a lot faster when done on a smaller array:

In [7]: data_large = [np.arange(10000), np.arange(20)]

In [8]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data_large)) 100 loops, best of 3: 9.88 ms per loop

In [9]: %timeit pd.MultiIndex.from_product(data_large) 100 loops, best of 3: 2.74 ms per loop