REF: codes-based MultiIndex engine by toobaz · Pull Request #19074 · pandas-dev/pandas (original) (raw)

closes #18519
closes #18818
closes #18520
closes #18485
closes #15994
closes #19086

This PR provides a cleaner and more robust MultiIndex engine.

asv benchmarks

       before           after         ratio
     [93033151]       [6e1ecec1]
+     4.26±0.03ms      20.8±0.08ms     4.87  multiindex_object.GetLoc.time_med_get_loc_warm
+     4.28±0.02μs       19.6±0.1μs     4.58  multiindex_object.GetLoc.time_string_get_loc
+     4.12±0.02ms      18.1±0.09ms     4.40  multiindex_object.GetLoc.time_small_get_loc_warm
+      4.55±0.1μs       18.8±0.3μs     4.13  multiindex_object.GetLoc.time_med_get_loc
-         178±4ms        148±0.7ms     0.83  multiindex_object.GetLoc.time_large_get_loc
-         163±1ms        120±0.5ms     0.73  multiindex_object.Integer.time_get_indexer
-       336±0.6ms          167±1ms     0.50  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

... clearly show that the engine increases performance for large indexes, and significantly decreases performance for get_loc on small/medium indexes. For reference, on the small test case (10 elements), 16 μs. are lost on each get_loc call, while on the large test case (1 million elements), 146 μs. are gained on each get_loc call. I think the tradeoff is acceptable, also considering that with this approach, any improvement to flat indexes automatically transfers to MultiIndex. But most importantly, I think we have no other options to fix several annoying open issues (and more bugs - I'm sure - which were never reported) about incoherence between flat indexes and MultiIndex.

Notice that moving the engine into index.pyx is problematic because cython does not support multiple inheritance. Anyway, trying to cythonize a couple of methods brought no gain at all (as the overhead is mostly due to the multiple get_loc(val) calls on the different levels).