REF: codes-based MultiIndex engine by toobaz · Pull Request #19074 · pandas-dev/pandas (original) (raw)
closes #18519
closes #18818
closes #18520
closes #18485
closes #15994
closes #19086
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
This PR provides a cleaner and more robust MultiIndex
engine.
asv benchmarks
before after ratio
[93033151] [6e1ecec1]
+ 4.26±0.03ms 20.8±0.08ms 4.87 multiindex_object.GetLoc.time_med_get_loc_warm
+ 4.28±0.02μs 19.6±0.1μs 4.58 multiindex_object.GetLoc.time_string_get_loc
+ 4.12±0.02ms 18.1±0.09ms 4.40 multiindex_object.GetLoc.time_small_get_loc_warm
+ 4.55±0.1μs 18.8±0.3μs 4.13 multiindex_object.GetLoc.time_med_get_loc
- 178±4ms 148±0.7ms 0.83 multiindex_object.GetLoc.time_large_get_loc
- 163±1ms 120±0.5ms 0.73 multiindex_object.Integer.time_get_indexer
- 336±0.6ms 167±1ms 0.50 multiindex_object.GetLoc.time_large_get_loc_warm
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
... clearly show that the engine increases performance for large indexes, and significantly decreases performance for get_loc
on small/medium indexes. For reference, on the small
test case (10 elements), 16 μs. are lost on each get_loc
call, while on the large
test case (1 million elements), 146 μs. are gained on each get_loc
call. I think the tradeoff is acceptable, also considering that with this approach, any improvement to flat indexes automatically transfers to MultiIndex
. But most importantly, I think we have no other options to fix several annoying open issues (and more bugs - I'm sure - which were never reported) about incoherence between flat indexes and MultiIndex
.
Notice that moving the engine into index.pyx
is problematic because cython does not support multiple inheritance. Anyway, trying to cythonize a couple of methods brought no gain at all (as the overhead is mostly due to the multiple get_loc(val)
calls on the different levels).