REF/PERF: MultiIndex.get_locs to use boolean arrays internally by lukemanley · Pull Request #46330 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation13 Commits9 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

lukemanley

Use boolean arrays internally within MultiIndex.get_locs rather than int64 indexes. Logical operations show performance improvements over intersecting int64 indexes. The output remains an integer positional indexer.

       before           after         ratio
     [17dda440]       [94121581]
     <main>           <multiindex-get-locs-bool-arrays>
-        563±10μs          519±8μs     0.92  indexing.MultiIndexing.time_loc_all_scalars(True)
-      33.9±0.6ms       30.4±0.4ms     0.89  indexing.MultiIndexing.time_loc_all_null_slices(True)
-        38.7±1ms         34.5±1ms     0.89  indexing.MultiIndexing.time_loc_all_null_slices(False)
-     1.62±0.02ms      1.43±0.01ms     0.88  indexing.MultiIndexing.time_loc_all_slices(True)
-     6.40±0.06ms       5.53±0.1ms     0.86  indexing.MultiIndexing.time_loc_all_bool_indexers(True)
-         107±1ms       41.6±0.6ms     0.39  indexing.MultiIndexing.time_loc_all_lists(True)
-      34.6±0.8ms       8.24±0.4ms     0.24  indexing.MultiIndexing.time_loc_all_slices(False)
-         236±4ms       23.6±0.2ms     0.10  indexing.MultiIndexing.time_loc_all_lists(False)
-      97.3±0.7ms       9.16±0.4ms     0.09  indexing.MultiIndexing.time_loc_null_slice_plus_slice(False)
-      36.5±0.5ms      1.36±0.03ms     0.04  indexing.MultiIndexing.time_loc_null_slice_plus_slice(True)

@lukemanley

@lukemanley

@lukemanley

jreback

# if we have a provided indexer, then this need not consider
# the entire labels set
if step is not None and step < 0:
# Switch elements for negative step size
start, stop = stop - 1, start - 1
r = np.arange(start, stop, step)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an explanation (similar to the below) say around L3160, e.g. for a future reader to understand what this algorithm is doing.

@jreback

@jreback

@lukemanley

yep needs a rebase :->

rebased this one

jbrockmendel

@@ -310,7 +310,7 @@ Performance improvements
- Performance improvement in :meth:`.GroupBy.diff` (:issue:`16706`)
- Performance improvement in :meth:`.GroupBy.transform` when broadcasting values for user-defined functions (:issue:`45708`)
- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions when only a single group exists (:issue:`44977`)
- Performance improvement in :meth:`MultiIndex.get_locs` (:issue:`45681`, :issue:`46040`)
- Performance improvement in :meth:`MultiIndex.get_locs` (:issue:`45681`, :issue:`46040`, :issue:`46330`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most users dont use get_locs directly; is there a more user-facing description?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

Performance improvement in :meth:DataFrame.loc and :meth:Series.loc for tuple-based indexing of a :class:MultiIndex

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

@jbrockmendel

@lukemanley

@jreback

can you merge master once again

@lukemanley

@lukemanley

@jreback - merged main and greenish. I don't think the error is related as I see it showing up in other PRs as well

@lukemanley

jreback

)
indexer &= lvl_indexer
if not np.any(indexer) and np.any(lvl_indexer):
raise KeyError(seq)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hit by tests?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, covered by test_loc.py > test_missing_key_combination

@jreback

@lukemanley lukemanley deleted the multiindex-get-locs-bool-arrays branch

March 20, 2022 23:18

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request

Jul 13, 2022

@lukemanley @yehoshuadimarsky