PERF: IndexEngine.get_indexer_non_unique by lukemanley · Pull Request #55816 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation7 Commits2 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

lukemanley

maybe(?) closes #15364

When resizing result array, grow by factor of 2 rather than fixed amount.

import pandas as pd
import numpy as np

idx = pd.Index(np.ones(1_000_000))
target = pd.Index([1])

%timeit idx.get_indexer_for(target)

# 968 ms ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     -> main
# 73.1 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR
import pandas as pd
import numpy as np

idx = pd.Index(np.arange(1_000_000).tolist() + [0])
target = idx[:-1]

%timeit idx.get_indexer_for(target)

2.17 s ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  -> main
1.18 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  -> PR

@lukemanley

@lukemanley

@mroeschke

How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?

@lukemanley

How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?

I tried 10000 + 2^n starting with n=13 (8192) which produced the following timings for the two examples above:

79.7 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.34 s ± 71.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pretty close to the timings in this PR. Let me know if you prefer to use this.

@mroeschke

If the resizing is the expensive operation here, I think this would be generically more performant for larger cases too so yes

@lukemanley

If I understand correctly, I think your suggestion is very close to the current PR as they both end up growing by a factor of 2:

In [1]: import pandas as pd

In [2]: pd.DataFrame(
   ...:     {
   ...:         "main": [(n+1) * 10_000 for n in range(10)],
   ...:         "PR": [10_000 * 2**n for n in range(10)],
   ...:         "proposed": [10_000] + [10_000 + 2**(13+n) for n in range(9)]
   ...:     }
   ...: )
Out[2]: 
     main       PR  proposed
0   10000    10000     10000
1   20000    20000     18192
2   30000    40000     26384
3   40000    80000     42768
4   50000   160000     75536
5   60000   320000    141072
6   70000   640000    272144
7   80000  1280000    534288
8   90000  2560000   1058576
9  100000  5120000   2107152

Am I understanding your suggestion correctly?

WillAyd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mroeschke

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Yeah what you have is sufficient in this case (and less complex than what I proposed)

@mroeschke

Labels

Index

Related to the Index class or subclasses

Performance

Memory or execution speed performance