PERF: IndexEngine.get_indexer_non_unique by lukemanley · Pull Request #55816 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation7 Commits2 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- closes #xxxx (Replace xxxx with the GitHub issue number)
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/v2.2.0.rst
file if fixing a bug or adding a new feature.
maybe(?) closes #15364
When resizing result array, grow by factor of 2 rather than fixed amount.
import pandas as pd
import numpy as np
idx = pd.Index(np.ones(1_000_000))
target = pd.Index([1])
%timeit idx.get_indexer_for(target)
# 968 ms ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -> main
# 73.1 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) -> PR
import pandas as pd
import numpy as np
idx = pd.Index(np.arange(1_000_000).tolist() + [0])
target = idx[:-1]
%timeit idx.get_indexer_for(target)
2.17 s ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -> main
1.18 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -> PR
How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?
How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?
I tried 10000 + 2^n starting with n=13 (8192) which produced the following timings for the two examples above:
79.7 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.34 s ± 71.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pretty close to the timings in this PR. Let me know if you prefer to use this.
If the resizing is the expensive operation here, I think this would be generically more performant for larger cases too so yes
If I understand correctly, I think your suggestion is very close to the current PR as they both end up growing by a factor of 2:
In [1]: import pandas as pd
In [2]: pd.DataFrame(
...: {
...: "main": [(n+1) * 10_000 for n in range(10)],
...: "PR": [10_000 * 2**n for n in range(10)],
...: "proposed": [10_000] + [10_000 + 2**(13+n) for n in range(9)]
...: }
...: )
Out[2]:
main PR proposed
0 10000 10000 10000
1 20000 20000 18192
2 30000 40000 26384
3 40000 80000 42768
4 50000 160000 75536
5 60000 320000 141072
6 70000 640000 272144
7 80000 1280000 534288
8 90000 2560000 1058576
9 100000 5120000 2107152
Am I understanding your suggestion correctly?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Yeah what you have is sufficient in this case (and less complex than what I proposed)
Labels
Related to the Index class or subclasses
Memory or execution speed performance