improves groupby.get_group_index when shape is a long sequence by behzadnouri · Pull Request #11180 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation6 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
closes #10161
xref #10161 (comment)
In [5]: df = DataFrame(np.random.randn(5000, 100).astype(str))
In [6]: %timeit df.duplicated() 1 loops, best of 3: 151 ms per loop
In [7]: %timeit df.T.duplicated() 1 loops, best of 3: 1.39 s per loop
part of this is because of taking the transpose (maybe cache locality). i.e. below performs better even though the shape is the same as df.T
in above:
In [8]: df = DataFrame(np.random.randn(100, 5000).astype(str))
In [9]: %timeit df.duplicated() 1 loops, best of 3: 965 ms per loop
are there asv benches for this?
can you add a doc-note in the performance section as well. thxs.
I tested this fix on the same dataframe and it looks like it solves the problem
In [8]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 5000 entries, 0 to 4999 Data columns (total 35 columns): ... dtypes: float64(7), int64(12), object(16) memory usage: 1.4+ MB
In [9]: %timeit -n 3 df.T.duplicated() 3 loops, best of 3: 549 ms per loop
There is still a slight regression from 0.12.0 but it is minimal. Thanks for fixing this.
@behzadnouri maybe add to that benchmark a case with the tranposed frame? (to catch this case with many columns)