improves groupby.get_group_index when shape is a long sequence by behzadnouri · Pull Request #11180 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation6 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes #10161

xref #10161 (comment)

In [5]: df = DataFrame(np.random.randn(5000, 100).astype(str))

In [6]: %timeit df.duplicated() 1 loops, best of 3: 151 ms per loop

In [7]: %timeit df.T.duplicated() 1 loops, best of 3: 1.39 s per loop

part of this is because of taking the transpose (maybe cache locality). i.e. below performs better even though the shape is the same as df.T in above:

In [8]: df = DataFrame(np.random.randn(100, 5000).astype(str))

In [9]: %timeit df.duplicated() 1 loops, best of 3: 965 ms per loop

are there asv benches for this?

can you add a doc-note in the performance section as well. thxs.

I tested this fix on the same dataframe and it looks like it solves the problem

In [8]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 5000 entries, 0 to 4999 Data columns (total 35 columns): ... dtypes: float64(7), int64(12), object(16) memory usage: 1.4+ MB

In [9]: %timeit -n 3 df.T.duplicated() 3 loops, best of 3: 549 ms per loop

There is still a slight regression from 0.12.0 but it is minimal. Thanks for fixing this.

@behzadnouri maybe add to that benchmark a case with the tranposed frame? (to catch this case with many columns)