CLN: ASV long and broken benchmarks by mroeschke · Pull Request #19113 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation6 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Shortened the size of the data in GroupbyMethods
since some benchmarks took +30 seconds on my machine, and additionally fixed some broken benchmarks
asv dev -b ^groupby.GroupByMethods
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[ 0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[ 0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupByMethods.time_method ok
[100.00%] ····
======= ============== ========
dtype method
------- -------------- --------
int all 256ms
int any 255ms
int count 925μs
int cumcount 1.15ms
int cummax 1.13ms
int cummin 1.15ms
int cumprod 1.58ms
int cumsum 1.16ms
int describe 3.25s
int first 1.12ms
int head 1.37ms
int last 1.12ms
int mad 1.42s
int max 1.12ms
int min 1.16ms
int median 1.53ms
int mean 1.43ms
int nunique 1.40ms
int pct_change 1.56s
int prod 1.53ms
int rank 380ms
int sem 414ms
int shift 974μs
int size 858μs
int skew 414ms
int std 1.46ms
int sum 1.50ms
int tail 1.45ms
int unique 289ms
int value_counts 2.35ms
int var 1.34ms
float all 402ms
float any 406ms
float count 1.18ms
float cumcount 1.33ms
float cummax 1.40ms
float cummin 1.40ms
float cumprod 1.75ms
float cumsum 1.40ms
float describe 5.02s
float first 1.37ms
float head 1.58ms
float last 1.36ms
float mad 2.01s
float max 1.38ms
float min 1.37ms
float median 1.80ms
float mean 1.79ms
float nunique 1.60ms
float pct_change 2.17s
float prod 1.75ms
float rank 623ms
float sem 416ms
float shift 1.18ms
float size 1.09ms
float skew 646ms
float std 1.51ms
float sum 1.77ms
float tail 1.63ms
float unique 457ms
float value_counts 2.63ms
float var 1.43ms
======= ============== ========
$ asv dev -b ^groupby.GroupStrings
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[ 0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[ 0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupStrings.time_multi_columns 781ms
(pandas_dev)matt@matt-Inspiron-1545:~/Projects/pandas-mroeschke/asv_bench$ asv dev -b ^frame_methods.ToHTML
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[ 0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[ 0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running frame_methods.ToHTML.time_to_html_mixed 565ms
Codecov Report
Merging #19113 into master will not change coverage.
The diff coverage isn/a
.
@@ Coverage Diff @@ ## master #19113 +/- ##
Coverage 91.51% 91.51%
Files 148 148
Lines 48753 48753
Hits 44616 44616
Misses 4137 4137
Flag | Coverage Δ | |
---|---|---|
#multiple | 89.88% <ø> (ø) | ⬆️ |
#single | 41.6% <ø> (ø) | ⬆️ |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 36a71eb...5898212. Read the comment docs.
can you reduce the size of pct_change, mad, describe to be in-line with the others? or are they just exponentially slower?
I believe mad
, pct_change
, and describe
are just slower groupby methods. mad
and pct_change
are in the _common_apply_whitelist
while describe
is implemented via apply()
In [6]: df.head() #2000 rows x 2 columns
Out[6]:
key values
0 1271 534
1 654 195
2 758 336
3 428 298
4 254 531
In [7]: %timeit df.groupby('key')['values'].describe()
1 loop, best of 3: 3.12 s per loop
In [8]: %timeit df.groupby('key')['values'].pct_change()
1 loop, best of 3: 1.09 s per loop
In [9]: %timeit df.groupby('key')['values'].mad()
1 loop, best of 3: 991 ms per loop
In [10]: %timeit df.groupby('key')['values'].sem() # specifically defined
100 loops, best of 3: 2.21 ms per loop
In [11]: %timeit df.groupby('key')['values'].prod() # specifically defined
100 loops, best of 3: 2.25 ms per loop
I believe it's better to keep the size of the data equivalent for all the groupby methods so it's easier to identify which methods (like mad, pct_change) might need a performance boost.
@mroeschke ok great. thanks for the comment. yes some of these are non-performant.
if you want to go thru the non-performant ones and see if we have issues for them would be great (I know we have some, but ideally we could just create a single master issue).
maximveksler pushed a commit to maximveksler/pandas that referenced this pull request
2 participants