CLN: ASV long and broken benchmarks by mroeschke · Pull Request #19113 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation6 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

xref

Shortened the size of the data in GroupbyMethods since some benchmarks took +30 seconds on my machine, and additionally fixed some broken benchmarks

asv dev -b ^groupby.GroupByMethods
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupByMethods.time_method                     ok
[100.00%] ···· 
               ======= ============== ========
                dtype      method             
               ------- -------------- --------
                 int        all        256ms  
                 int        any        255ms  
                 int       count       925μs  
                 int      cumcount     1.15ms 
                 int       cummax      1.13ms 
                 int       cummin      1.15ms 
                 int      cumprod      1.58ms 
                 int       cumsum      1.16ms 
                 int      describe     3.25s  
                 int       first       1.12ms 
                 int        head       1.37ms 
                 int        last       1.12ms 
                 int        mad        1.42s  
                 int        max        1.12ms 
                 int        min        1.16ms 
                 int       median      1.53ms 
                 int        mean       1.43ms 
                 int      nunique      1.40ms 
                 int     pct_change    1.56s  
                 int        prod       1.53ms 
                 int        rank       380ms  
                 int        sem        414ms  
                 int       shift       974μs  
                 int        size       858μs  
                 int        skew       414ms  
                 int        std        1.46ms 
                 int        sum        1.50ms 
                 int        tail       1.45ms 
                 int       unique      289ms  
                 int    value_counts   2.35ms 
                 int        var        1.34ms 
                float       all        402ms  
                float       any        406ms  
                float      count       1.18ms 
                float     cumcount     1.33ms 
                float      cummax      1.40ms 
                float      cummin      1.40ms 
                float     cumprod      1.75ms 
                float      cumsum      1.40ms 
                float     describe     5.02s  
                float      first       1.37ms 
                float       head       1.58ms 
                float       last       1.36ms 
                float       mad        2.01s  
                float       max        1.38ms 
                float       min        1.37ms 
                float      median      1.80ms 
                float       mean       1.79ms 
                float     nunique      1.60ms 
                float    pct_change    2.17s  
                float       prod       1.75ms 
                float       rank       623ms  
                float       sem        416ms  
                float      shift       1.18ms 
                float       size       1.09ms 
                float       skew       646ms  
                float       std        1.51ms 
                float       sum        1.77ms 
                float       tail       1.63ms 
                float      unique      457ms  
                float   value_counts   2.63ms 
                float       var        1.43ms 
               ======= ============== ========

$ asv dev -b ^groupby.GroupStrings
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running groupby.GroupStrings.time_multi_columns             781ms
(pandas_dev)matt@matt-Inspiron-1545:~/Projects/pandas-mroeschke/asv_bench$ asv dev -b ^frame_methods.ToHTML
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[100.00%] ··· Running frame_methods.ToHTML.time_to_html_mixed             565ms

Codecov Report

Merging #19113 into master will not change coverage.
The diff coverage is n/a.

@@ Coverage Diff @@ ## master #19113 +/- ##

Coverage 91.51% 91.51%

Files 148 148
Lines 48753 48753

Hits 44616 44616
Misses 4137 4137

Flag	Coverage Δ
#multiple	89.88% <ø> (ø)	⬆️
#single	41.6% <ø> (ø)	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36a71eb...5898212. Read the comment docs.

can you reduce the size of pct_change, mad, describe to be in-line with the others? or are they just exponentially slower?

I believe mad, pct_change, and describe are just slower groupby methods. mad and pct_change are in the _common_apply_whitelist while describe is implemented via apply()

In [6]: df.head() #2000 rows x 2 columns
Out[6]: 
    key  values
0  1271     534
1   654     195
2   758     336
3   428     298
4   254     531

In [7]: %timeit df.groupby('key')['values'].describe()
1 loop, best of 3: 3.12 s per loop

In [8]: %timeit df.groupby('key')['values'].pct_change()
1 loop, best of 3: 1.09 s per loop

In [9]: %timeit df.groupby('key')['values'].mad()
1 loop, best of 3: 991 ms per loop

In [10]: %timeit df.groupby('key')['values'].sem() #  specifically defined
100 loops, best of 3: 2.21 ms per loop

In [11]: %timeit df.groupby('key')['values'].prod() #  specifically defined
100 loops, best of 3: 2.25 ms per loop

I believe it's better to keep the size of the data equivalent for all the groupby methods so it's easier to identify which methods (like mad, pct_change) might need a performance boost.

@mroeschke ok great. thanks for the comment. yes some of these are non-performant.

if you want to go thru the non-performant ones and see if we have issues for them would be great (I know we have some, but ideally we could just create a single master issue).

maximveksler pushed a commit to maximveksler/pandas that referenced this pull request

Jan 11, 2018

2 participants