PERF: Speed up Spearman calculation by dsaxton · Pull Request #28151 · pandas-dev/pandas (original) (raw)

Added a couple benchmarks, one with wide data (1000 by 500), wide data with random nans, and a memory benchmark on the wide data. Here's the summary:

       before           after         ratio
     [498f3008]       [420b1d6a]
     <master>         <corr-perf>
-        84.1±2ms       7.86±0.5ms     0.09  stat_ops.Correlation.time_corr('spearman', False)
-      83.1±0.5ms       7.57±0.4ms     0.09  stat_ops.Correlation.time_corr('spearman', True)
-       23.3±0.1s       1.86±0.05s     0.08  stat_ops.Correlation.time_corr_wide('spearman', True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Couple things that didn't make it into the summary but are interesting:

master

(Not directly relevant to this pull request, but Kendall's tau fails on the wide data. It's not shown here, but it fails without nans as well.)

[ 75.00%] · For pandas commit 498f3008 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 78.57%] ··· stat_ops.Correlation.peakmem_corr_wide                                                                                                                     ok
[ 78.57%] ··· ========== ======= =======
              --          use_bottleneck
              ---------- ---------------
                method     True   False 
              ========== ======= =======
               spearman   82.2M   82.2M 
               kendall    98.7M   98.7M 
               pearson    81.8M   81.8M 
              ========== ======= =======

[ 92.86%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                           2/6 failed
[ 92.86%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman    20.1±0s     20.0±0.02s 
               kendall      failed       failed   
               pearson    1.43±0.02s   1.55±0.06s 
              ========== ============ ============

corr-perf

Surprisingly the peak memory usage is not much worse than master. If we add random null values to the data then the time performance is the same as master.

[ 50.00%] · For pandas commit 420b1d6a <corr-perf> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 53.57%] ··· stat_ops.Correlation.peakmem_corr_wide                                                                                                                     ok
[ 53.57%] ··· ========== ======= =======
              --          use_bottleneck
              ---------- ---------------
                method     True   False 
              ========== ======= =======
               spearman    86M     86M  
               kendall    98.7M   98.7M 
               pearson    81.7M   81.6M 
              ========== ======= =======

[ 67.86%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                           2/6 failed
[ 67.86%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   20.1±0.1s     20.0±0s   
               kendall      failed       failed   
               pearson    1.42±0.02s   1.41±0.03s 
              ========== ============ ============