PERF: Speed up Spearman calculation by dsaxton · Pull Request #28151 · pandas-dev/pandas (original) (raw)
Added a couple benchmarks, one with wide data (1000 by 500), wide data with random nans, and a memory benchmark on the wide data. Here's the summary:
before after ratio
[498f3008] [420b1d6a]
<master> <corr-perf>
- 84.1±2ms 7.86±0.5ms 0.09 stat_ops.Correlation.time_corr('spearman', False)
- 83.1±0.5ms 7.57±0.4ms 0.09 stat_ops.Correlation.time_corr('spearman', True)
- 23.3±0.1s 1.86±0.05s 0.08 stat_ops.Correlation.time_corr_wide('spearman', True)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
Couple things that didn't make it into the summary but are interesting:
master
(Not directly relevant to this pull request, but Kendall's tau fails on the wide data. It's not shown here, but it fails without nans as well.)
[ 75.00%] · For pandas commit 498f3008 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 78.57%] ··· stat_ops.Correlation.peakmem_corr_wide ok
[ 78.57%] ··· ========== ======= =======
-- use_bottleneck
---------- ---------------
method True False
========== ======= =======
spearman 82.2M 82.2M
kendall 98.7M 98.7M
pearson 81.8M 81.8M
========== ======= =======
[ 92.86%] ··· stat_ops.Correlation.time_corr_wide_nans 2/6 failed
[ 92.86%] ··· ========== ============ ============
-- use_bottleneck
---------- -------------------------
method True False
========== ============ ============
spearman 20.1±0s 20.0±0.02s
kendall failed failed
pearson 1.43±0.02s 1.55±0.06s
========== ============ ============
corr-perf
Surprisingly the peak memory usage is not much worse than master. If we add random null values to the data then the time performance is the same as master.
[ 50.00%] · For pandas commit 420b1d6a <corr-perf> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 53.57%] ··· stat_ops.Correlation.peakmem_corr_wide ok
[ 53.57%] ··· ========== ======= =======
-- use_bottleneck
---------- ---------------
method True False
========== ======= =======
spearman 86M 86M
kendall 98.7M 98.7M
pearson 81.7M 81.6M
========== ======= =======
[ 67.86%] ··· stat_ops.Correlation.time_corr_wide_nans 2/6 failed
[ 67.86%] ··· ========== ============ ============
-- use_bottleneck
---------- -------------------------
method True False
========== ============ ============
spearman 20.1±0.1s 20.0±0s
kendall failed failed
pearson 1.42±0.02s 1.41±0.03s
========== ============ ============