DataFrame.corr(method="spearman") much slower than DataFrame.rank().corr(method="pearson") · Issue #28139 · pandas-dev/pandas (original) (raw)
import numpy as np import pandas as pd
np.random.seed(1234)
df = pd.DataFrame(np.random.random((5000, 100)))
%timeit df.corr(method="spearman")
4.54 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.rank().corr(method="pearson")
105 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pd.testing.assert_frame_equal(df.corr(method="spearman"), df.rank().corr(method="pearson"))
No error
pd.version
'0.25.0'
Problem description
DataFrame.corr(method="spearman")
seems to be an order of magnitude slower than DataFrame.rank().corr(method="pearson")
even though the two are ultimately doing the same calculation. While probably not the cleanest option, I would think that corr(method="spearman")
could at least be made an alias for rank().corr(method="pearson")
, or maybe a change could be made in algos.pyx
.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None