DataFrame.corr(method="spearman") much slower than DataFrame.rank().corr(method="pearson") · Issue #28139 · pandas-dev/pandas (original) (raw)

import numpy as np import pandas as pd

np.random.seed(1234)

df = pd.DataFrame(np.random.random((5000, 100)))

%timeit df.corr(method="spearman")

4.54 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.rank().corr(method="pearson")

105 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

pd.testing.assert_frame_equal(df.corr(method="spearman"), df.rank().corr(method="pearson"))

No error

pd.version

'0.25.0'

Problem description

DataFrame.corr(method="spearman") seems to be an order of magnitude slower than DataFrame.rank().corr(method="pearson") even though the two are ultimately doing the same calculation. While probably not the cleanest option, I would think that corr(method="spearman") could at least be made an alias for rank().corr(method="pearson"), or maybe a change could be made in algos.pyx.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None