ENH: parallelize DataFrame.corr · Issue #40956 · pandas-dev/pandas (original) (raw)

DataFrame.corr(method="spearman") is extremely slow.
method="pearson" is quite slow too.
I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course).
I did not check the actual code implementing this method.

Describe the solution you'd like

scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.

API breaking implications

None.

Describe alternatives you've considered

Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option.
Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function
on the DataFrame.values, wrapping the returned array in a new DataFrame.

Additional context

IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)

import numpy as np from scipy.stats import spearmanr import pandas as pd

df = pd.DataFrame(np.random.rand(1000, 2000)) pd_corr = df.corr(method='spearman') # a few seconds scipy_corr, p_value = spearmanr(df.values) # <1 sec

np.equal(pd_corr.values, scipy_corr) # False np.sum(np.abs(corr_m.values - corr_m_sci) > 1e-15) # 0