BUG: Cythonized kendall correlation implementation is incorrect · Issue #43401 · pandas-dev/pandas (original) (raw)


Problem description

The Cythonized implementation of Kendall correlation implemented here has been completely broken in the case of tied values since its original commit.

Scipy (and most other stats pacakges) implement the b variant of Kendall's tau, not the original version. Rank based correlation algorithms absolutely need to handle ties correctly as ordinal data often includes repeated values in some columns.

Even if we wanted to implement the "a variant" of Kendall's tau (which would be a breaking change from the implementation prior to the Cythonization diff), the correct code would disregard ties entirely, whereas the current implementation is including them in its calculations.

It's non-trivial to implement a Cythonized version of the correct code with tie-handling here, and I think that the Cythonized version should be reverted.

Code Sample, a copy-pastable example

df = pd.DataFrame([[1, 2], [2, 1], [1, 3]])
df[0].corr(df[1],method='kendall') == df.corr(method='kendall')[0][1]
# output
False

Expected Output

True

Change to test case that catches this bug

zrait@94a0eea is a simple change to the correlation test case that ensures that one of the columns used has repeated values and fails with the current implementation but not if the Cythonization is reverted.

More generally, I think the float test data should be updated to include duplicates, as a danger of randomly generated float data is the fact that it won't naturally include any duplicates.