BUG: Series.corr/cov raising with masked dtype by lukemanley · Pull Request #51422 · pandas-dev/pandas (original) (raw)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if there are NAs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both left and right are filtered for notna here:

valid = notna(a) & notna(b)

so this works:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: ser1 = pd.Series(np.random.randn(100), dtype="Float64")

In [4]: ser2 = pd.Series(np.random.randn(100), dtype="Float64")

In [5]: ser1[1] = pd.NA

In [6]: ser2[5:7] = pd.NA

In [7]: ser1.corr(ser2)
Out[7]: 0.09774253881093414

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. what about if there is an nan?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean an nan within a masked array? e.g.

In [8]: ser1[1] = 0.0

In [9]: ser1 /= (ser1 != 0)

In [10]: ser1
Out[10]: 
0     0.148073
1          NaN
2     0.556972
3    -0.554886
4     1.216938
        ...   
95     0.33919
96    0.528683
97    1.590215
98     0.84015
99    0.333666
Length: 100, dtype: Float64

In [11]: ser1.corr(ser2)
Out[11]: 0.09774253881093414

It works either way since the notna is applied to the ndarray which will capture both np.nan and pd.NA