BUG: Inconsistent correlation between constant series (varies with number of rows) (original) (raw)


Code Sample, a copy-pastable example

import pandas as pd

for length in [2, 3, 5, 10, 20]: print(pd.DataFrame(length*[[0.42, 0.1]], columns=["A", "B"]).corr())

gives

    A   B
A NaN NaN
B NaN NaN
    A    B
A NaN  NaN
B NaN  1.0
     A   B
A  1.0 NaN
B  NaN NaN
     A    B
A  1.0 -1.0
B -1.0  1.0
     A    B
A  1.0  1.0
B  1.0  1.0

Problem description

Inconsistent output with slightly varying number of rows. Would expect correlation between series where at least one of them is constant, to be NaN.

This makes e.g. code dependent on dropna() usage after calculating corr() difficult/error prone, as behaviour is inconsistent.

Expected Output

Either consistent NaN output when calculating correlation with constant data, or a warning in pandas.DataFrame.corr documentation stating that returned correlation between constant series can be anything from [1.0, -1.0, NaN].