Correlation inconsistencies between Series and DataFrame (original) (raw)
Sample Code
import pandas as pd import numpy as np
df = pd.DataFrame(data={'a': [-0.04096, -0.04096, -0.04096, -0.04096, -0.04096], 'b': [1., 2., 3., 4., 5.], 'c': [0.053646, 0.053646, 0.053646, 0.053646, 0.053646]}, dtype=np.float64) corr_df = df.corr()
s_a = pd.Series(data=[-0.04096, -0.04096, -0.04096, -0.04096, -0.04096], dtype=np.float64, name='a') s_b = pd.Series(data=[1., 2., 3., 4., 5.], index=[1, 2, 3, 4, 5], dtype=np.float64, name='b') s_c = pd.Series(data=[0.053646, 0.053646, 0.053646, 0.053646, 0.053646], dtype=np.float64, name='c')
Trying to rebuild the correlation matrix from above with the pandas.Series version.
np.nan is used because correlation with the same Series does not work.
corr_series_new = pd.DataFrame( {'a': [np.nan, s_a.corr(s_b), s_a.corr(s_c)], 'b': [s_b.corr(s_a), np.nan, s_b.corr(s_c)], 'c': [s_c.corr(s_a), s_c.corr(s_b), np.nan ]} )
corr_series_old = pd.DataFrame( {'a': [np.nan, df['a'].corr(df['b']), df['a'].corr(df['c'])], 'b': [df['b'].corr(df['a']), np.nan, df['b'].corr(df['c'])], 'c': [df['c'].corr(df['a']), df['c'].corr(df['b']), np.nan ]} )
Problem description
1
For some reason pandas.DataFrame.corr() and pandas.Series.corr(other) show different behavior. In general, the correlation between two Series is not defined when one Series does not have varying values, like e.g. s_a or s_c, as the denominator of the correlation function is evaluated to zero, resulting in a by-zero-division. However, the correlation function defined in DataFrame somehow manages to evaluate something as shown in the following result:
corr_df a b c a NaN NaN NaN b NaN 1.0 0.0 c NaN 0.0 1.0
2
The above results do also not match when working with Series, which should be expected(?). Note that I have explicitly put NaNs at the identities since e.g. s_b.corr(s_b) does yield an Error.
corr_series_new a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN
3
Another problem is that by using the existing data instead of newly created series, we get different results.
corr_series_old a b c 0 NaN NaN NaN 1 NaN NaN 0.0 2 NaN 0.0 NaN
I hope I did not miss anything.
Expected Output
Both methods in Series and DataFrame should produce the same output.
Output of pd.show_versions()
Details
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None