Unexpected behaviour of Series.rolling.corr when using a Timedelta based window. (original) (raw)
Code Sample, a copy-pastable example if possible
import numpy as np import pandas as pd
def f(x, y, window): return x.rolling(window).corr(y)
def g(x, y, window): return x.rolling(window).apply(lambda x: x.corr(y), raw=False)
if name == "main": N = 20 n = 10
x = pd.Series(np.random.randn(N))
y = 1.0 * x
x[0:n] = 0.
window = 4
print(f(x, y, window)[n + window - 1 :])13 1.0 14 1.0 15 1.0 16 1.0 17 1.0 18 1.0 19 1.0 dtype: float64
print(g(x, y, window)[n + window - 1 :])13 1.0 14 1.0 15 1.0 16 1.0 17 1.0 18 1.0 19 1.0
index = pd.date_range("2001-01-01", freq="D", periods=N)
x = pd.Series(np.random.randn(N), index=index)
y = 2.0 * x
x[0:n] = 0.
print(f(x, y, window)[n + window - 1 :])2001-01-14 1.0 2001-01-15 1.0 2001-01-16 1.0 2001-01-17 1.0 2001-01-18 1.0 2001-01-19 1.0 2001-01-20 1.0 Freq: D, dtype: float64
print(g(x, y, window)[n + window - 1 :])2001-01-14 1.0 2001-01-15 1.0 2001-01-16 1.0 2001-01-17 1.0 2001-01-18 1.0 2001-01-19 1.0 2001-01-20 1.0 Freq: D, dtype: float64
dt_window = pd.to_timedelta("4D")
print(f(x, y, dt_window)[n + window - 1 :])2001-01-14 0.354308 2001-01-15 0.373106 2001-01-16 0.372752 2001-01-17 0.380531 2001-01-18 0.380298 2001-01-19 0.386142 2001-01-20 0.410147 Freq: D, dtype: float64
print(g(x, y, dt_window)[n + window - 1 :])2001-01-14 1.0 2001-01-15 1.0 2001-01-16 1.0 2001-01-17 1.0 2001-01-18 1.0 2001-01-19 1.0 2001-01-20 1.0 Freq: D, dtype: float64
Problem description
Both functions f and g should return the same value for entries 13 - 19 in the resulting series.
Currently the result of f when window = Timedelta(days=4) is not the correlation between the values of x and y which should be equal to 1.0 for entries 13 - 19 in the result.
Computed values on a DataFrame are also affected, i.e.
df = pd.DataFrame({"x": x, "y": y}) df.rolling(dt_window).corr()
does also compute unexpected values for the crosscorrelation.
If .corr is replaced with .cov in f and g both functions return identical results, so it is likely that it is caused by a difference in the normalisation in the correlation computation that is applied when using f and when using g.
Expected Output
Output of pd.show_versions()
Details
pd.show_versions()
INSTALLED VERSIONS
commit : None
pandas : 0.25.3
numpy : 1.15.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.6
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 2.2.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None