offset-based rolling window, multiple issues with closed='left' · Issue #26005 · pandas-dev/pandas (original) (raw)
Code Sample
Case 1: single row
df1 = pd.DataFrame({'B': [0]}, index=[pd.Timestamp('20130101 09:00:00')]) df1.rolling('1s', closed='left').median() # <- raises 'MemoryError: skiplist_init failed'
Case 2: multiple rows, but entries separated by a larger time than the specified window
df2 = pd.DataFrame({'B': [0, 1]}, index=[pd.Timestamp('20130101 09:00:00'), pd.Timestamp('20130101 09:00:02')]) df2.rolling('1s', closed='left').median() # <- raises 'MemoryError: skiplist_init failed' df2.rolling('1s', closed='left').max() # <- no error, but second entry seems incorrect
Case 3: as long as at least one row has other entries in its window, it runs without
an exception but the values are suspect
df3 = pd.DataFrame({'B': [1, 2, 3]}, index=[pd.Timestamp('20130101 09:00:00'), pd.Timestamp('20130101 09:00:02'), pd.Timestamp('20130101 09:00:03')]) df3.rolling('1s', closed='left').median() # <- no exception, but the values seem incorrect df3.rolling('1s', closed='left').max() # ditto df3.rolling('2s', closed='left').median() # ditto (note longer window)
Problem description
Obviously, the exception cases are a big problem and should be addressed. The other cases laid out here seem to give unexpected results that are inconsistent with other aggregations (such as mean and sum) that do seem to be operating correctly. Note that using closed='right' or closed='both' does seem to give results consistent with my expectations, while using closed='neither' yields similar problems as closed='left'. (So, it would seem that the common factor here is whether or not the input rows are included in their own rolling windows.)
Expected Output
Case 1: since there are no other entries in the input row's window, I would expect that the median aggregation return NaN. (This would be consistent with mean, max, etc. for this case.)
B
2013-01-01 09:00:00 NaN
Case 2: since neither input row should have any other entries in their windows, I would expect that the median and max results should all be NaN. (This would be consistent with what the mean aggregation returns for this case.)
B
2013-01-01 09:00:00 NaN 2013-01-01 09:00:02 NaN
Case 3a and 3b (1s window): since neither of the first two input rows should have any other entries in their windows, I would expect that their median and max results should all be NaN. since the last row does have an entry in its window (the second row) I would expect that both the median and max should be 2.0. (This would be consistent with what the mean aggregation returns for this case.)
B
2013-01-01 09:00:00 NaN 2013-01-01 09:00:02 NaN 2013-01-01 09:00:03 2.0
Case 3c (2s window): since the first row should have no entries in its window, I would expect the first output row to be NaN. the second row will have the first entry in its window, so I would expect its output to be 1.0. similarly, the last row will have the second entry in its window and I would expect its output to be 2.0.
B
2013-01-01 09:00:00 NaN 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 2.0
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-17134-Microsoft
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 19.0.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.0
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None