BUG: Incorrect behavior of window aggregation functions on disjoint windows skipping overflowing elements · Issue #45647 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import numpy as np import pandas as pd from pandas.core.indexers.objects import BaseIndexer f64max = 1.7976931348623158E+308 values = np.array([1, f64max, f64max, 3], dtype=np.float64) start = np.array([0, 0, 3, 3], dtype=np.int64) end = start + 1 class WinInd(BaseIndexer): def get_window_bounds(self, num_values: 'int' = 0, min_periods: 'int | None' = None, center: 'bool | None' = None, closed: 'str | None' = None) -> 'tuple[np.ndarray, np.ndarray, np.ndarray]': return start, end

pd.Series(values).rolling(WinInd()).sum() pd.Series(values).rolling(WinInd()).mean()

Issue Description

The example outputs

0    1.0
1    1.0
2    NaN
3    NaN
dtype: float64

for both sum and mean. The correct behavior would output 4.0 instead of NaN in the last two rows. This shows that at least these rolling window aggregations produce incorrect output on disjoint windows that skip elements whose sum overflows. This incorrect behavior originates in the window aggregation functions in pandas._libs.window.aggregations that process, rather than skip over, elements outside the disjoint windows; many of these functions have this problem. Here is code that shows this for sum and mean:

import numpy as np import pandas as pd import pandas._libs.window.aggregations as wa values = np.array([1, np.inf, 3], dtype=np.float64) start = np.array([0, 2], dtype=np.int64) end = start + 1 wa.roll_sum(values, start, end, 0) array([ 1., nan, nan]) wa.roll_mean(values, start, end, 0) array([ 1., nan, nan])

Because these window aggregation functions are not exposed to the user and are wrapped by defensive code within Series.rolling that handles np.inf values, the above example is more involved and induces an overflow to expose the behavior.

One issue where the need for handling disjoint windows occurs is in GH-15354 when the step size is larger than the window size, which is the use case described there. The current issue is a precursor for handling GH-15354.

Expected Behavior

The expected output is for both sum and mean is:

0    1.0
1    1.0
2    4.0
3    4.0
dtype: float64

Installed Versions

This issue is confirmed on the main branch.