BUG: Segmentation fault when doing pandas.core.window.rolling.RollingGroupBy.apply · Issue #36727 · pandas-dev/pandas (original) (raw)


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame( [ ["A", "group_1", pd.Timestamp(2019, 1, 1, 9)], ["B", "group_1", pd.Timestamp(2019, 1, 2, 9)], ["C", "group_2", pd.Timestamp(2019, 1, 3, 9)], ["D", "group_1", pd.Timestamp(2019, 1, 6, 9)], ["E", "group_1", pd.Timestamp(2019, 1, 7, 9)], ["F", "group_1", pd.Timestamp(2019, 1, 10, 9)], ["G", "group_2", pd.Timestamp(2019, 1, 20, 9)], ["H", "group_1", pd.Timestamp(2019, 4, 8, 9)], ], columns=["index", "group", "eventTime"], ).set_index("index")

groups = df.groupby("group") df["count_to_date"] = groups.cumcount() rolling_groups = groups.rolling("10d", on="eventTime") group_size = rolling_groups.apply(lambda df: df.shape[0]) print(group_size)

Problem description

The above code causes a segmentation fault inside pandas for versions after 1.0.5. Since I need the above code for a project, I am restricted to using pandas 1.0.5 until this is resolved. I am not sure what is causing the segmentation fault, but all the above circumstances are necessary to reproducing the bug (ie DataFrame with special index, a column set in the DataFrame after grouping, a rolling window on a group, etc).

I have reproduced this bug on a variety of machines and operating systems.

Expected Output

                        eventTime  count_to_date
group   index                                   
group_1 A     2019-01-01 09:00:00            1.0
        B     2019-01-02 09:00:00            2.0
        D     2019-01-06 09:00:00            3.0
        E     2019-01-07 09:00:00            4.0
        F     2019-01-10 09:00:00            5.0
        H     2019-04-08 09:00:00            1.0
group_2 C     2019-01-03 09:00:00            1.0
        G     2019-01-20 09:00:00            1.0

Note: This is indeed the output of versions 1.0.5 and prior.

Output of pd.show_versions()

This is just one configuration but the bug has been reproduced on three different machines (both linux and mac), all exhibiting the same behavior.

INSTALLED VERSIONS

commit : 2a7d332
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Thu Jun 18 21:21:34 PDT 2020; root:xnu-4570.71.82.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.2
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None