Groupby and shift causing coredump · Issue #13813 · pandas-dev/pandas (original) (raw)
Hi,
I have attempted to reduce this to the smallest example that exhibits this issue - rather than a useful example. The problem is that the operation causes python to core dump.
In the original case in which I discovered this the core dump would only occur sometimes (and when I put it in a loop it would occur on different iterations). This code seems to core dump on the third iteration every time I have run it.
Code example:
import os
import pandas as pd
df = pd.read_csv(os.path.join(os.getcwd(), 'error_report.txt'), sep='\t')
for i in range(0, 1000):
print "Pre shift {}".format(i)
df['shift_F'] = df.groupby(['B', 'C'])['F'].shift(-1)
print "Post shift {}".format(i)
With the attached data file (tab separated) - code assumes in the same directory:
error_report.txt
This the output I get:
python pandas_test.py
Pre shift 0
Post shift 0
Pre shift 1
Post shift 1
Pre shift 2
Segmentation fault (core dumped)
If I modify the code to use apply and then add the shifted column inside the apply function then there is no error. Similarly if I use .shift(0) I do not get the error.
Version info:
>>> pandas.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: 1.3.1
pip: 8.1.2
setuptools: 20.2.2
Cython: None
numpy: 1.11.1
scipy: 0.13.3
statsmodels: 0.5.0
xarray: None
IPython: 1.2.1
sphinx: None
patsy: 0.2.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.1
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: 0.7.2.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
Regards
Stephen