Unexpected segmentation fault in pd.read_csv C-engine · Issue #13703 · pandas-dev/pandas (original) (raw)
Dear developers,
I am using pandas in an application where I need to process large csv files (around 1Gb each) which have approximately 800k records and 400+ columns of mixed type. That is why I decided to use data iterator functionality of pd.read_csv(). When experimenting with chunksize
my application seems to crash somewhere inside TextReader__string_convert call.
Here is an archive with a sample CSV data file that seems to cause the crash (it also includes crash dump reports, a copy of the example, and a snapshot of versions of installed python packages).
read_csv_crash.tar.gz
Code Sample
To run this example you would have to extract dataset.csv
from the supplied archive.
import pandas as pd for n_lines in range(82, 87): filelike = open("dataset.csv", "r") iterator_ = pd.read_csv(filelike, header=None, engine="c", dtype=object, chunksize=n_lines) for chunk_ in iterator_: print n_lines, chunk_.iloc[0, 0], chunk_.iloc[-1, 0] filelike.close()
Please, note that this crash does not seem to occur when the file is less than 260Kib. Also note that playing with low_memory
setting did not alleviate the problem.
Expected Output
This code sample outputs this:
82 9999-9 9999-9
82 9999-9 9999-9
82 9999-9 9999-9
83 9999-9 9999-9
83 9999-9 9999-9
83 9999-9 9999-9
84 9999-9 9999-9
84 9999-9 9999-9
Segmentation fault: 11
output of pd.show_versions()
The output of this call is attached to this issue.
pd_show_versions.txt
Python greetings string
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
OSX version
OS Version: Mac OS X 10.11.5 (15F34)
Model: Macmini6,2, BootROM MM61.0106.B0A, 4 processors, Intel Core i7, 2,6 GHz, 16 GB, SMC 2.8f