Unexpected segmentation fault in pd.read_csv C-engine · Issue #13703 · pandas-dev/pandas (original) (raw)

Dear developers,

I am using pandas in an application where I need to process large csv files (around 1Gb each) which have approximately 800k records and 400+ columns of mixed type. That is why I decided to use data iterator functionality of pd.read_csv(). When experimenting with chunksize my application seems to crash somewhere inside TextReader__string_convert call.

Here is an archive with a sample CSV data file that seems to cause the crash (it also includes crash dump reports, a copy of the example, and a snapshot of versions of installed python packages).
read_csv_crash.tar.gz

Code Sample

To run this example you would have to extract dataset.csv from the supplied archive.

import pandas as pd for n_lines in range(82, 87): filelike = open("dataset.csv", "r") iterator_ = pd.read_csv(filelike, header=None, engine="c", dtype=object, chunksize=n_lines) for chunk_ in iterator_: print n_lines, chunk_.iloc[0, 0], chunk_.iloc[-1, 0] filelike.close()

Please, note that this crash does not seem to occur when the file is less than 260Kib. Also note that playing with low_memory setting did not alleviate the problem.

Expected Output

This code sample outputs this:

82 9999-9 9999-9
82 9999-9 9999-9
82 9999-9 9999-9
83 9999-9 9999-9
83 9999-9 9999-9
83 9999-9 9999-9
84 9999-9 9999-9
84 9999-9 9999-9
Segmentation fault: 11

output of `pd.show_versions()`

The output of this call is attached to this issue.
pd_show_versions.txt

Python greetings string

Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

OSX version

OS Version:            Mac OS X 10.11.5 (15F34)
Model: Macmini6,2, BootROM MM61.0106.B0A, 4 processors, Intel Core i7, 2,6 GHz, 16 GB, SMC 2.8f