BUG: Unexpected behaviour when reading large text files with mixed datatypes · Issue #3866 · pandas-dev/pandas (original) (raw)
read_csv gives unexpected behaviour with large files if a column contains both strings and integers. eg
df=DataFrame({'colA':range(500000-1)+['apple', 'pear']+range(500000-1)}) len(set(df.colA)) 500001
df.to_csv('testpandas2.txt') df2=read_csv('testpandas2.txt') len(set(df2.colA)) 762143
pandas.version '0.11.0'
It seems some of the integers are parsed as integers and others as strings.
list(set(df2.colA))[-10:] ['282248', '282249', '282240', '282241', '282242', '15679', '282244', '282245', '282246', '282247'] list(set(df2.colA))[:10] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]