DOC: document the perils of reading mixed dtypes / how to handle · Issue #13746 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

See for example http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue
(the behavior didn't since 3 years ago, except a warning is now issued)

import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA') # warning issued
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique() # ['1' 'X' 1]
for a in df2['a'][262140:262150]:
    print repr(a) # switching between '1' and 1

Similar examples: the column that contains '1' and 1 is inferred as string; one that contains 1 and '1' is inferred as numeric.

import io
df = pd.read_csv(io.StringIO('1, "1"\n"1",1'), header=None) # no warning
df.dtypes # int and str

Furthermore, pd.to_numeric(df[1]) won't actually work in the above case.

While it's an understandable behavior, it is unexpected to many users and isn't even documented (beyond the general warnings that type inference isn't perfect). Given the number of people who use pandas for reading csv files, without knowing the history of the library or the mechanics of type inference, this results in a lot of wasted time identifying and fixing a problem that could have been prevented.

I suggest at least adding clear documentation on this, but preferably changing the default behavior to either fail with an error stating that type inference failed and that dtypes should be explicitly provided (I don't think it's possible to use two passes since the input data may be a generator that is exhausted after the first read).

Expected Output

the column

output of pd.show_versions()