read_csv issues with dict for na_values · Issue #19227 · pandas-dev/pandas (original) (raw)

Basically, I can't get a dictionary of na_values to work properly for me, no matter what I try. Pandas version is 0.22.0.
hack.csv contains:

113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373

Here are two variants of my code - the one with the list does what I expect, but the dict version doesn't:

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"]) df.head()

output from list version

0	1	2	3	4	5	6
113125	NaN	/blaha	kjsdkj	412.166	225.874	NaN
729639	qwer	NaN	asdfkj	466.681	NaN	252.373

looks correct, but the dict version

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'}) df.head()

although clearly paying attention to the columns I specify, is simply refusing to create any NaNs in those columns:

0	1	2	3	4	5	6
113125	blah	/blaha	kjsdkj	412.166	225.874	214.008
729639	qwer	asdfkj	466.681	NaN	252.373

So... I'm stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict.

Additionally the dict version does create NaNs in columns I didn't specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False and an explicit value for na_values. Maybe I'm misreading the docs on that point.

Finally, you may notice that I used "214.008" (and other quoted numeric values) in the dict above. This is because I get a "not iterable" error when I provide unquoted numbers. This is despite that having been flagged as an issue and fixed a while back. This feels like another buglet to me.

Btw: to be picky, another doc-related quibble: I think the docs for keep_default_na are a bit misleading, in that they imply that keep_default_na=True should have no effect unless na_values is supplied (but in fact there is an effect even when na_values isn't supplied). It might be over-pedantic of me to care, but I feel that primary documentation really ought to be unambiguous. If anybody agrees with this pedantry I would be happy to propose a tweak ;-)

This issue was raised after I recently commented on two other issues with the problems described above, and @gfyoung suggested I ought to raise a new issue (comment links below).
#1657 (comment)
#12224 (comment)

Output of `pd.show_versions()`

NB: I quite likely have some "old" modules in the pile below, but I believe that I've updated pandas itself correctly, so any dependencies ought to have been updated too. If my errors aren't reproducible by others, it might indicate that there's a hidden version dependency(?) but I'm too much of a pandas noob to know how likely that is.

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 2.9.2
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

read_csv issues with dict for na_values · Issue #19227 · pandas-dev/pandas (original) (raw)

Output of pd.show_versions()

INSTALLED VERSIONS

Output of `pd.show_versions()`