read_csv issues with dict for na_values · Issue #19227 · pandas-dev/pandas (original) (raw)
Basically, I can't get a dictionary of na_values
to work properly for me, no matter what I try. Pandas version is 0.22.0.
hack.csv contains:
113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373
Here are two variants of my code - the one with the list does what I expect, but the dict version doesn't:
df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"]) df.head()
output from list version
0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
113125 | NaN | /blaha | kjsdkj | 412.166 | 225.874 | NaN |
729639 | qwer | NaN | asdfkj | 466.681 | NaN | 252.373 |
looks correct, but the dict version
df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'}) df.head()
although clearly paying attention to the columns I specify, is simply refusing to create any NaNs in those columns:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
113125 | blah | /blaha | kjsdkj | 412.166 | 225.874 | 214.008 |
729639 | qwer | asdfkj | 466.681 | NaN | 252.373 |
So... I'm stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict.
Additionally the dict version does create NaNs in columns I didn't specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False
and an explicit value for na_values
. Maybe I'm misreading the docs on that point.
Finally, you may notice that I used "214.008" (and other quoted numeric values) in the dict above. This is because I get a "not iterable" error when I provide unquoted numbers. This is despite that having been flagged as an issue and fixed a while back. This feels like another buglet to me.
Btw: to be picky, another doc-related quibble: I think the docs for keep_default_na
are a bit misleading, in that they imply that keep_default_na=True
should have no effect unless na_values
is supplied (but in fact there is an effect even when na_values
isn't supplied). It might be over-pedantic of me to care, but I feel that primary documentation really ought to be unambiguous. If anybody agrees with this pedantry I would be happy to propose a tweak ;-)
This issue was raised after I recently commented on two other issues with the problems described above, and @gfyoung suggested I ought to raise a new issue (comment links below).
#1657 (comment)
#12224 (comment)
Output of pd.show_versions()
NB: I quite likely have some "old" modules in the pile below, but I believe that I've updated pandas itself correctly, so any dependencies ought to have been updated too. If my errors aren't reproducible by others, it might indicate that there's a hidden version dependency(?) but I'm too much of a pandas noob to know how likely that is.
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 2.9.2
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None