BUG: na_values dict form not working on index column · Issue #57547 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

from io import StringIO

from pandas._libs.parsers import STR_NA_VALUES import pandas as pd

file_contents = """,x,y MA,1,2 NA,2,1 OA,,3 """

default_nan_values = STR_NA_VALUES | {"squid"} names = [None, "x", "y"] nan_mapping = {name: default_nan_values for name in names} dtype = {0: "object", "x": "float32", "y": "float32"}

pd.read_csv( StringIO(file_contents), index_col=0, header=0, engine="c", dtype=dtype, names=names, na_values=nan_mapping, keep_default_na=False, )

Issue Description

I'm trying to find a way to read in an index column as exact strings, but read in the rest of the columns as NaN-able numbers or strings. The dict form of na_values seems to be the only way implied in the documentation to allow this to happen, however, when I try this, it errors with the message:

Traceback (most recent call last):
  File ".../test.py", line 17, in <module>
    pd.read_csv(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
    return parser.read(nrows)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 333, in read
    index, column_names = self._make_index(date_data, alldata, names)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 372, in _make_index
    index = self._agg_index(simple_index)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 504, in _agg_index
    arr, _ = self._infer_types(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 744, in _infer_types
    na_count = parsers.sanitize_objects(values, na_values)
TypeError: Argument 'na_values' has incorrect type (expected set, got dict)

This is unhelpful, as the docs imply this should work, and I can't find any other way to turn off nan detection in the index column without disabling it in the rest of the table (which is a hard requirement)

Expected Behavior

The pandas table should be read without error, leading to a pandas table a bit like the following:

       x    y
MA   1.0  2.0
NA   2.0  1.0
OA   NaN  3.0

Installed Versions

This has been tested on three versions of pandas v1.5.2, v2.0.2, and v2.2.0, all with similar results.

INSTALLED VERSIONS ------------------ commit : fd3f571python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-18-generic Version : #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 11:40:03 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.2.1
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None