QST: Is this expected behavior when pd.read_csv() with na_values arguments? · Issue #59303 · pandas-dev/pandas (original) (raw)

Research

https://stackoverflow.com/questions/46397526/how-to-use-na-values-option-in-the-pd-read-csv-function

Question about pandas

I have a simple csv file that looks like this:

x,y,z
a,-99,100
b,-99,200
c,-99.0,300
d,-99.0,400

and when I tried a few different na_values, I got different column y back:

import pandas as pd

df1 = pd.read_csv('test.csv', na_values={"y": -99})
print("df1 = \n", df1)

df2 = pd.read_csv('test.csv', na_values={"y": -99.0})
print("\ndf2 = \n", df2)

df3 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]})
print("\ndf3 = \n", df3)

df4 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]})
print("\ndf4 = \n", df4)

Results:

df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df3 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I'm not sure if this is a bug or it is by design, so just throwing out a general question here. Thank you!

Pandas version is 2.2.1, just in case needed.