BUG: incorrect reading of CSV containing large integers (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd from io import StringIO
scsv = """#SYMBOL,SYSTEM,TYPE,MOMENT,ID,ACTION,PRICE,VOLUME,ID_DEAL,PRICE_DEAL RIH3,F,S,20230301181139587,1925036343869802844,0,96690.00000,2,, USDRUBF,F,S,20230301181139587,2023552585717888193,0,75.10000,14,, USDRUBF,F,S,20230301181139587,2023552585717889863,1,75.00000,14,, USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263358,75.00000 USDRUBF,F,B,20230301181139587,2023552585717882895,2,75.00000,1,2023552585717263358,75.00000 USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263359,75.00000 USDRUBF,F,B,20230301181139587,2023552585717888161,2,75.00000,1,2023552585717263359,75.00000 USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,10,2023552585717263360,75.00000 USDRUBF,F,B,20230301181139587,2023552585717889759,2,75.00000,10,2023552585717263360,75.00000 USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,2,2023552585717263361,75.00000 USDRUBF,F,B,20230301181139587,2023552585717889827,2,75.00000,2,2023552585717263361,75.00000 """
for engine in "c pyarrow python".split(): test = StringIO(scsv) orders = pd.read_csv(test, engine=engine,) print(engine, len(orders.query("ID_DEAL==2023552585717263360")))
Issue Description
Reading of this financial data peace is terribly incorrect. I get the following output:
c 8 pyarrow 8 python 0
None of the engines is able to parse the data correctly, I suppose due to the integers being pretty big. But pandas produces no warning, just silently butchers the data. The issue is not new to 2.0, it was present in 1.5 as well.
It seems to be connected to float64 not having enough precision to hold all the digits. int64 would be ok probably, as the previous column works with int64, but current column has missing values.
using of the new nullable integer type kind of works, but not for pyarrow:
...
orders = pd.read_csv(test, engine=engine,dtype={"ID_DEAL": "Int64"})
...
c 2 pyarrow 8 python 2
But certainly the loss of precision needs to be recognized automatically, and either error should be issued, or Int64 dtype suggested. Otherwise, this bug can lead to catastrophical consequences in decision support systems powered by pandas.
Expected Behavior
c 2
pyarrow 2
python 2
Installed Versions
INSTALLED VERSIONS ------------------ commit : 478d340python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 45 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Russian_Russia.1251