Bug: On Python 3 to_csv() encoding defaults to ascii if the dataframe contains special characters. · Issue #17097 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd
L1 = ["AAAAA","BBBBB","TTTTT","77777"]
df1 = pd.DataFrame({"L1":L1})
df1.to_csv("test1.csv")
L2 = ["AAAAA","ÄÄÄÄÄ","ßßßßß","聞聞聞聞聞"]
df2 = pd.DataFrame({"L2":L2})
df2.to_csv("test2.csv")
df2.to_csv("test3.csv",encoding='utf8')
Problem description
The to-csv doc says about encoding
:
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
Therefore being on Python 3 I expect test1.csv
and test2.csv
to be utf8
.
However while test1.csv
is encoded in utf8
, test2.csv
is encoded in ascii
, if I want the correct encoding I have to explicitely add the encoding to produce the correct result as test3.csv
.
Correspondingly doing
leads to
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
EDIT: Additional info from first comment:
The output of the to_csv()
calls looks correct:
df1.to_csv()
Out[9]: ',L1\n0,AAAAA\n1,BBBBB\n2,TTTTT\n3,77777\n'
df2.to_csv()
Out[10]: ',L2\n0,AAAAA\n1,ÄÄÄÄÄ\n2,ßßßßß\n3,聞聞聞聞聞\n'
Regarding the read_csv()
part it's like this:
I can read test1.csv
and test3.csv
fine, regardless of specifying encoding='utf8'
or not.
Likewise I can not read test2.csv
at all, regardless of specifying encoding='utf8'
or not. The error message is returned in both cases.
The problem is only solved by explicitely specifying encoding='utf8'
in to_csv()
.
EDIT 2:
I can only read test2.csv
when I explicitely state encoding='ansi'
, so read_csv()
definitely expects utf-8
.
Output of pd.show_versions()
python: 3.6.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64
pandas: 0.20.3