Bug: On Python 3 to_csv() encoding defaults to ascii if the dataframe contains special characters. · Issue #17097 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd

L1 = ["AAAAA","BBBBB","TTTTT","77777"]
df1 = pd.DataFrame({"L1":L1})
df1.to_csv("test1.csv")

L2 = ["AAAAA","ÄÄÄÄÄ","ßßßßß","聞聞聞聞聞"]
df2 = pd.DataFrame({"L2":L2})
df2.to_csv("test2.csv")

df2.to_csv("test3.csv",encoding='utf8')

Problem description

The to-csv doc says about encoding:

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

Therefore being on Python 3 I expect test1.csv and test2.csv to be utf8.

However while test1.csv is encoded in utf8, test2.csv is encoded in ascii, if I want the correct encoding I have to explicitely add the encoding to produce the correct result as test3.csv.

Correspondingly doing

leads to

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

EDIT: Additional info from first comment:

The output of the to_csv() calls looks correct:

df1.to_csv()
Out[9]: ',L1\n0,AAAAA\n1,BBBBB\n2,TTTTT\n3,77777\n'

df2.to_csv()
Out[10]: ',L2\n0,AAAAA\n1,ÄÄÄÄÄ\n2,ßßßßß\n3,聞聞聞聞聞\n'

Regarding the read_csv() part it's like this:

I can read test1.csv and test3.csv fine, regardless of specifying encoding='utf8' or not.

Likewise I can not read test2.csv at all, regardless of specifying encoding='utf8' or not. The error message is returned in both cases.

The problem is only solved by explicitely specifying encoding='utf8' in to_csv().

EDIT 2:

I can only read test2.csv when I explicitely state encoding='ansi', so read_csv() definitely expects utf-8.

Output of pd.show_versions()

python: 3.6.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64

pandas: 0.20.3