BUG: read_stata
always uses 'utf8' · Issue #21244 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576) for chunk in data: pass # do something with chunk (never reached)
This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte
.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:
import pandas data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1') for chunk in data: pass # do something with chunk (never reached)
This raises the same exception at exactly the same place (still utf-8).
Problem description
This is a problem because it appears that read_stata
doesn't honour the encoding
argument.
I think this line introduced a bug. The StataReader
doesn't manage any other type of data than ascii or latin-1.
Changing the line 1338 of the pandas.io.stata
module:
to:
return s.decode('latin-1')
Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:
return s.decode(self._encoding or self._default_encoding)
also seems to have made it work.
I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS
: utf-8, utf8, iso10646.
Expected Output
The file should be correctly read and parsed
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None
pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None