BUG: `read_stata` always uses 'utf8' · Issue #21244 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576) for chunk in data: pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1') for chunk in data: pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn't honour the encoding argument.
I think this line introduced a bug. The StataReader doesn't manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

to:

    return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:

    return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

BUG: read_stata always uses 'utf8' · Issue #21244 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

BUG: `read_stata` always uses 'utf8' · Issue #21244 · pandas-dev/pandas (original) (raw)

Output of `pd.show_versions()`