UnicodeDecodeError for Stata file · Issue #25960 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd pd.read_stata('mwe.dta')

mwe.dta available here: mwe.zip
This file is a derivative of The Supreme Court Database

Problem description

The command raises

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 20: invalid start byte

I traced the error to a value label containing that byte.
This is a follow-up for #21244 and #23736
Changing line 1334 of pandas.io.stata from

return s.decode('latin-1')

allows me to read in the file.

Expected Output

The file should be correctly read and parsed.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.6.2
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None