UnicodeDecodeError with Latin-1 characters in Stata files · Issue #23736 · pandas-dev/pandas (original) (raw)

Steps to reproduce

df = pd.read_stata('buggy_file.dta')

Expected behaviour

Pandas reads the stata file just fine.

Actual behaviour

Pandas raises an error to do with encoding, traceable back to this line:

Diagnosis

The error is caused by the “smart quote” character “, which is encoded in Latin-1 in the Stata .dta file, but it considered an invalid byte sequence in Unicode.

The errors originates in the StataReader class in io/stata.py:

def _decode(self, s):
    s = s.partition(b"\0")[0]
    return s.decode('utf-8')

Instead of 'utf-8', Pandas should use self._encoding or self._default_encoding, just like other parts of the code use when reading from the input buffer/file. Making the relevant change on my machine makes the issue go away.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None