UnicodeDecodeError with Latin-1 characters in Stata files · Issue #23736 · pandas-dev/pandas (original) (raw)

Steps to reproduce

df = pd.read_stata('buggy_file.dta')

Expected behaviour

Pandas reads the stata file just fine.

Actual behaviour

Pandas raises an error to do with encoding, traceable back to this line:

Diagnosis

The error is caused by the “smart quote” character “, which is encoded in Latin-1 in the Stata .dta file, but it considered an invalid byte sequence in Unicode.

The errors originates in the StataReader class in io/stata.py:

def _decode(self, s):
    s = s.partition(b"\0")[0]
    return s.decode('utf-8')

Instead of 'utf-8', Pandas should use self._encoding or self._default_encoding, just like other parts of the code use when reading from the input buffer/file. Making the relevant change on my machine makes the issue go away.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8