UnicodeDecodeError with Latin-1 characters in Stata files · Issue #23736 · pandas-dev/pandas (original) (raw)
Steps to reproduce
df = pd.read_stata('buggy_file.dta')
Expected behaviour
Pandas reads the stata file just fine.
Actual behaviour
Pandas raises an error to do with encoding, traceable back to this line:
Diagnosis
The error is caused by the “smart quote” character “, which is encoded in Latin-1 in the Stata .dta
file, but it considered an invalid byte sequence in Unicode.
The errors originates in the StataReader
class in io/stata.py
:
def _decode(self, s):
s = s.partition(b"\0")[0]
return s.decode('utf-8')
Instead of 'utf-8'
, Pandas should use self._encoding or self._default_encoding
, just like other parts of the code use when reading from the input buffer/file. Making the relevant change on my machine makes the issue go away.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.5.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None