issue with StataReader for stata files versions 108 and older · Issue #12232 · pandas-dev/pandas (original) (raw)

I am having an issue with the StataReader class, which is found in stata.py ("pandas/io/stata.py").
I have pandas: 0.17.1.

The following is the python code I am trying to run:

import sys
reload(sys).setdefaultencoding('utf-8')  
import pandas as pd
from pandas.io import stata

sr=stata.StataReader(fileName)

where fileName is a stata file.

The following code is part of the _read_old_header method(which starts on line 1184) of the StataReader class in stata.py, which gets called during the initialization of a StataReader object:

if self.format_version > 108:
    typlist = [ord(self.path_or_buf.read(1))
        for i in range(self.nvar)]
else:
    typlist = [
        self.OLD_TYPE_MAPPING[
            self._decode_bytes(self.path_or_buf.read(1))
        ] for i in range(self.nvar)
    ]

I have no errors when my stata files are newer than version 108, but with files that are version 105, there seems to be a bug in _decode_bytes. The above code passes in self and only one additional argument to _decode_bytes, the string that is returned by path_or_buf.read(1).

Here is the the method _decode_bytes (line 896):

def _decode_bytes(self, str, errors=None):
        if compat.PY3 or self._encoding is not None:
            return str.decode(self._encoding, errors)
        else:
            return str

When no third argument is passed in (as is the case when it is called by _read_old_header), the argument "errors" is set to None. Here is where the error is thrown. The error is:

TypeError: decode() argument 2 must be string, not None

That is the issue: the decode method of the string class is expecting the second argument to not be a None type, but _decode_bytes passes in errors as None by default.