to_stata + read_stata results in NaNs (close to double precision limit) · Issue #14618 · pandas-dev/pandas (original) (raw)

Explanation

Saving and loading data as stata results in a lot of NaNs.

I think the code & output is pretty self-explanatory, otherwise please ask.

I've not been able to test this on other systems yet.

If this is somehow expected behaviour, maybe a bigger warning would be in order.

A small, complete example of the issue

from numpy.random.mtrand import RandomState
from pandas import DataFrame, read_stata

pth = '/tmp/demo.dta'
rs = RandomState(seed=123456789)
data = (2 * rs.rand(1000, 400).astype('float64') - 1) * 1.7976931348623157e+308

colnames = tuple('c{0:03d}'.format(k) for k in range(data.shape[1]))
frame = DataFrame(data=data, columns=colnames)
with open(pth, 'w+') as fh:
    frame.to_stata(fh)

with open(pth, 'r') as fh:
    frame2 = read_stata(fh)

print(frame2.tail())

Expected Output

     index          c000           c001           c002           c003  \
995    995  1.502566e+308  1.019238e+308 -1.169342e+308  6.845363e+307
996    996 -3.418435e+307 -8.113486e+307  2.544741e+306  5.771775e+307
997    997  1.507324e+308  4.610183e+307 -1.016633e+308 -1.632862e+308
998    998 -8.138620e+307  6.312126e+307 -6.557370e+307  6.342690e+307
999    999 -1.179032e+308  1.554709e+308 -1.175680e+308  1.921731e+307

              c004           c005           c006           c007  \
995  1.611898e+308 -5.171776e+307 -8.918000e+307 -5.322720e+307
996  3.693405e+307 -1.480267e+308  1.586053e+308  7.489689e+306
997  1.060605e+308 -6.826590e+307  1.644990e+308 -1.379562e+308
998  1.379642e+308  1.005632e+307 -1.206948e+308 -1.198931e+308
999 -5.965607e+307  8.844623e+307  2.727894e+307 -5.433995e+307

              c008           c009      ...                 c390  \
995 -6.580851e+306  1.284482e+308      ...       -1.770789e+308
996 -9.312612e+307 -1.778315e+308      ...        7.410784e+307
997 -9.415141e+307  9.058828e+307      ...       -5.451829e+305
998  1.651712e+308  4.435415e+307      ...        5.220773e+307
999 -1.747738e+308 -1.603248e+308      ...        1.415798e+307

              c391           c392           c393           c394  \
995  7.360232e+307 -3.850417e+307  1.453624e+308  5.690363e+307
996 -6.943490e+307  1.047268e+308  4.026712e+307  9.161669e+305
997  4.406343e+306  1.617739e+308  4.218585e+307  1.573892e+307
998 -2.390131e+307 -6.649416e+307  6.548489e+307  1.000078e+307
999 -1.239203e+308 -5.038284e+307 -1.340608e+307 -1.193758e+308

              c395           c396           c397           c398           c399
995  8.371989e+307  3.491895e+307  7.344525e+307 -9.260950e+307  1.032120e+308
996  9.200510e+307 -1.729595e+308  4.021503e+307  2.274318e+307  5.856302e+307
997 -7.624901e+307 -1.206386e+308 -6.164537e+306 -7.634148e+307 -1.462809e+308
998 -9.399560e+307  9.697224e+307 -6.963726e+307 -1.655656e+308  1.513218e+308
999 -1.476121e+308  1.187603e+308  1.402195e+308 -1.584051e+308 -1.232190e+308

[5 rows x 401 columns]

Actual Output

     index           c000           c001           c002           c003  \
995    995            NaN            NaN -1.169342e+308  6.845363e+307
996    996 -3.418435e+307 -8.113486e+307  2.544741e+306  5.771775e+307
997    997            NaN  4.610183e+307 -1.016633e+308 -1.632862e+308
998    998 -8.138620e+307  6.312126e+307 -6.557370e+307  6.342690e+307
999    999 -1.179032e+308            NaN -1.175680e+308  1.921731e+307

              c004           c005           c006           c007  \
995            NaN -5.171776e+307 -8.918000e+307 -5.322720e+307
996  3.693405e+307 -1.480267e+308            NaN  7.489689e+306
997            NaN -6.826590e+307            NaN -1.379562e+308
998            NaN  1.005632e+307 -1.206948e+308 -1.198931e+308
999 -5.965607e+307  8.844623e+307  2.727894e+307 -5.433995e+307

              c008      ...                 c390           c391  \
995 -6.580851e+306      ...       -1.770789e+308  7.360232e+307
996 -9.312612e+307      ...        7.410784e+307 -6.943490e+307
997 -9.415141e+307      ...       -5.451829e+305  4.406343e+306
998            NaN      ...        5.220773e+307 -2.390131e+307
999 -1.747738e+308      ...        1.415798e+307 -1.239203e+308

              c392           c393           c394           c395  \
995 -3.850417e+307            NaN  5.690363e+307  8.371989e+307
996            NaN  4.026712e+307  9.161669e+305            NaN
997            NaN  4.218585e+307  1.573892e+307 -7.624901e+307
998 -6.649416e+307  6.548489e+307  1.000078e+307 -9.399560e+307
999 -5.038284e+307 -1.340608e+307 -1.193758e+308 -1.476121e+308

              c396           c397           c398           c399
995  3.491895e+307  7.344525e+307 -9.260950e+307            NaN
996 -1.729595e+308  4.021503e+307  2.274318e+307  5.856302e+307
997 -1.206386e+308 -6.164537e+306 -7.634148e+307 -1.462809e+308
998            NaN -6.963726e+307 -1.655656e+308            NaN
999            NaN            NaN -1.584051e+308 -1.232190e+308

[5 rows x 401 columns]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-45-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 9.0.1
setuptools: 26.1.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None