pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

The problem occurs when reading an external file, so unfortunately this

requires a data file linked to in the issue description

import pandas as pd df = df.read_stata("bdendo.dta")

Issue Description

A previously closed issue (#26667) is related to this, but at that time it appears that documentation or examples of this format were not available.

Support for Stata 104-format data files was added in commit 703834c (as long as the data only contained values stored as int, float or byte). I have confirmed that this worked correctly at least up to the Pandas version 0.15.2-1 included with Fedora 21.

This support was broken in commit d194844 when the condition if self.format_version > 104: was removed before the line self.time_stamp = self._null_terminate(self.path_or_buf.read(18)) during the splitting of _read_header(): into _read_new_header() and _read_old_header().

Some years later pull request #34116 changed the _version_error string to remove 104 from the list of supported format versions, however this did not change the actual code check in _read_old_header() that would trigger the message:

    if self._format_version not in [104, 105, 108, 111, 113, 114, 115]:
        raise ValueError(_version_error.format(version=self._format_version))

This results in an exception when attempting to read a file in this format instead of a helpful error message:

pandas.read_stata("bdendo.dta") Traceback (most recent call last): File "", line 1, in File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata return reader.read() ^^^^^^^^^^^^^ File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1683, in read self._ensure_open() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1175, in _ensure_open self._open_file() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1205, in _open_file self._read_header() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1287, in _read_header self._read_old_header(first_char) File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1464, in _read_old_header self._time_stamp = self._get_time_stamp() ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1436, in _get_time_stamp raise ValueError() ValueError

The test file used above is \ssa8\dta\bdendo.dta contained in https://web.archive.org/web/19991011204243/http://lib.stat.cmu.edu/stata/STB/stb27v4.zip

Expected Behavior

I would expect either an error message indicating that the format is not supported, i.e.

Traceback (most recent call last): File "", line 1, in File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata return reader.read() ^^^^^^^^^^^^^ File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1683, in read self._ensure_open() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1175, in _ensure_open self._open_file() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1205, in _open_file self._read_header() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1287, in _read_header self._read_old_header(first_char) File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1453, in _read_old_header raise ValueError(_version_error.format(version=self._format_version)) ValueError: Version of given Stata file is 104. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).

or, if support was restored the data to be loaded:

df set fail gall hyp ob est dos dur non duration age cest agegrp 0 1 1 0 0 1.0 1 3 4.0 1 96 74 3.0 4.0 1 1 0 0 0 NaN 0 0 0.0 0 0 75 0.0 4.0 2 1 0 0 0 NaN 0 0 0.0 0 0 74 0.0 4.0 3 1 0 0 0 NaN 0 0 0.0 0 0 74 0.0 4.0 4 1 0 0 0 1.0 1 1 3.0 1 48 75 1.0 4.0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... 310 63 1 1 0 1.0 1 3 2.0 1 18 68 3.0 3.0 311 63 0 0 0 1.0 1 2 4.0 1 96 69 2.0 3.0 312 63 0 0 0 NaN 0 0 0.0 0 0 70 0.0 3.0 313 63 0 0 1 1.0 1 2 3.0 1 92 69 2.0 3.0 314 63 0 0 1 0.0 1 3 3.0 1 59 69 3.0 3.0

[315 rows x 13 columns]

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.10.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None