pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
The problem occurs when reading an external file, so unfortunately this
requires a data file linked to in the issue description
import pandas as pd df = df.read_stata("bdendo.dta")
Issue Description
A previously closed issue (#26667) is related to this, but at that time it appears that documentation or examples of this format were not available.
Support for Stata 104-format data files was added in commit 703834c (as long as the data only contained values stored as int, float or byte). I have confirmed that this worked correctly at least up to the Pandas version 0.15.2-1 included with Fedora 21.
This support was broken in commit d194844 when the condition if self.format_version > 104:
was removed before the line self.time_stamp = self._null_terminate(self.path_or_buf.read(18))
during the splitting of _read_header():
into _read_new_header()
and _read_old_header()
.
Some years later pull request #34116 changed the _version_error string to remove 104 from the list of supported format versions, however this did not change the actual code check in _read_old_header()
that would trigger the message:
if self._format_version not in [104, 105, 108, 111, 113, 114, 115]:
raise ValueError(_version_error.format(version=self._format_version))
This results in an exception when attempting to read a file in this format instead of a helpful error message:
pandas.read_stata("bdendo.dta") Traceback (most recent call last): File "", line 1, in File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata return reader.read() ^^^^^^^^^^^^^ File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1683, in read self._ensure_open() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1175, in _ensure_open self._open_file() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1205, in _open_file self._read_header() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1287, in _read_header self._read_old_header(first_char) File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1464, in _read_old_header self._time_stamp = self._get_time_stamp() ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1436, in _get_time_stamp raise ValueError() ValueError
The test file used above is \ssa8\dta\bdendo.dta contained in https://web.archive.org/web/19991011204243/http://lib.stat.cmu.edu/stata/STB/stb27v4.zip
Expected Behavior
I would expect either an error message indicating that the format is not supported, i.e.
Traceback (most recent call last): File "", line 1, in File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata return reader.read() ^^^^^^^^^^^^^ File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1683, in read self._ensure_open() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1175, in _ensure_open self._open_file() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1205, in _open_file self._read_header() File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1287, in _read_header self._read_old_header(first_char) File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1453, in _read_old_header raise ValueError(_version_error.format(version=self._format_version)) ValueError: Version of given Stata file is 104. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).
or, if support was restored the data to be loaded:
df set fail gall hyp ob est dos dur non duration age cest agegrp 0 1 1 0 0 1.0 1 3 4.0 1 96 74 3.0 4.0 1 1 0 0 0 NaN 0 0 0.0 0 0 75 0.0 4.0 2 1 0 0 0 NaN 0 0 0.0 0 0 74 0.0 4.0 3 1 0 0 0 NaN 0 0 0.0 0 0 74 0.0 4.0 4 1 0 0 0 1.0 1 1 3.0 1 48 75 1.0 4.0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... 310 63 1 1 0 1.0 1 3 2.0 1 18 68 3.0 3.0 311 63 0 0 0 1.0 1 2 4.0 1 96 69 2.0 3.0 312 63 0 0 0 NaN 0 0 0.0 0 0 70 0.0 3.0 313 63 0 0 1 1.0 1 2 3.0 1 92 69 2.0 3.0 314 63 0 0 1 0.0 1 3 3.0 1 59 69 3.0 3.0
[315 rows x 13 columns]
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.10.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None