BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 · Issue #58924 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
OK:
n, val = 127, pd.NA idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64")) s = pd.Series(index=idx, data=range(n+1), dtype="Int64") s.drop(0)
Still OK:
n, val = 128, 128 idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64")) s = pd.Series(index=idx, data=range(n+1), dtype="Int64") s.drop(0)
But this FAILS:
n, val = 128, pd.NA idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64")) s = pd.Series(index=idx, data=range(n+1), dtype="Int64") s.drop(0) # ValueError: 'indices' contains values less than allowed (-128 < -1)
Expected no error
WORKAROUND. to filter out elements, use a boolean mask/indexing instead of s.drop():
Issue Description
When NA
is present in Index
and the length of the Index exceeds 128, it behaves in a completely weird way.
This bug can be narrowed down to IndexEngine.get_indexer()
or MaskedIndexEngine.get_indexer(), as these examples suggest:
axis = pd.Index(range(250), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64')) new_axis = axis.drop(0) axis.get_indexer(new_axis)[-5:] # array([246, 247, 248, 249, -6])
Expected array([246, 247, 248, 249, 250])
axis = pd.Index(range(254), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64')) new_axis = axis.drop(0) axis.get_indexer(new_axis)[-5:] # array([250, 251, 252, 253, -2])
Expected array([250, 251, 252, 253, 254])
These examples further suggest that the root cause of the bug is in how NaN
is represented in and is interacting with the hash tables that Index
uses for its _engine
.
Expected Behavior
See above
Installed Versions
commit : 76c7274
python : 3.11.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.216-204.855.amzn2.x86_64
Version : #1 SMP Sat May 4 16:53:27 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+1067.g76c7274985
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None