BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128 · Issue #58924 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd

OK:

n, val = 127, pd.NA idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64")) s = pd.Series(index=idx, data=range(n+1), dtype="Int64") s.drop(0)

Still OK:

n, val = 128, 128 idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64")) s = pd.Series(index=idx, data=range(n+1), dtype="Int64") s.drop(0)

But this FAILS:

n, val = 128, pd.NA idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64")) s = pd.Series(index=idx, data=range(n+1), dtype="Int64") s.drop(0) # ValueError: 'indices' contains values less than allowed (-128 < -1)

Expected no error

WORKAROUND. to filter out elements, use a boolean mask/indexing instead of s.drop():

Issue Description

When NA is present in Index and the length of the Index exceeds 128, it behaves in a completely weird way.

This bug can be narrowed down to IndexEngine.get_indexer() or MaskedIndexEngine.get_indexer(), as these examples suggest:

axis = pd.Index(range(250), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64')) new_axis = axis.drop(0) axis.get_indexer(new_axis)[-5:] # array([246, 247, 248, 249, -6])

Expected array([246, 247, 248, 249, 250])

axis = pd.Index(range(254), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64')) new_axis = axis.drop(0) axis.get_indexer(new_axis)[-5:] # array([250, 251, 252, 253, -2])

Expected array([250, 251, 252, 253, 254])

These examples further suggest that the root cause of the bug is in how NaN is represented in and is interacting with the hash tables that Index uses for its _engine.

Expected Behavior

See above

Installed Versions

commit : 76c7274
python : 3.11.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.216-204.855.amzn2.x86_64
Version : #1 SMP Sat May 4 16:53:27 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1067.g76c7274985
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None