BUG: merging DataFrames on a column containing just NaN values triggers address violation in safe_sort · Issue #59421 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import numpy as np import pandas as pd

df1 = pd.DataFrame( {'x': [1, 2, 3], 'y': [np.nan, np.nan, np.nan], 'z': [4, 5, 6]} ) df2 = pd.DataFrame( {'x': [1, 2, 3], 'y': [np.nan, np.nan, np.nan], 'zz': [4, 5, 6]} ) df1.merge(df2, on=['x', 'y'], how='outer')

Issue Description

Related to #55984

Merging DataFrames on a column containing all NaN values results in a
This was not present in 2.1.4 and I think was introduced in #55984 (which fixed other address violations).

Found using asan, can also seen by enabling bounds_checking on take_1d_* in algos_take_helper.pxi.in

My understanding of the cause is:

  1. uniques is an empty array in _factorize_keys - https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L2706
  2. The mask set in safe_sort assumes that the array being sorted is at least size 1 - https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531
  3. The masked indices are set to 0.
  4. take_nd assumes the indexer contains no out-of-bounds indices, but an index of 0 is out of bounds in this case.

I am not familiar with pandas internals but changing the mask on https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531 to

mask = (codes < min(-len(values), -1)) | (codes >= len(values))

avoids this out-of-bounds access. Is this a suitable fix? If so, I can prepare a pull request.

Expected Behavior

No array bounds access errors, should produce

   x   y  z  zz
0  1 NaN  4   4
1  2 NaN  5   5
2  3 NaN  6   6

Installed Versions

INSTALLED VERSIONS

commit : 642d244
python : 3.11.9
python-bits : 64
OS : Linux
OS-release : 6.6.15-2rodete2-amd64
Version : #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2rodete2 (2024-03-19)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1287.g642d244606.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.0
Cython : 3.0.11
sphinx : 8.0.2
IPython : 8.26.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
fastparquet : 2024.5.0
fsspec : 2024.6.1
html5lib : 1.1
hypothesis : 6.108.8
gcsfs : 2024.6.1
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : 3.9.0
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.2
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.6.1
scipy : 1.14.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.7.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None