BUG: Index.get_indexer_non_unique misbehaves with multiple nan by alexhlim · Pull Request #35498 · pandas-dev/pandas (original) (raw)
- closes BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan #35392
- tests added / passed
- passes
black pandas - passes
git diff upstream/master -u -- "*.py" | flake8 --diff - whatsnew entry
Looking at the implementation in index.pyx, I notice that when [np.nan] is passed to get_indexer_non_unique, the code was not able to get passed the __contains__ method for stargets (line 314).
Further testing (python=3.7.7, numpy=1.18.5):
import numpy as np
# Case 1: Does not work -> prints nothing
# nan dtype: np.float64, ndarray dtype: np.float64
targets = np.array([np.nan])
# Case 2: Works -> prints 0, 1, 2
# nan dtype: U3, ndarray dtype: <U32
targets = np.array([np.nan, 'var1'])
values = np.array([np.nan, 'var1', np.nan])
stargets = set(targets)
for i, v in enumerate(values):
if v in stargets:
print(i)
Case 1 and 2 results differ because of the dtype of nan (U3 vs float64).
Upon further research, I figured out that np.nan != np.nan as per IEEE (when it is a float) and creating a set from a np.array could lead to some bizarre results (numpy/numpy#9358). Also, since a dictionary is the main data structure in this method to keep track of the targets indices, I don't think it is ideal to use nans as keys (https://stackoverflow.com/questions/6441857/nans-as-key-in-dictionaries).
I thought it would be appropriate to replace nans (with 0s) in the targets and values arrays in order to avoid the problems stated above. When considering where to replace the nans, I thought of two places where it could potentially happen:
- In
get_indexer_non_unique(/pandas/core/indexes/base.py) - In
get_indexer_non_unique(/pandas/_libs/index.pyx)
Including the changes in 1. would mean overwriting the the Index object's properties, so I decided to include the changes in 2.
FYI -- I wasn't sure if the test I included was in the correct file. Please let me know if you would like this test to be in another file.