Missing keys in a selector list are matched to None-labeled entries of MultiIndex · Issue #46173 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd df = pd.DataFrame.from_dict({ ("foo",): [1, 2, 3], ("bar",): [5, 6, 7], (None,): [8, 9, 0], }) df[[ ("missingKey",) ]] # returns the NaN-labeled column [8, 9, 0] instead of raising a KeyError
Issue Description
Given a DataFrame with MultiIndex containing NaN values in its keys, then keys with missing labels on the same level as NaN values will retrieve the NaN-labeled columns if keys are passed in a list.
If multiple missing labels are passed, each of them will retrieve the None column
pd.DataFrame.from_dict({(None,): [8, 9, 0]}).loc[:, [("foo",), ("bar",)]] # returns a DF with two copies of the NaN-labeled [8, 9, 0] column
This behaviour occurs only when selection is done via a list of keys.
pd.DataFrame.from_dict({(None,): [8, 9, 0]}).loc[:, ("foo",)] # single key - raises KeyError pd.DataFrame.from_dict({(None,): [8, 9, 0]}).loc[:, [("foo",)]] # key in a list - returns [8,9,0]
The same issue occurs for multi-level MultiIndex, as long as missing labels in the selector occur only on the same level as NaN values.
pd.DataFrame.from_dict({(None,None): [8, 9, 0]}).loc[:, [("a","a")]] # returns [8, 9, 0] pd.DataFrame.from_dict({(None,None): [8, 9, 0]}).loc[:, [("a","b")]] # returns [8, 9, 0] pd.DataFrame.from_dict({(None,"b"): [8, 9, 0]}).loc[:, [("a","b")]] # returns [8, 9, 0] pd.DataFrame.from_dict({(None,"a"): [8, 9, 0]}).loc[:, [("a","b")]] # KeyError, because 2nd level didn't match
None
can also be replaced with any of pd.NA
, np.nan
, pd.NaT
for the same effect.
An additional, more complex example, with multiple None
in index.
df = pd.DataFrame.from_dict({ (None,None): [1, 2, 3], (None,"a"): [5, 6, 7], ("b",None): [8, 9, 0], })
df[[("foo", "bar")]] # returns [1,2,3] df[[("foo", "a")]] # returns [5,6,7] df[[("b", "foo")]] # returns [8,9,0] df[[("b", "a")]] # raises KeyError
Interestingly enough, if [("a", "b")]
is used to select from this DF, a KeyError will be raised, instead of matching it to (None, None).
Expected Behavior
A KeyError should be raised, as it would be if the key with missing label was not wrapped in a list.
Installed Versions
INSTALLED VERSIONS
commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-27-generic
Version : #29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 8.0.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.4.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None