nlargest gives a zero-row dataframe when ordering columns are all NaN · Issue #28984 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd from numpy import nan df = pd.DataFrame({'grp': [1, 1, 2, 2], 'y': [1, 0, 2, 5], 'z': [1, 2, nan, nan]}) df.groupby('grp').apply(lambda grp_df: grp_df.nlargest(1, 'z'))
grp y z
grp
1 1 1 0 2.0
(Group 2 is gone!)
Problem description
When the values of the ordering variables are all missing, the nlargest
and nsmallest
methods return a zero-row dataframe. This behavior is particularly unexpected when applying over groups, since it silently omits groups with all-NaN values. I think it would be better to return the requested number of rows, with NaN as appropriate.
Put differently, this is a case where nlargest
differs from the corresponding sort_values(...).head(...)
code.
df.groupby('grp').apply(lambda x: x.sort_values('z', ascending=False).head(1))
grp y z
grp
1 1 1 0 2.0
2 2 2 2 NaN
(I'm aware that a better way to write that sort_values
line would be to skip the apply
and write df.sort_values('z', ascending=False).groupby('grp').head(1)
, which gives a similar result, but better index.)
I've talked about grouped dataframes because that seems like a more pernicious problem, but the behavior is the same with ungrouped dataframes.
The problem is also the same for nsmallest
and nlargest
, and it doesn't matter if there are multiple ordering columns, as long as they're all NaN.
Related, but not identical issues:
#23993 (requesting nlargest and nsmallest methods for grouped dataframes)
#21426 (bug with unsigned integers)
#12694 (NaN in Series.argsort
)
Expected Output
df.groupby('grp').apply(lambda grp_df: grp_df.nlargest(1, 'z'))
grp y z
grp
1 1 1 0 2.0
2 2 2 2 NaN
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-31-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 2.2.4
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None