Unexpected output for nlargest function with multiple columns · Issue #22752 · pandas-dev/pandas (original) (raw)

Code Sample

import pandas as pd

aas = [2,2,2,1,1,1] bbs = [1,2,3,3,2,1] n = 4

df = pd.DataFrame({'a': aas, 'b': bbs})

print('-- First --') nlargest = df.nlargest(n, columns=['a', 'b']).sort_values(['a', 'b'], ascending=False) print(nlargest)

print('-- Second --') pseudo_nlargest = df.sort_values(['a', 'b'], ascending=False).head(n) print(pseudo_nlargest)

Actual Output

-- First --
   a  b
2  2  3
1  2  2
3  1  3
4  1  2
-- Second --
   a  b
2  2  3 <same>
1  2  2 <same>
0  2  1 <different!>
3  1  3 <different!>

Text within square brackets added to call attention to rows with unexpected output.

Problem description

According to the documentation for nlargest, the nlargest function should function identically to df.sort_values(columns, ascending=False).head(n) but be more performant. Presumably this is more performant due to not needing to sort the entire dataframe.

I am observing different behavior. In the example above, I expect the first and second dataframes to be the same in both indices and values. (Note that I've sorted the output of the nlargest function to remove sort order as a difference).

Similar issues, but different enough that I opened a new one

#21426 - Deals with unsigned ints, this issue uses signed int64s.
#19563 - Different by sort order only, this issue is different in that the rows themselves are a different, non-unique subset of the original rows.

Expected Output

-- First --
   a  b
2  2  3
1  2  2
0  2  1
3  1  3
-- Second --
   a  b
2  2  3 <same>
1  2  2 <same>
0  2  1 <same>
3  1  3 <same>

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 40.4.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None