Unexpected output for nlargest
function with multiple columns · Issue #22752 · pandas-dev/pandas (original) (raw)
Code Sample
import pandas as pd
aas = [2,2,2,1,1,1] bbs = [1,2,3,3,2,1] n = 4
df = pd.DataFrame({'a': aas, 'b': bbs})
print('-- First --') nlargest = df.nlargest(n, columns=['a', 'b']).sort_values(['a', 'b'], ascending=False) print(nlargest)
print('-- Second --') pseudo_nlargest = df.sort_values(['a', 'b'], ascending=False).head(n) print(pseudo_nlargest)
Actual Output
-- First --
a b
2 2 3
1 2 2
3 1 3
4 1 2
-- Second --
a b
2 2 3 <same>
1 2 2 <same>
0 2 1 <different!>
3 1 3 <different!>
Text within square brackets added to call attention to rows with unexpected output.
Problem description
According to the documentation for nlargest, the nlargest
function should function identically to df.sort_values(columns, ascending=False).head(n)
but be more performant. Presumably this is more performant due to not needing to sort the entire dataframe.
I am observing different behavior. In the example above, I expect the first and second dataframes to be the same in both indices and values. (Note that I've sorted the output of the nlargest
function to remove sort order as a difference).
Similar issues, but different enough that I opened a new one
#21426 - Deals with unsigned ints, this issue uses signed int64
s.
#19563 - Different by sort order only, this issue is different in that the rows themselves are a different, non-unique subset of the original rows.
Expected Output
-- First --
a b
2 2 3
1 2 2
0 2 1
3 1 3
-- Second --
a b
2 2 3 <same>
1 2 2 <same>
0 2 1 <same>
3 1 3 <same>
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 40.4.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None