Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value · Issue #26011 · pandas-dev/pandas (original) (raw)

Code Sample

import pandas as pd import numpy as np test = pd.DataFrame(data = [[np.NaN, 1, np.NaN], [2, 3, 4]], index = [0, 1], columns = ['one','two','three']) test one two three 0 NaN 1 NaN 1 2.0 3 4.0 test.groupby('one').nth(0) two three one
2.0 1 NaN 2.0 3 4.0

Problem description

A df.groupby('colname').nth(0) operation should return a new data-frame with an index that contains no duplicates; each unique colname will become a single value in the resulting index.

However if the column labeled colname contains a np.NaN value, instead of this np.NaN value becoming its own unique value in the index or simply ignored, it instead turns into a duplicate value as seen above, which is extremely unexpected behavior (probably a bug).

Note only nth is affected - first and last never produce duplicate indices. However first and last have performance problems on categorical data: #25397

Expected Output

first actually "just works"

test.groupby('one').first() two three one
2.0 3 4.0

so probably this is the correct behavior

test.groupby('one').nth(0) two three one
2.0 3 4.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: 7.3.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None