Inconsistent handling of index after groupby operation (original) (raw)

snippet 1


df = pd.DataFrame(dict(A=[0, 1, 2, 3]))

# returns results identical to df.A
print(df.groupby(df.A // 2).A.nsmallest(2))

# returns results out of order
print(df.groupby(df.A // 2).A.nlargest(2))

0    0
1    1
2    2
3    3
Name: A, dtype: int64
A   
0  1    1
   0    0
1  3    3
   2    2
Name: A, dtype: int64

snippet 2


df = pd.DataFrame(dict(A=[0, 1, 2, 3]))

print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2))

Problem description

When the results of a groupby operation return the same results as what was in a the group in the first place, the index is left identical to the object being grouped. This doesn't sound so horrible until you realize that it is inconsistent with very comparable operations. This is observed in snippet 1. However, snippet 2 puts a finer point on it. The same code sample produces randomly different results.

Expected Output

A   
0  1    0
   0    1
1  3    2
   2    3
Name: A, dtype: int64
A   
0  1    1
   0    0
1  3    3
   2    2
Name: A, dtype: int64

Output of `pd.show_versions()`

Details

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1