pandas (original) (raw)

Hello everyone,

I just stumbled upon an inconsistent behaviour of the groupby function which is causing me a lot of trouble. When grouping on bins, I expect the empty bins to be kept as NA values, for dimension consistency when one wants to aggregate and compare data.

This is effectively the case when grouping on a single key, but the empty bins are dropped as soon as one adds a second key to the groupby function.

import pandas as pd import numpy as np

d = {'Col 1': [3, 3, 4, 5], 'Col 2': [1, 2, 3, 4], 'Col 3': [10, 100, 200, 34]}

test = pd.DataFrame(d)

values = pd.cut(test['Col 1'], [1, 2, 3, 6])

Grouping on a single column

groups_single_key = test.groupby(values)

Grouping on two columns

groups_double_key = test.groupby([values,'Col 2'])

The empty group is kept as NA, which is the behaviour I was expecting

groups_single_key.describe()

The empty groups are dropped

groups_double_key.describe()

This is not just an artifact of the describe() method: the empty group really

does exist and is taken into account when performing aggregation

print(groups_single_key.agg('mean')) print(groups_double_key.agg('mean'))

pandas: 0.14.1
nose: 1.3.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.2
patsy: 0.3.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2013.9
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.0
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None