Categorical Column GroupBy agg with as_index=False produces NaN rows 7.5X Slower with unexpected extra Cardinality · Issue #15217 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import numpy as np import pandas as pd

df = pd.DataFrame(np.random.randint(10,100, size=(200,6)), columns=['C'+str(i) for i in range(6)]) df['C0'] = ['A','B','C','D']*50 df['C1'] = ['E','F']*100 df['C2'] = ['H','I','J','K', 'L']*40

for col in df.columns[:3]: df[col] = df[col].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): C0 200 non-null category C1 200 non-null category C2 200 non-null category C3 200 non-null int32 C4 200 non-null int32 C5 200 non-null int32 dtypes: category(3), int32(3) memory usage: 3.1 KB

%time ix_true = df.groupby(df.columns.tolist()[:3], as_index=True)['C5'].max() Wall time: 2 ms

ix_true.shape (20,)

%time ix_false = df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max() Wall time: 15 ms

ix_false.shape (40, 4)

ix_true C0 C1 C2 A E H 93 I 99 J 88 K 91 L 94 B F H 98 I 89 J 94 K 92 L 96 C E H 96 I 96 J 85 K 88 L 98 D F H 96 I 84 J 71 K 96 L 94 Name: C5, dtype: int32

ix_false C0 C1 C2 C5 0 A E H 93.0 1 A E I 99.0 2 A E J 88.0 3 A E K 91.0 4 A E L 94.0 5 A F H NaN 6 A F I NaN 7 A F J NaN 8 A F K NaN 9 A F L NaN 10 B E H NaN 11 B E I NaN 12 B E J NaN 13 B E K NaN 14 B E L NaN 15 B F H 98.0 16 B F I 89.0 17 B F J 94.0 18 B F K 92.0 19 B F L 96.0 20 C E H 96.0 21 C E I 96.0 22 C E J 85.0 23 C E K 88.0 24 C E L 98.0 25 C F H NaN 26 C F I NaN 27 C F J NaN 28 C F K NaN 29 C F L NaN 30 D E H NaN 31 D E I NaN 32 D E J NaN 33 D E K NaN 34 D E L NaN 35 D F H 96.0 36 D F I 84.0 37 D F J 71.0 38 D F K 96.0 39 D F L 94.0

Problem description

Using as_index=False in df.groupby(df.columns.tolist()[:3], as_index=False)['C5'].max() with categorical columns produces NaN output rows.

Expected Output

I expect that the output should not contain any extra Cardinality explosion and have the same number of rows. as the as_index=True case.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None