groupby on multiple columns does not preserve (categorical) dtype (original) (raw)

When doing a groupby on more than one column, the resulting MultiIndex does not seem to preserve the original column dtypes. I noticed it when working with Categorical columns, expecting CategoricalIndex when grouping on them, but this is only the case when grouping on just one column.

I did see that the behaviour was discussed in a PR, but it ultimately was not addressed.

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({ ...: 'a': pd.Series(list('xyxxyz')).astype('category', categories=list('xyz')), ...: 'b': pd.Series(list('yzzyxz')).astype('category', categories=list('xyz')), ...: 'c': [1,2,3,4,5,6] ...: })

In [3]: df.groupby('a').sum().reset_index().dtypes Out[3]: a category c int64 dtype: object

In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes Out[4]: a object b object c float64 dtype: object

Expected Output

In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes Out[4]: a category b category c int64 dtype: object

output of `pd.show_versions()`

In [5]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Linux OS-release: 4.4.13 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.18.1+240.gbb6b5e5 nose: None pip: 8.1.2 setuptools: 19.4 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.17.1 statsmodels: 0.6.1 xarray: None IPython: 5.0.0 sphinx: None patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: 0.9.3 lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.14 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: None pandas_datareader: None

groupby on multiple columns does not preserve (categorical) dtype (original) (raw)

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

output of `pd.show_versions()`