groupby on multiple columns does not preserve (categorical) dtype (original) (raw)
When doing a groupby on more than one column, the resulting MultiIndex does not seem to preserve the original column dtypes. I noticed it when working with Categorical columns, expecting CategoricalIndex when grouping on them, but this is only the case when grouping on just one column.
I did see that the behaviour was discussed in a PR, but it ultimately was not addressed.
Code Sample, a copy-pastable example if possible
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({ ...: 'a': pd.Series(list('xyxxyz')).astype('category', categories=list('xyz')), ...: 'b': pd.Series(list('yzzyxz')).astype('category', categories=list('xyz')), ...: 'c': [1,2,3,4,5,6] ...: })
In [3]: df.groupby('a').sum().reset_index().dtypes Out[3]: a category c int64 dtype: object
In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes Out[4]: a object b object c float64 dtype: object
Expected Output
In [4]: df.groupby(['a', 'b']).sum().reset_index().dtypes Out[4]: a category b category c int64 dtype: object
output of pd.show_versions()
In [5]: pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.11.final.0 python-bits: 64 OS: Linux OS-release: 4.4.13 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None
pandas: 0.18.1+240.gbb6b5e5 nose: None pip: 8.1.2 setuptools: 19.4 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.17.1 statsmodels: 0.6.1 xarray: None IPython: 5.0.0 sphinx: None patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: 0.9.3 lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.14 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: None pandas_datareader: None