Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object (original) (raw)

Code Sample, a copy-pastable example if possible

Your code here

In [1]: df=pd.DataFrame(dict(payload=[-1,-2,-1,-2], col=pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)));df
Out[1]: 
   col  payload
0  foo       -1
1  bar       -2
2  bar       -1
3  qux       -2

In [2]: df.groupby("payload").first().col.dtype
Out[2]: dtype('O')

Problem description

Grouping shouldn't coerce a categorical into object. Categorical dtypes should be preserved as long as possible for efficiency and correctness.

Expected Output

The result dtype should be CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True), just like it is here:

In [6]: df.groupby("payload").head().col.dtype
Out[6]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.4.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-98-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.5.0
Cython: 0.20.1post0
numpy: 1.13.3
scipy: 0.13.3
pyarrow: None
xarray: 0.9.6
IPython: 6.2.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.4
feather: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
sqlalchemy: 0.8.4
pymysql: None
psycopg2: None
jinja2: 2.7.2
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None