groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe · Issue #21390 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd import numpy as np

df = pd.DataFrame({ 'label1': list('abcbabcba'), 'label2': list('xyxyxyxyx'), 'minute': list(pd.date_range('2018-06-01 00', freq='1T', periods=3)) * 3, 'n1': np.arange(9, dtype='float'), 'n2': np.arange(9, dtype='float') ** 2 })

this is correct

df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

convert to categoricals

df['label1'] = df['label1'].astype('category') df['label2'] = df['label2'].astype('category') df['minute'] = df['minute'].astype('category')

this is wrong, returns all NaNs

df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

Problem description

When grouping by [str, datetime] columns, results are as expected:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                             n1    n2
label1 minute                        
a      2018-06-01 00:00:00  0.0   0.0
       2018-06-01 00:01:00  4.0  16.0
       2018-06-01 00:02:00  8.0  64.0
b      2018-06-01 00:00:00  3.0   9.0
       2018-06-01 00:01:00  4.0  25.0
       2018-06-01 00:02:00  5.0  25.0
c      2018-06-01 00:00:00  6.0  36.0
       2018-06-01 00:02:00  2.0   4.0

After converting label1, label2, and minute to categoricals, that same groupby returns all NaNs:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                            n1  n2
label1 minute                     
a      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
b      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
c      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN

I only got this bug when grouping on 2 categoricals with one of them being datetime based (order is irrelevant). Grouping by ['label1', 'label2'] and 'minute' by itself works as expected.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None