groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe · Issue #21390 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd import numpy as np
df = pd.DataFrame({ 'label1': list('abcbabcba'), 'label2': list('xyxyxyxyx'), 'minute': list(pd.date_range('2018-06-01 00', freq='1T', periods=3)) * 3, 'n1': np.arange(9, dtype='float'), 'n2': np.arange(9, dtype='float') ** 2 })
this is correct
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
convert to categoricals
df['label1'] = df['label1'].astype('category') df['label2'] = df['label2'].astype('category') df['minute'] = df['minute'].astype('category')
this is wrong, returns all NaNs
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
Problem description
When grouping by [str, datetime] columns, results are as expected:
>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
n1 n2
label1 minute
a 2018-06-01 00:00:00 0.0 0.0
2018-06-01 00:01:00 4.0 16.0
2018-06-01 00:02:00 8.0 64.0
b 2018-06-01 00:00:00 3.0 9.0
2018-06-01 00:01:00 4.0 25.0
2018-06-01 00:02:00 5.0 25.0
c 2018-06-01 00:00:00 6.0 36.0
2018-06-01 00:02:00 2.0 4.0
After converting label1
, label2
, and minute
to categoricals, that same groupby returns all NaNs:
>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
n1 n2
label1 minute
a 2018-06-01 00:00:00 NaN NaN
2018-06-01 00:01:00 NaN NaN
2018-06-01 00:02:00 NaN NaN
b 2018-06-01 00:00:00 NaN NaN
2018-06-01 00:01:00 NaN NaN
2018-06-01 00:02:00 NaN NaN
c 2018-06-01 00:00:00 NaN NaN
2018-06-01 00:01:00 NaN NaN
2018-06-01 00:02:00 NaN NaN
I only got this bug when grouping on 2 categoricals with one of them being datetime based (order is irrelevant). Grouping by ['label1', 'label2']
and 'minute'
by itself works as expected.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None