value_counts() crashes if groupby object contains empty groups · Issue #28479 · pandas-dev/pandas (original) (raw)
When you group some statistical counts for every day, it is possible that on some day there is no counts at all. This will result in empty groups in the groupby object. Performing value_counts() on such groupby objects causes crash.
The following example illustrates the problem:
import pandas as pd
df = pd.DataFrame({'Timestamp':[1565083561, 1565083561+86400, 1565083561+86500, 1565083561+86400*2, 1565083561+86400*3, 1565083561+86500*3, 1565083561+86400*4],
'Food':['apple', 'apple', 'banana', 'banana', 'orange', 'orange', 'pear']})
df['Datetime'] = pd.to_datetime(df['Timestamp'].apply(lambda t:str(t)), unit='s')
display(df)
dfg = df.groupby(pd.Grouper(freq='1D', key='Datetime'))
[print(g) for g in dfg]
display(dfg['Food'].value_counts())
df = df.drop([3])
display(df)
dfg = df.groupby(pd.Grouper(freq='1D', key='Datetime'))
[print(g) for g in dfg]
display(dfg['Food'].value_counts())
This table does not contain days with empty data, value_counts() does not crash:
Timestamp | Food | Datetime |
---|---|---|
1565083561 | apple | 2019-08-06 09:26:01 |
1565169961 | apple | 2019-08-07 09:26:01 |
1565170061 | banana | 2019-08-07 09:27:41 |
1565256361 | banana | 2019-08-08 09:26:01 |
1565342761 | orange | 2019-08-09 09:26:01 |
1565343061 | orange | 2019-08-09 09:31:01 |
1565429161 | pear | 2019-08-10 09:26:01 |
After groupby each day:
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Timestamp Food Datetime
3 1565256361 banana 2019-08-08 09:26:01)
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)
Result of value_counts():
Datetime Food
2019-08-06 apple 1
2019-08-07 apple 1
`` banana 1
2019-08-08 banana 1
2019-08-09 orange 2
2019-08-10 pear 1
Name: Food, dtype: int64
This table contains a day with empty data (2019-08-08), value_counts() will crash:
Timestamp | Food | Datetime |
---|---|---|
1565083561 | apple | 2019-08-06 09:26:01 |
1565169961 | apple | 2019-08-07 09:26:01 |
1565170061 | banana | 2019-08-07 09:27:41 |
1565342761 | orange | 2019-08-09 09:26:01 |
1565343061 | orange | 2019-08-09 09:31:01 |
1565429161 | pear | 2019-08-10 09:26:01 |
After groupby each day (note the empty group on 2019-08-08):
(Timestamp('2019-08-06 00:00:00', freq='D'), Timestamp Food Datetime
0 1565083561 apple 2019-08-06 09:26:01)
(Timestamp('2019-08-07 00:00:00', freq='D'), Timestamp Food Datetime
1 1565169961 apple 2019-08-07 09:26:01
2 1565170061 banana 2019-08-07 09:27:41)
(Timestamp('2019-08-08 00:00:00', freq='D'), Empty DataFrame
Columns: [Timestamp, Food, Datetime]
Index: [])
(Timestamp('2019-08-09 00:00:00', freq='D'), Timestamp Food Datetime
4 1565342761 orange 2019-08-09 09:26:01
5 1565343061 orange 2019-08-09 09:31:01)
(Timestamp('2019-08-10 00:00:00', freq='D'), Timestamp Food Datetime
6 1565429161 pear 2019-08-10 09:26:01)
value_counts() crashes:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-543-5efc1c882109> in <module>
14 [print(g) for g in dfg]
15 print('This will cause crash:')
---> 16 display(dfg['Food'].value_counts())
~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in value_counts(self, normalize, sort, ascending, bins, dropna)
1137
1138 # multi-index components
-> 1139 labels = list(map(rep, self.grouper.recons_labels)) + [llab(lab, inc)]
1140 levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
1141 names = self.grouper.names + [self._selection_name]
~/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
469
470 """
--> 471 return _wrapfunc(a, 'repeat', repeats, axis=axis)
472
473
~/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
54 def _wrapfunc(obj, method, *args, **kwds):
55 try:
---> 56 return getattr(obj, method)(*args, **kwds)
57
58 # An AttributeError occurs if the object does not have
ValueError: operands could not be broadcast together with shape (5,) (4,)
It turns out that this might result from a design flaw in DataFrame construction that it skips empty rows:pd.DataFrame.from_dict(data={'row1':{'a':1, 'b':2}, 'row2': {'a':3, 'b':4, 'c':5}, 'row3':{}}, orient='index').fillna(0)
| | a | b | c | | | ---- | - | - | --- | | row1 | 1 | 2 | 0 | | row2 | 3 | 4 | 5.0 |
Take note that row3 is not constructed at all. The correct behavior should output:
| | a | b | c | | | ---- | - | - | --- | | row1 | 1 | 2 | 0.0 | | row2 | 3 | 4 | 5.0 | | row3 | 0 | 0 | 0.0 |