Unexpected behaviour when grouping datetime column containing null-values, SeriesGroupby · Issue #10979 · pandas-dev/pandas (original) (raw)
I found some unexpected behaviour when looking for the group minima of a datetime column containing null values. It appears that when the min
method is called on a SeriesGroupBy
of dtype datetime64
with null values, the values are cast to floats before the minima are computed. Consider the following:
df = pd.DataFrame({'datetime': pd.date_range('20150903', periods=4),
'groups': ['a', 'b']*2})
df.loc[0, 'datetime'] = pd.NaT
In [357]: df.groupby('groups').datetime.min()
Out[357]:
groups
a NaN
b 1.441325e+18
Name: datetime, dtype: float64
The float
value of pd.NaT
is -2^63
and so it is determined to be the minimum of any group which contains it. The expected behaviour would be for null values to be ignored and the minima of the non-null values returned as datetime64
objects. Interestingly, the max
method seems to work as expected;
In [367]: df.groupby('groups').datetime.max()
Out[367]:
groups
a 2015-09-05
b 2015-09-06
Name: datetime, dtype: datetime64[ns]
The min
method of the DataFrameGroupBy
object is kind of half way between; it fails to ignore the null-values and gives pd.NaT
as the min of any group which contains it but it does return the correct data type:
In [369]: df.groupby('groups').min()
Out[369]:
datetime
groups
a NaT
b 2015-09-04
I tried to trace the source of the error and I got as far as the call to
self.grouper.aggregate(obj.value, how='min')
where 'obj' is a (the only) set of values in self._iterate_slices. Within self.grouper.aggregate
the lines
if is_datetime_or_timedelta_dtype(values.dtype):
values = values.view('int64')
and
if com.is_integer_dtype(result):
if len(result[result == tslib.iNaT]) > 0:
result = result.astype('float64')
result[result == tslib.iNaT] = np.nan
seem relevant. It might be worth noting that self.aggregate(lambda x: np.min(x, axis=self.axis)
has the desired output while self.aggregate(np.min)
does not. Also, changing the definition of the min
method to
min = _groupby_function('min', 'min', np.min, numeric_only=True)
fixes this particular problem.