Unexpected behaviour when grouping datetime column containing null-values, SeriesGroupby · Issue #10979 · pandas-dev/pandas (original) (raw)

I found some unexpected behaviour when looking for the group minima of a datetime column containing null values. It appears that when the min method is called on a SeriesGroupBy of dtype datetime64 with null values, the values are cast to floats before the minima are computed. Consider the following:

df = pd.DataFrame({'datetime': pd.date_range('20150903', periods=4), 
                   'groups': ['a', 'b']*2})
df.loc[0, 'datetime'] = pd.NaT

In [357]: df.groupby('groups').datetime.min()
Out[357]:
groups
a             NaN
b    1.441325e+18
Name: datetime, dtype: float64

The float value of pd.NaT is -2^63 and so it is determined to be the minimum of any group which contains it. The expected behaviour would be for null values to be ignored and the minima of the non-null values returned as datetime64 objects. Interestingly, the max method seems to work as expected;

In [367]: df.groupby('groups').datetime.max()
Out[367]:
groups
a   2015-09-05
b   2015-09-06
Name: datetime, dtype: datetime64[ns]

The min method of the DataFrameGroupBy object is kind of half way between; it fails to ignore the null-values and gives pd.NaT as the min of any group which contains it but it does return the correct data type:

In [369]: df.groupby('groups').min()
Out[369]:
          datetime
groups  
a         NaT
b         2015-09-04

I tried to trace the source of the error and I got as far as the call to

self.grouper.aggregate(obj.value, how='min') 

where 'obj' is a (the only) set of values in self._iterate_slices. Within self.grouper.aggregate the lines

        if is_datetime_or_timedelta_dtype(values.dtype):
        values = values.view('int64')

and

    if com.is_integer_dtype(result):
        if len(result[result == tslib.iNaT]) > 0:
            result = result.astype('float64')
            result[result == tslib.iNaT] = np.nan

seem relevant. It might be worth noting that self.aggregate(lambda x: np.min(x, axis=self.axis) has the desired output while self.aggregate(np.min) does not. Also, changing the definition of the min method to

min = _groupby_function('min', 'min', np.min, numeric_only=True)

fixes this particular problem.