pandas (original) (raw)

In my attempt to enable DataFrame reductions for ExtensionArrays (and at the same time cleaning up DataFrame._reduce a bit) at #32867, I am running into the following inconsistency.

Consider a Timedelta array, and how it behaves for the any and all logical functions (on current master, bit similar for released pandas):

td = pd.timedelta_range(0, periods=3)

Index

td.any() ... TypeError: cannot perform any with this index type: TimedeltaIndex

Series

pd.Series(td).any()
... TypeError: cannot perform any with type timedelta64[ns]

DataFrame

pd.DataFrame({'a': td}).any() a True dtype: bool

(and exactly the same pattern for datetime64[ns])

For DataFrame, it works because there we call pandas.core.nanops.nanany(), and our internal version of nan-aware any/all work fine with timedelta64/datetime64 dtypes.
The reason it fails for Index and Series is because we simply up front say: this operation is not possible with this dtype (we never get to call nanany() in those cases). But in the DataFrame reduction, we simply try it on df.values, and it works.

This is clearly an inconsistency that we should solve. And especially for my PR #32867 this gives problems, as I want to rely on the array (column-wise) ops for DataFrame reductions, but thus those give a differen result leading to failing tests.

The question is thus: do we think that the any() and all() operations make sense for datetime-like data?
(and if yes, we can simply enable them for the array/index/series case)

For timedelta, I think you can say there is a concept of "zero" and "non-zero" (which is what any/all are testing in the end, for non-boolean, numeric data)
But for datetimes this might make less sense? The fact that 1970-01-01 00:00:00 would be "False" is just an implementation detail of the numbering starting at that point in time.

As reference, this operation fails in numpy for both timedelta and datetime:

np.array([0, 1], dtype="datetime64[ns]").any() ... TypeError: No loop matching the specified signature and casting was found for ufunc logical_or

np.array([0, 1], dtype="timedelta64[ns]").any() ... TypeError: No loop matching the specified signature and casting was found for ufunc logical_or

But for timedeltas, that might rather be an issue of "never got implemented" rather than "deliberately not implemented" ? (there was a similar conversation about "mean" for timedelta) cc @shoyer