API: sum of Series of all NaN should return 0 or NaN ? (original) (raw)

Summary

The question is what the sum of a Series of all NaNs should return (which is equivalent to an empty Series after skipping the NaNs): NaN or 0?

In [1]: s = Series([np.nan])                 

In [2]: s.sum(skipna=True)  # skipping NaNs is the default
Out[2]: nan or 0     <---- DISCUSSION POINT

In [3]: s.sum(skipna=False)
Out[3]: nan

The reason this is a discussion point has the following cause: the internal nansum implementation of pandas returns NaN. But, when bottleneck is installed, pandas will use bottlenecks implementation of nansum, which returns 0 (for the versions >= 1.0).
Bottleneck changed the behaviour from returning NaN to returning 0 to model it after numpy's nansum function.

This has the very annoying consequence that depending on whether bottleneck is installed or not (which is only an optional dependency), you get a different behaviour.

So the decision we need to make, is to either:

adapt pandas internal implementation to return 0, so in all cases 0 is returned for all NaN/empty series.
workaround bottlenecks behaviour or not use it for nansum, in order to consistently return NaN instead of 0
choose one of both above as the default, but have an option to switch behaviour

Original title: nansum in bottleneck 1.0 will return 0 for all NaN arrays instead of NaN

xref pydata/bottleneck#96
xref #9421

Tests are turned off for bottleneck >1.0 (xref TST: test_nanops turns off bottneck for all tests after #10986)

This matches a change from numpy 1.8 -> 1.9.

We should address this for pandas 0.16.

Should we work around the new behavior (probably the simplest choice) or change nansum in pandas?