ENH: Added a min_count keyword to stat funcs by TomAugspurger · Pull Request #18876 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation57 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass. I'll add additional commits here changing the default to be 0.

Currently, only nansum and nanprod actually do anything with min_count. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.

Additional tests for the new behavior have been added.

closes #18678

if len(self.kwargs) > 0:
for k, v in compat.iteritems(self.kwargs):
if k not in kwds:
kwds[k] = v
try:
if values.size == 0:

if values.size < min_count:
# we either return np.nan or pd.NaT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I moved this logic to _na_for_min_count but forgot to call it here.

5630d9d passed all the CI tests, so that's good.

I just pushed another commit that fixes a few things, but is still backwards compatible with 0.21.x.

I'll push my breaking changes once I get things passing locally.

I think we should for now only add min_count to sum and prod. Or does that complicate the implementation?

Or does that complicate the implementation?

Not much. I mainly did it for all of them since it was easier, and would possibly be useful for other reductions (haven't thought it through). I'll push a commit that does it just for sum and prod.

For other reductions like min, max, std, .. we already return NaN for empty or all-NA series, so not sure what min_count would be for

Pushed another backwards compatible commit, including docstrings, written from the point of view of a hypothetical 0.21.2 release. I haven't through through whether it'd be worth backporting the non-breaking changes (adding min_count=1) to 0.21.x, but in case we do it'll just be the current set of commits:

Signature: pd.Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=1, **kwargs) Docstring: Return the sum of the values for the requested axis

Parameters

axis : {index (0)} skipna : boolean, default True Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA level : int or level name, default None If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar numeric_only : boolean, default None Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series. min_count : int, default 1 The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

.. versionadded :: 0.21.2

   Added with the default being 1. This means the sum or product
   of an all-NA or empty series is ``NaN``.

Returns

sum : scalar or Series (if level specified)

Examples

By default, the sum of an empty series is NaN.

pd.Series([]).sum() # min_count=1 is the default nan

This can be controlled with the min_count parameter. For example, if you'd like the sum of an empty series to be 0, pass min_count=0.

pd.Series([]).sum(min_count=0) 0.0

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

pd.Series([np.nan]).sum() nan

pd.Series([np.nan]).sum(min_count=0) 0.0

For reviewing, do people have a preference for

a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).

result = np.empty(result_shape, dtype=values.dtype)
result.fill(fill_value)
return result
if values.size < min_count:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values could be 2d here, so in theory should handle this on a per-column basis (but not 100% sure if it is ever practically 2d).

@@ -304,15 +308,18 @@ def nanall(values, axis=None, skipna=True):

@disallow('M8')
@bottleneck_switch()
def nansum(values, axis=None, skipna=True):
def nansum(values, axis=None, skipna=True, min_count=1):
if len(values) < min_count:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can have a ndim > 1 here

@@ -1759,6 +1759,48 @@ def test_value_counts_categorical_not_ordered(self):
tm.assert_series_equal(idx.value_counts(normalize=True), exp)


class TestMinCount():
@pytest.mark.parametrize("use_bottleneck", [True, False])
@pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for sum we never use bottleneck, I think for prod we DO use bottleneck. So maybe should just fix that here as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _bn_ok_dtype

        # GH 9422
        # further we also want to preserve NaN when all elements
        # are NaN, unlinke bottleneck/numpy which consider this
        # to be 0
        if name in ['nansum', 'nanprod']:
            return False

so we don't use bottleneck. And apparently bottleneck doesn't have a nanprod, at least in my version installed locally.

@pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)])
def test_min_count_empty(self, method, use_bottleneck):
method, unit = method
s = pd.Series()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these tests should be at the very top, right after (or maybe) replacing

class TestSeriesAnalytics(TestData):

    @pytest.mark.parametrize("use_bottleneck", [True, False])
    @pytest.mark.parametrize("method", ["sum", "prod"])
    def test_empty(self, method, use_bottleneck):

        with pd.option_context("use_bottleneck", use_bottleneck):
            # GH 9422
            # treat all missing as NaN

those tests need to be changed, no?

For reviewing, do people have a preference for

a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).

a) maybe that is better, then we can see both passing

I think this is ready for review (cc @shoyer). I haven't written a whatsnew since I'll be doing that in the followup PR.

Just to verify, the current plan is to release 0.22.0 as 0.21.1 + this PR and my follow PR with the breaking changes?

@@ -32,36 +32,64 @@ class TestSeriesAnalytics(TestData):
@pytest.mark.parametrize("use_bottleneck", [True, False])
@pytest.mark.parametrize("method", ["sum", "prod"])
def test_empty(self, method, use_bottleneck):
if method == "sum":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just pass in unit with method

@pytest.mark.parametrize("method, unit", [("sum", 0.0), ("prod", 1.0)]))


with pd.option_context("use_bottleneck", use_bottleneck):
# GH 9422
# treat all missing as NaN
s = Series([])
result = getattr(s, method)()
assert isna(result)
result = getattr(s, method)(min_count=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make explcity passing min_count=1 in tests as well, and then add another entry w/o passing min_count (e.g. the default)

FYI, I have a min_count keyword working for groupby on my next branch:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": pd.Categorical(['a', 'a'], categories=['a', 'b']), "B": [1, 1]})

In [3]: df.groupby("A").B.sum() Out[3]: A a 2 b 0 Name: B, dtype: int64

In [4]: df.groupby("A").B.sum(min_count=1) Out[4]: A a 2.0 b NaN Name: B, dtype: float64

I think I'll include that in this PR as well (with the default being 1).

@@ -88,7 +89,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

for i in range(ncounts):
for j in range(K):
if nobs[i, j] == 0:
if nobs[i, j] < min_count:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work for min_count==0?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sumx starts out as zeros, so we just have to avoid setting it to NaN. Same for prod, but with ones.

@@ -2322,9 +2326,16 @@ def _aggregate(self, result, counts, values, comp_ids, agg_func,
for i, chunk in enumerate(values.transpose(2, 0, 1)):

chunk = chunk.squeeze()
agg_func(result[:, :, i], counts, chunk, comp_ids)
if min_count is None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would always pass it to cython as an int. it can be -1 if its None somehow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly did this to avoid having to add an unused min_count keyword to all the group_* methods. If I added min_count to each of those, then I'd also need to raise if the non-default value is passed. I'll add a comment here that it's only expected for count and prod though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add it and assert == -1. these are all internal routines anyhow. an if is very ugly and error prone here. better to make them have the same uniform signature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok that sounds easy enough.

agg_func(result, counts, values, comp_ids)
if min_count is None:
agg_func(result, counts, values, comp_ids)
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -332,7 +343,8 @@ def get_dispatch(dtypes):
def group_last_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels):
ndarray[int64_t] labels,
Py_ssize_t min_count):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can set == -1 here (and actually all routines) i think

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do (had to for OHLC). I couldn't remember if cython took a perf hit for checking default values.

@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self):
grps = range(0, 55, 5)

for op, targop in ops:
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)
# calling _cython_agg_general directly, instead of via the user API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit was unfortunate, but I think necessary since we're calling _cython_agg_general directly.

I'll need change min_count to 0 in the followup PR.

Added min_count to resample in the last push.

In [4]: s.resample('30T').sum() # default min_count=1 for now, will change to 0 Out[4]: 2017-01-01 00:00:00 1.0 2017-01-01 00:30:00 NaN 2017-01-01 01:00:00 1.0 Freq: 30T, dtype: float64

In [5]: s.resample('30T').sum(min_count=0) Out[5]: 2017-01-01 00:00:00 1 2017-01-01 00:30:00 0 2017-01-01 01:00:00 1 Freq: 30T, dtype: int64

I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?

I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.

Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.

@Appender(_local_template)
def f(self, **kwargs):
return f_(self, **kwargs)
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? not sure u need this
shouldn’t the template just work for this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required, but with it I get the min_count in the signature.

In [5]: gr.sum? Signature: gr.sum(min_count=1, **kwargs) Docstring: Compute sum of group values

Without the if / else, things work , but then the sig is just **kwargs. I have a slight preference for including it the signature, but will revert it if you want.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert it for now. we don't include this for anything else (IOW any parameters now).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there is an issue about this. its needs a comprehensive soln.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I forgot about numeric_only, etc. OK.

@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self):
grps = range(0, 55, 5)

for op, targop in ops:
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should paramterize this

I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?

I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.

Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.

min_count is de-facto the same as min_periods or at least it performs the same function. I don't think it is necessary to add and would add much complexity and user confusion. We could just rename min_periods to min_count I suppose. (need a deprecation cycle).

Ah, is min_periods the minimum number of observations, or the minimum number of non-NA observations? e.g.

In [2]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D")) ...: t = s.copy() ...: t.iloc[[3, 4]] = np.nan ...:

In [3]: s Out[3]: 2017-01-01 1 2017-01-02 1 2017-01-03 1 2017-01-04 1 2017-01-05 1 2017-01-06 1 Freq: D, dtype: int64

In [4]: t Out[4]: 2017-01-01 1.0 2017-01-02 1.0 2017-01-03 1.0 2017-01-04 NaN 2017-01-05 NaN 2017-01-06 1.0 Freq: D, dtype: float64

In [5]: s.rolling(2).sum() Out[5]: 2017-01-01 NaN 2017-01-02 2.0 2017-01-03 2.0 2017-01-04 2.0 2017-01-05 2.0 2017-01-06 2.0 Freq: D, dtype: float64

In [6]: t.rolling(2).sum() Out[6]: 2017-01-01 NaN # (1,) 2017-01-02 2.0 # (1, 1) 2017-01-03 2.0 # (1, 1) 2017-01-04 NaN # (1, nan) 2017-01-05 NaN # (nan, nan) 2017-01-06 NaN # (nan, 1) Freq: D, dtype: float64

So will Out[6] be changing at all?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Do we need to add some tests with higher values for min_count. This 'works', so either should test it or disallow it.

As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:

In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))

In [16]: s
Out[16]: 
1  a    1.0
   b    NaN
2  a    NaN
   b    NaN
dtype: float64

In [17]: s.sum(level=0)
Out[17]: 
1    1.0
2    0.0
dtype: float64

In [18]: s.sum(level=0, min_count=0)
Out[18]: 
1    1.0
2    0.0
dtype: float64

In [19]: s.sum(level=0, min_count=1)
Out[19]: 
1    1.0
2    0.0
dtype: float64

(and it also changed compared to 0.21.0, there it gave 1.0, NaN in this case).

Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:

In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))

In [21]: s.groupby(level=0).sum()
Out[21]: 
1    1.0
2    0.0
dtype: float64

In [22]: s.groupby(level=0).sum(min_count=0)
Out[22]: 
1    1.0
2    0.0
dtype: float64

In [23]: s.groupby(level=0).sum(min_count=1)
Out[23]: 
1    1.0
2    NaN
dtype: float64

(I get the same when grouping by array of values or grouping by column in dataframe)

The required number of valid values to perform the operation. If fewer than
``min_count`` non-NA values are present the result will be NA.

.. versionadded :: 0.21.2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it will become 0.22 ? (but can change later)

axis = self._stat_axis_number
if level is not None:
return self._agg_by_level(name, axis=axis, level=level,
skipna=skipna)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this one also need to handle min_count ?

if axis is not None and getattr(result, 'ndim', False):
null_mask = (mask.shape[axis] - mask.sum(axis)) == 0
null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just substract here something if null_mask is later on used for indexing into result ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. null_mask will be a boolean array of which values in the output should be null because they have too few valid values.

Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:

I just pushed a commit fixing that. min_count wasn't being passed through.

In [4]: s.groupby(level=0).sum() Out[4]: 1 1.0 2 NaN dtype: float64

As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:

This seems to be fixed (presumably level= is using groupy and was hit by the same issue)

In [10]: s.sum(level=0) Out[10]: 1 1.0 2 NaN dtype: float64

Groupby is indeed fixed!

This seems to be fixed (presumably level= is using groupy and was hit by the same issue)

The default is indeed fixed, but min_count has not yet any effect, so that still needs to be passed through as well in some way I think

Ok, we should be good on series/frame, groupby, and resample. Does anyone have thoughts on expanding / window? Does min_periods mean

The minimum number of observations in the window
The minimum number of non-NA observations in the window

If it's 1, then the current behavior (on master) is buggy:

In [6]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D")) ...: t = s.copy() ...: t.iloc[[3, 4]] = np.nan ...:

In [7]: t.rolling(2, min_periods=2).apply(lambda x: np.nansum(x)) Out[7]: 2017-01-01 NaN # (1,) 2017-01-02 2.0 # (1, 1) 2017-01-03 2.0 # (1, 1) 2017-01-04 NaN # (1, nan) 2017-01-05 NaN # (nan, nan) 2017-01-06 NaN # (nan, 1) Freq: D, dtype: float64

why would you think 1)? it always has meant non-NA

I agree it currently means non-NA rows (although that is not explicit in the docstring).

I am thinking if there are possible reasons we would want to change this, and to have both min_periods in rolling and min_count in sum. In principle that would give the ability to distinguish between actuel NaNs in the data and no data in the period ("introduced" NaNs).
(and for all other functions apart from sum and prod, it wouldn't change anything as NaN values always return NaN anyway, so whether it is because of a NaN in the data or because of min_periods no including those NaN data, that does not matter).

I personally wouldn't rename min_periods to min_count, but just leave the name as is (also if we don't add a min_count to the rolling sum). It is in a different function (rolling, not sum) and is also relevant for other aggregations that don't have such a min_count in their normal dataframe counterpart.

Agreed with @jorisvandenbossche.

There may be a bit of a slight issue with the current implementation of rolling.sum() when min_periods=0.

On my other branch:

In [9]: x = pd.Series([np.nan])

In [10]: x.rolling(1, min_periods=0).sum()

Out[10]: 0 NaN dtype: float64

In [11]: x.rolling(1, min_periods=0).apply(lambda v: pd.Series(v).sum())

Out[11]: 0 0.0 dtype: float64

I assume we want those to be consistent?

Here's master:

In [1]: import pandas as pd; import numpy as np;

In [2]: x = pd.Series([np.nan])

In [3]: x.rolling(1, min_periods=0).sum() Out[3]: 0 NaN dtype: float64

In [4]: x.rolling(1, min_periods=1).sum() Out[4]: 0 NaN dtype: float64

Here's my branch

In [1]: import pandas as pd; import numpy as np;

In [2]: x = pd.Series([np.nan])

In [3]: x.rolling(1, min_periods=0).sum() Out[3]: 0 0.0 dtype: float64

In [4]: x.rolling(1, min_periods=1).sum() Out[4]: 0 NaN dtype: float64

In [5]: x.rolling(1).sum() Out[5]: 0 NaN dtype: float64

The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".

The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".

To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum?
(which is fine for me)

If we interpret min_periods as the minimum number of non-na periods, then yes. There won’t be a need for rolling.sum to take min_count.

…

________________________________ From: Joris Van den Bossche notifications@github.com Sent: Saturday, December 23, 2017 8:05:55 AM To: pandas-dev/pandas Cc: Tom Augspurger; Author Subject: Re: [pandas-dev/pandas] ENH: Added a min_count keyword to stat funcs (#18876) The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing". To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum? (which is fine for me) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#18876 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIr1ucILhc6rd72DpuGQSd7Yn-ObQks5tDQjDgaJpZM4RIuoM>.

@TomAugspurger why don't you rebase on master. can merge this on green (as I think you are taking care of any remaining comments in other PR). (and then backport, maybe create the new branch as well)?

The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass.

Currently, only nansum and nanprod actually do anything with min_count. It will not be hard to adjust other nan* methods use it if we want. This was just simplest for now.

Additional tests for the new behavior have been added.

Yeah, sounds good, this one should be OK.

I'll create a 0.22.x branch that starts at 0.21.x and backport this + the other PR once that's finished.

@TomAugspurger I also think that need to fix the version number? (or I suppose once you tag 0.22, then can fix that)?

Yeah, how should we do that? I suppose we could tag 0.23.dev0 on master any time right? So then it stays ahead of 0.22.x.

yeah I think can change the tag on master. merge away when ready.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request

Dec 29, 2017

The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass.

Currently, only nansum and nanprod actually do anything with min_count. It will not be hard to adjust other nan* methods use it if we want. This was just simplest for now.

Additional tests for the new behavior have been added.

(cherry picked from commit dbec3c9)

TomAugspurger added a commit that referenced this pull request

Dec 30, 2017

The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass.

Currently, only nansum and nanprod actually do anything with min_count. It will not be hard to adjust other nan* methods use it if we want. This was just simplest for now.

Additional tests for the new behavior have been added.

(cherry picked from commit dbec3c9)

yarikoptic added a commit to neurodebian/pandas that referenced this pull request

Feb 22, 2018

yarikoptic added a commit to neurodebian/pandas that referenced this pull request

Feb 22, 2018