ENH: Added a min_count keyword to stat funcs by TomAugspurger · Pull Request #18876 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation57 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
The current default is 1, reproducing the behavior of pandas 0.21. The current
test suite should pass. I'll add additional commits here changing the default to be 0.
Currently, only nansum and nanprod actually do anything with min_count
. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.
Additional tests for the new behavior have been added.
closes #18678
if len(self.kwargs) > 0: |
---|
for k, v in compat.iteritems(self.kwargs): |
if k not in kwds: |
kwds[k] = v |
try: |
if values.size == 0: |
if values.size < min_count: |
# we either return np.nan or pd.NaT |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I moved this logic to _na_for_min_count
but forgot to call it here.
5630d9d passed all the CI tests, so that's good.
I just pushed another commit that fixes a few things, but is still backwards compatible with 0.21.x.
I'll push my breaking changes once I get things passing locally.
I think we should for now only add min_count
to sum
and prod
. Or does that complicate the implementation?
Or does that complicate the implementation?
Not much. I mainly did it for all of them since it was easier, and would possibly be useful for other reductions (haven't thought it through). I'll push a commit that does it just for sum and prod.
For other reductions like min, max, std, .. we already return NaN for empty or all-NA series, so not sure what min_count
would be for
Pushed another backwards compatible commit, including docstrings, written from the point of view of a hypothetical 0.21.2 release. I haven't through through whether it'd be worth backporting the non-breaking changes (adding min_count=1
) to 0.21.x, but in case we do it'll just be the current set of commits:
Signature: pd.Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=1, **kwargs) Docstring: Return the sum of the values for the requested axis
Parameters
axis : {index (0)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result
will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a scalar
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
min_count : int, default 1
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.
.. versionadded :: 0.21.2
Added with the default being 1. This means the sum or product
of an all-NA or empty series is ``NaN``.
Returns
sum : scalar or Series (if level specified)
Examples
By default, the sum of an empty series is NaN
.
pd.Series([]).sum() # min_count=1 is the default nan
This can be controlled with the min_count
parameter. For example, if
you'd like the sum of an empty series to be 0, pass min_count=0
.
pd.Series([]).sum(min_count=0) 0.0
Thanks to the skipna
parameter, min_count
handles all-NA and
empty series identically.
pd.Series([np.nan]).sum() nan
pd.Series([np.nan]).sum(min_count=0) 0.0
For reviewing, do people have a preference for
a.) 2 PRs: the first adding a backwards compatible min_count=1
, the second changing min_count=0
and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).
result = np.empty(result_shape, dtype=values.dtype) |
---|
result.fill(fill_value) |
return result |
if values.size < min_count: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values could be 2d here, so in theory should handle this on a per-column basis (but not 100% sure if it is ever practically 2d).
@@ -304,15 +308,18 @@ def nanall(values, axis=None, skipna=True): |
---|
@disallow('M8') |
@bottleneck_switch() |
def nansum(values, axis=None, skipna=True): |
def nansum(values, axis=None, skipna=True, min_count=1): |
if len(values) < min_count: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can have a ndim > 1 here
@@ -1759,6 +1759,48 @@ def test_value_counts_categorical_not_ordered(self): |
---|
tm.assert_series_equal(idx.value_counts(normalize=True), exp) |
class TestMinCount(): |
@pytest.mark.parametrize("use_bottleneck", [True, False]) |
@pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)]) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for sum we never use bottleneck, I think for prod we DO use bottleneck. So maybe should just fix that here as well.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _bn_ok_dtype
# GH 9422
# further we also want to preserve NaN when all elements
# are NaN, unlinke bottleneck/numpy which consider this
# to be 0
if name in ['nansum', 'nanprod']:
return False
so we don't use bottleneck. And apparently bottleneck doesn't have a nanprod, at least in my version installed locally.
@pytest.mark.parametrize("method", [("sum", 0), ("prod", 1)]) |
---|
def test_min_count_empty(self, method, use_bottleneck): |
method, unit = method |
s = pd.Series() |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these tests should be at the very top, right after (or maybe) replacing
class TestSeriesAnalytics(TestData):
@pytest.mark.parametrize("use_bottleneck", [True, False])
@pytest.mark.parametrize("method", ["sum", "prod"])
def test_empty(self, method, use_bottleneck):
with pd.option_context("use_bottleneck", use_bottleneck):
# GH 9422
# treat all missing as NaN
those tests need to be changed, no?
For reviewing, do people have a preference for
a.) 2 PRs: the first adding a backwards compatible min_count=1, the second changing min_count=0 and updating all the tests
b.) A single PR (maybe with me rebasing / squashing to keep the two changes separate).
a) maybe that is better, then we can see both passing
I think this is ready for review (cc @shoyer). I haven't written a whatsnew since I'll be doing that in the followup PR.
Just to verify, the current plan is to release 0.22.0 as 0.21.1 + this PR and my follow PR with the breaking changes?
@@ -32,36 +32,64 @@ class TestSeriesAnalytics(TestData): |
---|
@pytest.mark.parametrize("use_bottleneck", [True, False]) |
@pytest.mark.parametrize("method", ["sum", "prod"]) |
def test_empty(self, method, use_bottleneck): |
if method == "sum": |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should just pass in unit with method
@pytest.mark.parametrize("method, unit", [("sum", 0.0), ("prod", 1.0)]))
with pd.option_context("use_bottleneck", use_bottleneck): |
# GH 9422 |
# treat all missing as NaN |
s = Series([]) |
result = getattr(s, method)() |
assert isna(result) |
result = getattr(s, method)(min_count=0) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make explcity passing min_count=1
in tests as well, and then add another entry w/o passing min_count (e.g. the default)
FYI, I have a min_count
keyword working for groupby on my next branch:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": pd.Categorical(['a', 'a'], categories=['a', 'b']), "B": [1, 1]})
In [3]: df.groupby("A").B.sum() Out[3]: A a 2 b 0 Name: B, dtype: int64
In [4]: df.groupby("A").B.sum(min_count=1) Out[4]: A a 2.0 b NaN Name: B, dtype: float64
I think I'll include that in this PR as well (with the default being 1).
@@ -88,7 +89,7 @@ def group_add_{{name}}(ndarray[{{dest_type2}}, ndim=2] out, |
---|
for i in range(ncounts): |
for j in range(K): |
if nobs[i, j] == 0: |
if nobs[i, j] < min_count: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work for min_count==0
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sumx
starts out as zeros
, so we just have to avoid setting it to NaN. Same for prod
, but with ones.
@@ -2322,9 +2326,16 @@ def _aggregate(self, result, counts, values, comp_ids, agg_func, |
---|
for i, chunk in enumerate(values.transpose(2, 0, 1)): |
chunk = chunk.squeeze() |
agg_func(result[:, :, i], counts, chunk, comp_ids) |
if min_count is None: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would always pass it to cython as an int. it can be -1 if its None
somehow
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly did this to avoid having to add an unused min_count
keyword to all the group_*
methods. If I added min_count
to each of those, then I'd also need to raise if the non-default value is passed. I'll add a comment here that it's only expected for count
and prod
though.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add it and assert == -1. these are all internal routines anyhow. an if is very ugly and error prone here. better to make them have the same uniform signature.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok that sounds easy enough.
agg_func(result, counts, values, comp_ids) |
---|
if min_count is None: |
agg_func(result, counts, values, comp_ids) |
else: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -332,7 +343,8 @@ def get_dispatch(dtypes): |
---|
def group_last_{{name}}(ndarray[{{dest_type2}}, ndim=2] out, |
ndarray[int64_t] counts, |
ndarray[{{c_type}}, ndim=2] values, |
ndarray[int64_t] labels): |
ndarray[int64_t] labels, |
Py_ssize_t min_count): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can set == -1 here (and actually all routines) i think
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do (had to for OHLC). I couldn't remember if cython took a perf hit for checking default values.
@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self): |
---|
grps = range(0, 55, 5) |
for op, targop in ops: |
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op) |
# calling _cython_agg_general directly, instead of via the user API |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit was unfortunate, but I think necessary since we're calling _cython_agg_general directly.
I'll need change min_count
to 0 in the followup PR.
Added min_count to resample
in the last push.
In [4]: s.resample('30T').sum() # default min_count=1 for now, will change to 0 Out[4]: 2017-01-01 00:00:00 1.0 2017-01-01 00:30:00 NaN 2017-01-01 01:00:00 1.0 Freq: 30T, dtype: float64
In [5]: s.resample('30T').sum(min_count=0) Out[5]: 2017-01-01 00:00:00 1 2017-01-01 00:30:00 0 2017-01-01 01:00:00 1 Freq: 30T, dtype: int64
I'm going to see how difficult rolling
is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods
?
I believe that min_periods
specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count
.
Given a widow at least min_periods
length, min_count
specifies the required number of non-NA observation for a non-NA output.
@Appender(_local_template) |
---|
def f(self, **kwargs): |
return f_(self, **kwargs) |
else: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? not sure u need this
shouldn’t the template just work for this?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required, but with it I get the min_count
in the signature.
In [5]: gr.sum? Signature: gr.sum(min_count=1, **kwargs) Docstring: Compute sum of group values
Without the if / else, things work , but then the sig is just **kwargs
. I have a slight preference for including it the signature, but will revert it if you want.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would revert it for now. we don't include this for anything else (IOW any parameters now).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC there is an issue about this. its needs a comprehensive soln.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I forgot about numeric_only, etc. OK.
@@ -822,7 +822,14 @@ def test_cython_agg_empty_buckets(self): |
---|
grps = range(0, 55, 5) |
for op, targop in ops: |
result = df.groupby(pd.cut(df[0], grps))._cython_agg_general(op) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should paramterize this
I'm going to see how difficult rolling is now. I think for symmetry it'd be nice to have, but how does it interact with min_periods?
I believe that min_periods specifies required length of the window. If the window is shorter than that, the output is NA regardless of min_count.
Given a widow at least min_periods length, min_count specifies the required number of non-NA observation for a non-NA output.
min_count
is de-facto the same as min_periods
or at least it performs the same function. I don't think it is necessary to add and would add much complexity and user confusion. We could just rename min_periods
to min_count
I suppose. (need a deprecation cycle).
Ah, is min_periods
the minimum number of observations, or the minimum number of non-NA observations? e.g.
In [2]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D")) ...: t = s.copy() ...: t.iloc[[3, 4]] = np.nan ...:
In [3]: s Out[3]: 2017-01-01 1 2017-01-02 1 2017-01-03 1 2017-01-04 1 2017-01-05 1 2017-01-06 1 Freq: D, dtype: int64
In [4]: t Out[4]: 2017-01-01 1.0 2017-01-02 1.0 2017-01-03 1.0 2017-01-04 NaN 2017-01-05 NaN 2017-01-06 1.0 Freq: D, dtype: float64
In [5]: s.rolling(2).sum() Out[5]: 2017-01-01 NaN 2017-01-02 2.0 2017-01-03 2.0 2017-01-04 2.0 2017-01-05 2.0 2017-01-06 2.0 Freq: D, dtype: float64
In [6]: t.rolling(2).sum() Out[6]: 2017-01-01 NaN # (1,) 2017-01-02 2.0 # (1, 1) 2017-01-03 2.0 # (1, 1) 2017-01-04 NaN # (1, nan) 2017-01-05 NaN # (nan, nan) 2017-01-06 NaN # (nan, 1) Freq: D, dtype: float64
So will Out[6]
be changing at all?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Do we need to add some tests with higher values for min_count
. This 'works', so either should test it or disallow it.
As also commented inline somewhere, the sum(level=)
does not seem to be implemented yet:
In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))
In [16]: s
Out[16]:
1 a 1.0
b NaN
2 a NaN
b NaN
dtype: float64
In [17]: s.sum(level=0)
Out[17]:
1 1.0
2 0.0
dtype: float64
In [18]: s.sum(level=0, min_count=0)
Out[18]:
1 1.0
2 0.0
dtype: float64
In [19]: s.sum(level=0, min_count=1)
Out[19]:
1 1.0
2 0.0
dtype: float64
(and it also changed compared to 0.21.0, there it gave 1.0, NaN
in this case).
Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:
In [15]: s = pd.Series([1, np.nan, np.nan, np.nan], pd.MultiIndex.from_product([[1, 2], ['a', 'b']]))
In [21]: s.groupby(level=0).sum()
Out[21]:
1 1.0
2 0.0
dtype: float64
In [22]: s.groupby(level=0).sum(min_count=0)
Out[22]:
1 1.0
2 0.0
dtype: float64
In [23]: s.groupby(level=0).sum(min_count=1)
Out[23]:
1 1.0
2 NaN
dtype: float64
(I get the same when grouping by array of values or grouping by column in dataframe)
The required number of valid values to perform the operation. If fewer than |
---|
``min_count`` non-NA values are present the result will be NA. |
.. versionadded :: 0.21.2 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably it will become 0.22 ? (but can change later)
axis = self._stat_axis_number |
---|
if level is not None: |
return self._agg_by_level(name, axis=axis, level=level, |
skipna=skipna) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this one also need to handle min_count
?
if axis is not None and getattr(result, 'ndim', False): |
---|
null_mask = (mask.shape[axis] - mask.sum(axis)) == 0 |
null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just substract here something if null_mask
is later on used for indexing into result
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here. null_mask
will be a boolean array of which values in the output should be null because they have too few valid values.
Groupby seems to have a wrong default (for now to be backwards compatible), but can't directly spot the error in the code:
I just pushed a commit fixing that. min_count
wasn't being passed through.
In [4]: s.groupby(level=0).sum() Out[4]: 1 1.0 2 NaN dtype: float64
As also commented inline somewhere, the sum(level=) does not seem to be implemented yet:
This seems to be fixed (presumably level=
is using groupy and was hit by the same issue)
In [10]: s.sum(level=0) Out[10]: 1 1.0 2 NaN dtype: float64
Groupby is indeed fixed!
This seems to be fixed (presumably level= is using groupy and was hit by the same issue)
The default is indeed fixed, but min_count
has not yet any effect, so that still needs to be passed through as well in some way I think
Ok, we should be good on series/frame, groupby, and resample. Does anyone have thoughts on expanding / window? Does min_periods
mean
- The minimum number of observations in the window
- The minimum number of non-NA observations in the window
If it's 1, then the current behavior (on master) is buggy:
In [6]: s = pd.Series(1, index=pd.date_range("2017", periods=6, freq="D")) ...: t = s.copy() ...: t.iloc[[3, 4]] = np.nan ...:
In [7]: t.rolling(2, min_periods=2).apply(lambda x: np.nansum(x)) Out[7]: 2017-01-01 NaN # (1,) 2017-01-02 2.0 # (1, 1) 2017-01-03 2.0 # (1, 1) 2017-01-04 NaN # (1, nan) 2017-01-05 NaN # (nan, nan) 2017-01-06 NaN # (nan, 1) Freq: D, dtype: float64
why would you think 1)? it always has meant non-NA
I agree it currently means non-NA rows (although that is not explicit in the docstring).
I am thinking if there are possible reasons we would want to change this, and to have both min_periods
in rolling
and min_count
in sum
. In principle that would give the ability to distinguish between actuel NaNs in the data and no data in the period ("introduced" NaNs).
(and for all other functions apart from sum and prod, it wouldn't change anything as NaN values always return NaN anyway, so whether it is because of a NaN in the data or because of min_periods
no including those NaN data, that does not matter).
I personally wouldn't rename min_periods
to min_count
, but just leave the name as is (also if we don't add a min_count
to the rolling sum
). It is in a different function (rolling
, not sum
) and is also relevant for other aggregations that don't have such a min_count
in their normal dataframe counterpart.
Agreed with @jorisvandenbossche.
There may be a bit of a slight issue with the current implementation of rolling.sum()
when min_periods=0
.
On my other branch:
In [9]: x = pd.Series([np.nan])
In [10]: x.rolling(1, min_periods=0).sum()
Out[10]: 0 NaN dtype: float64
In [11]: x.rolling(1, min_periods=0).apply(lambda v: pd.Series(v).sum())
Out[11]: 0 0.0 dtype: float64
I assume we want those to be consistent?
Here's master:
In [1]: import pandas as pd; import numpy as np;
In [2]: x = pd.Series([np.nan])
In [3]: x.rolling(1, min_periods=0).sum() Out[3]: 0 NaN dtype: float64
In [4]: x.rolling(1, min_periods=1).sum() Out[4]: 0 NaN dtype: float64
Here's my branch
In [1]: import pandas as pd; import numpy as np;
In [2]: x = pd.Series([np.nan])
In [3]: x.rolling(1, min_periods=0).sum() Out[3]: 0 0.0 dtype: float64
In [4]: x.rolling(1, min_periods=1).sum() Out[4]: 0 NaN dtype: float64
In [5]: x.rolling(1).sum() Out[5]: 0 NaN dtype: float64
The fix in window.pyx
is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".
The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing".
To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum?
(which is fine for me)
If we interpret min_periods as the minimum number of non-na periods, then yes. There won’t be a need for rolling.sum to take min_count.
________________________________ From: Joris Van den Bossche notifications@github.com Sent: Saturday, December 23, 2017 8:05:55 AM To: pandas-dev/pandas Cc: Tom Augspurger; Author Subject: Re: [pandas-dev/pandas] ENH: Added a min_count keyword to stat funcs (#18876) The fix in window.pyx is small. I don't think it's worth including here, since the behavior on master is correct for the 0.21 semantics of "sum of all-missing is missing". To be clear: so for now no change, which is keep min_periods as it was, and for now no min_count added to rolling sum? (which is fine for me) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#18876 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIr1ucILhc6rd72DpuGQSd7Yn-ObQks5tDQjDgaJpZM4RIuoM>.
@TomAugspurger why don't you rebase on master. can merge this on green (as I think you are taking care of any remaining comments in other PR). (and then backport, maybe create the new branch as well)?
The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass.
Currently, only nansum and nanprod actually do anything with min_count
. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.
Additional tests for the new behavior have been added.
Yeah, sounds good, this one should be OK.
I'll create a 0.22.x branch that starts at 0.21.x and backport this + the other PR once that's finished.
@TomAugspurger I also think that need to fix the version number? (or I suppose once you tag 0.22, then can fix that)?
Yeah, how should we do that? I suppose we could tag 0.23.dev0 on master any time right? So then it stays ahead of 0.22.x.
yeah I think can change the tag on master. merge away when ready.
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request
The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass.
Currently, only nansum and nanprod actually do anything with min_count
. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.
Additional tests for the new behavior have been added.
(cherry picked from commit dbec3c9)
TomAugspurger added a commit that referenced this pull request
The current default is 1, reproducing the behavior of pandas 0.21. The current test suite should pass.
Currently, only nansum and nanprod actually do anything with min_count
. It
will not be hard to adjust other nan* methods use it if we want. This was just
simplest for now.
Additional tests for the new behavior have been added.
(cherry picked from commit dbec3c9)
yarikoptic added a commit to neurodebian/pandas that referenced this pull request
yarikoptic added a commit to neurodebian/pandas that referenced this pull request