BUG: Remove null values before sorting during groupby nunique calculation by MarcoGorelli · Pull Request #27951 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation24 Commits3 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- Closes groupby nunique() with dates vs datetimes in presence of NaTs #27904
- tests added / passed
- passes
black pandas
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
Does this completely close #27904? Or should it remain open for the case of dropna=False
?
In the case of dropna=False
, what should nunique
return? As in, do the NaN
s all count as distinct values? I wouldn't expect them to, although as np.nan == np.nan
returns False
, so I'm not sure
Just one NA value I think.
And to be clear, we can leave handling dropna=False
to a followup PR. Just want to make sure that we have an open issue for it.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for the dropna=True case. Will leave the issue open for dropna=False, which seems broken still.
Actually... I'm going to push this off the 0.25.1 milestone (sorry @MarcoGorelli)
Can you check a few things:
- How's the performance of this? I assume
np.lexsort
has to do some kinds of NA masking and filtering internally. Are we much slower than that? - It looks like there's a now unnecessary
if dropna
starting around line 1171 now. IIUC,val
should now have all the na values removed, so masking again shouldn't be necessary.
Sorry for the delay - yes, I agree with your decision, Tom :) I'll pick this up again next week
Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻
Comment last updated at 2019-09-06 15:50:47 UTC
As noted as a comment in #27904, this works fine with numpy's own nat
value...is it considered cheating just replace pandas.NaT
with np.datetime64("NaT")
? I tried that, and both the existing and the new tests (which include dropna=False
) pass.
Haven't looked deeply yet but does this also impact Series and DataFrame methods? The location of the change would imply so hence my reason for asking
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment. Can you also add a whatsnew for v1.0.0
Thanks for your review!
Sure, whatsnew entry added.
If I've understood your question, then the pandas.DataFrame.nunique
and pandas.Series.nunique
methods didn't have this issue to begin with, and are unaffected by this change.
Ah misread the file location but makes sense so thanks for confirming. Should be able to review more deeply over the next few days
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for delay in review. Looks pretty good some general comments
@@ -1143,6 +1143,9 @@ def nunique(self, dropna=True): |
---|
val = self.obj._internal_get_values() |
# GH 27951 |
val[isna(val)] = np.datetime64("NaT") |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected
So I think this works for now but maybe add a comment about NumPy bug 12629 for reference
@@ -1015,6 +1025,81 @@ def test_nunique_with_timegrouper(): |
---|
tm.assert_series_equal(result, expected) |
@pytest.mark.parametrize( |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each.
Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - have modified it so the DataFrame is constructed within the test
MarcoGorelli changed the title
Remove null values before sorting during groupby nunique calculation BUG: Remove null values before sorting during groupby nunique calculation
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli!
Can you confirm that this fixed all of #27904, or are there outstanding tasks?
That's right - thanks for having taken the time to review my work!
proost pushed a commit to proost/pandas that referenced this pull request
proost pushed a commit to proost/pandas that referenced this pull request