BUG: Remove null values before sorting during groupby nunique calculation by MarcoGorelli · Pull Request #27951 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation24 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

MarcoGorelli

jschendel

jschendel

@TomAugspurger

Does this completely close #27904? Or should it remain open for the case of dropna=False?

@MarcoGorelli

In the case of dropna=False, what should nunique return? As in, do the NaNs all count as distinct values? I wouldn't expect them to, although as np.nan == np.nan returns False, so I'm not sure

@TomAugspurger

Just one NA value I think.

@TomAugspurger

And to be clear, we can leave handling dropna=False to a followup PR. Just want to make sure that we have an open issue for it.

TomAugspurger

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for the dropna=True case. Will leave the issue open for dropna=False, which seems broken still.

@TomAugspurger

Actually... I'm going to push this off the 0.25.1 milestone (sorry @MarcoGorelli)

Can you check a few things:

  1. How's the performance of this? I assume np.lexsort has to do some kinds of NA masking and filtering internally. Are we much slower than that?
  2. It looks like there's a now unnecessary if dropna starting around line 1171 now. IIUC, val should now have all the na values removed, so masking again shouldn't be necessary.

@MarcoGorelli

Sorry for the delay - yes, I agree with your decision, Tom :) I'll pick this up again next week

@pep8speaks

Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-06 15:50:47 UTC

@MarcoGorelli

As noted as a comment in #27904, this works fine with numpy's own nat value...is it considered cheating just replace pandas.NaT with np.datetime64("NaT")? I tried that, and both the existing and the new tests (which include dropna=False) pass.

@WillAyd

Haven't looked deeply yet but does this also impact Series and DataFrame methods? The location of the change would imply so hence my reason for asking

WillAyd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment. Can you also add a whatsnew for v1.0.0

@MarcoGorelli

Thanks for your review!

Sure, whatsnew entry added.

If I've understood your question, then the pandas.DataFrame.nunique and pandas.Series.nunique methods didn't have this issue to begin with, and are unaffected by this change.

@WillAyd

Ah misread the file location but makes sense so thanks for confirming. Should be able to review more deeply over the next few days

WillAyd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for delay in review. Looks pretty good some general comments

@@ -1143,6 +1143,9 @@ def nunique(self, dropna=True):
val = self.obj._internal_get_values()
# GH 27951
val[isna(val)] = np.datetime64("NaT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected

numpy/numpy#12629

So I think this works for now but maybe add a comment about NumPy bug 12629 for reference

@@ -1015,6 +1025,81 @@ def test_nunique_with_timegrouper():
tm.assert_series_equal(result, expected)
@pytest.mark.parametrize(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each.

Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - have modified it so the DataFrame is constructed within the test

@MarcoGorelli

@MarcoGorelli

@MarcoGorelli MarcoGorelli changed the titleRemove null values before sorting during groupby nunique calculation BUG: Remove null values before sorting during groupby nunique calculation

Sep 6, 2019

WillAyd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger

Thanks @MarcoGorelli!

Can you confirm that this fixed all of #27904, or are there outstanding tasks?

@MarcoGorelli

That's right - thanks for having taken the time to review my work!

proost pushed a commit to proost/pandas that referenced this pull request

Dec 19, 2019

@MarcoGorelli @proost

proost pushed a commit to proost/pandas that referenced this pull request

Dec 19, 2019

@MarcoGorelli @proost