BUG: Remove null values before sorting during groupby nunique calculation by MarcoGorelli · Pull Request #27951 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Closes groupby nunique() with dates vs datetimes in presence of NaTs #27904
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Does this completely close #27904? Or should it remain open for the case of dropna=False?

In the case of dropna=False, what should nunique return? As in, do the NaNs all count as distinct values? I wouldn't expect them to, although as np.nan == np.nan returns False, so I'm not sure

Just one NA value I think.

And to be clear, we can leave handling dropna=False to a followup PR. Just want to make sure that we have an open issue for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for the dropna=True case. Will leave the issue open for dropna=False, which seems broken still.

Actually... I'm going to push this off the 0.25.1 milestone (sorry @MarcoGorelli)

Can you check a few things:

How's the performance of this? I assume np.lexsort has to do some kinds of NA masking and filtering internally. Are we much slower than that?
It looks like there's a now unnecessary if dropna starting around line 1171 now. IIUC, val should now have all the na values removed, so masking again shouldn't be necessary.

Sorry for the delay - yes, I agree with your decision, Tom :) I'll pick this up again next week

Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-06 15:50:47 UTC

As noted as a comment in #27904, this works fine with numpy's own nat value...is it considered cheating just replace pandas.NaT with np.datetime64("NaT")? I tried that, and both the existing and the new tests (which include dropna=False) pass.

Haven't looked deeply yet but does this also impact Series and DataFrame methods? The location of the change would imply so hence my reason for asking

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment. Can you also add a whatsnew for v1.0.0

Thanks for your review!

Sure, whatsnew entry added.

If I've understood your question, then the pandas.DataFrame.nunique and pandas.Series.nunique methods didn't have this issue to begin with, and are unaffected by this change.

Ah misread the file location but makes sense so thanks for confirming. Should be able to review more deeply over the next few days

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for delay in review. Looks pretty good some general comments

val = self.obj._internal_get_values()

# GH 27951
val[isna(val)] = np.datetime64("NaT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected

numpy/numpy#12629

So I think this works for now but maybe add a comment about NumPy bug 12629 for reference

tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each.

Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - have modified it so the DataFrame is constructed within the test

MarcoGorelli changed the title~~Remove null values before sorting during groupby nunique calculation~~ BUG: Remove null values before sorting during groupby nunique calculation