REF: Refactor GroupBy a tiny bit by NickCrews · Pull Request #46258 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation18 Commits4 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

[NA, small] closes #xxxx (Replace xxxx with the Github issue number)
[waiting for review] Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
[waiting for review] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes were both semantically doing the same thing as core.algorithms.factorize(), so rename them so that it's just a bit easier to follow.

Per a review comment in pandas-dev#46207

The return value is never used. Don't pretend that it is.

and move dropna up out of the "generated" attribute section into the "source" attribute section

NickCrews changed the title~~Simple refactor groupby~~ REF: Simple GroupBy refactors

NickCrews changed the title~~REF: Simple GroupBy refactors~~ REF: Refactor GroupBy a tiny bit

@@ -413,7 +413,6 @@ def _set_grouper(self, obj: NDFrame, sort: bool = False):
# "NDFrameT", variable has type "None")
self.obj = obj # type: ignore[assignment]
self._gpr_index = ax

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can find a way to avoid setting these attributes instead, that'd make this class easier to reason about

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if I understand what you're saying. Do you mean make this whole class less stateful?

I actually started doing this because I thought the same thing. I wanted to make this Grouper be immutable. Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean make this whole class less stateful?

exactly. IIRC i put some effort into this last year and found it difficult.

Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Not sure off the top of my head. Depends in large part on if it would be user-facing, in which case it is a harder sell.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was worried about user facing changes. What exactly do we consider user-facing? In the docs Grouper is pretty much a namedtuple, and I would argue this should be the public API (just the attributes given in the constructor).

What gets in the way of this are the ax and groups properties, which look like they are supposed to be public. But I'm pretty sure it is a mistake to expose those two, since they depend on the Grouper being bound to an obj, and there isn't a user-facing way to bind a Grouper to an obj. So I feel like any user who is relying on those two properties is already reaching too far into the Grouper and therefore that should be considered undefined.

Thoughts?

self,
) -> tuple[npt.NDArray[np.signedinteger], npt.NDArray[np.intp]]:
# The first returned ndarray may have any signed integer dtype
"""Analogous to core.algorithms.factorize"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is accurate; the second item returned is ndarray[intp] instead of ArrayLike

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still is the same pattern where uniques.take[codes] is the same as the original data, just with a slightly different type. Or is there some bigger difference I'm missing? Should I just change the docstring to be "Similar idea as core.algorithms.factorize", or you don't think algorithms.factorize should be mentioned at all?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you don't think algorithms.factorize should be mentioned at all?

yah lets leave this case out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if you're interested in a hopefully-hours-sized goal:

In all extant test cases, the second array returned by this method has length(self._get_compressed_codes()[1]) == self.ngroups. Can we confirm that this will always hold? Doing so would allow simplifying group_info.

(It is easy-ish to confirm that this holds for the len(self.groupings) == 1 case, but I haven't confirmed it for the other case)

@@ -263,7 +263,7 @@ class Grouper:
_gpr_index: Index \| None
_grouper: Index \| None

_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort")
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort", "dropna")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is _attributes used for anything? if so, this should change some behavior right

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure after a bit of poking around it's just used in repr, but I'm not positive (TimeGrouper does some funky stuff with getattr that might somehow be getting to this). I think the only failed tests are ones that test repr()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being the case, I think I might even rename this to _repr_attributes, what do you think? Because it is quite hard to tell what this is otherwise used for.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we use the name _attributes elsewhere so wouldn't change it.

@jbrockmendel how about I drop the commit where I add dropna, address the other tweaks you suggest, and then we can merge this as a start? And then maybe do those offshoot changes separately?

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

NickCrews added a commit to NickCrews/pandas that referenced this pull request

The return value is never used. Don't pretend that it is.

Ideally we don't even hold this state inside the Grouper, it would make it easier to reason about. But just do this tiny change for now, still an improvement. See pandas-dev#46258 for more details.

NickCrews added a commit to NickCrews/pandas that referenced this pull request

All this changes I think is that now dropna is included in the __repr__ output. See pandas-dev#46258 for more discussion.

And move dropna up out of the "generated" attribute section into the "source" attribute section in the init

Looks like this is a bit more than I want to take on, but I pulled the less-controversial changes into #46754, where hopefully they can still be merged.

NickCrews added a commit to NickCrews/pandas that referenced this pull request

All this changes I think is that now dropna is included in the __repr__ output. See pandas-dev#46258 for more discussion.

And move dropna up out of the "generated" attribute section into the "source" attribute section in the init

NickCrews added a commit to NickCrews/pandas that referenced this pull request

The return value is never used. Don't pretend that it is.

Ideally we don't even hold this state inside the Grouper, it would make it easier to reason about. But just do this tiny change for now, still an improvement. See pandas-dev#46258 for more details.

NickCrews added a commit to NickCrews/pandas that referenced this pull request

All this changes I think is that now dropna is included in the __repr__ output. See pandas-dev#46258 for more discussion.

And move dropna up out of the "generated" attribute section into the "source" attribute section in the init

2 participants