Cythonized GroupBy Fill by WillAyd · Pull Request #19673 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation85 Commits25 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes PERF: groupby-fillna perf, implement in cython #11296
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I am not a fan of how I've implemented this in groupby.py but I think this change highlights even more the need for some refactoring of that module, specifically in how Cython transformations are getting dispatched. This still works in the meantime but open to any feedback on how the methods are getting wired back to the Cython layer.

Below are ASVs for the change

   before           after         ratio
 [d9551c8e]       [5e007f86]

 81.6±0.3μs         96.7±2μs     1.19  groupby.GroupByMethods.time_method('float', 'count')

    455±3μs         516±20μs     1.13  groupby.GroupByMethods.time_method('float', 'cummin')

  306±0.6μs          340±5μs     1.11  groupby.GroupByMethods.time_method('float', 'prod')

  142±0.4ms          280±6μs     0.00  groupby.GroupByMethods.time_method('int', 'bfill')

   230±10ms          327±7μs     0.00  groupby.GroupByMethods.time_method('float', 'bfill')

  135±0.8ms        189±0.3μs     0.00  groupby.GroupByMethods.time_method('int', 'ffill')

    227±2ms        184±0.6μs     0.00  groupby.GroupByMethods.time_method('float', 'ffill')


sorted_labels = np.argsort(labels)
if method == 'bfill':
sorted_labels[::-1].sort()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the labels reversed before sorting?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They only get reversed in the case of bfill - it's the cheapest way to fill backwards that I could think of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but doesn't the .sort() undo the reversing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks! Perhaps too clever, how about this? still avoids the copy

sorted_labels.sort()
sorted_labels = sorted_labels[::-1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think you are on to something - the extra sort doesn't feel right and I think it could cause a bug as it will really just return an array ranging from N-1..0. Going to work up some test cases to validate and add fix to next commit

out[idx, 0] == {{nan_val}}
filled_vals += 1
else: # reset items when not missing
filled_vals = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't filled_values need to be tracked by group? Consider this case (haven't tested on your PR):

df = pd.DataFrame({ 'k': [1, 1, 2, 2, 1, 1], 'v': [1.1, np.nan, 1, np.nan, np.nan, np.nan]})

df.groupby('k').ffill(limit=2) Out[25]: k v 0 1 1.1 1 1 1.1 2 2 1.0 3 2 1.0 4 1 1.1 5 1 NaN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be tracked by both group and value. One thing to keep in mind is that the main loop in the Cython function doesn't go sequentially over the values in the series / frame, but rather iterates over the argsorteded labels, which allows you to keep track of groups easily. Just ran your example on my PR and it gave the result you have above.

That said, I don't have a problem adding a test case that mixes the groups up to prevent this from breaking in the future. Can bundle that in with the next commit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case for this is now a little more complicated, handling both sequential groups (ex: ['a', 'a', 'a', 'b', 'b', 'b']) and "interwoven" (ex: ['a', 'b', 'a', 'b', 'a', 'b']). Open to splitting it up into separate tests if you think that would make it more readable.

Otherwise this instance should be covered. I fixed a bug in the Cython code to go along with it - thanks for the callout!

@@ -1472,7 +1479,7 @@ def pad(self, limit=None):
Series.fillna
DataFrame.fillna
"""
return self.apply(lambda x: x.ffill(limit=limit))
return self.apply('ffill', limit=limit)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than calling apply, pattern so far has been to define the method directly Groupby.cumsum, etc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tied to the comment below, I added _cython_apply as its own function to handle dispatching back to the Cython layer, which this apply tries first before falling back to the current method of iterating over the groups and applying the method directly from there. Items like Groupby.cumsum call _cython_transform in a similar fashion, which I was trying to emulate here but I was hesitant to send these methods down the transform route because of the complexity that would cause there

@@ -2032,6 +2039,38 @@ def _get_group_keys(self):
self.levels,
self.labels)

def _cython_apply(self, ftype, data, axis, **kwargs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this should pass down to _cython_transform?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little torn over this. The call signature doesn't play all that nicely with the other methods utilizing _cython_transform. Specifically, the is_numeric and is_datetimelike arguments aren't applicable and there are some conditionals that need to be added to prevent accidental casting of data during the fill. On top of that, we'd have to wrap the transformation to be sure to include the grouped columns as none of the other transformations do that.

Touched on this with @jreback in the comments of #19481 - there for sure needs to be a pretty comprehensive refactoring on groupby.py to more effectively dispatch Cython operations. For the time being I thought this approach would be the "best of the worst" option, but can take another look at wiring into transform if we feel that is a show stopper

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree the whole thing needs some refactoring, xref also to #19354 - I'll think about it a bit more too

WillAyd changed the title~~Grp fill perf~~ Cythonized GroupBy Fill

Feb 13, 2018

One possibility to clean this up would be to replicate GroupBy.shift

def shift(self, periods=1, freq=None, axis=0):

Rather than type specific functions, there the cython routine computes a single indexer, then the normal take functions are used to make the actual output.

def group_shift_indexer(int64_t[:] out, int64_t[:] labels,

Probably a bit slower than what you do here, but would be surprised if it's bad.

Thanks for the idea. I think I could make something like that work here and that would definitely simplify things

Just simplified the code to mirror what we do for shift. The groupby.py changes are MUCH cleaner now and there doesn't appear to be any significant change to benchmarks either. Updated results below:

   before           after         ratio
 [07137a5a]       [0939aeae]

      983ms            1.23s     1.25  groupby.GroupByMethods.time_method('float', 'pct_change')

    356±4μs         443±20μs     1.24  groupby.GroupByMethods.time_method('float', 'prod')

 84.6±0.2μs          101±3μs     1.19  groupby.GroupByMethods.time_method('int', 'count')

    185±1μs          218±3μs     1.18  groupby.GroupByMethods.time_method('int', 'cumcount')

    161±3μs          185±3μs     1.15  groupby.GroupByMethods.time_method('float', 'cumcount')

    334±9μs         382±20μs     1.14  groupby.GroupByMethods.time_method('float', 'median')

    461±3μs         523±10μs     1.13  groupby.GroupByMethods.time_method('int', 'cummax')

  344±0.8μs         387±10μs     1.12  groupby.GroupByMethods.time_method('float', 'nunique')

 64.5±0.2μs       72.1±0.9μs     1.12  groupby.GroupByMethods.time_method('int', 'size')

    456±2μs         504±20μs     1.11  groupby.GroupByMethods.time_method('float', 'cummin')

    133±5ms          146±3ms     1.10  groupby.GroupByMethods.time_method('int', 'any')

   769±20μs         684±30μs     0.89  groupby.GroupByMethods.time_method('int', 'cumprod')

  893±100μs         598±30μs     0.67  groupby.GroupByMethods.time_method('int', 'sem')

    145±2ms          275±7μs     0.00  groupby.GroupByMethods.time_method('int', 'bfill')

    154±3ms          245±1μs     0.00  groupby.GroupByMethods.time_method('int', 'ffill')

    228±5ms         270±10μs     0.00  groupby.GroupByMethods.time_method('float', 'bfill')

    237±5ms        232±0.5μs     0.00  groupby.GroupByMethods.time_method('float', 'ffill')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good


N = len(out)

sorted_labels = np.argsort(labels).view(dtype=np.int64)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if limit is None:
limit = -1
output = {}
if type(self) is DataFrameGroupBy:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this if branch for?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill is unique in that unlike say transformations the column(s) used in the grouping are also returned as part of the output of a DataFrame object. _iterate_slices does not go over those items and I didn't see an existing method to handle that, so this conditional makes sure the grouped columns still make their way into the frame that gets returned

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I was thinking to myself if its worth adding an instance method to the GroupBy object to handle this (_iterate_groups or something to the effect) but couldn't think of any other function that requires this feature

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the if type(...), make a method in SeriesGroupBy and DataFrameGroupBy

['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we bench for the compat methods for datetimelikes? (timestamp / timedelta), most of these would work (except prod, mean, mad, pct_change and a few more), though these likely work for timedeltas. ok with adding an issue to do this as well (IOW doesn't need to be in this PR).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -646,6 +646,7 @@ Performance Improvements
- Improved performance of pairwise ``.rolling()`` and ``.expanding()`` with ``.cov()`` and ``.corr()`` operations (:issue:`17917`)
- Improved performance of :func:`DataFrameGroupBy.rank` (:issue:`15779`)
- Improved performance of variable ``.rolling()`` on ``.min()`` and ``.max()`` (:issue:`19521`)
- Improved performance of :func:`GroupBy.ffill` and :func:`GroupBy.bfill` (:issue:`11296`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this will render. you can just do .groupby().ffill() and so on

with nogil:
for i in range(N):
idx = sorted_labels[i]
if mask[idx] == 1: # is missing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curr_fill_idx seems potentially not defined (if it is missing but the limit is not hit), not sure if that is possible. maybe could initialize curr_fill_idx = -1?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization occurs at declaration. Can move that into the body of the function if you feel that is more readable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh see that ok then

int64_t curr_fill_idx=-1
int64_t idx, filled_vals=0

N = len(out)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe want an assert that N == len(labels) == len(mask)

if limit is None:
limit = -1
output = {}
if type(self) is DataFrameGroupBy:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the if type(...), make a method in SeriesGroupBy and DataFrameGroupBy

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
mask = isnull(obj.values).view(np.uint8)
libgroupby.group_fillna_indexer(indexer, mask, labels, how,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make a routine that shares code with how the shift_indexer is called (I am talking about in python space here)

with nogil:
for i in range(N):
idx = sorted_labels[i]
if mask[idx] == 1: # is missing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh see that ok then

curr_fill_idx = idx

out[idx] = curr_fill_idx
# If we move to the next group, reset

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

Latest commit includes a shared function between fill and shift, which can theoretically house any / all implementations with a little clean up as well. As touched on in previous conversations, if we expand this it should probably be in its own module

Open to review and plan to change the location of the func within the module / update docs, but I wanted to push because this commit will cause tests to fail at the following location:

tm.assert_frame_equal(expected, getattr(gb, op)(*args))

The reason this is failing is because I replaced output = {} in the original shift code with output = collections.OrderedDict() in the newly shared function. I'm not entirely clear on why that would impact the test case, but it seems like a more explicit code path regardless.

I believe the test case is wrong anyway and the failing line should either have a .sort_index call or be placed before the expected = expected.sort_index(axis=1) call a few lines ahead of it, but wanted to get your input and make sure I am not misreading that

The reason this is failing is because I replaced output = {} in the original shift code with output = collections.OrderedDict() in the newly shared function. I'm not entirely clear on why that would impact the test case, but it seems like a more explicit code path regardless.

its prob happenstance that this worked before, IOW it happened to be sorted already. go ahead and add what is needed to make it pass

(and FYI make this an expected = ), rather than code in assert_frame_equal block

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you want this to be a separate module at this point? it looks ok here

-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be passed in directly (from the calling function), no? rather than hard coded inside this function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe the caller passes in a partial the kwargs baked rather than the string version of the funtion?

_get_cythonized_result(
    partial(pd._libs.groupby.group_shift_indexer, nperiods=2),
    ...
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll think more about that. The one complicating factor is that masked values cannot be passed in directly as they are calculated on a per slice basis. To avoid a convoluted set of calls I figured it would make sense to have all keywords and positional arguments resolved within one function, but there’s certainly other ways to go about this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 comment is about the exp_kwds, I am still not clear why the caller cannot pass these

if needs_ngroups:
func = partial(func, ngroups)

# Convert any keywords into positional arguments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass kwargs to cython functions, so not sure this is necessary

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly converted kwargs into positional arguments from the first comment back in #19481. If we are OK with kwargs in the Cython layer I can rewrite this, as it could simplify a few things

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can pass kwargs

base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to specify int64? (or platform int)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, int64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy/paste of existing code, so I didn't give it too much thought. That said, labels is already wrapped with a _ensure_int64 call as part of group_info within the BaseGrouper, so unless there's something in particular you know of I figure it would be easiest to inherit that type from labels and not override

def _fill(self, direction, limit=None):
"""Overriden method to concat grouped columns in output"""
res = super()._fill(direction, limit=limit)
output = collections.OrderedDict()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be slightly simpler as a list-comprehension

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe this should be an arg to _fill? (IOW the option to do it should be done in fill)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean as an arg to _get_cythonized_result (fill would always use, unless we wanted to implement a new feature)? I was thinking about that but given fill on a DataFrameGroupBy is the only operation I know of at the moment that includes the grouping in the body of the returned object I figured it made the most sense to just perform that action there

base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, int64

-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe the caller passes in a partial the kwargs baked rather than the string version of the funtion?

_get_cythonized_result(
    partial(pd._libs.groupby.group_shift_indexer, nperiods=2),
    ...
)

indexer = np.zeros_like(labels)
func = partial(base_func, indexer, labels)
if needs_mask:
mask = isnull(obj.values).astype(np.uint8, copy=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to the int64 case, you do want to use view here - numpy doesn't see bool and uint8 as the 'same' type, so will always copy, which view elides.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up - still trying to wrap my head around some of the nuances there. Out of curiosity, how do you know this? Just from experience or is there a good reference?

I do see that astype documentation mentions dtype, order and subok requirements need to be satisfied to prevent copy, but I'm not clear on what those requirements are and if they are documented

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately all I can really point to is trial and error. Some "general" rules, at least in a pandas context

copy=False will only elide a copy if the dtype is identical (may be contradictions to this, but at least not in general). Useful for conditional casts.
arr.view(dtype) never copies. Viewing across types should only be used if if the itemsize of the two types are identical. Primarily useful for casting a type with some metadata down to a a more primitive type (bool to uint8, datetime64 to int64).

To clarify my point about having a separate module for _get_cythonized_result I don't think that method as a stand-alone needs it's own module, but thinking long term and going back to the discussion in #19481 a class to more effectively dispatch all of the functions (agg and transform included) may make sense as a sub-module of groupby.py and _get_cythonized_result could potentially evolve into that

@WillAyd ahh perfect - yes absolutely in favor of s class dispatch mechanism
and should be in its own cython/python modules as needed
ok with merging this (small fix ups) then attacking that (any/all can be in current or new framework)

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 24, 2018 at 16:41 Hours UTC

@@ -1457,6 +1457,15 @@ def expanding(self, args, *kwargs):
from pandas.core.window import ExpandingGroupby
return ExpandingGroupby(self, args, *kwargs)

def _fill(self, direction, limit=None):
# Need int value for Cython

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 comment is about the exp_kwds, I am still not clear why the caller cannot pass these

if needs_ngroups:
func = partial(func, ngroups)

# Convert any keywords into positional arguments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can pass kwargs

int64_t lab, idxer, idxer_slot
int64_t[:] label_seen = np.zeros(ngroups, dtype=np.int64)
int64_t[:, :] label_indexer

periods = kwargs['periods']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if you are not doing a .get() here, is there a reason you are not simply haveing periods=None in the signature? (if its optional)?

ndarray[int64_t] sorted_labels
int64_t limit, idx, curr_fill_idx=-1, filled_vals=0

direction = kwargs['direction']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again why not simply list them in the signature? (you can have **kwargs if you want to ignore other ones)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to define a default value for something like periods in both the Python and Cython layers, leaving the former to set that default and be responsible for passing to Cython.

It's a pretty trivial change so will re-push with your changes reflected soon

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I've been looking at this change too long...planning on making these required in the Cython signature and using the existing defaults in the Python layer to populate those via **kwargs. If we are not on the same page let me know

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that sounds good

Travis failure on latest commit looks unrelated (something with pyarrow?)

harisbal pushed a commit to harisbal/pandas that referenced this pull request

Feb 28, 2018