Groupby-like API for resampling by darothen · Pull Request #1272 · pydata/xarray (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@darothen

This is a work-in-progress to resolve #1269.

Openly welcome feedback/critiques on how I approached this. Subclassing Data{Array/set}GroupBy may not be the best way, but it would be easy enough to re-write the necessary helper functions (just apply(), I think) so that we do not need to inherit form them directly. Additional issues I'm working to resolve:

shoyer

if isinstance(dim, basestring):
dim = self[dim]
group = DataArray(dim, [(RESAMPLE_DIM, dim)], name=RESAMPLE_DIM)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think this should probably be group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM) but for some reason this works.

"""Implement the original version of .resample() which immediately
executes the desired resampling operation. """
from .dataarray import DataArray
RESAMPLE_DIM = '__resample_dim__'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be better way to handle this, but currently using RESAMPLE_DIM is actually pretty critical here, and removing it is probably the source of some of the bugs you are seeing with your new implementation. The problem is we need to keep track of the resampled time variable in a dataset along with the original time variable. If we give the resampled group variable the same name, then xarray gets them mixed up.

@darothen

Smoothed out most of the problems from earlier and missing details. Still not sure if it's wise to refactor most of the resampling logic into a new resample.py, like what was done with rolling, but it still makes some sense to keep things in groupby.py because we're just subclassing existing machinery from there.

The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.

@shoyer

The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.

Yes, use pop, e.g., dim = kwargs.pop('dim', None). pop removes the arguments from kwargs, so you can pass on the remaining ones unchanged to the super class method.

shoyer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to think about valid uses (and add some test coverage) for groupby-like resampling that doesn't use aggregation. For example, what happens if you want to use resample with arithmetic, .apply or to iterate over groups? This means that the ugly '__resample_dim__' name will leak out into the external API.

We still need a way to deal with redundant dimension/coordinate names, but maybe we can give a slightly friendlier name for the coordinate, e.g., resampled_time if time is the name of the resampled coordinate.

**kwargs)
result = self.apply(reduce_array, shortcut=shortcut)
return result.rename({self._resample_dim: self._dim})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this .rename({self._resample_dim: self._dim}) business at the end of .apply instead? That would give a slightly better result for non-reduce uses of the new resample (e.g., for arithmetic).

with self.assertRaisesRegexp(ValueError, 'index must be monotonic'):
array[[2, 0, 1]].resample(time='1D')
def test_resample_old_api(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to remove some of these old API tests in preference to tests using the new API. We don't really need all of them to be sure the old API still works.

times = pd.date_range('2000-01-01', freq='6H', periods=10)
array = DataArray(np.ones(10), [('time', times)])
# Simple mean

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pytest.warns around each use of the old API to verify that the right warning is raised (and also to ensure the warnings don't get issued when we run the test suite).

ds.resample(time='6H').reduce(np.mean)
Resampling does not yet work for upsampling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).

@darothen

Thanks for the feedback, @shoyer! Will circle back around to continue working on this in a few days when I have some free time. - Daniel

@darothen

Should .apply() really work on non-aggregation functions? Based on the pandas documentation it seems like "resample" is truly just a synonym for a transformation of the time dimension. I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().

As written, non-aggregation ("transformation"?) doesn't work because the call in _combine() to _maybe_reorder() messes things up (it drops all of the data along the resampled dimension). It shouldn't be too hard to fix this, although I'm leaning more and more to making stand-alone Data{Array,set}Resample classes in a separate file which only loosely inherit from their Data{Array,set}GroupBy cousins, since they need to re-write some really critical parts of the underlying machinery.

@shoyer

I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().

I think this is mostly because TimeGrouper has been around far longer than non-aggregating resample.

@jhamman

@darothen - can you give a summary of what's left to do here?

@darothen

I think a pull against the new releases is critical to see what breaks. Beyond that, just code clean up and testing. I can try to bump this higher on my priority list.

darothen added 11 commits

July 19, 2017 09:26

…setResample" subclasses of their equivalent "GroupBy" implementation

…actual resampling dimension name

@darothen

I did my best to re-base everything to master... plan on spending an hour or so figuring out what's broken and at least restoring the status quo.

Un-do _combine temporary debugging output

@darothen

TODO

I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).


Alright @jhamman, here's the complete list of work left here. I'll tackle some of it during my commutes this week.

max-sixty

for how in ['mean', 'sum', 'first', 'last', ]:
method = getattr(actual, how)
result = method()
self.assertDatasetEqual(expected, result)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, but in future you can use pytest parameters for these sorts of loops, and assert_equal rather than the self... methods throughout

darothen added 2 commits

September 13, 2017 12:11

darothen added 2 commits

September 13, 2017 12:37

@darothen

shoyer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go! I'll merge this in a day or two once other have had the chance to look at it over.

"variable '{}', but it is a dask array; dask arrays not "
"yet supprted in resample.interpolate(). Load into "
"memory with Dataset.load() before resampling."
.format(name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it -- thanks!

----------
kind : str {'linear', 'nearest', 'zero', 'slinear',
'quadratic', 'cubic'}
Interpolation scheme to use

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a period

@shoyer shoyer changed the titleWIP: Groupby-like API for resampling Groupby-like API for resampling

Sep 13, 2017

jhamman

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another pass through. I found a number of pep8 violations (run git diff upstream/master | flake8 --diff) that we should fix. One comment of substance: we probably need better test coverage on the new errors.

label=label, base=base)
resampler = self._resample_cls(self, group=group, dim=dim_name,
grouper=time_grouper,
resample_dim=resample_dim)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

), DeprecationWarning, stacklevel=3)
if isinstance(dim, basestring):
dim_name = dim

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dim_name is never used

result = result.rename({RESAMPLE_DIM: dim.name})
return result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: remove extra blank line

if (coords is not None and not utils.is_dict_like(coords) and
len(coords) != len(shape)):
len(coords) != len(shape)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

"yet supprted in resample.interpolate(). Load into "
"memory with Dataset.load() before resampling."
.format(name)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test that covers this TypeError? Right now, I think you'll actually get a NameError because name is not defined.

return DataArray(f(new_x), coords, dims, name=dummy.name,
attrs=dummy.attrs)
ops.inject_reduce_methods(DataArrayResample)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: add a second empty line after class definition

# convention from old api
new_api = getattr(resampler, method)(keep_attrs=False)
with pytest.warns(DeprecationWarning):
old_api = array.resample('1D', dim='time', how=method)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

def test_resample_mean_keep_attrs(self):
array = DataArray(np.arange(10), [('__resample_dim__', times)])
actual = array.resample('1D', dim='__resample_dim__', how='first')
self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be:

with self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension'): actual = array.resample('1D', dim='resample_dim', how='first')

I'm not sure why this is passing actually.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was butchered in more ways than one; I fixed the test which revealed some underlying flaws, too, which are also fixed.

ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y'))
tcoord = DataArray(tt, {'time': times}, ('time', ))
ds = Dataset({'data': array, 'xc': xcoord,
'yc': ycoord, 'tc': tcoord})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y'))
tcoord = DataArray(tt, {'time': times}, ('time', ))
ds = Dataset({'data': array, 'xc': xcoord,
'yc': ycoord, 'tc': tcoord})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@darothen

@jhamman Gotcha, I'll clean everything up by the end of the week. If that's going to block 0.10.0, let me know and I'll shuffle some things around to prioritize this.

@jhamman

@darothen - we have a few other PRs to wrap up for 0.10 so end of the week is okay.

darothen added 2 commits

September 20, 2017 08:16

@darothen

@jhamman Think we're good. I deferred 4 small pep8 issues because they're in parts of the codebase which I don't think I ever touched, and i'm worried they're going to screw up the merge.

@jhamman

Thanks @darothen - can you resolve the merge conflicts?

@darothen

@jhamman done - caught me right while I was compiling GEOS-Chem, and the merge conflicts were very simple.

jhamman

jhamman

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more changes that need to be made...These were caught by travis.

('x', 'y', 'time'))
self.assertDataArrayIdentical(expected, actual)
def test_upsample_interpolate(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a @requires_scipy decorator or put

interp1d = pytest.importorskip('scipy.interpolate.interp1d')

in the first line of the test (instead of the import)

def _interpolate(self, kind='linear'):
"""Apply scipy.interpolate.interp1d along resampling dimension."""
from .dataarray import DataArray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's import interp1d here, instead of at the module level. scipy is still a optional dependency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - it needed to be imported under DatasetResample, too.

@jhamman

@darothen - almost there. Two or three more dependency conflicts in the tests.

@darothen

@jhamman Ohhh i totally misunderstood the last readout from travis-ci. Dealing with the scipy dependency is easy enough. However, another test fails because it uses np.flip() which wasn't added to numpy until v1.12.0. Do we want to bump the numpy version in the dependencies? Or is there another aproach to take here?

Nevermind, easy solution is just to use other axis-reversal methods :)

@shoyer

We just bumped numpy to 1.11, but 1.12 would be too new.

Let's just add a np.flip backport to core/npcompat.py. The whole function is only a couple of lines.

darothen added 2 commits

September 21, 2017 17:11

jhamman

@jhamman

Labels