Groupby-like API for resampling by darothen · Pull Request #1272 · pydata/xarray (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
This is a work-in-progress to resolve #1269.
- Basic functionality
- Cleanly deprecate old API
- New test cases
- Documentation / examples
- "What's new"
Openly welcome feedback/critiques on how I approached this. Subclassing Data{Array/set}GroupBy may not be the best way, but it would be easy enough to re-write the necessary helper functions (just apply(), I think) so that we do not need to inherit form them directly. Additional issues I'm working to resolve:
- I tried make sure that calls using the old API won't break by refactoring the old logic to
_resample_immediately(). This may not be the best approach! - Similarly, I copied all the original test cases and added the suffix
..._old_api; these could trivially be placed into their related test cases for the new API. - BUG: keep_attrs is ignored when you call it on methods chained to
Dataset.resample(). Oddly enough, if I hard-code keep_attrs=True insidereduce_array()inDatasetResample::reduceit works just fine. I haven't figured out where the kwarg is getting lost. - BUG: Some of the test cases (for instance,
test_resample_old_vs_new_api) fail because the resampling by callingself.groupby_clsends up not working - it crashes because the group sizes that get computed are not what it expects. Occurs with both new and old API
| if isinstance(dim, basestring): |
| dim = self[dim] |
| group = DataArray(dim, [(RESAMPLE_DIM, dim)], name=RESAMPLE_DIM) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think this should probably be group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM) but for some reason this works.
| """Implement the original version of .resample() which immediately |
|---|
| executes the desired resampling operation. """ |
| from .dataarray import DataArray |
| RESAMPLE_DIM = '__resample_dim__' |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be better way to handle this, but currently using RESAMPLE_DIM is actually pretty critical here, and removing it is probably the source of some of the bugs you are seeing with your new implementation. The problem is we need to keep track of the resampled time variable in a dataset along with the original time variable. If we give the resampled group variable the same name, then xarray gets them mixed up.
Smoothed out most of the problems from earlier and missing details. Still not sure if it's wise to refactor most of the resampling logic into a new resample.py, like what was done with rolling, but it still makes some sense to keep things in groupby.py because we're just subclassing existing machinery from there.
The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.
The only issue now is the signature for init() in Data{set,Array}Resample, where we have to add in two keyword arguments. Python 2.x doesn't like named arguments after *args. There are a few options here, mostly just playing with **kwargs as in this StackOverflow thread.
Yes, use pop, e.g., dim = kwargs.pop('dim', None). pop removes the arguments from kwargs, so you can pass on the remaining ones unchanged to the super class method.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to think about valid uses (and add some test coverage) for groupby-like resampling that doesn't use aggregation. For example, what happens if you want to use resample with arithmetic, .apply or to iterate over groups? This means that the ugly '__resample_dim__' name will leak out into the external API.
We still need a way to deal with redundant dimension/coordinate names, but maybe we can give a slightly friendlier name for the coordinate, e.g., resampled_time if time is the name of the resampled coordinate.
| **kwargs) |
|---|
| result = self.apply(reduce_array, shortcut=shortcut) |
| return result.rename({self._resample_dim: self._dim}) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this .rename({self._resample_dim: self._dim}) business at the end of .apply instead? That would give a slightly better result for non-reduce uses of the new resample (e.g., for arithmetic).
| with self.assertRaisesRegexp(ValueError, 'index must be monotonic'): |
|---|
| array[[2, 0, 1]].resample(time='1D') |
| def test_resample_old_api(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be better to remove some of these old API tests in preference to tests using the new API. We don't really need all of them to be sure the old API still works.
| times = pd.date_range('2000-01-01', freq='6H', periods=10) |
|---|
| array = DataArray(np.ones(10), [('time', times)]) |
| # Simple mean |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pytest.warns around each use of the old API to verify that the right warning is raised (and also to ensure the warnings don't get issued when we run the test suite).
| ds.resample(time='6H').reduce(np.mean) |
| Resampling does not yet work for upsampling. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).
Thanks for the feedback, @shoyer! Will circle back around to continue working on this in a few days when I have some free time. - Daniel
Should .apply() really work on non-aggregation functions? Based on the pandas documentation it seems like "resample" is truly just a synonym for a transformation of the time dimension. I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().
As written, non-aggregation ("transformation"?) doesn't work because the call in _combine() to _maybe_reorder() messes things up (it drops all of the data along the resampled dimension). It shouldn't be too hard to fix this, although I'm leaning more and more to making stand-alone Data{Array,set}Resample classes in a separate file which only loosely inherit from their Data{Array,set}GroupBy cousins, since they need to re-write some really critical parts of the underlying machinery.
I can't really find many examples of people using this as a substitute for time group-bys... it seems that's what the pd.TimeGrouper is for, in conjunction with a normal .groupby().
I think this is mostly because TimeGrouper has been around far longer than non-aggregating resample.
@darothen - can you give a summary of what's left to do here?
I think a pull against the new releases is critical to see what breaks. Beyond that, just code clean up and testing. I can try to bump this higher on my priority list.
darothen added 11 commits
…setResample" subclasses of their equivalent "GroupBy" implementation
…actual resampling dimension name
I did my best to re-base everything to master... plan on spending an hour or so figuring out what's broken and at least restoring the status quo.
Un-do _combine temporary debugging output
TODO
- ensure that
count()works onData{Array,set}Resampleobjects - refactor
Data{Array,set}Resampleobjects into a stand-alone file core/resample.py alongside core/groupby.py - wrap
pytest.warnsaround tests targeting old API - move old API tests into stand-alone
- Crude up-sampling. Copy/pasting Stephan's earlier comment from Feb 20:
I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq, though this is basically pure sugar since in practice I think it works exactly the same as .first (or .mean if only doing pure upsampling).
Alright @jhamman, here's the complete list of work left here. I'll tackle some of it during my commutes this week.
| for how in ['mean', 'sum', 'first', 'last', ]: |
|---|
| method = getattr(actual, how) |
| result = method() |
| self.assertDatasetEqual(expected, result) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change, but in future you can use pytest parameters for these sorts of loops, and assert_equal rather than the self... methods throughout
darothen added 2 commits
darothen added 2 commits
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go! I'll merge this in a day or two once other have had the chance to look at it over.
| "variable '{}', but it is a dask array; dask arrays not " |
|---|
| "yet supprted in resample.interpolate(). Load into " |
| "memory with Dataset.load() before resampling." |
| .format(name) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it -- thanks!
| ---------- |
|---|
| kind : str {'linear', 'nearest', 'zero', 'slinear', |
| 'quadratic', 'cubic'} |
| Interpolation scheme to use |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a period
shoyer changed the title
WIP: Groupby-like API for resampling Groupby-like API for resampling
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this another pass through. I found a number of pep8 violations (run git diff upstream/master | flake8 --diff) that we should fix. One comment of substance: we probably need better test coverage on the new errors.
- Passes
git diff upstream/master | flake8 --diff
| label=label, base=base) |
|---|
| resampler = self._resample_cls(self, group=group, dim=dim_name, |
| grouper=time_grouper, |
| resample_dim=resample_dim) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
| ), DeprecationWarning, stacklevel=3) |
|---|
| if isinstance(dim, basestring): |
| dim_name = dim |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dim_name is never used
| result = result.rename({RESAMPLE_DIM: dim.name}) |
|---|
| return result |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: remove extra blank line
| if (coords is not None and not utils.is_dict_like(coords) and |
| len(coords) != len(shape)): |
| len(coords) != len(shape)): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
| "yet supprted in resample.interpolate(). Load into " |
|---|
| "memory with Dataset.load() before resampling." |
| .format(name) |
| ) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test that covers this TypeError? Right now, I think you'll actually get a NameError because name is not defined.
| return DataArray(f(new_x), coords, dims, name=dummy.name, |
|---|
| attrs=dummy.attrs) |
| ops.inject_reduce_methods(DataArrayResample) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: add a second empty line after class definition
| # convention from old api |
|---|
| new_api = getattr(resampler, method)(keep_attrs=False) |
| with pytest.warns(DeprecationWarning): |
| old_api = array.resample('1D', dim='time', how=method) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
| def test_resample_mean_keep_attrs(self): |
|---|
| array = DataArray(np.arange(10), [('__resample_dim__', times)]) |
| actual = array.resample('1D', dim='__resample_dim__', how='first') |
| self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension') |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be:
with self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension'): actual = array.resample('1D', dim='resample_dim', how='first')
I'm not sure why this is passing actually.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was butchered in more ways than one; I fixed the test which revealed some underlying flaws, too, which are also fixed.
| ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y')) |
|---|
| tcoord = DataArray(tt, {'time': times}, ('time', )) |
| ds = Dataset({'data': array, 'xc': xcoord, |
| 'yc': ycoord, 'tc': tcoord}) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
| ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y')) |
|---|
| tcoord = DataArray(tt, {'time': times}, ('time', )) |
| ds = Dataset({'data': array, 'xc': xcoord, |
| 'yc': ycoord, 'tc': tcoord}) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
@jhamman Gotcha, I'll clean everything up by the end of the week. If that's going to block 0.10.0, let me know and I'll shuffle some things around to prioritize this.
@darothen - we have a few other PRs to wrap up for 0.10 so end of the week is okay.
darothen added 2 commits
@jhamman Think we're good. I deferred 4 small pep8 issues because they're in parts of the codebase which I don't think I ever touched, and i'm worried they're going to screw up the merge.
Thanks @darothen - can you resolve the merge conflicts?
@jhamman done - caught me right while I was compiling GEOS-Chem, and the merge conflicts were very simple.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two more changes that need to be made...These were caught by travis.
| ('x', 'y', 'time')) |
|---|
| self.assertDataArrayIdentical(expected, actual) |
| def test_upsample_interpolate(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a @requires_scipy decorator or put
interp1d = pytest.importorskip('scipy.interpolate.interp1d')
in the first line of the test (instead of the import)
| def _interpolate(self, kind='linear'): |
|---|
| """Apply scipy.interpolate.interp1d along resampling dimension.""" |
| from .dataarray import DataArray |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's import interp1d here, instead of at the module level. scipy is still a optional dependency.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - it needed to be imported under DatasetResample, too.
@darothen - almost there. Two or three more dependency conflicts in the tests.
@jhamman Ohhh i totally misunderstood the last readout from travis-ci. Dealing with the scipy dependency is easy enough. However, another test fails because it uses np.flip() which wasn't added to numpy until v1.12.0. Do we want to bump the numpy version in the dependencies? Or is there another aproach to take here?
Nevermind, easy solution is just to use other axis-reversal methods :)
We just bumped numpy to 1.11, but 1.12 would be too new.
Let's just add a np.flip backport to core/npcompat.py. The whole function is only a couple of lines.
darothen added 2 commits