API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc by shoyer · Pull Request #9258 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation72 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

shoyer

Fixes #8845

Currently only an index method; I'm open to ideas for how to expose it more
directly -- maybe df.loc_nearest?.

In particular, I think this is usually a more useful way to do indexing for
floats than pandas's reindex based .loc, because floats are inherently
imprecise.

CC @jreback @jorisvandenbossche @immerrr @hugadams

@shoyer

Hmm. I think this has all the functionality we need, but it would also be good to reconcile the API design here with asof. I'll dig into that, but I'm also open to any proposals.

@jorisvandenbossche

@jreback

I agree with @jorisvandenbossche this should just replace asof entirely
asod was essentially a simple version of this

@jreback

I think you should extract the indexer into another method maybe asof_indexer (and asof just calls that)
then we can support 'asof' joins! (I think there is an outstanding issue for this)

@shoyer

Should it return an array or an Index when given multiple values?

I'm returning an array, based on how get_indexer works.

I'm currently modeling this function as a sort of combination of get_loc and get_indexer. Perhaps it would indeed make more sense to separate this into two interfaces -- get_loc_nearest (could handle duplicate indexes) and get_indexer_nearest.

@jreback

@shoyer yes this should be 2 methods get_indexer_nearest which is called by get_loc_nearest (whose impl is trivial), but this exposes the indexer which is important.

my 2c, is that we already have asof so this should be just an impl, e.g. get_loc_asof/get_indexer_asof. No reason to have both

@shoyer shoyer changed the titleAPI/ENH: add Index.get_nearest API/ENH: add get_loc_nearest and get_indexer_nearest to Index

Jan 19, 2015

@shoyer

OK, latest commits include the split to 2 methods and handle missing values properly (that was not being done right before). The differences are:

  1. get_loc_nearest will coerce input to the appropriate Timedelta/Timestamp/Period scalar, like get_loc
  2. get_indexer_nearest uses -1 to mark missing values that cannot be found whereas get_loc_nearest raises KeyError

I was able to make Index.asof be a thin wrapper around get_loc_nearest. So it should be able to handle non-datetimes now (for whatever that's worth).

The other use of asof is in the method Series.asof and the undocumented method Index.asof_locs (which appears to only be used or tested by Series.asof). This method is a little different and somewhat strange IMO -- it skips missing values in the Series. For now I've left it alone because the skipping of missing values makes it somewhat different, but is could definitely also use a cleanup.

@hughesadam87

Where do you find the time man!

@shoyer

One more design question: How should we handle partial string matching for DatetimeIndex.get_loc_nearest?

The current design casts the input to a Timestamp and finds the nearest label. This seems reasonable, but it's not consistent with get_loc, which will return all matching labels (even a slice, if possible). Arguably, get_loc_nearest should be equivalent to calling get_loc and then falling back to nearest lookup if the key is not found.

Here's an concrete example:

idx = pd.to_datetime(['2000-01-31', '2000-02-28']) print idx.get_loc_nearest('2000-02', side='left') print idx.get_loc_nearest('2000-02', side='nearest')

With the current implementation, each of these would print 0. This is consistent with the current behavior of asof (as verified by test_asof_datetime_partial in test_index.py), but not get_loc, which would print 1 for each of these.

@shoyer

OK, it looks like I need to make my own API decision :).

I've settled on using the result of get_loc for the result of DatetimeIndex.get_loc_nearest, if possible. This means that the "nearest" value to a partial string index like '2000-01' will be any values in January 2001, not the nearest value to 2000-01-01T00:00:00.

This is also a small breaking API change for DatetimeIndex.asof. I've added documentation to that effect to what's new in the latest commit (along with more functionality/tests).

Any other comments? I think this is possibly ready to merge (once rebased).

@jorisvandenbossche

Would it be much trouble to not have this API change in asof? (possibly just leaving it as it is if it causes too much problems)

jorisvandenbossche

return label
See also: get_loc_nearest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a See also section?

See also
--------
get_loc_nearest : small explanation what this is

then you get automatic linking in the html pages

@jorisvandenbossche

A question: when you have a partial string, what is the 'nearest' value, so what will DatetimeIndex.get_loc_nearest return? (say if you have all days of the month in your index, what day is 'nearest' to the month)?

@shoyer

@jorisvandenbossche We definitely could include a work around to avoid the change for asof by overwriting the subclass DatetimeIndex.asof method.... but I'm not even entirely sure this should really be considered an API change rather than a bug fix (even though there was a test for it).

Note the asof docstring: "For a sorted index, return the most recent label up to and including the passed label." A string like '2000-02' arguably does include the timestamp '2000-02-28'.

Looking into the history a little more: this test appeared in PR #8246 by @TomAugspurger. There, it appears that the issue was something else entirely -- asof was returning time stamps not in the index at all (see #8245). It's not clear to me that anyone carefully considered exactly how partial datetime strings should be interpreted in this change.

@shoyer

@jorisvandenbossche The "nearest" value from idx.get_loc_nearest('2000-01') will be a slice object indexing all values in January 2001 if idx.get_loc('2000-01') is valid; otherwise, it will return the location of the nearest value to 2000-01-01T00:00:00. I suppose it would be a good idea to document this more directly!

@jreback

I haven't really looked at this in depth - but

the string slicing should be analogous to partial string indexing with the same semantics

at first glance it appears that they are the same?

@shoyer

@jreback Yes, the nearest partial string indexing of a DatetimeIndex is exactly equivalent to partial partial string indexing if there is a match; otherwise, it casts the string label to a Timestamp and finds the nearest value.

jorisvandenbossche

@@ -1256,6 +1256,19 @@ def get_loc(self, key):
except (KeyError, ValueError):
raise KeyError(key)
def get_loc_nearest(self, key, side='nearest'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should also get a docstring (there are normally some decorators for adding a docstring)

@jorisvandenbossche

@shoyer on asof, OK then, given this is rather new behaviour as you pointed out, so lets regard that as a bug fix (maybe also specify it like that in the whatsnew docs)

Some other things:

I know you modelled the functions to the existing get_loc and get_indexer, but supposing these wouldn't exist, I would have the following remarks:

@jreback

@jorisvandenbossche

I think this impl is fine. The signal of -1 is typical in the way the indexers work. These are by definition int64 and -1 is a not-found value. It is up to the user to process these.

By definition these are always indexers, by position, always, full-stop. These only have a single semantic for the output results. loc here is not location based, but just location (its an older nomenclature and I can see it being somewhat confusing, but its how all of the Index classes work.)

@shoyer I would think this API simpler if you do this;

def get_indexer(self, target, method=None, limit=None):   
        """
        Compute indexer and mask for new index given the current index. The
        indexer should be then used as an input to ndarray.take to align the
        current data to the new index. The mask determines whether labels are
        found or not in the current index

        Parameters
        ----------
        target : Index
        method : {'pad', 'ffill', 'backfill', 'bfill'}
            pad / ffill: propagate LAST valid observation forward to next valid
            backfill / bfill: use NEXT valid observation to fill gap

and allow method='nearest' (and allow the side parameter to be passed as well).
(similarly for get_loc).

I think its just think its adding API noise otherwise. These conceptually do exactly the same thing. (you certainly could have an internal _get_indexer_nearest if that makes the impl clearer though).

@shoyer

@jreback This is a good point about incorporating this into the get_indexer method. In fact, I realize now that get_indexer_nearest(side='left') is actually just a slower version of get_indexer(method='pad'). So it looks like adding method='nearest' is probably the way to go here.

@shoyer

It appears that the existing get_indexer method only works for monotonic ascending indexes and keys. So I'll probably keep around some of this code as a fallback option for pad/backfill with decreasing indexes and/or unordered keys.

@shoyer

@jorisvandenbossche I agree, the current index API is a little strange -- some refactoring to provide a unified API and to untangle core.indexers would definitely be in order. See also: #7651.

I'm implementing only the low level methods now, but eventually we should figure out a higher level API for nearest neighbor lookups in pandas.

jreback

new_index, indexer = axis_values.reindex(labels, method, level,
limit=limit)
return self._reindex_with_indexers(
{axis: [new_index, indexer]}, method=method, fill_value=fill_value,
limit=limit, copy=copy)
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, method is extraneous here (above too?)

@jreback

minor comments, lgtm otherwise.

jorisvandenbossche

pad / ffill: find the PREVIOUS index value if no exact match.
backfill / bfill: use NEXT index value if no exact match
nearest: use the NEAREST index value if no exact match. Tied
distances are broken by preferring the larger index value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the html docs, you have to make bullet points of these (use - or * before each line) to have this render properly.
The same for the other docstrings.

@jorisvandenbossche

docstring of reindex should still be updated

@shoyer

Yep, the docstrings need a bit of work. I'll do that next week -- going camping and will be offline for the next three days.

@shoyer shoyer changed the titleAPI/ENH: add method='nearest' to Index.get_indexer and method to get_loc API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc

Feb 17, 2015

@shoyer

OK, I updated docstrings for reindex, reindex_like, get_indexer and get_loc, and they appear to generate well-formatted API docs. Merge ready, I think?

jreback

s = Series(np.arange(10))
target = [0.1, 0.9, 1.5, 2.0]
actual = s.reindex(target, method='nearest')
expected = Series(np.around(target).astype(int), target)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype these to int64 otherwiset hey will fail on windows.

@jreback

lgtm

@shoyer

@shoyer

@jreback Fixed the dtypes.

@jorisvandenbossche @jreback I was planning on updating the basics.rst docs later (once "nearest" works for fillna), but it was simple enough to do it now. Let me know what you think. In particular, I removed some musing from Wes about the possibility of more sophisticated interpolation logic.

@jreback

This was referenced

Feb 18, 2015

@shoyer

@shoyer

OK, I'm going to merge this. Doc issues can be dealt with in a followup PR if necessary.

shoyer added a commit that referenced this pull request

Feb 23, 2015

@shoyer

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc

@jreback

jreback added a commit that referenced this pull request

Feb 28, 2015

@jreback

@mjroth

Is the intent for this to return the nearest valid observation?
i.e. for the example given in the whats new 0.16,
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html
if we throw in a NaN, it does not search for the next valid observation.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: df = pd.DataFrame({'x': range(5)}) In [4]: df.x[0] = np.nan In [5]: df.reindex([0.2, 1.8, 3.5], method='nearest') Out[5]: x 0.2 NaN 1.8 2 3.5 4

I'd expect x=1 at 0.2, given the Docs.

@shoyer

@mjroth the intent is for it to return values at the nearest index label, similarly to the existing behavior for forward and backward filling reindexing methods. If you don't want to include NA values, you'll need to do a dropna first.