API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc by shoyer · Pull Request #9258 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation72 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Fixes #8845
Currently only an index method; I'm open to ideas for how to expose it more
directly -- maybe df.loc_nearest
?.
In particular, I think this is usually a more useful way to do indexing for
floats than pandas's reindex based .loc
, because floats are inherently
imprecise.
CC @jreback @jorisvandenbossche @immerrr @hugadams
Hmm. I think this has all the functionality we need, but it would also be good to reconcile the API design here with asof
. I'll dig into that, but I'm also open to any proposals.
- Should it return an array or an Index when given multiple values?
- For
asof
, I think this can completely replace the usage ofasof
, and is also compatible in signature. Only differences are:asof
would have a default ofleft
in terms ofget_nearest
- it returns NaN if it cannot find anything.
I think we can leaveasof
as is, but add above remarks to the docs. Referring in the docstring ofasof
toget_nearest
and indicating how you can replace it.
I agree with @jorisvandenbossche this should just replace asof entirely
asod was essentially a simple version of this
I think you should extract the indexer into another method maybe asof_indexer (and asof just calls that)
then we can support 'asof' joins! (I think there is an outstanding issue for this)
Should it return an array or an Index when given multiple values?
I'm returning an array, based on how get_indexer
works.
I'm currently modeling this function as a sort of combination of get_loc
and get_indexer
. Perhaps it would indeed make more sense to separate this into two interfaces -- get_loc_nearest
(could handle duplicate indexes) and get_indexer_nearest
.
@shoyer yes this should be 2 methods get_indexer_nearest
which is called by get_loc_nearest
(whose impl is trivial), but this exposes the indexer which is important.
my 2c, is that we already have asof
so this should be just an impl, e.g. get_loc_asof/get_indexer_asof
. No reason to have both
shoyer changed the title
API/ENH: add Index.get_nearest API/ENH: add get_loc_nearest and get_indexer_nearest to Index
OK, latest commits include the split to 2 methods and handle missing values properly (that was not being done right before). The differences are:
get_loc_nearest
will coerce input to the appropriate Timedelta/Timestamp/Period scalar, likeget_loc
get_indexer_nearest
uses -1 to mark missing values that cannot be found whereasget_loc_nearest
raisesKeyError
I was able to make Index.asof
be a thin wrapper around get_loc_nearest
. So it should be able to handle non-datetimes now (for whatever that's worth).
The other use of asof is in the method Series.asof
and the undocumented method Index.asof_locs
(which appears to only be used or tested by Series.asof
). This method is a little different and somewhat strange IMO -- it skips missing values in the Series. For now I've left it alone because the skipping of missing values makes it somewhat different, but is could definitely also use a cleanup.
Where do you find the time man!
One more design question: How should we handle partial string matching for DatetimeIndex.get_loc_nearest
?
The current design casts the input to a Timestamp and finds the nearest label. This seems reasonable, but it's not consistent with get_loc
, which will return all matching labels (even a slice, if possible). Arguably, get_loc_nearest
should be equivalent to calling get_loc
and then falling back to nearest lookup if the key is not found.
Here's an concrete example:
idx = pd.to_datetime(['2000-01-31', '2000-02-28']) print idx.get_loc_nearest('2000-02', side='left') print idx.get_loc_nearest('2000-02', side='nearest')
With the current implementation, each of these would print 0. This is consistent with the current behavior of asof
(as verified by test_asof_datetime_partial
in test_index.py
), but not get_loc
, which would print 1 for each of these.
OK, it looks like I need to make my own API decision :).
I've settled on using the result of get_loc
for the result of DatetimeIndex.get_loc_nearest
, if possible. This means that the "nearest" value to a partial string index like '2000-01'
will be any values in January 2001, not the nearest value to 2000-01-01T00:00:00
.
This is also a small breaking API change for DatetimeIndex.asof
. I've added documentation to that effect to what's new in the latest commit (along with more functionality/tests).
Any other comments? I think this is possibly ready to merge (once rebased).
Would it be much trouble to not have this API change in asof
? (possibly just leaving it as it is if it causes too much problems)
return label |
See also: get_loc_nearest |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a See also
section?
See also
--------
get_loc_nearest : small explanation what this is
then you get automatic linking in the html pages
A question: when you have a partial string, what is the 'nearest' value, so what will DatetimeIndex.get_loc_nearest
return? (say if you have all days of the month in your index, what day is 'nearest' to the month)?
@jorisvandenbossche We definitely could include a work around to avoid the change for asof
by overwriting the subclass DatetimeIndex.asof
method.... but I'm not even entirely sure this should really be considered an API change rather than a bug fix (even though there was a test for it).
Note the asof
docstring: "For a sorted index, return the most recent label up to and including the passed label." A string like '2000-02' arguably does include the timestamp '2000-02-28'.
Looking into the history a little more: this test appeared in PR #8246 by @TomAugspurger. There, it appears that the issue was something else entirely -- asof
was returning time stamps not in the index at all (see #8245). It's not clear to me that anyone carefully considered exactly how partial datetime strings should be interpreted in this change.
@jorisvandenbossche The "nearest" value from idx.get_loc_nearest('2000-01')
will be a slice object indexing all values in January 2001 if idx.get_loc('2000-01')
is valid; otherwise, it will return the location of the nearest value to 2000-01-01T00:00:00
. I suppose it would be a good idea to document this more directly!
I haven't really looked at this in depth - but
the string slicing should be analogous to partial string indexing with the same semantics
at first glance it appears that they are the same?
@jreback Yes, the nearest partial string indexing of a DatetimeIndex is exactly equivalent to partial partial string indexing if there is a match; otherwise, it casts the string label to a Timestamp and finds the nearest value.
@@ -1256,6 +1256,19 @@ def get_loc(self, key): |
---|
except (KeyError, ValueError): |
raise KeyError(key) |
def get_loc_nearest(self, key, side='nearest'): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should also get a docstring (there are normally some decorators for adding a docstring)
@shoyer on asof
, OK then, given this is rather new behaviour as you pointed out, so lets regard that as a bug fix (maybe also specify it like that in the whatsnew docs)
Some other things:
- On the partial string indexing. The docstring of
get_loc_nearest
says it returns an int -> so this is not always correct then. - For a DatetimeIndex, you only implemented
get_loc_nearest
and notget_indexer_nearest
. Is this on purpose? If you now try usingget_indexer_nearest
with a string, you get rather unhelpful error messages.
I know you modelled the functions to the existing get_loc
and get_indexer
, but supposing these wouldn't exist, I would have the following remarks:
- Why have both? Why not just one function that accepts label or array of labels? (and returns int or array of ints)
- I find the name a bit confusing.
get_loc_nearest
returns the 'integer location by position'. But in the 'indexing terminology',loc
is for the 'label-location' based indexer. - The return of
-1
if a key is not found, seems a bit strange for me. As you would typically use the result to index your frame/series. But -1 has in that case a specific meaning, namely the last value. And that is not always what you want I think. (I know this is not that simple, as you cannot have NaN in integer, and you cannot have non-integer in the indexer, ..)
I think this impl is fine. The signal of -1 is typical in the way the indexers work. These are by definition int64
and -1 is a not-found value. It is up to the user to process these.
By definition these are always indexers, by position, always, full-stop. These only have a single semantic for the output results. loc
here is not location based, but just location (its an older nomenclature and I can see it being somewhat confusing, but its how all of the Index classes work.)
@shoyer I would think this API simpler if you do this;
def get_indexer(self, target, method=None, limit=None):
"""
Compute indexer and mask for new index given the current index. The
indexer should be then used as an input to ndarray.take to align the
current data to the new index. The mask determines whether labels are
found or not in the current index
Parameters
----------
target : Index
method : {'pad', 'ffill', 'backfill', 'bfill'}
pad / ffill: propagate LAST valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
and allow method='nearest'
(and allow the side
parameter to be passed as well).
(similarly for get_loc
).
I think its just think its adding API noise otherwise. These conceptually do exactly the same thing. (you certainly could have an internal _get_indexer_nearest
if that makes the impl clearer though).
@jreback This is a good point about incorporating this into the get_indexer
method. In fact, I realize now that get_indexer_nearest(side='left')
is actually just a slower version of get_indexer(method='pad')
. So it looks like adding method='nearest'
is probably the way to go here.
It appears that the existing get_indexer
method only works for monotonic ascending indexes and keys. So I'll probably keep around some of this code as a fallback option for pad
/backfill
with decreasing indexes and/or unordered keys.
@jorisvandenbossche I agree, the current index API is a little strange -- some refactoring to provide a unified API and to untangle core.indexers
would definitely be in order. See also: #7651.
I'm implementing only the low level methods now, but eventually we should figure out a higher level API for nearest neighbor lookups in pandas.
new_index, indexer = axis_values.reindex(labels, method, level, |
---|
limit=limit) |
return self._reindex_with_indexers( |
{axis: [new_index, indexer]}, method=method, fill_value=fill_value, |
limit=limit, copy=copy) |
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, method
is extraneous here (above too?)
minor comments, lgtm otherwise.
pad / ffill: find the PREVIOUS index value if no exact match. |
---|
backfill / bfill: use NEXT index value if no exact match |
nearest: use the NEAREST index value if no exact match. Tied |
distances are broken by preferring the larger index value. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the html docs, you have to make bullet points of these (use -
or *
before each line) to have this render properly.
The same for the other docstrings.
docstring of reindex should still be updated
Yep, the docstrings need a bit of work. I'll do that next week -- going camping and will be offline for the next three days.
shoyer changed the title
API/ENH: add method='nearest' to Index.get_indexer and method to get_loc API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc
OK, I updated docstrings for reindex, reindex_like, get_indexer and get_loc, and they appear to generate well-formatted API docs. Merge ready, I think?
s = Series(np.arange(10)) |
---|
target = [0.1, 0.9, 1.5, 2.0] |
actual = s.reindex(target, method='nearest') |
expected = Series(np.around(target).astype(int), target) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astype these to int64 otherwiset hey will fail on windows.
lgtm
- minor dtype chaneges needed in some tests
- haven't checked, but make basics.rst docs need updating (where reindex is, to maybe have an example for nearest)
@jreback Fixed the dtypes.
@jorisvandenbossche @jreback I was planning on updating the basics.rst docs later (once "nearest" works for fillna), but it was simple enough to do it now. Let me know what you think. In particular, I removed some musing from Wes about the possibility of more sophisticated interpolation logic.
This was referenced
Feb 18, 2015
OK, I'm going to merge this. Doc issues can be dealt with in a followup PR if necessary.
shoyer added a commit that referenced this pull request
API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc
jreback added a commit that referenced this pull request
Is the intent for this to return the nearest valid observation?
i.e. for the example given in the whats new 0.16,
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html
if we throw in a NaN, it does not search for the next valid observation.
In [1]: import pandas as pd In [2]: import numpy as np In [3]: df = pd.DataFrame({'x': range(5)}) In [4]: df.x[0] = np.nan In [5]: df.reindex([0.2, 1.8, 3.5], method='nearest') Out[5]: x 0.2 NaN 1.8 2 3.5 4
I'd expect x=1 at 0.2, given the Docs.
@mjroth the intent is for it to return values at the nearest index label, similarly to the existing behavior for forward and backward filling reindexing methods. If you don't want to include NA values, you'll need to do a dropna first.