ENH: tolerance now takes list-like argument for reindex and get_indexer. by buntwo · Pull Request #17367 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation55 Commits6 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

buntwo

Enable use of list-like values for tolerance argument in DataFrame.reindex(), Series.reindex(), Index.get_indexer().

@pep8speaks

Hello @buntwo! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 14, 2017 at 19:10 Hours UTC

@jreback

does this have an associated issue?

can u show an example use case

@buntwo

Use case: suppose you have a timeseries of values for a stock, with some days with missing data in between, and you want to select some date range that includes the missing data. You want to have some fill backward (forward, nearest) N-many-days logic when doing so, but you want N to depend on some other property of the stock that varies as time. Currently pandas only supports N constant. This features enables this use case, by allowing the user to construct a list-like object of the same size that contains the tolerance per point and pass that in.

@gfyoung

@buntwo : It seems like you want the tolerance time-varying. I suppose that could be useful, but I haven't seen a use-case like this really.

I would suggest you open an issue for this and point your PR to it so we can discuss.

In the future, for enhancements like this, it would be preferable to open an issue first so that we can discuss rather than in the PR.

@codecov

@codecov

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments in-line. you are doing quite a bit of extra checking, I think you can pare this down substantially. you may need to adjust your tests slightly as well.

@@ -130,8 +130,7 @@ Other Enhancements
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for `tolerance`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-ticks on tolerance

@@ -2658,8 +2659,14 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
Tolerance may be a scalar value, which applies the same tolerance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line where you have new text (I think it will render ok)

@@ -2881,8 +2888,14 @@ def _reindex_multi(self, axes, copy, fill_value):
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
Tolerance may be a scalar value, which applies the same tolerance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same (bonus points if we can use a _shared_docs here to avoid code duplication)

@@ -2444,9 +2443,14 @@ def _get_unique_index(self, dropna=False):
tolerance : optional
Maximum distance from index value for inexact matches. The value of
the index at the matching location most satisfy the equation
``abs(index[loc] - key) <= tolerance``.
``abs(index[loc] - key) <= tolerance``. Tolerance may be a scalar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2566,8 +2570,14 @@ def _get_level_values(self, level):
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
Tolerance may be a scalar value, which applies the same tolerance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. actually let's fix the shared_docs things first in another PR (or here is ok too). too much doc-string duplication.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. How would I do that though? This (and the 3 others above) is already in a _shared_docs, so this would require another layer of abstraction above _shared_docs. Quite a bit of duplication with the existing tolerance docstrings as well.

except KeyError:
raise KeyError(key)
except ValueError as e:
# ndarray tolerance size must match target index size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list like tolerance size mismatch raises a ValueError, but so does something else, so I differentiate here...definitely a better way to do this.

except ValueError:
raise ValueError('tolerance argument for %s must be numeric: %r' %
(type(self).__name__, tolerance))
tolerance = _list_to_ndarray(tolerance)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, just use np.asarray(...), no need for the rest of this

offset = frequencies.to_offset(self.freq.rule_code)
if isinstance(offset, offsets.Tick):
nanos = tslib._delta_to_nanoseconds(other)
offset_nanos = tslib._delta_to_nanoseconds(offset)
if nanos % offset_nanos == 0:
if isinstance(other, np.ndarray):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just d

check = np.all(nanos % offset_nanos == 0)
which will work for scalars as well

@@ -775,6 +780,10 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
if tolerance is not None:
tolerance = self._convert_tolerance(tolerance)
if isinstance(tolerance, np.ndarray) and \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check should already be in _convert_tolerance (I am not sure I mentioned above)

@@ -902,6 +911,10 @@ def _get_string_slice(self, key):
def _convert_tolerance(self, tolerance):
tolerance = DatetimeIndexOpsMixin._convert_tolerance(self, tolerance)
if isinstance(tolerance, np.ndarray) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

jreback added a commit to jreback/pandas that referenced this pull request

Sep 23, 2017

@jreback

@buntwo

The failed test in appveyor I think has nothing to do with my change, didn't fail on the last commit.

jreback

@@ -234,6 +234,7 @@ Other Enhancements
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names. (:issue:`14207`)
- Improved the import time of pandas by about 2.25x. (:issue:`16764`)
- :func:`read_json` and :func:`to_json` now accept a ``compression`` argument which allows them to transparently handle compressed files. (:issue:`17798`)
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for ``tolerance``

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number (or PR number if no issue)

elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1:
if getattr(arg, 'ndim', 1) == 0:
# extract array scalar and process below
arg = arg[()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .item()

actual = idx.get_indexer([0.2, 1.8, 8.5], method=method,
tolerance=[0.5, 0.2, 0.2])
tm.assert_numpy_array_equal(actual, np.array(expected,
dtype=np.intp))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u use parametrize here to make the tests more readable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a separate test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little of unsure of what to parameterize here: in each for loop, I actually just copy the last test (actual = idx.(..., tm.assert_numpy_array_equal(...) using scalar tolerance but change the tolerance to an array. I could add some comments to make it clearer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize on method,expected

@@ -783,6 +788,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
if tolerance is not None:
tolerance = self._convert_tolerance(tolerance)
if target.size != tolerance.size and tolerance.size > 1:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think would also pass target to _convert_tolerance so
can move error checks there

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would change all _convert_tolerance to pass target as well
to

jreback

offset = frequencies.to_offset(self.freq.rule_code)
if isinstance(offset, offsets.Tick):
nanos = tslib._delta_to_nanoseconds(other)
if isinstance(other, np.ndarray):
nanos = np.vectorize(tslib._delta_to_nanoseconds)(other)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the array version of this function is almost trivial
if u can add it alongside the other and call here
(u just need to type the input as ndarray i think)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually ignore the above this is ok here

jreback

[[0, 2, -1], [0, -1, -1],
[-1, 2, 9]]):
# list
actual = idx.get_indexer([0.2, 1.8, 8.5], method='nearest',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also parametrize on the type of list-like

jreback

@jreback

@buntwo ok parametrized the tests, ping on green.

@jreback

going to fail from some flake errors

git diff master -u "*.py" | flake8 --diff
will show you them

@jreback

also rebase on master (you don't need to squash, though its easier to do so)

@jjhelmus

@jschendel

@buntwo

@jreback

your rebasing is removing things. just leave it and restore code (in a new commit) that was removed. making sure all tests pass.

@buntwo

Apologies for the mistake in rebasing. I used git diff to cherry pick the unintended changes and git apply -R to revert.

jreback

@@ -83,8 +83,12 @@ def to_timedelta(arg, unit='ns', box=True, errors='raise'):
elif isinstance(arg, ABCIndexClass):
return _convert_listlike(arg, unit=unit, box=box,
errors=errors, name=arg.name)
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
return _convert_listlike(arg, unit=unit, box=box, errors=errors)
elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you separate these into 2 conditions on the main if/else, IOW

elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 0:
     arg = arg.item()

# this would then be unchanged
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
     return _convert.....

@buntwo

@jreback

@buntwo

@buntwo buntwo deleted the ndarray_tolerance branch

October 14, 2017 21:59

ghost pushed a commit to reef-technologies/pandas that referenced this pull request

Oct 16, 2017

@buntwo

ghost pushed a commit to reef-technologies/pandas that referenced this pull request

Oct 16, 2017

alanbato pushed a commit to alanbato/pandas that referenced this pull request

Nov 10, 2017

@buntwo @alanbato

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request

Nov 28, 2017

@buntwo @No-Stream