ENH: tolerance now takes list-like argument for reindex and get_indexer. by buntwo · Pull Request #17367 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation55 Commits6 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Enable use of list-like values for tolerance argument in DataFrame.reindex(), Series.reindex(), Index.get_indexer().
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
Hello @buntwo! Thanks for updating the PR.
Cheers ! There are no PEP8 issues in this Pull Request. 🍻
Comment last updated on October 14, 2017 at 19:10 Hours UTC
does this have an associated issue?
can u show an example use case
Use case: suppose you have a timeseries of values for a stock, with some days with missing data in between, and you want to select some date range that includes the missing data. You want to have some fill backward (forward, nearest) N-many-days logic when doing so, but you want N to depend on some other property of the stock that varies as time. Currently pandas only supports N constant. This features enables this use case, by allowing the user to construct a list-like object of the same size that contains the tolerance per point and pass that in.
@buntwo : It seems like you want the tolerance time-varying. I suppose that could be useful, but I haven't seen a use-case like this really.
I would suggest you open an issue for this and point your PR to it so we can discuss.
In the future, for enhancements like this, it would be preferable to open an issue first so that we can discuss rather than in the PR.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments in-line. you are doing quite a bit of extra checking, I think you can pare this down substantially. you may need to adjust your tests slightly as well.
@@ -130,8 +130,7 @@ Other Enhancements |
---|
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`). |
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`). |
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`) |
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for `tolerance`. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back-ticks on tolerance
@@ -2658,8 +2659,14 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False, |
---|
Maximum distance between original and new labels for inexact |
matches. The values of the index at the matching locations most |
satisfy the equation ``abs(index[indexer] - target) <= tolerance``. |
Tolerance may be a scalar value, which applies the same tolerance |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line where you have new text (I think it will render ok)
@@ -2881,8 +2888,14 @@ def _reindex_multi(self, axes, copy, fill_value): |
---|
Maximum distance between original and new labels for inexact |
matches. The values of the index at the matching locations most |
satisfy the equation ``abs(index[indexer] - target) <= tolerance``. |
Tolerance may be a scalar value, which applies the same tolerance |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same (bonus points if we can use a _shared_docs here to avoid code duplication)
@@ -2444,9 +2443,14 @@ def _get_unique_index(self, dropna=False): |
---|
tolerance : optional |
Maximum distance from index value for inexact matches. The value of |
the index at the matching location most satisfy the equation |
``abs(index[loc] - key) <= tolerance``. |
``abs(index[loc] - key) <= tolerance``. Tolerance may be a scalar |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -2566,8 +2570,14 @@ def _get_level_values(self, level): |
---|
Maximum distance between original and new labels for inexact |
matches. The values of the index at the matching locations most |
satisfy the equation ``abs(index[indexer] - target) <= tolerance``. |
Tolerance may be a scalar value, which applies the same tolerance |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. actually let's fix the shared_docs things first in another PR (or here is ok too). too much doc-string duplication.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. How would I do that though? This (and the 3 others above) is already in a _shared_docs, so this would require another layer of abstraction above _shared_docs. Quite a bit of duplication with the existing tolerance docstrings as well.
except KeyError: |
---|
raise KeyError(key) |
except ValueError as e: |
# ndarray tolerance size must match target index size |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list like tolerance size mismatch raises a ValueError, but so does something else, so I differentiate here...definitely a better way to do this.
except ValueError: |
---|
raise ValueError('tolerance argument for %s must be numeric: %r' % |
(type(self).__name__, tolerance)) |
tolerance = _list_to_ndarray(tolerance) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, just use np.asarray(...)
, no need for the rest of this
offset = frequencies.to_offset(self.freq.rule_code) |
---|
if isinstance(offset, offsets.Tick): |
nanos = tslib._delta_to_nanoseconds(other) |
offset_nanos = tslib._delta_to_nanoseconds(offset) |
if nanos % offset_nanos == 0: |
if isinstance(other, np.ndarray): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just d
check = np.all(nanos % offset_nanos == 0)
which will work for scalars as well
@@ -775,6 +780,10 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): |
---|
if tolerance is not None: |
tolerance = self._convert_tolerance(tolerance) |
if isinstance(tolerance, np.ndarray) and \ |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check should already be in _convert_tolerance (I am not sure I mentioned above)
@@ -902,6 +911,10 @@ def _get_string_slice(self, key): |
---|
def _convert_tolerance(self, tolerance): |
tolerance = DatetimeIndexOpsMixin._convert_tolerance(self, tolerance) |
if isinstance(tolerance, np.ndarray) \ |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
jreback added a commit to jreback/pandas that referenced this pull request
The failed test in appveyor I think has nothing to do with my change, didn't fail on the last commit.
@@ -234,6 +234,7 @@ Other Enhancements |
---|
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names. (:issue:`14207`) |
- Improved the import time of pandas by about 2.25x. (:issue:`16764`) |
- :func:`read_json` and :func:`to_json` now accept a ``compression`` argument which allows them to transparently handle compressed files. (:issue:`17798`) |
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for ``tolerance`` |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number (or PR number if no issue)
elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1: |
---|
if getattr(arg, 'ndim', 1) == 0: |
# extract array scalar and process below |
arg = arg[()] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .item()
actual = idx.get_indexer([0.2, 1.8, 8.5], method=method, |
---|
tolerance=[0.5, 0.2, 0.2]) |
tm.assert_numpy_array_equal(actual, np.array(expected, |
dtype=np.intp)) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u use parametrize here to make the tests more readable
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be a separate test
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little of unsure of what to parameterize here: in each for loop, I actually just copy the last test (actual = idx.(...
, tm.assert_numpy_array_equal(...
) using scalar tolerance but change the tolerance to an array. I could add some comments to make it clearer?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize on method,expected
@@ -783,6 +788,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): |
---|
if tolerance is not None: |
tolerance = self._convert_tolerance(tolerance) |
if target.size != tolerance.size and tolerance.size > 1: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think would also pass target to _convert_tolerance so
can move error checks there
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would change all _convert_tolerance to pass target as well
to
offset = frequencies.to_offset(self.freq.rule_code) |
---|
if isinstance(offset, offsets.Tick): |
nanos = tslib._delta_to_nanoseconds(other) |
if isinstance(other, np.ndarray): |
nanos = np.vectorize(tslib._delta_to_nanoseconds)(other) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the array version of this function is almost trivial
if u can add it alongside the other and call here
(u just need to type the input as ndarray i think)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually ignore the above this is ok here
[[0, 2, -1], [0, -1, -1], |
---|
[-1, 2, 9]]): |
# list |
actual = idx.get_indexer([0.2, 1.8, 8.5], method='nearest', |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also parametrize on the type of list-like
@buntwo ok parametrized the tests, ping on green.
going to fail from some flake errors
git diff master -u "*.py" | flake8 --diff
will show you them
also rebase on master (you don't need to squash, though its easier to do so)
your rebasing is removing things. just leave it and restore code (in a new commit) that was removed. making sure all tests pass.
Apologies for the mistake in rebasing. I used git diff to cherry pick the unintended changes and git apply -R to revert.
@@ -83,8 +83,12 @@ def to_timedelta(arg, unit='ns', box=True, errors='raise'): |
---|
elif isinstance(arg, ABCIndexClass): |
return _convert_listlike(arg, unit=unit, box=box, |
errors=errors, name=arg.name) |
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1: |
return _convert_listlike(arg, unit=unit, box=box, errors=errors) |
elif is_list_like(arg) and getattr(arg, 'ndim', 1) <= 1: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you separate these into 2 conditions on the main if/else, IOW
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 0:
arg = arg.item()
# this would then be unchanged
elif is_list_like(arg) and getattr(arg, 'ndim', 1) == 1:
return _convert.....
buntwo deleted the ndarray_tolerance branch
ghost pushed a commit to reef-technologies/pandas that referenced this pull request
ghost pushed a commit to reef-technologies/pandas that referenced this pull request
alanbato pushed a commit to alanbato/pandas that referenced this pull request
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request