BUG: RangeIndex.get_indexer is incorrect for some decreasing RangeIndex · Issue #28678 · pandas-dev/pandas (original) (raw)
For some decreasing RangeIndex
, the get_indexer
method will indicate that all of it's own values are missing, and find matches for values not included in the index:
In [2]: ri = pd.RangeIndex(10, 0, -3)
In [3]: ri.get_indexer(ri) Out[3]: array([-1, -1, -1, -1])
In [4]: ri.get_indexer(ri - 1) Out[4]: array([ 1, 2, 3, -1])
This will in turn result in methods like Series.reindex
not working properly:
In [5]: s = pd.Series(list('abcd'), index=ri)
In [6]: s Out[6]: 10 a 7 b 4 c 1 d dtype: object
In [7]: s.reindex([10, 9, 7]) Out[7]: 10 NaN 9 b 7 NaN dtype: object
The issue appears to occur specifically for decreasing RangeIndex
that are not in their canonical form. By canonical form, I mean when stop
is the next valid value in the range that's not included, e.g. when you think of a more standard range like range(1, 7, 1)
, 7 is the next valid value that's not present, but when the step is larger than 1 you lose uniqueness of representation with stop
(i.e. range(1, 6, 2) == range(1, 7, 2)
).
Note that the code above works properly for the equivalent RangeIndex
in it's canonical form:
In [8]: ri2 = pd.RangeIndex(start=10, stop=-2, step=-3)
In [9]: ri2.equals(ri) Out[9]: True
In [10]: ri2.get_indexer(ri2) Out[10]: array([0, 1, 2, 3])
In [11]: ri2.get_indexer(ri2 - 1) Out[11]: array([-1, -1, -1, -1])
In [12]: s2 = pd.Series(list('abcd'), index=ri2)
In [13]: s2 Out[13]: 10 a 7 b 4 c 1 d dtype: object
In [14]: s2.reindex([10, 9, 7]) Out[14]: 10 a 9 NaN 7 b dtype: object
The cause of the issue appears to be that the code to determine start
, stop
, step
when dealing with decreasing RangeIndex
in get_indexer
assumes self.stop
is the canonical form:
if self.step > 0: |
---|
start, stop, step = self.start, self.stop, self.step |
else: |
# Work on reversed range for simplicity: |
start, stop, step = (self.stop - self.step, self.start + 1, -self.step) |
Instead of directly computing the reversed values ourselves, I think we should simply take the values from the reversed underlying range
object:
diff --git a/pandas/core/indexes/range.py b/pandas/core/indexes/range.py index 8783351cc..4c5904e5a 100644 --- a/pandas/core/indexes/range.py +++ b/pandas/core/indexes/range.py @@ -387,7 +387,8 @@ class RangeIndex(Int64Index): start, stop, step = self.start, self.stop, self.step else: # Work on reversed range for simplicity:
start, stop, step = (self.stop - self.step, self.start + 1, -self.step)
reverse = self._range[::-1]
start, stop, step = reverse.start, reverse.stop, reverse.step target_array = np.asarray(target) if not (is_integer_dtype(target_array) and target_array.ndim == 1):