API: change IntervalIndex.contains to work elementwise by jorisvandenbossche · Pull Request #17753 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation33 Commits10 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
xref #16316
This is more a proof-of-concept, to see if there is agreement on the behaviour change (I suppose the implementation itself can be easily vectorized based on the left/right attributes).
cc @zfrenchee @shoyer @buyology @jreback
Any comments on this one?
(I still need to fix implementation and tests, but would like to check first whether there is some agreement on this?)
return True |
---|
except KeyError: |
return False |
return np.array([key in interval for interval in self], dtype='bool') |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just doreturn self.get_indexer(key) != -1
non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains
in the first place is to allow us to control what __contains__
does (and for II have it mean something differnt), so the default of scalar in index
is preserved.
But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean for each element, but .contains
shouldn't be how its spelled.
Related to indexing on series/frames, not to indexes themselves
Interval data type
labels
If .contains() is different from __contains__, why does it need to return a scalar in the same way? What's the use case for returning a scalar? We could have different method name of course (I think we considered .within() earlier), but I think we need something like this.
On Tue, Oct 10, 2017 at 3:44 AM Jeff Reback ***@***.***> wrote: non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved. But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean *for each element*, but .contains shouldn't be it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17753 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1kBcmKw9cJ_zOjp5xpeblqScaD4tks5squgGgaJpZM4PrQMj> .
Also, if you want the return value of the current contains, you can always do idx.contains(key).any()
with the proposed new behaviour to get the same (although then it might not be as efficient as before).
We can indeed discuss what would be the best name for this method (the one that returns boolean array), and it might not necessarily be 'contains'. But if we decide for 'contains', the fact that it is different from __contains__
should not hold us back IMO (the user interface is quite different: key in index
vs index.contains(key)
, and that can be easily documented).
But if we decide for 'contains', the fact that it is different from
__contains__
should not hold us back IMO (the user interface is quite different:key in index
vsindex.contains(key)
, and that can be easily documented).
Agreed with @jorisvandenbossche here.
It seems like this would be inconsistent with every other type of index, as index.contains(key)
is not elementwise for them (albeit for most elementwise wouldn't make sense). Even PeriodIndex
, which seems the most similar to IntervalIndex
, doesn't do elementwise.
I recall there being plans to add an interval accessor. Perhaps that would be a more consistent place for an elementwise contains
, something like index.iv.contains(key)
?
It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them
I am not aware of a general existing contains
method for Index ? (so I don't see how this can be inconsistent)
It is true that DatetimeIndex/PeriodIndex have one, but this was added together with the introduction of IntervalIndex, and is thus rather coupled. Indeed, when changing it for IntervalIndex, we need to change it for DatetimeIndex/PeriodIndex as well (or for DatetimeIndex, we could also remove it, as this has no notion of intervals).
I recall there being plans to add an interval accessor
We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)
Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?
IMO it would be unfortunate that index.iv.contains
and index.contains
did something completely different.
In general: note that the proposed behaviour is consistent with how 'contains' works on an individual Interval
. So in that sense the fact that a IntervalIndex.contains
method works element-wise and does what it means for the individual elements, feels very natural and is similar to the many datetime-related methods/attributes both on Timestamp as on DatetimeIndex.
I am not aware of a general existing contains method for Index ?
There is a contains method for Index
:
In [2]: idx = pd.Index(['foo', 'bar', 'baz'])
In [3]: idx.contains('foo') Out[3]: True
In [4]: idx.contains('a') Out[4]: False
IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.
Note that this already happens with the string accessor:
In [5]: idx.str.contains('a') Out[5]: array([False, True, True], dtype=bool)
I do see your point though, and am not trying to imply that this would be fine simply because there is precedence for it. Just pointing out that it's not an unbroken rule.
We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)
I was thinking of #16401. Seems like the proposal was just for columns, so I guess I was mistaken in thinking that it would also work on an IntervalIndex
, similar to how the string accessor works on both columns and indexes?
I probably wasn't clear in my original post, but I'm not against this change (nor do I feel like I've been around long enough to cast +1/-1 votes on stuff like this anyways). I was actually quite surprised when I first used IntervalIndex.contains
and it didn't return elementwise. Just wanted to make sure there wasn't a more appropriate place, and that the consistency ramifications were fully understood.
Contributor
jreback commented
•
edited by jorisvandenbossche
Loading
The entire point of introducing .contains
was to avoid conflating a very well-defined behavior for __contains__
(generally); this is only defined diffferently for IntervalIndex
. So not really against changing the behavior of it. It by-definition should be able to act differently on II, that's the point.
But I think the intent was still a scalar return value, and not an array. The idea was to allow value in II to work.
maybe if we had/have .contain
and .contains
this could be more clear (e.g. former is a scalar, latter is an array). Not sure this is the best spelling though.
There is a contains method for Index:
Whoops it seemed I really looked bad, I thought I was sure to have checked it :-) @jschendel Thanks for pointing this out. That is a bit unfortunate for the changes I am proposing here.
similar to how the string accessor works on both columns and indexes
Ah, yes, string accessor is indeed available on Index. I was thinking about the dt
and cat
accessors. But you are right, neither precedent necessarily needs to determine what we think best for IntervalIndex.
The entire point of introducing .contains was to avoid conflating a very well-defined behavior for
__contains__
(generally);
I don't fully understand this point, as Index.contains is (more or less) an exact alias for __contains__
for most index types. Apart for IntervalIndex, where it does something slightly different (not a pure __contains__
), and this difference in behaviour of .contains
between all indices and IntervalIndex is also confusing IMO.
I somehow like the fact that you can also do a 'contains' operation with a method instead of key in index
, but it adds a duplicate method which is not available for something else (as the behaviour proposed here)
I don't agree with closing this :-), but it is true the discussion died.
Let me summarize my reasoning for this PR:
- Currenlty
Index.contains(key)
does the same asIndex.__contains__(key)
/key in index
. This is therefore duplicating functionality that is IMO not needed (in other PRs we are actively removing methods). - For all index types
.contains()
currenlty does an "exact search" of the label,IntervalIndex
does not (also checks if passed values is contained in any of the individual values of the index), making this method inconsistent between index types. - The proposed
IntervalIndex.contains
is an element-wise version of the 'contains' operation on an individualInterval
object (curentlyval in Interval()
, but we could also add aInterval.contains()
method). - I don't know of other interval-like implementations that have something like this that we could use as reference, but in spatial topology this predicate is called "contains" (https://en.wikipedia.org/wiki/DE-9IM, although the rules for the edges are a bit unintuitive, and you could also say the equivalent is "covers").
So my proposal is:
- Remove (deprecate)
.contains()
for all other Index types - Only keep it for
IntervalIndex
and change it here to work element-wise
And also if we do not agree on the above, I think we need the functionality of this PR, so then we need to come up with another name than .contains()
(and then "covers" might be reasonable one, although I find "contains" more logical).
ok with implementing this. pls rebase / update.
@jorisvandenbossche : Any chance this gets implemented relatively soon? I'm looking for an IntervalIndex
method to test with #19502, but there aren't any existing methods that fit the bill, and this looks like it should work for that purpose. I'd be willing to take over the implementation if you have other priorities that take precedence over this.
@jschendel can you review and rebase if you think good?
@jreback : Sure, but will wait until IntervalArray
is done so as to not add additional conflicts there.
closing as stale. if you'd like to continue pls ping.
IntervalArray is done in the meantime, so we can rebase this now
Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻
Comment last updated at 2019-06-30 16:37:27 UTC
@jschendel If you want to take over, feel free. Otherwise I will take this up next week.
closing as stale. if someone would like to revive, pls ping.
I think it would be nice to include this along the other IntervalIndexing changes in #27100
(still need to add whatsnew docs + tests for deprecated methods)
@@ -615,6 +615,7 @@ Other deprecations |
---|
- :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`) |
- :meth:`Series.put` is deprecated. (:issue:`18262`) |
- :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`) |
- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`). |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is fine, but I would add a sub-section on the usage for IntervalIndex itself
Will add one once the other interval PR is merged Op ma 1 jul. 2019 12:28 schreef Jeff Reback notifications@github.com:
***@***.**** requested changes on this pull request. ------------------------------ In doc/source/whatsnew/v0.25.0.rst <#17753 (comment)>: > @@ -615,6 +615,7 @@ Other deprecations - :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`) - :meth:`Series.put` is deprecated. (:issue:`18262`) - :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`) +- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`). so this is fine, but I would add a sub-section on the usage for IntervalIndex itself — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#17753?email_source=notifications&email_token=AAHZEUGJNGZRKLLGKIU5ALLP5I5DBA5CNFSM4D5NAMR2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5EWQRI#pullrequestreview-256469061>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHZEUFHONUV22VC7GZVUDDP5I5DBANCNFSM4D5NAMRQ> .
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
def contains(self, other): |
---|
if isinstance(other, Interval): |
raise NotImplementedError( |
'contains not implemented for two intervals' |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be able to implement this in the next day or two if we want to get it in for 0.25.0; it should be relatively straight-forward and we have good testing infrastructure from overlaps
that could be partially reused.
Labels
Functionality to remove in pandas
Related to indexing on series/frames, not to indexes themselves
Interval data type