API: change IntervalIndex.contains to work elementwise by jorisvandenbossche · Pull Request #17753 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation33 Commits10 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

xref #16316

This is more a proof-of-concept, to see if there is agreement on the behaviour change (I suppose the implementation itself can be easily vectorized based on the left/right attributes).

cc @zfrenchee @shoyer @buyology @jreback

Any comments on this one?
(I still need to fix implementation and tests, but would like to check first whether there is some agreement on this?)

return True
except KeyError:
return False
return np.array([key in interval for interval in self], dtype='bool')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do
return self.get_indexer(key) != -1

non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved.

But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean for each element, but .contains shouldn't be how its spelled.

jreback added Indexing

Related to indexing on series/frames, not to indexes themselves

Interval

Interval data type

labels

Oct 10, 2017

If .contains() is different from __contains__, why does it need to return a scalar in the same way? What's the use case for returning a scalar? We could have different method name of course (I think we considered .within() earlier), but I think we need something like this.

…

On Tue, Oct 10, 2017 at 3:44 AM Jeff Reback ***@***.***> wrote: non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved. But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean *for each element*, but .contains shouldn't be it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17753 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1kBcmKw9cJ_zOjp5xpeblqScaD4tks5squgGgaJpZM4PrQMj> .

Also, if you want the return value of the current contains, you can always do idx.contains(key).any() with the proposed new behaviour to get the same (although then it might not be as efficient as before).

We can indeed discuss what would be the best name for this method (the one that returns boolean array), and it might not necessarily be 'contains'. But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

Agreed with @jorisvandenbossche here.

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them (albeit for most elementwise wouldn't make sense). Even PeriodIndex, which seems the most similar to IntervalIndex, doesn't do elementwise.

I recall there being plans to add an interval accessor. Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them

I am not aware of a general existing contains method for Index ? (so I don't see how this can be inconsistent)

It is true that DatetimeIndex/PeriodIndex have one, but this was added together with the introduction of IntervalIndex, and is thus rather coupled. Indeed, when changing it for IntervalIndex, we need to change it for DatetimeIndex/PeriodIndex as well (or for DatetimeIndex, we could also remove it, as this has no notion of intervals).

I recall there being plans to add an interval accessor

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

In general: note that the proposed behaviour is consistent with how 'contains' works on an individual Interval. So in that sense the fact that a IntervalIndex.contains method works element-wise and does what it means for the individual elements, feels very natural and is similar to the many datetime-related methods/attributes both on Timestamp as on DatetimeIndex.

I am not aware of a general existing contains method for Index ?

There is a contains method for Index:

In [2]: idx = pd.Index(['foo', 'bar', 'baz'])

In [3]: idx.contains('foo') Out[3]: True

In [4]: idx.contains('a') Out[4]: False

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

Note that this already happens with the string accessor:

In [5]: idx.str.contains('a') Out[5]: array([False, True, True], dtype=bool)

I do see your point though, and am not trying to imply that this would be fine simply because there is precedence for it. Just pointing out that it's not an unbroken rule.

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

I was thinking of #16401. Seems like the proposal was just for columns, so I guess I was mistaken in thinking that it would also work on an IntervalIndex, similar to how the string accessor works on both columns and indexes?

I probably wasn't clear in my original post, but I'm not against this change (nor do I feel like I've been around long enough to cast +1/-1 votes on stuff like this anyways). I was actually quite surprised when I first used IntervalIndex.contains and it didn't return elementwise. Just wanted to make sure there wasn't a more appropriate place, and that the consistency ramifications were fully understood.

Contributor

jreback commented

Oct 14, 2017

•

edited by jorisvandenbossche

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally); this is only defined diffferently for IntervalIndex. So not really against changing the behavior of it. It by-definition should be able to act differently on II, that's the point.

But I think the intent was still a scalar return value, and not an array. The idea was to allow value in II to work.

maybe if we had/have .contain and .contains this could be more clear (e.g. former is a scalar, latter is an array). Not sure this is the best spelling though.

There is a contains method for Index:

Whoops it seemed I really looked bad, I thought I was sure to have checked it :-) @jschendel Thanks for pointing this out. That is a bit unfortunate for the changes I am proposing here.

similar to how the string accessor works on both columns and indexes

Ah, yes, string accessor is indeed available on Index. I was thinking about the dt and cat accessors. But you are right, neither precedent necessarily needs to determine what we think best for IntervalIndex.

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally);

I don't fully understand this point, as Index.contains is (more or less) an exact alias for __contains__ for most index types. Apart for IntervalIndex, where it does something slightly different (not a pure __contains__), and this difference in behaviour of .contains between all indices and IntervalIndex is also confusing IMO.
I somehow like the fact that you can also do a 'contains' operation with a method instead of key in index, but it adds a duplicate method which is not available for something else (as the behaviour proposed here)

I don't agree with closing this :-), but it is true the discussion died.

Let me summarize my reasoning for this PR:

Currenlty Index.contains(key) does the same as Index.__contains__(key) / key in index. This is therefore duplicating functionality that is IMO not needed (in other PRs we are actively removing methods).
For all index types .contains() currenlty does an "exact search" of the label, IntervalIndex does not (also checks if passed values is contained in any of the individual values of the index), making this method inconsistent between index types.
The proposed IntervalIndex.contains is an element-wise version of the 'contains' operation on an individual Interval object (curently val in Interval(), but we could also add a Interval.contains() method).
I don't know of other interval-like implementations that have something like this that we could use as reference, but in spatial topology this predicate is called "contains" (https://en.wikipedia.org/wiki/DE-9IM, although the rules for the edges are a bit unintuitive, and you could also say the equivalent is "covers").

So my proposal is:

Remove (deprecate) .contains() for all other Index types
Only keep it for IntervalIndex and change it here to work element-wise

And also if we do not agree on the above, I think we need the functionality of this PR, so then we need to come up with another name than .contains() (and then "covers" might be reasonable one, although I find "contains" more logical).

ok with implementing this. pls rebase / update.

@jorisvandenbossche : Any chance this gets implemented relatively soon? I'm looking for an IntervalIndex method to test with #19502, but there aren't any existing methods that fit the bill, and this looks like it should work for that purpose. I'd be willing to take over the implementation if you have other priorities that take precedence over this.

@jschendel can you review and rebase if you think good?

@jreback : Sure, but will wait until IntervalArray is done so as to not add additional conflicts there.

closing as stale. if you'd like to continue pls ping.

IntervalArray is done in the meantime, so we can rebase this now

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-30 16:37:27 UTC

@jschendel If you want to take over, feel free. Otherwise I will take this up next week.

closing as stale. if someone would like to revive, pls ping.

I think it would be nice to include this along the other IntervalIndexing changes in #27100

(still need to add whatsnew docs + tests for deprecated methods)

@@ -615,6 +615,7 @@ Other deprecations
- :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`)
- :meth:`Series.put` is deprecated. (:issue:`18262`)
- :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`)
- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is fine, but I would add a sub-section on the usage for IntervalIndex itself

Will add one once the other interval PR is merged Op ma 1 jul. 2019 12:28 schreef Jeff Reback notifications@github.com:

…

***@***.**** requested changes on this pull request. ------------------------------ In doc/source/whatsnew/v0.25.0.rst <#17753 (comment)>: > @@ -615,6 +615,7 @@ Other deprecations - :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`) - :meth:`Series.put` is deprecated. (:issue:`18262`) - :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`) +- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`). so this is fine, but I would add a sub-section on the usage for IntervalIndex itself — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#17753?email_source=notifications&email_token=AAHZEUGJNGZRKLLGKIU5ALLP5I5DBA5CNFSM4D5NAMR2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5EWQRI#pullrequestreview-256469061>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHZEUFHONUV22VC7GZVUDDP5I5DBANCNFSM4D5NAMRQ> .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

def contains(self, other):
if isinstance(other, Interval):
raise NotImplementedError(
'contains not implemented for two intervals'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to implement this in the next day or two if we want to get it in for 0.25.0; it should be relatively straight-forward and we have good testing infrastructure from overlaps that could be partially reused.

Labels

Deprecate

Functionality to remove in pandas

Indexing

Related to indexing on series/frames, not to indexes themselves

Interval

Interval data type