API: change IntervalIndex.contains to work elementwise by jorisvandenbossche · Pull Request #17753 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation33 Commits10 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jorisvandenbossche

xref #16316

This is more a proof-of-concept, to see if there is agreement on the behaviour change (I suppose the implementation itself can be easily vectorized based on the left/right attributes).

cc @zfrenchee @shoyer @buyology @jreback

@jorisvandenbossche

@jorisvandenbossche

Any comments on this one?
(I still need to fix implementation and tests, but would like to check first whether there is some agreement on this?)

jreback

return True
except KeyError:
return False
return np.array([key in interval for interval in self], dtype='bool')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do
return self.get_indexer(key) != -1

@jreback

non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved.

But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean for each element, but .contains shouldn't be how its spelled.

@jreback jreback added Indexing

Related to indexing on series/frames, not to indexes themselves

Interval

Interval data type

labels

Oct 10, 2017

@shoyer

If .contains() is different from __contains__, why does it need to return a scalar in the same way? What's the use case for returning a scalar? We could have different method name of course (I think we considered .within() earlier), but I think we need something like this.

On Tue, Oct 10, 2017 at 3:44 AM Jeff Reback ***@***.***> wrote: non-withstanding my comment on the impl. I actually don't think this is a good idea. The reason for .contains in the first place is to allow us to control what __contains__ does (and for II have it mean something differnt), so the default of scalar in index is preserved. But you are now returning a boolean array, which is contrary to a single boolean result. I agree we need a method to return a boolean *for each element*, but .contains shouldn't be it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17753 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1kBcmKw9cJ_zOjp5xpeblqScaD4tks5squgGgaJpZM4PrQMj> .

@jorisvandenbossche

Also, if you want the return value of the current contains, you can always do idx.contains(key).any() with the proposed new behaviour to get the same (although then it might not be as efficient as before).

We can indeed discuss what would be the best name for this method (the one that returns boolean array), and it might not necessarily be 'contains'. But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

@TomAugspurger

But if we decide for 'contains', the fact that it is different from __contains__ should not hold us back IMO (the user interface is quite different: key in index vs index.contains(key), and that can be easily documented).

Agreed with @jorisvandenbossche here.

@jschendel

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them (albeit for most elementwise wouldn't make sense). Even PeriodIndex, which seems the most similar to IntervalIndex, doesn't do elementwise.

I recall there being plans to add an interval accessor. Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

@jorisvandenbossche

It seems like this would be inconsistent with every other type of index, as index.contains(key) is not elementwise for them

I am not aware of a general existing contains method for Index ? (so I don't see how this can be inconsistent)

It is true that DatetimeIndex/PeriodIndex have one, but this was added together with the introduction of IntervalIndex, and is thus rather coupled. Indeed, when changing it for IntervalIndex, we need to change it for DatetimeIndex/PeriodIndex as well (or for DatetimeIndex, we could also remove it, as this has no notion of intervals).

I recall there being plans to add an interval accessor

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

Perhaps that would be a more consistent place for an elementwise contains, something like index.iv.contains(key)?

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

@jorisvandenbossche

In general: note that the proposed behaviour is consistent with how 'contains' works on an individual Interval. So in that sense the fact that a IntervalIndex.contains method works element-wise and does what it means for the individual elements, feels very natural and is similar to the many datetime-related methods/attributes both on Timestamp as on DatetimeIndex.

@jschendel

I am not aware of a general existing contains method for Index ?

There is a contains method for Index:

In [2]: idx = pd.Index(['foo', 'bar', 'baz'])

In [3]: idx.contains('foo') Out[3]: True

In [4]: idx.contains('a') Out[4]: False

IMO it would be unfortunate that index.iv.contains and index.contains did something completely different.

Note that this already happens with the string accessor:

In [5]: idx.str.contains('a') Out[5]: array([False, True, True], dtype=bool)

I do see your point though, and am not trying to imply that this would be fine simply because there is precedence for it. Just pointing out that it's not an unbroken rule.

We have no accessors for any other index, so that would also be inconsistent. Such an accessor would be to access the special methods when the data are in a Series (but I may misremember the previous discussion, do you know where this took place?)

I was thinking of #16401. Seems like the proposal was just for columns, so I guess I was mistaken in thinking that it would also work on an IntervalIndex, similar to how the string accessor works on both columns and indexes?

I probably wasn't clear in my original post, but I'm not against this change (nor do I feel like I've been around long enough to cast +1/-1 votes on stuff like this anyways). I was actually quite surprised when I first used IntervalIndex.contains and it didn't return elementwise. Just wanted to make sure there wasn't a more appropriate place, and that the consistency ramifications were fully understood.

@jreback

Contributor

jreback commented

Oct 14, 2017

edited by jorisvandenbossche

Loading

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally); this is only defined diffferently for IntervalIndex. So not really against changing the behavior of it. It by-definition should be able to act differently on II, that's the point.

But I think the intent was still a scalar return value, and not an array. The idea was to allow value in II to work.

maybe if we had/have .contain and .contains this could be more clear (e.g. former is a scalar, latter is an array). Not sure this is the best spelling though.

@jorisvandenbossche

There is a contains method for Index:

Whoops it seemed I really looked bad, I thought I was sure to have checked it :-) @jschendel Thanks for pointing this out. That is a bit unfortunate for the changes I am proposing here.

similar to how the string accessor works on both columns and indexes

Ah, yes, string accessor is indeed available on Index. I was thinking about the dt and cat accessors. But you are right, neither precedent necessarily needs to determine what we think best for IntervalIndex.

The entire point of introducing .contains was to avoid conflating a very well-defined behavior for __contains__ (generally);

I don't fully understand this point, as Index.contains is (more or less) an exact alias for __contains__ for most index types. Apart for IntervalIndex, where it does something slightly different (not a pure __contains__), and this difference in behaviour of .contains between all indices and IntervalIndex is also confusing IMO.
I somehow like the fact that you can also do a 'contains' operation with a method instead of key in index, but it adds a duplicate method which is not available for something else (as the behaviour proposed here)

@jreback

@jorisvandenbossche

I don't agree with closing this :-), but it is true the discussion died.

Let me summarize my reasoning for this PR:

So my proposal is:

And also if we do not agree on the above, I think we need the functionality of this PR, so then we need to come up with another name than .contains() (and then "covers" might be reasonable one, although I find "contains" more logical).

@jreback

ok with implementing this. pls rebase / update.

@jschendel

@jorisvandenbossche : Any chance this gets implemented relatively soon? I'm looking for an IntervalIndex method to test with #19502, but there aren't any existing methods that fit the bill, and this looks like it should work for that purpose. I'd be willing to take over the implementation if you have other priorities that take precedence over this.

@jreback

@jschendel can you review and rebase if you think good?

@jschendel

@jreback : Sure, but will wait until IntervalArray is done so as to not add additional conflicts there.

@jreback

closing as stale. if you'd like to continue pls ping.

@jorisvandenbossche

IntervalArray is done in the meantime, so we can rebase this now

@pep8speaks

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-30 16:37:27 UTC

@jorisvandenbossche

@jschendel If you want to take over, feel free. Otherwise I will take this up next week.

@jreback

closing as stale. if someone would like to revive, pls ping.

@jorisvandenbossche

I think it would be nice to include this along the other IntervalIndexing changes in #27100

(still need to add whatsnew docs + tests for deprecated methods)

jschendel

jschendel

@jorisvandenbossche

jreback

@@ -615,6 +615,7 @@ Other deprecations
- :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`)
- :meth:`Series.put` is deprecated. (:issue:`18262`)
- :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`)
- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is fine, but I would add a sub-section on the usage for IntervalIndex itself

@jorisvandenbossche

Will add one once the other interval PR is merged Op ma 1 jul. 2019 12:28 schreef Jeff Reback notifications@github.com:

***@***.**** requested changes on this pull request. ------------------------------ In doc/source/whatsnew/v0.25.0.rst <#17753 (comment)>: > @@ -615,6 +615,7 @@ Other deprecations - :attr:`Series.imag` and :attr:`Series.real` are deprecated. (:issue:`18262`) - :meth:`Series.put` is deprecated. (:issue:`18262`) - :meth:`Index.item` and :meth:`Series.item` is deprecated. (:issue:`18262`) +- :meth:`Index.contains` is deprecated. Use ``key in index`` (``__contains__``) instead (:issue:`17753`). so this is fine, but I would add a sub-section on the usage for IntervalIndex itself — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#17753?email_source=notifications&email_token=AAHZEUGJNGZRKLLGKIU5ALLP5I5DBA5CNFSM4D5NAMR2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5EWQRI#pullrequestreview-256469061>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHZEUFHONUV22VC7GZVUDDP5I5DBANCNFSM4D5NAMRQ> .

jschendel

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

def contains(self, other):
if isinstance(other, Interval):
raise NotImplementedError(
'contains not implemented for two intervals'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to implement this in the next day or two if we want to get it in for 0.25.0; it should be relatively straight-forward and we have good testing infrastructure from overlaps that could be partially reused.

jreback

@jreback

Labels

Deprecate

Functionality to remove in pandas

Indexing

Related to indexing on series/frames, not to indexes themselves

Interval

Interval data type