Fix IntervalDtype Bugs and Inconsistencies by jschendel · Pull Request #18997 · pandas-dev/pandas (original) (raw)

is there utility in having an IntervalIndex that works on objects / categorical types generally?

I've been wondering the same thing, and have been going back and forth. It seems like we'd need to put in a fair number of guardrails for dtypes that aren't continuous (or continuous enough) where things like mid and length aren't guaranteed to be defined. I've been asked in other PR's to handle such edge cases, e.g. #18805 (comment). I'm not sure I'd classify that as demand, so much as reviewers just trying to ensure that all corners are covered.

As a source of comparison, postgres range types only provide numeric support by default. However, postgres also makes it fairly easy to extend this to non-numeric types, e.g.

CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet );

would provide support for intervals of IP addresses. I don't think there'd be such easy extensions in our case if we only provided numeric support?

In terms of use cases, everything that I'd use intervals for would be covered by numeric. I can imagine some scenarios where non-numeric could be useful though. For example, intervals with string endpoints essentially allows for prefix level searches:

In [2]: iv = pd.Interval('ab', 'bas')

In [3]: 'abracadabra' in iv Out[3]: True

In [4]: 'bar' in iv Out[4]: True

In [5]: 'bass' in iv Out[5]: False

Some quick searching turns up a postgres extension that appears do this. The real world use case mentioned there is telephony applications. I could maybe see similar prefix stuff being done in a biology/bioinformatics context, but really only have passing knowledge of that area, so could be wrong.

I could see applications for ordered categoricals with categories along the lines of low/medium/high (but more categories), where you'd have logical intervals to partition by. Many problems could probably be solved via alternative approaches though, e.g. groupby logic. Not sure if there'd be cases where Interval support would be far and away a superior solution. You could probably even solve such problems without explicit Categorical support by mapping label <--> integer, creating integer intervals, and mapping back as appropriate, though kind of seems like reinventing the Categorical wheel.

All that being said, I don't know how prevalent these use cases would actually be. Could very well not be worth the effort to support such operations.