[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection by makbigc · Pull Request #26225 · pandas-dev/pandas (original) (raw)

A few things after poking around at this a bit.

It looks like this isn't behaving as expected for nested intervals:

In [1]: import pandas as pd; pd.version Out[1]: '0.25.0.dev0+472.ge9ddf02b5'

In [2]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3), (1, 4)])

In [3]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3)])

In [4]: ii1.intersection(ii2) Out[4]: IntervalIndex([], closed='right', dtype='interval[int64]')

Haven't looked into this enough to know exactly where it's failing.

It also looks like this fixes an issue that was present in the previous implementation of intersection (probably a bug with MultiIndex.intersection), specifically in relation to how duplicates are handled:

In [5]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 2), (2, 3), (3, 4)])

In [6]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (2, 3)])

In [7]: ii1.intersection(ii2) Out[7]: IntervalIndex([(1, 2], (1, 2], (2, 3]], closed='right', dtype='interval[int64]')

On master this only returns one instance of (1, 2] but returning both dupes is consistent with other index types, so I think the new behavior is correct:

In [8]: idx1 = pd.Index(list('aabc'))

In [9]: idx2 = pd.Index(list('ab'))

In [10]: idx1.intersection(idx2) Out[10]: Index(['a', 'a', 'b'], dtype='object')

It'd be nice to include both the failing and newly passing examples above as test cases. I wouldn't be opposed to creating a test_setops.py file and moving all the setops tests there, as a similar file exists for other index types.