BUG: Joining MultiIndex with IntervalIndex level fails when IntervalIndex level is overlapping (original) (raw)

I have checked that this issue has not already been reported. (might be related to issues with overlapping multi index intervals #27456)
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

idx_1_working = pd.MultiIndex.from_tuples([ (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (1, pd.Interval(2.0, 5.0)), (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 2.0)), (2, pd.Interval(2.0, 5.0)), ], names=['num', 'interval'])

Same index but sequence slightly changed

idx_2_working = pd.MultiIndex.from_tuples([ (1, pd.Interval(2.0, 5.0)), # in idx_1_working this is in 3rd row (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (2, pd.Interval(2.0, 5.0)), # in idx_1_working this is in 6th row (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 2.0)), ], names=['num', 'interval'])

print(idx_1_working.join(idx_2_working, how='outer'))

idx_1_broken = pd.MultiIndex.from_tuples([ (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (1, pd.Interval(2.0, 5.0)), (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 3.0)), # interval limit is here at 3.0, not at 2.0 (2, pd.Interval(3.0, 5.0)), ], names=['num', 'interval'])

idx_2_broken = pd.MultiIndex.from_tuples([ (1, pd.Interval(2.0, 5.0)), (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (2, pd.Interval(3.0, 5.0)), (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 3.0)), ], names=['num', 'interval'])

print(idx_1_broken.join(idx_2_broken, how='outer'))

Issue Description

Joining idx_1_broken with idx_2_broken for no obvious reason. The difference between the _working and the _broken indeces is, that in _working the interval level is the same for the two num level entries, whereas in _broken they one limit differs (see comment in the code)

Output of the example script:

MultiIndex([(1, (0.0, 1.0]),
            (1, (1.0, 2.0]),
            (1, (2.0, 5.0]),
            (2, (0.0, 1.0]),
            (2, (1.0, 2.0]),
            (2, (2.0, 5.0])],
           names=['num', 'interval'])
Traceback (most recent call last):
  File "/home/jmu3si/tmp/intervalindex_test.py", line 47, in <module>
    print(idx_1_broken.join(idx_2_broken, how='outer'))
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 214, in join
    join_index, lidx, ridx = meth(self, other, how=how, level=level, sort=sort)
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 4302, in join
    return self._join_via_get_indexer(other, how, sort)
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 4336, in _join_via_get_indexer
    rindexer = other.get_indexer(join_index)
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3665, in get_indexer
    return self._get_indexer(target, method, limit, tolerance)
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3684, in _get_indexer
    tgt_values = self._engine._extract_level_codes(target)
  File "pandas/_libs/index.pyx", line 652, in pandas._libs.index.BaseMultiIndexCodesEngine._extract_level_codes
    level_codes = [lev.get_indexer(codes) + 1 for lev, codes
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3602, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique

Expected Behavior

I would expect that the join operation works the same way as for idx_1_working and idx_2_working.

Expected output of the scipt:

MultiIndex([(1, (0.0, 1.0]),
            (1, (1.0, 2.0]),
            (1, (2.0, 5.0]),
            (2, (0.0, 1.0]),
            (2, (1.0, 2.0]),
            (2, (2.0, 5.0])],
           names=['num', 'interval'])
MultiIndex([(1, (0.0, 1.0]),
            (1, (1.0, 2.0]),
            (1, (2.0, 5.0]),
            (2, (0.0, 1.0]),
            (2, (1.0, 3.0]),
            (2, (3.0, 5.0])],
           names=['num', 'interval'])

Installed versions

INSTALLED VERSIONS

commit : 3a6d4cd
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-88-lowlatency
Version : #99-Ubuntu SMP PREEMPT Thu Sep 23 18:30:52 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8

pandas : 1.4.0.dev0+933.g3a6d4cd01d
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : 6.2.5
hypothesis : 6.23.3
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None