BUG: Joining MultiIndex with IntervalIndex level fails when IntervalIndex level is overlapping · Issue #44096 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported. (might be related to issues with overlapping multi index intervals #27456)
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
idx_1_working = pd.MultiIndex.from_tuples([ (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (1, pd.Interval(2.0, 5.0)), (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 2.0)), (2, pd.Interval(2.0, 5.0)), ], names=['num', 'interval'])
Same index but sequence slightly changed
idx_2_working = pd.MultiIndex.from_tuples([ (1, pd.Interval(2.0, 5.0)), # in idx_1_working this is in 3rd row (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (2, pd.Interval(2.0, 5.0)), # in idx_1_working this is in 6th row (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 2.0)), ], names=['num', 'interval'])
print(idx_1_working.join(idx_2_working, how='outer'))
idx_1_broken = pd.MultiIndex.from_tuples([ (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (1, pd.Interval(2.0, 5.0)), (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 3.0)), # interval limit is here at 3.0, not at 2.0 (2, pd.Interval(3.0, 5.0)), ], names=['num', 'interval'])
idx_2_broken = pd.MultiIndex.from_tuples([ (1, pd.Interval(2.0, 5.0)), (1, pd.Interval(0.0, 1.0)), (1, pd.Interval(1.0, 2.0)), (2, pd.Interval(3.0, 5.0)), (2, pd.Interval(0.0, 1.0)), (2, pd.Interval(1.0, 3.0)), ], names=['num', 'interval'])
print(idx_1_broken.join(idx_2_broken, how='outer'))
Issue Description
Joining idx_1_broken
with idx_2_broken
for no obvious reason. The difference between the _working
and the _broken
indeces is, that in _working
the interval
level is the same for the two num
level entries, whereas in _broken
they one limit differs (see comment in the code)
Output of the example script:
MultiIndex([(1, (0.0, 1.0]),
(1, (1.0, 2.0]),
(1, (2.0, 5.0]),
(2, (0.0, 1.0]),
(2, (1.0, 2.0]),
(2, (2.0, 5.0])],
names=['num', 'interval'])
Traceback (most recent call last):
File "/home/jmu3si/tmp/intervalindex_test.py", line 47, in <module>
print(idx_1_broken.join(idx_2_broken, how='outer'))
File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 214, in join
join_index, lidx, ridx = meth(self, other, how=how, level=level, sort=sort)
File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 4302, in join
return self._join_via_get_indexer(other, how, sort)
File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 4336, in _join_via_get_indexer
rindexer = other.get_indexer(join_index)
File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3665, in get_indexer
return self._get_indexer(target, method, limit, tolerance)
File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3684, in _get_indexer
tgt_values = self._engine._extract_level_codes(target)
File "pandas/_libs/index.pyx", line 652, in pandas._libs.index.BaseMultiIndexCodesEngine._extract_level_codes
level_codes = [lev.get_indexer(codes) + 1 for lev, codes
File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3602, in get_indexer
raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique
Expected Behavior
I would expect that the join operation works the same way as for idx_1_working
and idx_2_working
.
Expected output of the scipt:
MultiIndex([(1, (0.0, 1.0]),
(1, (1.0, 2.0]),
(1, (2.0, 5.0]),
(2, (0.0, 1.0]),
(2, (1.0, 2.0]),
(2, (2.0, 5.0])],
names=['num', 'interval'])
MultiIndex([(1, (0.0, 1.0]),
(1, (1.0, 2.0]),
(1, (2.0, 5.0]),
(2, (0.0, 1.0]),
(2, (1.0, 3.0]),
(2, (3.0, 5.0])],
names=['num', 'interval'])
Installed versions
INSTALLED VERSIONS
commit : 3a6d4cd
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-88-lowlatency
Version : #99-Ubuntu SMP PREEMPT Thu Sep 23 18:30:52 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8
pandas : 1.4.0.dev0+933.g3a6d4cd01d
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : 6.2.5
hypothesis : 6.23.3
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None