BUG: inconsistent behaviors for Index.union() and Index.intersection() with duplicates · Issue #31326 · pandas-dev/pandas (original) (raw)
While working on #31312, I noticed that the behavior of Index.union()
and Index.intersection()
is inconsistent when there are duplicates in one of the Index.
import pandas as pd import traceback
a = pd.Index([1, 2, 2, 3]) b = pd.Index([3, 3, 4])
def test_setops(left, right): for op in ["intersection", "union"]: for sort in [None, False]: result = getattr(left, op)(right, sort=sort) print(f"sort = {sort}, {op}: {result} -> has duplicates: {result.has_duplicates}")
test_setops(a, b) #> sort = None, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True #> sort = False, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True #> sort = None, union: Int64Index([1, 2, 2, 3, 3, 4], dtype='int64') -> has duplicates: True #> sort = False, union: Int64Index([1, 2, 2, 3, 4], dtype='int64') -> has duplicates: True
arrays = [['a', 'b', 'b', 'c'], ['1', '2', '2', '1']] a_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second']) arrays = [['c', 'c', 'd'], ['1', '1', '2']] b_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
test_setops(a_mi, b_mi) #> sort = None, intersection: MultiIndex([('c', '1')], #> names=['first', 'second']) -> has duplicates: False #> sort = False, intersection: MultiIndex([('c', '1')], #> names=['first', 'second']) -> has duplicates: False #> sort = None, union: MultiIndex([('a', '1'), #> ('b', '2'), #> ('c', '1'), #> ('d', '2')], #> names=['first', 'second']) -> has duplicates: False #> sort = False, union: MultiIndex([('a', '1'), #> ('b', '2'), #> ('c', '1'), #> ('d', '2')], #> names=['first', 'second']) -> has duplicates: False
Created on 2020-01-26 by the reprexpy package
Problem description
- The behavior of
intersection()
andunion()
when duplicates are present is not consistent betweenIndex
andMultiIndex
. Those operations return duplicates withIndex
but not withMultiIndex
. The documentation doesn't clearly state what to expect. - When duplicates are present, the size of the result of
Index.union()
depends on sort is None or False. - If duplicates are present on only one side,
Index.intersection()
always return duplicates.
Here are more succinct examples for 2. and 3.
import pandas as pd
a = pd.Index([1, 2, 2, 3]) b = pd.Index([3, 3, 4])
expected [1, 2, 2, 3, 3, 3, 4]
a.union(b, sort=None) #> Int64Index([1, 2, 2, 3, 3, 4], dtype='int64') a.union(b, sort=False) #> Int64Index([1, 2, 2, 3, 4], dtype='int64')
expected [3]
a.intersection(b, sort=None) #> Int64Index([3, 3], dtype='int64') a.intersection(b, sort=False) #> Int64Index([3, 3], dtype='int64')
expected [3, 3]
b.intersection(a, sort=None) #> Int64Index([3, 3], dtype='int64') b.intersection(a, sort=False) #> Int64Index([3, 3], dtype='int64')
Created on 2020-01-26 by the reprexpy package
Expected Output
For consistency and clarity, I think it would be better to enforce unicity in the index returned by logical operations. Index.union()
and Index.interesection()
are the only ones allowing duplicates.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : ca3bfcc
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.0rc0+212.gca3bfcc54
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0.post20200119
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.3.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0