BUG: inconsistent behaviors for Index.union() and Index.intersection() with duplicates · Issue #31326 · pandas-dev/pandas (original) (raw)

While working on #31312, I noticed that the behavior of Index.union() and Index.intersection() is inconsistent when there are duplicates in one of the Index.

import pandas as pd import traceback

a = pd.Index([1, 2, 2, 3]) b = pd.Index([3, 3, 4])

def test_setops(left, right): for op in ["intersection", "union"]: for sort in [None, False]: result = getattr(left, op)(right, sort=sort) print(f"sort = {sort}, {op}: {result} -> has duplicates: {result.has_duplicates}")

test_setops(a, b) #> sort = None, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True #> sort = False, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True #> sort = None, union: Int64Index([1, 2, 2, 3, 3, 4], dtype='int64') -> has duplicates: True #> sort = False, union: Int64Index([1, 2, 2, 3, 4], dtype='int64') -> has duplicates: True

arrays = [['a', 'b', 'b', 'c'], ['1', '2', '2', '1']] a_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second']) arrays = [['c', 'c', 'd'], ['1', '1', '2']] b_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

test_setops(a_mi, b_mi) #> sort = None, intersection: MultiIndex([('c', '1')], #> names=['first', 'second']) -> has duplicates: False #> sort = False, intersection: MultiIndex([('c', '1')], #> names=['first', 'second']) -> has duplicates: False #> sort = None, union: MultiIndex([('a', '1'), #> ('b', '2'), #> ('c', '1'), #> ('d', '2')], #> names=['first', 'second']) -> has duplicates: False #> sort = False, union: MultiIndex([('a', '1'), #> ('b', '2'), #> ('c', '1'), #> ('d', '2')], #> names=['first', 'second']) -> has duplicates: False

Created on 2020-01-26 by the reprexpy package

Problem description

  1. The behavior of intersection() and union() when duplicates are present is not consistent between Index and MultiIndex. Those operations return duplicates with Index but not with MultiIndex. The documentation doesn't clearly state what to expect.
  2. When duplicates are present, the size of the result of Index.union() depends on sort is None or False.
  3. If duplicates are present on only one side, Index.intersection() always return duplicates.

Here are more succinct examples for 2. and 3.

import pandas as pd

a = pd.Index([1, 2, 2, 3]) b = pd.Index([3, 3, 4])

expected [1, 2, 2, 3, 3, 3, 4]

a.union(b, sort=None) #> Int64Index([1, 2, 2, 3, 3, 4], dtype='int64') a.union(b, sort=False) #> Int64Index([1, 2, 2, 3, 4], dtype='int64')

expected [3]

a.intersection(b, sort=None) #> Int64Index([3, 3], dtype='int64') a.intersection(b, sort=False) #> Int64Index([3, 3], dtype='int64')

expected [3, 3]

b.intersection(a, sort=None) #> Int64Index([3, 3], dtype='int64') b.intersection(a, sort=False) #> Int64Index([3, 3], dtype='int64')

Created on 2020-01-26 by the reprexpy package

Expected Output

For consistency and clarity, I think it would be better to enforce unicity in the index returned by logical operations. Index.union() and Index.interesection() are the only ones allowing duplicates.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : ca3bfcc
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0+212.gca3bfcc54
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0.post20200119
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.3.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0