API: Set ops for CategoricalIndex · Issue #10186 · pandas-dev/pandas (original) (raw)

Derived from #10157. Would like to clarify what these results should be. Basically, I think:

Followings are current results.

intersection

# for reference
pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index([2, 3, 4, 2, 3, 4]))
# Int64Index([2, 2, 3, 3], dtype='int64')

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# CategoricalIndex([2, 2, 3, 3], categories=[1, 2, 3], ordered=False, dtype='category')
# -> Is this OK or it should have categories=[2, 3]?

union

Doc says "Form the union of two Index objects and sorts if possible". I'm not sure whether the last sentence says "raise error if sort is impossible" or "not sort if impossible"?

pd.Index([1, 2, 4]).union(pd.Index([2, 3, 4]))
# Int64Index([1, 2, 3, 4], dtype='int64')

pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
# CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 3, 4], ordered=False, dtype='category')
# -> Should be sorted?
pd.Index([1, 2, 3, 1, 2, 3]).union(pd.Index([2, 3, 4, 2, 3, 4]))
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
-> This should results Index([1, 2, 3, 1, 2, 3, 4, 4])?

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).union(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# TypeError: type() takes 1 or 3 arguments
# -> should raise understandable error, or Int64Index shouldn't raise (and return unsorted result?)

difference

pd.CategoricalIndex([1, 2, 4, 5]).difference(pd.CategoricalIndex([2, 3, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?

sym_diff

pd.CategoricalIndex([1, 2, 4, 5]).sym_diff(pd.CategoricalIndex([2, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?