Warn on duplicate names in MI? · Issue #19029 · pandas-dev/pandas (original) (raw)

Opening a new issue so this isn't lost.

In #18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9], ...: 'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]}, ...: index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a") ...: ...:

In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)

Another, more realistic example: groupwise drop_duplicates:

In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))

In [19]: df Out[19]: B a 0 0 1 0 1 0 2 1 2 1 2 1 0 2 0 2 1 2

In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates) Out[20]: B a a 0 0 0 0 2 1 1 0 1 2 2 2 1

Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?