BUG: New feature allowing merging on combination of columns and index levels drops levels of index · Issue #20452 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.version Out[2]: '0.23.0.dev0+657.g01882ba5b'

In [3]: df1 = pd.DataFrame({'v1' : range(12)}, index=pd.MultiIndex.from_product([list('abc'),list('xy'),[1,2]], names=['abc','xy','num'])) ...: df1 ...: Out[3]: v1 abc xy num a x 1 0 2 1 y 1 2 2 3 b x 1 4 2 5 y 1 6 2 7 c x 1 8 2 9 y 1 10 2 11

In [4]: df2 = pd.DataFrame({'v2': [100*i for i in range(1,7)]}, index=pd.MultiIndex.from_product([list('abc'), list('xy')],names=['abc','xy']))

In [5]: df2 Out[5]: v2 abc xy a x 100 y 200 b x 300 y 400 c x 500 y 600

In [6]: df1.merge(df2, on=['abc','xy']) # 'num' disappears Out[6]: v1 v2 abc xy a x 0 100 x 1 100 y 2 200 y 3 200 b x 4 300 x 5 300 y 6 400 y 7 400 c x 8 500 x 9 500 y 10 600 y 11 600

In [7]: df1.reset_index().merge(df2, on=['abc','xy']) # This preserves 'num' Out[7]: abc xy num v1 v2 0 a x 1 0 100 1 a x 2 1 100 2 a y 1 2 200 3 a y 2 3 200 4 b x 1 4 300 5 b x 2 5 300 6 b y 1 6 400 7 b y 2 7 400 8 c x 1 8 500 9 c x 2 9 500 10 c y 1 10 600 11 c y 2 11 600

In [8]: df1.merge(df2, on='xy') # 'abc' and 'num' disappear Out[8]: v1 v2 xy x 0 100 x 0 300 x 0 500 x 1 100 x 1 300 x 1 500 x 4 100 x 4 300 x 4 500 x 5 100 x 5 300 x 5 500 x 8 100 x 8 300 x 8 500 x 9 100 x 9 300 x 9 500 y 2 200 y 2 400 y 2 600 y 3 200 y 3 400 y 3 600 y 6 200 y 6 400 y 6 600 y 7 200 y 7 400 y 7 600 y 10 200 y 10 400 y 10 600 y 11 200 y 11 400 y 11 600

Problem description

It seems that the new feature implemented in #17484 that allows merging on a combination of columns and index levels can drop index levels, which is really non-intuitive. In the first example, the index level named "num" gets dropped, while in the last example, both "abc" and "xy" are dropped.

If this is the desired behavior, then it needs to be carefully documented.

N.B. There is also an error in the docs of merging.rst that says this feature was introduced in v.0.22, but it will be introduced in v0.23

I'm guessing @jmmease will need to look at this.

Expected Output

In [6]: df1.merge(df2, on=['abc','xy']) Out[6]: v1 v2 abc xy num a x 1 0 100 2 1 100 y 1 2 200 2 3 200 b x 1 4 300 2 5 300 y 1 6 400 2 7 400 c x 1 8 500 2 9 500 y 1 10 600 2 11 600

In [8]: df1.merge(df2, on='xy') Out[8]: abc_x num v1 abc_y v2 xy x a 1 0 a 100 x a 1 0 b 300 x a 1 0 c 500 x a 2 1 a 100 x a 2 1 b 300 x a 2 1 c 500 x b 1 4 a 100 x b 1 4 b 300 x b 1 4 c 500 x b 2 5 a 100 x b 2 5 b 300 x b 2 5 c 500 x c 1 8 a 100 x c 1 8 b 300 x c 1 8 c 500 x c 2 9 a 100 x c 2 9 b 300 x c 2 9 c 500 y a 1 2 a 200 y a 1 2 b 400 y a 1 2 c 600 y a 2 3 a 200 y a 2 3 b 400 y a 2 3 c 600 y b 1 6 a 200 y b 1 6 b 400 y b 1 6 c 600 y b 2 7 a 200 y b 2 7 b 400 y b 2 7 c 600 y c 1 10 a 200 y c 1 10 b 400 y c 1 10 c 600 y c 2 11 a 200 y c 2 11 b 400 y c 2 11 c 600

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+657.g01882ba5b
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None