BUG: New feature allowing merging on combination of columns and index levels drops levels of index · Issue #20452 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
In [1]: import pandas as pd
In [2]: pd.version Out[2]: '0.23.0.dev0+657.g01882ba5b'
In [3]: df1 = pd.DataFrame({'v1' : range(12)}, index=pd.MultiIndex.from_product([list('abc'),list('xy'),[1,2]], names=['abc','xy','num'])) ...: df1 ...: Out[3]: v1 abc xy num a x 1 0 2 1 y 1 2 2 3 b x 1 4 2 5 y 1 6 2 7 c x 1 8 2 9 y 1 10 2 11
In [4]: df2 = pd.DataFrame({'v2': [100*i for i in range(1,7)]}, index=pd.MultiIndex.from_product([list('abc'), list('xy')],names=['abc','xy']))
In [5]: df2 Out[5]: v2 abc xy a x 100 y 200 b x 300 y 400 c x 500 y 600
In [6]: df1.merge(df2, on=['abc','xy']) # 'num' disappears Out[6]: v1 v2 abc xy a x 0 100 x 1 100 y 2 200 y 3 200 b x 4 300 x 5 300 y 6 400 y 7 400 c x 8 500 x 9 500 y 10 600 y 11 600
In [7]: df1.reset_index().merge(df2, on=['abc','xy']) # This preserves 'num' Out[7]: abc xy num v1 v2 0 a x 1 0 100 1 a x 2 1 100 2 a y 1 2 200 3 a y 2 3 200 4 b x 1 4 300 5 b x 2 5 300 6 b y 1 6 400 7 b y 2 7 400 8 c x 1 8 500 9 c x 2 9 500 10 c y 1 10 600 11 c y 2 11 600
In [8]: df1.merge(df2, on='xy') # 'abc' and 'num' disappear Out[8]: v1 v2 xy x 0 100 x 0 300 x 0 500 x 1 100 x 1 300 x 1 500 x 4 100 x 4 300 x 4 500 x 5 100 x 5 300 x 5 500 x 8 100 x 8 300 x 8 500 x 9 100 x 9 300 x 9 500 y 2 200 y 2 400 y 2 600 y 3 200 y 3 400 y 3 600 y 6 200 y 6 400 y 6 600 y 7 200 y 7 400 y 7 600 y 10 200 y 10 400 y 10 600 y 11 200 y 11 400 y 11 600
Problem description
It seems that the new feature implemented in #17484 that allows merging on a combination of columns and index levels can drop index levels, which is really non-intuitive. In the first example, the index level named "num" gets dropped, while in the last example, both "abc" and "xy" are dropped.
If this is the desired behavior, then it needs to be carefully documented.
N.B. There is also an error in the docs of merging.rst that says this feature was introduced in v.0.22, but it will be introduced in v0.23
I'm guessing @jmmease will need to look at this.
Expected Output
In [6]: df1.merge(df2, on=['abc','xy']) Out[6]: v1 v2 abc xy num a x 1 0 100 2 1 100 y 1 2 200 2 3 200 b x 1 4 300 2 5 300 y 1 6 400 2 7 400 c x 1 8 500 2 9 500 y 1 10 600 2 11 600
In [8]: df1.merge(df2, on='xy') Out[8]: abc_x num v1 abc_y v2 xy x a 1 0 a 100 x a 1 0 b 300 x a 1 0 c 500 x a 2 1 a 100 x a 2 1 b 300 x a 2 1 c 500 x b 1 4 a 100 x b 1 4 b 300 x b 1 4 c 500 x b 2 5 a 100 x b 2 5 b 300 x b 2 5 c 500 x c 1 8 a 100 x c 1 8 b 300 x c 1 8 c 500 x c 2 9 a 100 x c 2 9 b 300 x c 2 9 c 500 y a 1 2 a 200 y a 1 2 b 400 y a 1 2 c 600 y a 2 3 a 200 y a 2 3 b 400 y a 2 3 c 600 y b 1 6 a 200 y b 1 6 b 400 y b 1 6 c 600 y b 2 7 a 200 y b 2 7 b 400 y b 2 7 c 600 y c 1 10 a 200 y c 1 10 b 400 y c 1 10 c 600 y c 2 11 a 200 y c 2 11 b 400 y c 2 11 c 600
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.0.dev0+657.g01882ba5b
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None