BUG: merge left and merge inner produce different index-order · Issue #33554 · pandas-dev/pandas (original) (raw)

According to the df.merge docstring and documentation, concerning the how parameter:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

(emphasis mine.) I understand this to mean, that the order of the left index will be preserved when using how='left' and how='inner' (or omitted). Of course, some keys might not be present in the second case, but that's beside the point here.

If my understanding is incorrect, my apologies and feel free to discard this report.

If my understanding is correct, however, I noticed today that this is not always true. Here is an example, where the order is not preserved.

Code Sample

import pandas as pd import numpy as np

data = pd.DataFrame({'value':np.random.randint(-10,100,12)}, index=pd.date_range('2020-01-01', periods=12, freq='M')) data['q'] = data.index.map(lambda ts: ts.quarter) data['even'] = data.index.map(lambda ts: ts.month % 2 ==0) cols = ['even', 'q'] av = data.groupby(cols).apply(lambda gr: gr[['value']].mean()) df1 = data.merge(av, how='inner', left_on=cols, suffixes=['', '_av'], right_index=True) df2 = data.merge(av, how='left', left_on=cols, suffixes=['', '_av'], right_index=True)

The dataframes:

data:
            value  q   even
2020-01-31     74  1  False
2020-02-29     87  1   True
2020-03-31     79  1  False
2020-04-30     74  2   True
2020-05-31     71  2  False
2020-06-30     80  2   True
2020-07-31     94  3  False
2020-08-31     19  3   True
2020-09-30     58  3  False
2020-10-31     97  4   True
2020-11-30      5  4  False
2020-12-31     16  4   True

av:
         value
even  q       
False 1   76.5
      2   71.0
      3   76.0
      4    5.0
True  1   87.0
      2   77.0
      3   19.0
      4   56.5

Output

df2: #as expected and wanted
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-02-29     87  1   True      87.0
2020-03-31     79  1  False      76.5
2020-04-30     74  2   True      77.0
2020-05-31     71  2  False      71.0
2020-06-30     80  2   True      77.0
2020-07-31     94  3  False      76.0
2020-08-31     19  3   True      19.0
2020-09-30     58  3  False      76.0
2020-10-31     97  4   True      56.5
2020-11-30      5  4  False       5.0
2020-12-31     16  4   True      56.5

df1: #not as expected
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-03-31     79  1  False      76.5
2020-02-29     87  1   True      87.0
2020-04-30     74  2   True      77.0
2020-06-30     80  2   True      77.0
2020-05-31     71  2  False      71.0
2020-07-31     94  3  False      76.0
2020-09-30     58  3  False      76.0
2020-08-31     19  3   True      19.0
2020-10-31     97  4   True      56.5
2020-12-31     16  4   True      56.5
2020-11-30      5  4  False       5.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 42.0.2.post20191203
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0