merge does not preserve order of data frame when left key has duplicate values · Issue #7331 · pandas-dev/pandas (original) (raw)
import pandas as pd
dates = [ 20140101, 20140102, 20140103]
states = [ "CA", "NY", "CA"]
x = pd.DataFrame({ 'dates' : dates, 'states' : states })
#y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]})
y = pd.DataFrame({ 'states' : [ "CA", "NY" ], 'stateid' : [ 1, 2]})
z = pd.merge(x, y, how='left', on='states',)
x=
dates states
0 20140101 CA
1 20140102 NY
2 20140103 CA
y=
stateid states
0 1 CA
1 2 NY
z= dates states stateid
0 20140101 CA 1
1 20140103 CA 1
2 20140102 NY 2
Note z is always sorted by "states" column whether argument sort=True or False.
This only happens when the x's states column is not unique. If x.states is unique(such as NY, CA, CT), sort=True and False behaves as expected.
This causes inconvenience when x is a time series and merge does not
preserve the time sequence.