Feature Request: row-level Merge Status Variable · Issue #8790 · pandas-dev/pandas (original) (raw)

Hello All!

Moving into Pandas from R/Matlab/Stata. One feature I'm finding I really miss: a field generated during a merge that reports, for each row, whether that row came from the left dataset, the right dataset, or was successfully merged.

I do a lot of work in social science where our data is VERY dirty. For example, I'm often merging transliterated names from Pakistan, and so I just want to see after a merge how many records successfully merged, and then easily pull out the records of each time to compare. I'm including a kludge I'm using below, but an in-line option would be so nice, and I think others in my area would also appreciate it.

Thanks!

df1['left'] = 1
df2['right'] = 1

mergedDF = pd.merge(df1,df2, how='outer', left_on='key1', right_on='key2')

def mergeVar(x):
    if x['left'] == 1 and x['right'] == 1 :
        return 'both'
    elif x['left'] == 1 and x['right'] != 1:
        return 'leftOnly'
    else: return 'rightOnly'

mergedDF['mergeReport'] = mergedDF.apply(mergeVar, axis=1)
mergedDF.drop(['left', 'right'], axis = 1)

(I'd also be happy to help make it, but I'm relatively new to python, so would need some hand-holding to learn how to integrate a new option into an existing function...)