Feature Request: row-level Merge Status Variable · Issue #8790 · pandas-dev/pandas (original) (raw)
Hello All!
Moving into Pandas from R/Matlab/Stata. One feature I'm finding I really miss: a field generated during a merge that reports, for each row, whether that row came from the left dataset, the right dataset, or was successfully merged.
I do a lot of work in social science where our data is VERY dirty. For example, I'm often merging transliterated names from Pakistan, and so I just want to see after a merge how many records successfully merged, and then easily pull out the records of each time to compare. I'm including a kludge I'm using below, but an in-line option would be so nice, and I think others in my area would also appreciate it.
Thanks!
df1['left'] = 1
df2['right'] = 1
mergedDF = pd.merge(df1,df2, how='outer', left_on='key1', right_on='key2')
def mergeVar(x):
if x['left'] == 1 and x['right'] == 1 :
return 'both'
elif x['left'] == 1 and x['right'] != 1:
return 'leftOnly'
else: return 'rightOnly'
mergedDF['mergeReport'] = mergedDF.apply(mergeVar, axis=1)
mergedDF.drop(['left', 'right'], axis = 1)
(I'd also be happy to help make it, but I'm relatively new to python, so would need some hand-holding to learn how to integrate a new option into an existing function...)