Pandas DataFrame merge() Method – Be on the Right Side of Change (original) (raw)

Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

The Pandas library enables access to/from a DataFrame.
The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

FeFeel free to view the PyCharm installation guide for the required libraries.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd import numpy

The merge() method joins the DataFrame(s) on columns/indexes.

Points to note:

If joining columns, the DataFrame indexes ignore.
If joining indexes, the index passes on.
If cross-merge, no column specs to merge done.

💡 Note: If the key columns contain rows where the key is NULL (empty), the rows match against each other.

This method does not act like a SQL join.

The syntax for this method is as follows:

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Description	Parameter
right	This parameter is the DataFrame/Series to merge.
how	This parameter can be one of the following options: – left: uses calling index (or column on, if used). – right: uses the index specified in the other parameter. – outer: creates union on calling index (or column, if on) with other index and sort. – inner: creates intersection on calling index (or column, if on) and preserves the order. – cross: creates the cartesian product from both and preserves the order of the left keys.
on	This parameter is the column/index name(s) to join and must exist in DataFrames/Series. The format is a label/list.
left_on	The column/index names to join on the left DataFrame.
right_on	The column/index name(s) to join on the right DataFrame.
left_index	The index from the left DataFrame uses as join key(s).
right_index	The index from the right DataFrame uses as join key(s).
sort	This parameter sorts the resultant DataFrame.
suffixes	This parameter must be two (2) characters. The suffix adds to the end of each over-lapping column on the left/right.
copy	If True, applies a merge on the original DataFrame.
indicator	If True, a column adds to the resultant DataFrame called _merge. This column provides details on the source of each row.
validate	These options indicate validation type (strings): – one_to_one/1:1: are merge keys unique in both DataFrames. – one_to_many/1:m: are merge keys unique in the left DataFrame. – many_to_one/m:1: are merge keys unique in right DataFrame. – many_to_many/m:m: allowed but does not check.

For these examples, merge using various options appear below (see above for details).

The inner option returns rows df1 and df2 that share a common key.

Code – Inner Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]}) df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='inner') print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the inner option.
Line [4] outputs the result to the terminal.

Output

| | key | value_x | value_y | | | ----- | -------- | -------- | -- | | 0 | France | 67081000 | 33 | | 1 | Spain | 47431256 | 34 |

The details for Germany (in df1) and DE (in df2) do not display.

Note: To rename the column names, use the rename() method.

The outer option uses the keys from the DataFrames (df1 and df2). A NaN value adds for missing data in both DataFrames.

Code – Outer Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]}) df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='outer') print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the outer option.
Line [4] outputs the result to the terminal.

Output

| | key | value_x | value_y | | | ----- | -------- | ---------- | ---- | | 0 | Germany | 83783942.0 | NaN | | 1 | France | 67081000.0 | 33.0 | | 2 | Spain | 47431256.0 | 34.0 | | 3 | DE | NaN | 49.0 |

Details for Germany (value_y field) and DE (value_x field) display NaN values.

The left option uses the keys from df1. A NaN value adds for missing data in both DataFrames.

Code – Left Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]}) df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='left') print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the left option.
Line [4] outputs the result to the terminal.

Output:

| | key | value_x | value_y | | | ----- | -------- | ---------- | ---- | | 0 | Germany | 83783942.0 | NaN | | 1 | France | 67081000.0 | 33.0 | | 2 | Spain | 47431256.0 | 34.0 |

The row DE does not display. The Germany value_y field displays NaN.

The right option takes keys from the df2. Missing data from df1 changes to NaN value(s).

Code – Right Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]}) df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='right') print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the right option.
Line [4] outputs the result to the terminal.

Output

| | key | value_x | value_y | | | ----- | -------- | ---------- | -- | | 0 | DE | NaN | 49 | | 1 | France | 67081000.0 | 33 | | 2 | Spain | 47431256.0 | 34 |

The row for Germany does not display. The value_x of DE displays NaN.

More Pandas DataFrame Methods

Feel free to learn more about the previous and next pandas DataFrame methods (alphabetically) here:

Also, check out the full cheat sheet overview of all Pandas DataFrame methods.