Pandas dataframe.drop_duplicates() (original) (raw)

Last Updated : 25 Nov, 2024

Pandas **drop_duplicates() method helps in removing duplicates from the Pandas Dataframe allows to remove duplicate rows from a DataFrame, either based on all columns or specific ones in python.

By default, drop_duplicates() scans the entire DataFrame for duplicate rows and removes all subsequent occurrences, retaining only the first instance being the simple and efficient method. Let’s see a quick example:

Python `

import pandas as pd data = { "Name": ["Alice", "Bob", "Alice", "David"], "Age": [25, 30, 25, 40], "City": ["NY", "LA", "NY", "Chicago"] } df = pd.DataFrame(data) display(df)

Removing duplicates

unique_df = df.drop_duplicates() display(unique_df)

**Output:

Pandas-dataframe-drop-duplicates

Pandas dataframe.drop_duplicates()

This example demonstrates how duplicate rows are removed while retaining the first occurrence using pandas.DataFrame.drop_duplicates() since it’s commonly used and recommended.

**dataframe.drop_duplicates() Syntax in Python :

**Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

**Parameters:

**subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates. ( Optional)

**keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.

If ‘**first‘, it considers first value as unique and rest of the same values as duplicate.

If ‘**last‘, it considers last value as unique and rest of the same values as duplicate.

If **False, it consider all of the same values as duplicates

**inplace: Boolean values, removes rows with duplicates if True.

**Return type: DataFrame with removed duplicate rows depending on Arguments passed.

Python dataframe.drop_duplicates() : Examples

Duplicate rows can arise due to merging datasets, incorrect data entry, or other reasons. The drop_duplicates() **works by identifying duplicates based on all columns (default) or specified columns and removing them as per your requirements. Below, we are discussing examples of dataframe.drop_duplicates() method:

**1. Dropping Duplicates Based on Specific Columns

You can target duplicates in specific columns using the subset parameter. This helps when certain fields are more relevant for identifying duplicates.

Python `

import pandas as pd df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Alice', 'David'], 'Age': [25, 30, 25, 40], 'City': ['NY', 'LA', 'SF', 'Chicago'] })

Drop duplicates based on the 'Name' column

result = df.drop_duplicates(subset=['Name']) print(result)

Output

Name  Age     City

0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago

Here, duplicates are removed based **solely on the Name column, ignoring the other fields. This is helpful when specific columns uniquely identify rows.

**2. Keeping the Last Occurrence

By default, drop_duplicates() retains the first occurrence of duplicates. However, you can retain the last duplicate instead using keep='last'.

Python `

import pandas as pd

df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Alice', 'David'], 'Age': [25, 30, 25, 40], 'City': ['NY', 'LA', 'NY', 'Chicago'] })

Keep the last occurrence of duplicates

result = df.drop_duplicates(keep='last') print(result)

Output

Name  Age     City

1 Bob 30 LA 2 Alice 25 NY 3 David 40 Chicago

The keep='last' parameter ensures the last occurrence of each duplicate is retained instead of the first.

**3. Dropping All Duplicates

To remove all rows with duplicates, use keep=False. This keeps only rows that are entirely unique.

Python `

import pandas as pd

df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Alice', 'David'], 'Age': [25, 30, 25, 40], 'City': ['NY', 'LA', 'NY', 'Chicago'] })

Drop all duplicates

result = df.drop_duplicates(keep=False) print(result)

Output

Name  Age     City

1 Bob 30 LA 3 David 40 Chicago

With keep=False, all occurrences of duplicate rows are removed, leaving only rows that are entirely unique across all columns.

4. Modifying the Original DataFrame Directly

To modify the original DataFrame directly without creating a new one, use inplace=True.

Python `

import pandas as pd

df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Alice', 'David'], 'Age': [25, 30, 25, 40], 'City': ['NY', 'LA', 'NY', 'Chicago'] })

Modify the DataFrame in place

df.drop_duplicates(inplace=True) print(df)

Output

Name  Age     City

0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago

Using inplace=True modifies the original DataFrame directly, saving memory and avoiding the need to assign the result to a new variable.