Working with Missing Data in Pandas (original) (raw)

In Pandas, missing values often arise from uncollected data or incomplete entries. This article explores how to detect, handle and fill missing values in a DataFrame, ensuring clean and accurate data for analysis.

In Pandas, missing values are represented as:

Checking for Missing Values in Pandas DataFrame

For finding the missing values and handling them, Pandas gives us two convenient functions: isnull() and notnull(). They assist us in detecting if a value is NaN or not, which facilitates data cleaning and preprocessing in a DataFrame or Series.

1. Checking for Missing Values Using isnull()

**isnull() returns a DataFrame of Boolean values, where **True represents missing data (**NaN). This is handy if you want to find and fill missing data in a dataset.

**Example 1: Finding Missing Values in a DataFrame

Python `

Importing pandas and numpy

import pandas as pd import numpy as np

Sample DataFrame with missing values

d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d)

Checking for missing values using isnull()

mv = df.isnull()

print(mv)

`

**Output

**Example 2: Filtering Data Based on Missing Values

Here, isnull() method is used over the “Gender” column in order to filter and print out rows containing missing gender data.

Python `

import pandas as pd d = pd.read_csv("employees.csv")

bool_series = pd.isnull(d["Gender"]) missing_gender_data = d[bool_series] print(missing_gender_data)

`

**Output

2. Checking for Non-Missing Values Using notnull()

**notnull() returns a DataFrame containing Boolean values with True representing non-missing data. The function comes in handy when you need to pay attention only to the rows that have valid, non-missing data.

**Example 3: Identifying Non-Missing Values in a DataFrame

Python `

import pandas as pd import numpy as np

Sample DataFrame with missing values

d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d)

Checking for non-missing values using notnull()

nmv = df.notnull()

print(nmv)

`

**Output

**Example 4: Filtering Data with Non-Missing Values

This code snippet employs the notnull() function to filter rows for which the “Gender” column does not contain missing values.

Python `

import pandas as pd d = pd.read_csv("employees.csv")

Identifying non-missing values in the 'Gender' column

nmg = pd.notnull(d["Gender"])

Filtering rows where 'Gender' is not missing

nmgd= d[nmg]

display(nmgd)

`

**Output

Filling Missing Values in Pandas Using fillna(), replace() and interpolate()

When dealing with missing data in Pandas, the **fillna(), **replace(), and interpolate() functions are frequently employed to fill NaN values. These functions enable you to replace missing values with a certain value or apply interpolation methods.

1. Filling Missing Values with a Specific Value Using fillna()

**fillna() is a function to replace missing values ****(NaN)** with a given value. For instance, you can replace missing values with 0

**Example: Fill Missing Values with Zero

Python `

import pandas as pd import numpy as np

d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d)

Filling missing values with 0

df.fillna(0)

`

**Output

**2. Filling Missing Values with the Prev/Next Value Using fillna

You can use the pad method to fill missing values with the previous value or **bfill to fill with the next value. We will be using the above dataset for the demonstration.

**Example: Fill with Previous Value (Forward Fill)

Python `

df.fillna(method='pad') # Forward fill

`

**Output

**Example: Fill with Next Value (Backward Fill)

Python `

df.fillna(method='bfill') # Backward fill

`

**Output

**Example: Fill NaN Values with ‘No Gender’ using fillna()

Download the csv file from here.

Python `

import pandas as pd import numpy as np d = pd.read_csv("employees.csv")

Print records from 10th row to 24th row

d[10:25]

`

**Output

Now we are going to fill all the null values in Gender column with “No Gender”

Python `

filling a null values using fillna()

d["Gender"].fillna('No Gender', inplace = True) d[10:25]

`

**Output

3. Replacing Missing Values Using replace()

Use replace() to replace **NaN values with a specific value like **-99.

**Example: Replace NaN with -99

Python `

import pandas as pd import numpy as np

data = pd.read_csv("employees.csv") data[10:25]

`

**Output

Now, we are going to replace the all Nan value in the data frame with -99 value.

Python `

data.replace(to_replace=np.nan, value=-99)

`

**Output

4. Filling Missing Values Using interpolate()

The interpolate() function fills missing values using interpolation techniques, such as the linear method.

**Example: Linear Interpolation

Python `

import pandas as pd

Creating the dataframe

df = pd.DataFrame({"A": [12, 4, 5, None, 1], "B": [None, 2, 54, 3, None], "C": [20, 16, None, 3, 8], "D": [14, 3, None, None, 6]})

Print the dataframe

print(df)

`

**Output

Let’s interpolate the missing values using Linear method. Note that Linear method ignore the index and treat the values as equally spaced.

Python `

to interpolate the missing values

df.interpolate(method ='linear', limit_direction ='forward')

`

**Output

This method fills missing values by treating the data as equally spaced.

Dropping Missing Values in Pandas Using dropna()

The **dropna()function in Pandas removes rows or columns with NaN values. It can be used to drop data based on different conditions.

1. Dropping Rows with At Least One Null Value

Use **dropna() to remove rows that contain at least one missing value.

**Example: Drop Rows with At Least One NaN

Python `

import pandas as pd import numpy as np

dict = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, 40, 80, 98], 'Fourth Score': [np.nan, np.nan, np.nan, 65]} df = pd.DataFrame(dict)

Drop rows with at least one missing value

df.dropna()

`

**Output

**2. Dropping Rows with All Null Values

You can drop rows where all values are missing using **dropna(how=’all’).

**Example: Drop Rows with All NaN Values

Python `

dict = {'First Score': [100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, np.nan, 80, 98], 'Fourth Score': [np.nan, np.nan, np.nan, 65]} df = pd.DataFrame(dict)

Drop rows where all values are missing

df.dropna(how='all')

`

**Output

3. Dropping Columns with At Least One Null Value

To remove columns that contain at least one missing value, use **dropna(axis=1).

**Example: Drop Columns with At Least One NaN

Python `

dict = {'First Score': [100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, np.nan, 80, 98], 'Fourth Score': [60, 67, 68, 65]} df = pd.DataFrame(dict)

Drop columns with at least one missing value

df.dropna(axis=1)

`

**Output

4. Dropping Rows with Missing Values in CSV Files

When working with data from CSV files, you can drop rows with missing values using **dropna().

**Example: Drop Rows with NaN in a CSV File

Python `

import pandas as pd d = pd.read_csv("employees.csv")

Drop rows with any missing value

nd = d.dropna(axis=0, how='any')

Compare lengths of original and new dataframes

print("Old data frame length:", len(d)) print("New data frame length:", len(nd)) print("Rows with at least one missing value:", (len(d) - len(nd)))

`

**Output

Old data frame length: 1000
New data frame length: 764
Rows with at least one missing value: 236

Since the difference is 236, there were 236 rows which had at least 1 Null value in any column.

WorkingwithMissingData-min

Working with Missing Data