Working with Missing Data in Pandas (original) (raw)
In Pandas, missing values often arise from uncollected data or incomplete entries. This article explores how to detect, handle and fill missing values in a DataFrame, ensuring clean and accurate data for analysis.
In Pandas, missing values are represented as:
None
: A Python object commonly used to represent missing values in object-type arrays.NaN
: A special floating-point value from NumPy, which is recognized by all systems that use IEEE floating-point standards.
Checking for Missing Values in Pandas DataFrame
For finding the missing values and handling them, Pandas gives us two convenient functions: isnull() and notnull(). They assist us in detecting if a value is NaN or not, which facilitates data cleaning and preprocessing in a DataFrame or Series.
1. Checking for Missing Values Using isnull()
**isnull() returns a DataFrame of Boolean values, where **True represents missing data (**NaN). This is handy if you want to find and fill missing data in a dataset.
**Example 1: Finding Missing Values in a DataFrame
Python `
Importing pandas and numpy
import pandas as pd import numpy as np
Sample DataFrame with missing values
d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d)
Checking for missing values using isnull()
mv = df.isnull()
print(mv)
`
**Output
**Example 2: Filtering Data Based on Missing Values
Here, isnull() method is used over the “Gender” column in order to filter and print out rows containing missing gender data.
Python `
import pandas as pd d = pd.read_csv("employees.csv")
bool_series = pd.isnull(d["Gender"]) missing_gender_data = d[bool_series] print(missing_gender_data)
`
**Output
2. Checking for Non-Missing Values Using notnull()
**notnull() returns a DataFrame containing Boolean values with True representing non-missing data. The function comes in handy when you need to pay attention only to the rows that have valid, non-missing data.
**Example 3: Identifying Non-Missing Values in a DataFrame
Python `
import pandas as pd import numpy as np
Sample DataFrame with missing values
d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d)
Checking for non-missing values using notnull()
nmv = df.notnull()
print(nmv)
`
**Output
**Example 4: Filtering Data with Non-Missing Values
This code snippet employs the notnull() function to filter rows for which the “Gender” column does not contain missing values.
Python `
import pandas as pd d = pd.read_csv("employees.csv")
Identifying non-missing values in the 'Gender' column
nmg = pd.notnull(d["Gender"])
Filtering rows where 'Gender' is not missing
nmgd= d[nmg]
display(nmgd)
`
**Output
Filling Missing Values in Pandas Using fillna(), replace() and interpolate()
When dealing with missing data in Pandas, the **fillna(), **replace(), and interpolate() functions are frequently employed to fill NaN values. These functions enable you to replace missing values with a certain value or apply interpolation methods.
1. Filling Missing Values with a Specific Value Using fillna()
**fillna() is a function to replace missing values ****(NaN)** with a given value. For instance, you can replace missing values with 0
**Example: Fill Missing Values with Zero
Python `
import pandas as pd import numpy as np
d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d)
Filling missing values with 0
df.fillna(0)
`
**Output
**2. Filling Missing Values with the Prev/Next Value Using fillna
You can use the pad method to fill missing values with the previous value or **bfill to fill with the next value. We will be using the above dataset for the demonstration.
**Example: Fill with Previous Value (Forward Fill)
Python `
df.fillna(method='pad') # Forward fill
`
**Output
**Example: Fill with Next Value (Backward Fill)
Python `
df.fillna(method='bfill') # Backward fill
`
**Output
**Example: Fill NaN Values with ‘No Gender’ using fillna()
Download the csv file from here.
Python `
import pandas as pd import numpy as np d = pd.read_csv("employees.csv")
Print records from 10th row to 24th row
d[10:25]
`
**Output
Now we are going to fill all the null values in Gender column with “No Gender”
Python `
filling a null values using fillna()
d["Gender"].fillna('No Gender', inplace = True) d[10:25]
`
**Output
3. Replacing Missing Values Using replace()
Use replace() to replace **NaN values with a specific value like **-99.
**Example: Replace NaN with -99
Python `
import pandas as pd import numpy as np
data = pd.read_csv("employees.csv") data[10:25]
`
**Output
Now, we are going to replace the all Nan value in the data frame with -99 value.
Python `
data.replace(to_replace=np.nan, value=-99)
`
**Output
4. Filling Missing Values Using interpolate()
The interpolate() function fills missing values using interpolation techniques, such as the linear method.
**Example: Linear Interpolation
Python `
import pandas as pd
Creating the dataframe
df = pd.DataFrame({"A": [12, 4, 5, None, 1], "B": [None, 2, 54, 3, None], "C": [20, 16, None, 3, 8], "D": [14, 3, None, None, 6]})
Print the dataframe
print(df)
`
**Output
Let’s interpolate the missing values using Linear method. Note that Linear method ignore the index and treat the values as equally spaced.
Python `
to interpolate the missing values
df.interpolate(method ='linear', limit_direction ='forward')
`
**Output
This method fills missing values by treating the data as equally spaced.
Dropping Missing Values in Pandas Using dropna()
The **dropna()function in Pandas removes rows or columns with NaN values. It can be used to drop data based on different conditions.
1. Dropping Rows with At Least One Null Value
Use **dropna() to remove rows that contain at least one missing value.
**Example: Drop Rows with At Least One NaN
Python `
import pandas as pd import numpy as np
dict = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, 40, 80, 98], 'Fourth Score': [np.nan, np.nan, np.nan, 65]} df = pd.DataFrame(dict)
Drop rows with at least one missing value
df.dropna()
`
**Output
**2. Dropping Rows with All Null Values
You can drop rows where all values are missing using **dropna(how=’all’).
**Example: Drop Rows with All NaN Values
Python `
dict = {'First Score': [100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, np.nan, 80, 98], 'Fourth Score': [np.nan, np.nan, np.nan, 65]} df = pd.DataFrame(dict)
Drop rows where all values are missing
df.dropna(how='all')
`
**Output
3. Dropping Columns with At Least One Null Value
To remove columns that contain at least one missing value, use **dropna(axis=1).
**Example: Drop Columns with At Least One NaN
Python `
dict = {'First Score': [100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, np.nan, 80, 98], 'Fourth Score': [60, 67, 68, 65]} df = pd.DataFrame(dict)
Drop columns with at least one missing value
df.dropna(axis=1)
`
**Output
4. Dropping Rows with Missing Values in CSV Files
When working with data from CSV files, you can drop rows with missing values using **dropna().
**Example: Drop Rows with NaN in a CSV File
Python `
import pandas as pd d = pd.read_csv("employees.csv")
Drop rows with any missing value
nd = d.dropna(axis=0, how='any')
Compare lengths of original and new dataframes
print("Old data frame length:", len(d)) print("New data frame length:", len(nd)) print("Rows with at least one missing value:", (len(d) - len(nd)))
`
**Output
Old data frame length: 1000
New data frame length: 764
Rows with at least one missing value: 236
Since the difference is 236, there were 236 rows which had at least 1 Null value in any column.
Working with Missing Data