Data Cleaning (original) (raw)

Last Updated : 28 Apr, 2026

Data cleaning is the process of preparing raw data by detecting and correcting errors so it can be effectively used for analysis. It is a foundational step in data preprocessing that ensures datasets are suitable for analytical, statistical and machine learning tasks.

Common Data Anomalies

Data quality issues can arise from human errors, system failures or problems during data collection and integration. Some of the most common data quality challenges include:

Data Cleaning Process

1. Assess Data Quality

The first step in data cleaning is to assess the quality of your data. This involves checking for:

quadrilaterals

Imperfect Dataframe

After assessing data quality, several issues can be identified in the dataset:

2. Remove Irrelevant Data

Removing irrelevant or duplicate data ensures the dataset is clean, accurate and meaningful, preventing skewed analysis and improving overall quality.

imperfect_dataframe

Remove Irrelevant Data

In the deduplicated DataFrame, the duplicate rows 1 and 6 have been removed to ensure each record is unique

3. Fix Structural Errors

Structural errors occur when data formats, naming conventions or variable types are inconsistent which can affect analysis accuracy. Correcting these issues ensures uniform and reliable data representation.

dataframe_with_standardized_date_fromat

Fix Structural Errors

4. Handle Missing Data

Missing data can introduce bias and reduce the reliability of analysis. Properly addressing missing values helps maintain the integrity of your dataset.

dataframe_with_handled_missing_values

Handle Missing Data

The missing value in the 'Name' column (row 7) has been filled with 'Unknown' to indicate unavailable data, ensuring the dataset remains complete and consistent.

5. Normalize Data

Data normalization organizes the dataset to reduce redundancy and ensure consistency making it easier to manage and analyze.

normalized_data_scores_

Normalize Data

6. Identify and Manage Outliers

Outliers are data points that deviate significantly from the rest of the dataset and can affect analysis accuracy. Properly handling them ensures more reliable insights.

dataframe_with_managed_outliers

Manage Outliers

Implementation for Data Cleaning

Let's understand each step for Database Cleaning using titanic dataset.

You can download the dataset from here.

Step 1: Import Libraries and Load Dataset

We will import all the necessary libraries i.e pandas and numpy.

Python `

import pandas as pd import numpy as np

df = pd.read_csv('Titanic-Dataset.csv') df.info() df.head()

`

**Output:

Step 2: Check for Duplicate Rows

df.duplicated()

`

**Output:

Screenshot-2025-08-29-122420

Duplicated Data

Step 3: Identify Column Data Types

cat_col = [col for col in df.columns if df[col].dtype == 'object'] num_col = [col for col in df.columns if df[col].dtype != 'object']

print('Categorical columns:', cat_col) print('Numerical columns:', num_col)

`

**Output:

Screenshot-2025-08-29-123218

Column Data Types

Step 4: Count Unique Values in the Categorical Columns

df[cat_col].nunique()

`

**Output:

Screenshot-2025-08-29-122434

Unique Values

Step 5: Calculate Missing Values as Percentage

round((df.isnull().sum() / df.shape[0]) * 100, 2)

`

**Output:

Screenshot-2025-08-29-122442

Missing Value Percentage

Step 6: Drop Irrelevant or Data-Heavy Missing Columns

df1 = df.drop(columns=['Name', 'Ticket', 'Cabin']) df1.dropna(subset=['Embarked'], inplace=True) df1['Age'] = df1['Age'].fillna(df1['Age'].mean())

`

Step 7: Detect Outliers with Box Plot

import matplotlib.pyplot as plt

plt.boxplot(df1['Age'], vert=False) plt.ylabel('Variable') plt.xlabel('Age') plt.title('Box Plot') plt.show()

`

**Output:

boxplot

Boxplot

Step 8: Calculate Outlier Boundaries and Remove Them

mean = df1['Age'].mean() std = df1['Age'].std()

lower_bound = mean - 2 * std upper_bound = mean + 2 * std

df2 = df1[(df1['Age'] >= lower_bound) & (df1['Age'] <= upper_bound)]

`

Step 9: Impute Missing Data Again if Any

**fillna() applied again on filtered data to handle any remaining missing values.

Python `

df3 = df2.fillna(df2['Age'].mean()) df3.isnull().sum()

`

**Output:

Screenshot-2025-08-29-122505

Missing Value

Step 10: Recalculate Outlier Bounds and Remove Outliers from the Updated Data

mean = df3['Age'].mean() std = df3['Age'].std()

lower_bound = mean - 2 * std upper_bound = mean + 2 * std

print('Lower Bound :', lower_bound) print('Upper Bound :', upper_bound)

df4 = df3[(df3['Age'] >= lower_bound) & (df3['Age'] <= upper_bound)]

`

**Output:

Screenshot-2025-08-29-122513

Outlier Check

Step 11: Data validation and verification

Data validation and verification involve ensuring that the data is accurate and consistent by comparing it with external sources or expert knowledge.

X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']] Y = df3['Survived']

`

Step 12: Data formatting

Data formatting involves converting the data into a standard format or structure that can be easily processed by the algorithms or models used for analysis. Here we will discuss commonly used data formatting techniques i.e. Scaling and Normalization.

**1. Min-Max Scaling: Scaling involves transforming the values of features to a specific range. Min-Max scaling rescales the values to a specified range, typically between 0 and 1. It preserves the original distribution and ensures that the minimum value maps to 0 and the maximum value maps to 1.

Python `

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

num_col_ = [col for col in X.columns if X[col].dtype != 'object'] x1 = X x1[num_col_] = scaler.fit_transform(x1[num_col_]) x1.head()

`

**Output:

Screenshot-2025-08-29-122526

Min-Max Scaling

**2. Standardization (Z-score scaling): Standardization transforms the values to have a mean of 0 and a standard deviation of 1. It centers the data around the mean and scales it based on the standard deviation. Standardization makes the data more suitable for algorithms that assume a Gaussian distribution or require features to have zero mean and unit variance.

Z = (X - μ) / σ

Where,

You can download the source code from here.

Data Cleaning Strategies

Advantages

Disadvantages