How to Detect Outliers in Machine Learning (original) (raw)

Last Updated : 13 Sep, 2025

In machine learning, outliers are data points that deviate significantly from the general distribution of the dataset. They may occur due to errors in data collection, natural variation or rare events. While sometimes they contain useful insights like in fraud detection but in many cases they negatively affect model accuracy and skew results making outlier detection a crucial preprocessing step.

Outliers

Outliers

Types of Outliers

Outliers can be categorized as:

1. Global Outliers (Point Anomalies):

2. Contextual Outliers:

3.Collective Outliers:

Outliers Detection Methods

We will be using Wine Quality Dataset to illustrate different techniques.

The used dataset can be downloaded from here.

**Step 1: Import Libraries and Load Dataset

Here we will import numpy, pandas, matplotlib, seaborn, scikit learn and scipy.

Python `

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor from scipy import stats

df = pd.read_csv("winequality-red.csv") print(df.shape) print(df.head())

data = df.drop("quality", axis=1)

`

**Output:

dataset

Dataset

**Step 2: Visualize

Python `

plt.figure(figsize=(12, 6)) sns.boxplot(data=data) plt.xticks(rotation=45) plt.title("Boxplots of Wine Features") plt.show()

`

**Output:

boxplt

Boxplot

Here we can see black dots represents outliers in our dataset on which we will work now using different techniques like:

1. Z-Score Method

he Z-Score method is a statistical technique that detects outliers based on how far a data point is from the mean, measured in terms of standard deviations. It assumes the data follows a normal distribution. A point with a very high or low Z-score (typically |Z| > 3) is flagged as an outlier because it lies in the extreme tails of the distribution.

**Formula:

Z = \frac{x - \mu}{\sigma}

Where,

**How it works: Compares distance of a point from the mean in units of standard deviation.

z_scores = np.abs(stats.zscore(data)) outliers_z = np.where(z_scores > 3)

print("Outlier positions (row, col):") print(list(zip(outliers_z[0][:10], outliers_z[1][:10])))

`

**Output:

Outlier positions (row, col):
[(np.int64(13), np.int64(9)), (np.int64(14), np.int64(5)), (np.int64(15), np.int64(5)), . . ., (np.int64(42), np.int64(4))]

2. IQR Method (Interquartile Range)

The IQR method is a robust statistical approach that identifies outliers by examining the spread of the middle 50% of the data. It calculates the Interquartile Range (IQR), which is the difference between the 75th percentile (Q3) and 25th percentile (Q1). Any value that falls below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is considered an outlier.

**Formula:

IQR = Q3 - Q1

**Outliers Thresholds:

**Intuition: Values too far below or above the “box” in a boxplot are flagged.

Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1

outliers_iqr = ((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))) print("Number of outliers per column:") print(outliers_iqr.sum())

`

**Output:

Screenshot-2025-09-12-114928

IQR Method

3. Isolation Forest

Isolation Forest is a model-based anomaly detection algorithm that isolates outliers instead of profiling normal data. It builds multiple random decision trees by repeatedly splitting the data. Since outliers are few and different, they are easier to isolate and require fewer splits.

**How it works:

**Pros: Works well in high dimensions, efficient.

**Cons: Requires choosing contamination (expected outlier fraction).

Python `

iso = IsolationForest(contamination=0.05, random_state=42) y_pred_iso = iso.fit_predict(data)

df["IsoForest_Outlier"] = y_pred_iso print(df["IsoForest_Outlier"].value_counts())

plt.figure(figsize=(7, 5)) sns.scatterplot(x="alcohol", y="residual sugar", data=df, hue="IsoForest_Outlier", palette="coolwarm") plt.title("Isolation Forest Outlier Detection") plt.show()

`

**Output:

IsoForest_Outlier
1 1519
-1 80

isolation-forest

Isolation Forest

4. Local Outlier Factor (LOF)

The Local Outlier Factor (LOF) method is a density-based anomaly detection technique that compares the local density of a data point to that of its neighbors. If a point has significantly lower density than its neighbors, it is flagged as an outlier.

**How it works:

**Pros: Works well with clusters of varying density.

**Cons: Sensitive to choice of k (neighbors).

Python `

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05) y_pred_lof = lof.fit_predict(data)

df["LOF_Outlier"] = y_pred_lof print(df["LOF_Outlier"].value_counts())

plt.figure(figsize=(7, 5)) sns.scatterplot(x="alcohol", y="volatile acidity", data=df, hue="LOF_Outlier", palette="Set1") plt.title("LOF Outlier Detection") plt.show()

`

**Output:

LOF_Outlier

1 1519
-1 80

lof

Comparison of Outlier Detection Techniques

Technique Type Key Idea Works Well For Pros Cons
**Z-Score Statistical Flags points far from mean (in SD units) Normally distributed continuous data Simple, fast and easy to implement Not reliable for skewed or non-normal data
**IQR Statistical Flags points outside 1.5×IQR from Q1/Q3 Univariate data, boxplot-based analysis Robust to extreme values and is non-parametric Doesn’t adapt well to very skewed distributions
**Isolation Forest Model-based Isolates outliers via random tree splits High-dimensional datasets Handles large datasets, efficient and works with many features Requires setting contamination parameter with which results can vary
**Local Outlier Factor (LOF) Density-based Compares local density to neighbors Data with clusters or varying densities Detects local outliers well Sensitive to number of neighbors (k), computationally costlier