Anomaly detection using Isolation Forest (original) (raw)

Last Updated : 23 Jul, 2025

Anomaly detection is vital across industries, revealing outliers in data that signal problems or unique insights. Isolation Forests offer a powerful solution, isolating anomalies from normal data. In this tutorial, we will explore the Isolation Forest algorithm's implementation for anomaly detection using the Iris flower dataset, showcasing its effectiveness in identifying outliers amidst multidimensional data.

**What is Anomaly Detection?

Anomalies, also known as outliers, are data points that deviate significantly from the expected behavior or norm within a dataset. They are crucial to identify because they can signal potential problems, fraudulent activities, or interesting discoveries. Anomaly detection plays a vital role in various fields, including data analysis, machine learning, and network security.

**Types of Anomalies

There are essentially three types of anomalies: point anomalies, contextual anomalies, and collective anomalies.

Isolation Forests for Anomaly Detection

Isolation Forest is an unsupervised anomaly detection algorithm particularly effective for high-dimensional data. It operates under the principle that anomalies are rare and distinct, making them easier to isolate from the rest of the data. Unlike other methods that profile normal data, Isolation Forests focus on isolating anomalies. At its core, the Isolation Forest algorithm, it banks on the fundamental concept that anomalies, they deviate significantly, thereby making them easier to identify.

Isolation Forests excel at anomaly detection by leveraging a unique approach: isolating anomalies instead of profiling normal data points. The workings of isolation forests are defined below:

**Key Takeaways:

**Anomaly detection using Isolation Forest: Implementation

Let's see implementation for Isolation Forest algorithm for anomaly detection using the Iris flower dataset from scikit-learn. In the context of the Iris flower dataset, the outliers would be data points that do not correspond to any of the three known Iris flower species (Iris Setosa, Iris Versicolor, and Iris Virginica). The following steps are mentioned:

Step 1: Import necessary libraries

Python3 `

from sklearn.ensemble import IsolationForest from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris import matplotlib.pyplot as plt

`

Step 2: Loading and Splitting the Dataset

Python3 `

iris = load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

`

Step 3: Fitting the model

This code creates an Isolation Forest classifier instance using the IsolationForest class. Contamination is a parameter that specifies the expected proportion of anomalies in the data. Here, it's set to 0.1 (10%).

Python3 `

initialize and fit the model

clf = IsolationForest(contamination=0.1) clf.fit(X_train)

`

Step 4: Predictions

The predict method returns labels indicating whether each data point is classified as **normal (1) or anomalous (-1) by the model.

Python3 `

predict the anomalies in the data

y_pred_train = clf.predict(X_train) y_pred_test = clf.predict(X_test) print(y_pred_train) print(y_pred_test)

`

**Output:

[ 1 1 1 1 -1 1 -1 1 1 -1 1 1 1 1 -1 1 1 1 1 1 1 1 -1 1
1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 -1 1 1 -1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1
1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 -1 1 1]
[ 1 -1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 -1 1 1 1 -1 1]

Step 4: Visualization

Python3 `

def create_scatter_plots(X1, y1, title1, X2, y2, title2): fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Scatter plot for the first set of data
axes[0].scatter(X1[y1==1, 0], X1[y1==1, 1], color='green', label='Normal')
axes[0].scatter(X1[y1==-1, 0], X1[y1==-1, 1], color='red', label='Anomaly')
axes[0].set_title(title1)
axes[0].legend()

# Scatter plot for the second set of data
axes[1].scatter(X2[y2==1, 0], X2[y2==1, 1], color='green', label='Normal')
axes[1].scatter(X2[y2==-1, 0], X2[y2==-1, 1], color='red', label='Anomaly')
axes[1].set_title(title2)
axes[1].legend()

plt.tight_layout()
plt.show()

scatter plots

create_scatter_plots(X_train, y_pred_train, 'Training Data', X_test, y_pred_test, 'Test Data')

`

**Output:

isolation-(1)

The distribution of the anomalies in the training data is different from the distribution of the anomalies in the test data. In the training data, the anomalies tend to be located on the edges of the plot. In the test data, the anomalies are more scattered throughout the plot.

**Advantages of Isolation Forests

The Isolation Forest algorithm offers an efficient solution for identifying anomalies, especially in datasets with multiple dimensions. It stands out by isolating outliers rather than profiling normal cases, making it more adept at uncovering rare instances that differ from the usual pattern.