Clustering in R Programming (original) (raw)

Last Updated : 14 Apr, 2026

Clustering is an unsupervised learning technique that organizes data points into groups based on their similarity. It reveals underlying patterns, structures and relationships within a dataset without the need for labeled data.

algorithm

Clustering

Here unstructured data is processed by a clustering algorithm to automatically group similar items into meaningful clusters in the output.

**Clustering Algorithms in R Programming

In R, there are different clustering techniques that work with various types of data and address specific clustering challenges. Each method has its own strengths and can handle aspects like the number of clusters, their shapes and the presence of noise in the data.

clustering_algorithms_in_r_programming_1

Clustering Algorithms

1. K-Means Clustering

K-Means clustering is a widely used unsupervised learning algorithm that groups data points into a specified number of clusters based on their features. It works by iteratively assigning points to the nearest cluster center and updating the cluster centroids until convergence. K-Means is simple, fast and effective for many real-world datasets.

**Implementation: Here we apply K-Means clustering on the mtcars dataset and visualizes the resulting clusters using the factoextra package.

install.packages("factoextra") library(factoextra)

df <- mtcars df <- na.omit(df)
df_scaled <- scale(df)

set.seed(123)

km4 <- kmeans(df_scaled, centers = 4, nstart = 25)

fviz_cluster(km4, data = df_scaled, geom = "point", ellipse.type = "convex", main = "K-Means Clustering (k=4)")

`

**Output:

Screenshot-2026-02-19-103236

K means Clustering

2. Hierarchical Clustering

Hierarchical Clustering is an unsupervised learning method that groups data points into a tree-like structure based on their similarity. Instead of specifying the number of clusters in advance, it builds a hierarchy of clusters that can be visualized using a dendrogram. This method is useful when you want to understand the relationships between data points at different levels of similarity.

**Implementation: Here we implement Hierarchical Clustering on the Wholesale Customers dataset and visualizes the resulting clusters.

Download dataset from here

install.packages("factoextra") library(factoextra)

data <- read.csv("/content/Wholesale customers data.csv")

numeric_data <- data[, c("Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen")]

scaled_data <- scale(numeric_data)

dist_matrix <- dist(scaled_data, method = "euclidean")

hc <- hclust(dist_matrix, method = "ward.D2")

plot(hc, labels = FALSE, hang = -1, main = "Dendrogram - Hierarchical Clustering")

clusters <- cutree(hc, k = 3)

data$Cluster <- as.factor(clusters)

fviz_cluster(list(data = scaled_data, cluster = clusters), geom = "point", ellipse.type = "convex", main = "Hierarchical Clustering (k = 3)")

`

**Output:

Screenshot-2026-02-19-104031

Hierarchical Clustering

3. DBScan Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that groups data points based on regions of high density. Unlike K-Means, it does not require the number of clusters to be specified in advance. DBSCAN can detect clusters of arbitrary shapes and effectively identify noise or outliers in the dataset.

**Implementation: Here we implement DBSCAN clustering to the Iris dataset.

install.packages("dbscan") install.packages("factoextra")

library(dbscan) library(factoextra)

data <- iris[, 1:4]

scaled_data <- scale(data)

set.seed(123)

db <- dbscan(scaled_data, eps = 0.8, minPts = 5)

print(db)

fviz_cluster(db, data = scaled_data, geom = "point", main = "DBSCAN Clustering on Iris Dataset")

`

**Output:

Screenshot-2026-02-19-110005

DBSCAN clustering

4. Fuzzy Clustering

Fuzzy Clustering also known as Fuzzy C-Means (FCM), is an unsupervised learning technique where each data point can belong to multiple clusters with different degrees of membership. Unlike hard clustering methods such as K-Means, which assign each point to only one cluster, fuzzy clustering provides a membership score between 0 and 1 for every cluster. This approach is especially useful when cluster boundaries are overlapping or not clearly defined.

**Implementation: Here we implement Fuzzy C-Means clustering to the mtcars dataset

install.packages("e1071") install.packages("factoextra")

library(e1071) library(factoextra)

data <- mtcars scaled_data <- scale(data) set.seed(123)

fcm <- cmeans(scaled_data, centers = 3, m = 2, iter.max = 100)

print(fcm$membership)

clusters <- as.factor(fcm$cluster)

fviz_cluster(list(data = scaled_data, cluster = clusters), geom = "point", ellipse.type = "convex", main = "Fuzzy C-Means Clustering on mtcars Dataset")

`

**Output:

5. Spectral Clustering

Spectral Clustering is an advanced clustering technique that groups data points based on similarity by using concepts from graph theory and linear algebra. Instead of directly clustering the original data, it constructs a similarity graph and analyzes its structure using eigenvalues and eigenvectors. This approach is especially effective for detecting complex, non-convex and non-linearly separable clusters.

**Implementation: Here we implement Spectral Clustering

install.packages("kernlab") install.packages("ggplot2") install.packages("ggrepel")

library(kernlab) library(ggplot2) library(ggrepel)

data <- iris[, 1:4] scaled_data <- scale(data)

set.seed(123)

sc <- specc(scaled_data, centers = 3)

iris$Cluster <- as.factor(sc)

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +

geom_point(size = 3, alpha = 0.8) +

geom_text_repel(aes(label = Species), size = 3.5, box.padding = 0.4, point.padding = 0.3, max.overlaps = Inf) +

stat_ellipse(aes(group = Cluster), linetype = 2, linewidth = 0.8) +

labs(title = "Spectral Clustering of Iris Dataset", x = "Sepal Length", y = "Sepal Width", color = "Cluster") +

theme_minimal(base_size = 14) +

theme( plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "right" )

`

**Output:

Screenshot-2026-02-19-114917

Spectral Clustering

**Ensemble Clustering

Ensemble Clustering is an advanced clustering approach that combines the results of multiple clustering algorithms or multiple runs of the same algorithm to produce a more stable and reliable final clustering solution. Instead of relying on a single method, it aggregates different clustering outcomes to reduce variability and improve robustness. This technique is especially useful when the true cluster structure is uncertain or when individual algorithms produce inconsistent results.

**Implementation

Here we implement Ensemble Clustering on the Wholesale Customers dataset.

install.packages("dbscan") install.packages("kernlab") install.packages("factoextra")

library(dbscan) library(kernlab) library(factoextra)

data <- read.csv("Wholesale_Customers_Data.csv") numeric_data <- data[, c("Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen")] scaled_data <- scale(numeric_data)

set.seed(123)

km <- kmeans(scaled_data, centers = 3, nstart = 25) cluster_km <- km$cluster

dist_matrix <- dist(scaled_data) hc <- hclust(dist_matrix, method = "ward.D2") cluster_hc <- cutree(hc, k = 3)

db <- dbscan(scaled_data, eps = 1.2, minPts = 5) cluster_db <- db$cluster

cluster_db[cluster_db == 0] <- max(cluster_db) + 1

clusters_matrix <- cbind(cluster_km, cluster_hc, cluster_db)

ensemble_cluster <- apply(clusters_matrix, 1, function(x) { as.numeric(names(sort(table(x), decreasing = TRUE)[1])) })

data$Ensemble_Cluster <- as.factor(ensemble_cluster)

fviz_cluster(list(data = scaled_data, cluster = ensemble_cluster), geom = "point", ellipse.type = "convex", main = "Ensemble Clustering - Wholesale Dataset")

`

**Output:

Screenshot-2026-02-19-120008

Ensemble Clustering

Download full code from here

Evaluation Metrics in Clustering

Clustering evaluation metrics help assess the quality and effectiveness of clustering results. These metrics measure how well data points are grouped within clusters and how distinct the clusters are from each other.

Applications

Advantages

Limitations