Fuzzy Clustering in R (original) (raw)

Last Updated : 28 Jun, 2025

Clustering is an unsupervised machine learning technique that is used to identify similarities and patterns within data points by grouping similar points based on their features. These points can belong to different clusters simultaneously. Fuzzy clustering is a clustering technique where each data point can belong to multiple clusters with varying degrees of membership. In normal clustering, each data point goes into only one group, while in fuzzy clustering, a point can be shared between groups to different degrees.

Understanding Fuzzy Clustering

In real-world scenarios, data points may naturally belong to more than one cluster. Fuzzy clustering addresses this by allowing data points to belong to multiple clusters at the same time. Its key advantages over normal clustering include:

Difference between Normal and Fuzzy Clustering

**Factor **Normal Clustering **Fuzzy Clustering
Partitioning Hard Partitioning, data points can belong to only one cluster. Soft Partitioning, data points can belong to multiple clusters.
Membership Data points can either belong to one cluster or none at all. Data points can belong to multiple clusters simultaneously.
Representation represented by centroids. represented by centroids with degrees of membership.
Suitable dataset Dataset with distinct boundaries. Dataset with overlapping observations.
Algorithm used K-means, Hierarchical clustering. Fuzzy C -means, Gustafson-Kessel algorithm.
Implementation Easier to implement since the dataset is not complex. Difficult to Implement since dataset has overlapping observations.

Fuzzy Clustering in R using Customer Segmentation Dataset

In this example we will apply fuzzy clustering on a Sample sales dataset.

You can download the dataset from here.

1. Loading Required Libraries

We are installing and loading the libraries that provide necessary functions for clustering and visualization.

install.packages("e1071") install.packages("cluster") install.packages("factoextra")

library(e1071) library(cluster) library(factoextra)

`

2. Loading the Dataset

We are reading the CSV file containing the data and inspecting the first few rows.

data <- read.csv("/content/Mall_Customers.csv")

head(data)

`

**Output:

Data

Output

3. Data Preprocessing

We are cleaning the dataset to ensure accuracy in clustering.

x <- colSums(is.na(data)) x <- as.data.frame(x) x

data<- na.omit(data)

`

**Output:

Screenshot-2025-06-28-123933

Output

4. Selecting Data for Clustering

We are choosing relevant features that influence clustering: Age, Income and Spending Score.

R `

data_for_clustering <- data[, c("Age", "Annual.Income..k..", "Spending.Score..1.100.")]

`

5. Applying Fuzzy C-means Clustering

We are performing fuzzy clustering using the cmeans() function from e1071.

set.seed(123) n_cluster <- 5 m <- 2 result <- cmeans(data_for_clustering, centers = n_cluster, m = m)

fuzzy_membership_matrix <- result$membership

initial_centers <- result$centers final_centers <- t(result$centers)

`

6. Interpreting the Clustering Results

We are combining the fuzzy membership matrix with the original dataset for analysis.

cluster_membership <- as.data.frame(result$membership) data_with_clusters <- cbind(data, cluster_membership) head(data_with_clusters)

`

**Output:

fuzzy_clustering

Output

7. Evaluating the Clustering Quality

The quality of a cluster refers to how well-separated and distinct the clusters are from each other and how cohesive the data points within each cluster are. A high-quality cluster should have tightly grouped points and be well-separated from other clusters. We will evaluate the quality of the formed clusters using the following approach:

7.1. Using Gap Statistic

We are computing the gap statistic to determine the optimal number of clusters by comparing the model’s performance with random uniform distribution.

library(cluster)

gap_stat <- clusGap(data_for_clustering, FUN = kmeans, nstart = 25, K.max = 10, B = 50)

print(gap_stat)

fviz_gap_stat(gap_stat)

`

**Output:

Screenshot-2025-06-28-124900

Output

Screenshot-2025-06-28-124956

Output

7.2. Using Davies-Bouldin Index

We are calculating the Davies-Bouldin Index to measure intra-cluster similarity and inter-cluster separation.

install.packages("clusterSim") library(clusterSim)

set.seed(123) km_res <- kmeans(data_for_clustering, centers = 5, nstart = 25)

db_index <- index.DB(data_for_clustering, km_res$cluster, centrotypes = "centroids")

print(db_index$DB)

`

**Output:

[1] 0.884653

7.3. Using Calinski-Harabasz Index

We are computing the Calinski-Harabasz (CH) Index to evaluate the ratio of between-cluster dispersion to within-cluster dispersion.

ch_index <- index.G1(data_for_clustering, km_res$cluster)

print(ch_index)

`

**Output:

[1] 151.0439

8. Visualizing the Clustering Results

We will visualize the clustering results to better understand the distribution and separation of the formed clusters.

8.1. 2D Scatter Plot

We are plotting customer data colored by their fuzzy cluster assignment for easy interpretation.

data_with_clusters$Cluster <- apply(result$membership, 1, which.max)

ggplot(data_with_clusters, aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = as.factor(Cluster))) + geom_point(size = 2) + labs(title = "Fuzzy C-means Clustering", x = "Annual Income (k$)", y = "Spending Score")

`

**Output:

Screenshot-2025-06-28-125418

Output

7.2. Pairwise Variable Relationship Plot

We are creating pairwise scatter plots to study the relationship between clustering variables.

pairs(data_for_clustering, pch = 16, col = as.numeric(result$cluster))

`

**Output:

Screenshot-2025-06-28-125606

Output

7.3. Clusplot for 2D Cluster Projection

We are visualizing clusters in 2D space using dimensionality reduction.

clusplot(data_for_clustering, result$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)

`

**Output:

Screenshot-2025-06-28-125738

Output

This plot shows how the data points are grouped into clear clusters in 2D, capturing about 77.57% of the original data’s information.