A Fast Incremental Clustering Algorithm (original) (raw)

A Comparative Review of Incremental Clustering Methods for Large Dataset

International Journal of Advanced Trends in Computer Science and Engineering, 2021

Several algorithms have developed for analyzing large incremental datasets. Incremental algorithms are relatively efficient in dynamic evolving environment to seek out small clusters in large datasets. Many algorithms have devised for limiting the search space, building, and updating arbitrary shaped clusters in large incremented datasets. Within the real time visualization of real time data, when data in motion and growing dynamically, new data points arrive that generates instant cluster labels. In this paper, the comparative review of Incremental clustering methods for large dataset has done.

Computational analysis of incremental clustering approaches for Large Data

2021

Clustering is an approach of data mining, which helps us to find the underlying hidden structure in the dataset. K-means is a clustering method which usages distance functions to find the similarities or dissimilarities between the instances. DBSCAN is a clustering algorithm, which discovers the arbitrary shapes & sizes of clusters from huge volume of using spatial density method. These two approaches of clustering are the classical methods for efficient clustering but underperform when the data is updated frequently in the databases so, the incremental or gradual clustering approaches are always preferred in this environment. In this paper, an incremental approach for clustering is introduced using K-means and DBSCAN to handle the new datasets dynamically updated in the database in an interval.

Distance based Incremental Clustering for Mining Clusters of Arbitrary Shapes

Lecture Notes in Computer Science, 2013

Clustering has been recognized as one of the important tasks in data mining. One important class of clustering is distance based method. To reduce the computational and storage burden of the classical clustering methods, many distance based hybrid clustering methods have been proposed. However, these methods are not suitable for cluster analysis in dynamic environment where underlying data distribution and subsequently clustering structures change over time. In this paper, we propose a distance based incremental clustering method, which can find arbitrary shaped clusters in fast changing dynamic scenarios. Our proposed method is based on recently proposed al-SL method, which can successfully be applied to large static datasets. In the incremental version of the al-SL (termed as IncrementalSL), we exploit important characteristics of al-SL method to handle frequent updates of patterns to the given dataset. The IncrementalSL method can produce exactly same clustering results as produced by the al-SL method. To show the effectiveness of the IncrementalSL in dynamically changing database, we experimented with one synthetic and one real world datasets.

Efficient incremental density-based algorithm for clustering large datasets

Alexandria Engineering Journal, 2015

In dynamic information environments such as the web, the amount of information is rapidly increasing. Thus, the need to organize such information in an efficient manner is more important than ever. With such dynamic nature, incremental clustering algorithms are always preferred compared to traditional static algorithms. In this paper, an enhanced version of the incremental DBSCAN algorithm is introduced for incrementally building and updating arbitrary shaped clusters in large datasets. The proposed algorithm enhances the incremental clustering process by limiting the search space to partitions rather than the whole dataset which results in significant improvements in the performance compared to relevant incremental clustering algorithms. Experimental results with datasets of different sizes and dimensions show that the proposed algorithm speeds up the incremental clustering process by factor up to 3.2 compared to existing incremental algorithms.

Survey on Clustering Algorithm and Similarity Measure for Categorical Data

ICTACT Journal on Soft Computing, 2014

Learning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in real world consist of both numerical and categorical data. Categorical data are not analyzed as numerical data because of the absence of inherit ordering. This paper describes about ten different clustering algorithms, its methodology and the factors influencing its performance. Each algorithm is evaluated using real world datasets and its pro and cons are specified. The various similarity / dissimilarity measure applied to categorical data and its performance is also discussed. The time complexity defines the amount of time taken by an algorithm to perform the elementary operation. The time complexity of various algorithms are discussed and its performance on real world data such as mushroom, zoo, soya bean, cancer, vote, car and iris are measured. In this survey Cluster Accuracy and Error rate for four different clustering algorithm (Kmodes, fuzzy K-modes, ROCK and Squeezer), two different similarity measure (DISC and Overlap) and DILCA applied for hierarchy and partition algorithm are evaluated.

A Density Based Dynamic Data Clustering Algorithm based on Incremental Dataset

Journal of Computer Science, 2012

Problem statement: Clustering and visualizing high-dimensional dynamic data is a challenging problem. Most of the existing clustering algorithms are based on the static statistical relationship among data. Dynamic clustering is a mechanism to adopt and discover clusters in real time environments. There are many applications such as incremental data mining in data warehousing applications, sensor network, which relies on dynamic data clustering algorithms. Approach: In this work, we present a density based dynamic data clustering algorithm for clustering incremental dataset and compare its performance with full run of normal DBSCAN, Chameleon on the dynamic dataset. Most of the clustering algorithms perform well and will give ideal performance with good accuracy measured with clustering accuracy, which is calculated using the original class labels and the calculated class labels. However, if we measure the performance with a cluster validation metric, then it will give another kind of result. Results: This study addresses the problems of clustering a dynamic dataset in which the data set is increasing in size over time by adding more and more data. So to evaluate the performance of the algorithms, we used Generalized Dunn Index (GDI), Davies-Bouldin index (DB) as the cluster validation metric and as well as time taken for clustering. Conclusion: In this study, we have successfully implemented and evaluated the proposed density based dynamic clustering algorithm. The performance of the algorithm was compared with Chameleon and DBSCAN clustering algorithms. The proposed algorithm performed significantly well in terms of clustering accuracy as well as speed.

Analytical Comparison of Some Traditional Partitioning based and Incremental Partitioning based Clustering Methods

International Journal of Computer Applications, 2012

Data clustering is a highly valuable field of computational statistics and data mining. Data clustering can be considered as the most important unsupervised learning technique as it deals with finding a structure in a collection of unlabeled data. A Clustering is division of data into similar objects. A major difficulty in the design of data clustering algorithms is that, in majority of applications, new data are dynamically appended into an existing database and it is not feasible to perform data clustering from scratch every time new data instances get added up in the database. The development of clustering algorithms which handle the incremental updating of data points is known as an Incremental clustering. In this paper authors have reviewed Partition based clustering methods mainly, K-means & DBSCAN and provided a detailed comparison of Traditional clustering and Incremental clustering method for both.

Clustering Categorical Data Using the K-Means Algorithm and the Attribute's Relative Frequency

Zenodo (CERN European Organization for Nuclear Research), 2017

Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.

Review of Clustering Algorithm for Categorical Data

2013

Clustering is a partition of data into a group of similar or dissimilar data points and each group is a set of data points called clusters. Clustering is an unsupervised learning with no predefined class label for the data points. Clustering is considered an important tool for data mining. Clustering has many applications such as pattern recognition, image processing, market analysis, World Wide Web and many others. Categorical data are groups of categories and each value represents some category. The problem of clustering categorical data is solved by the use of the cluster ensemble approach, but this technique generates a final data partition with imperfect information. The ensemble-information matrix that is the binary cluster association matrix content presents only cluster-data point relations with many entries being left unknown and which decrease the quality of the whole data partition. To avoid the degradation of the final data partition, a new approach of linkbased is presented which includes the refined cluster association matrix. It maintains cluster to cluster relation and helps to improve quality of the final data partition result by determining the unknown entries through measuring similarity between clusters in an ensemble. The cluster ensemble combines multiple data partitions from different clustering algorithms into a single clustering solution to improve the robustness, accuracy and quality of the clustering result.