Performance Comparison of Incremental Kmeans and Incremental DBSCAN Algorithms (original) (raw)

Computational analysis of incremental clustering approaches for Large Data

2021

Clustering is an approach of data mining, which helps us to find the underlying hidden structure in the dataset. K-means is a clustering method which usages distance functions to find the similarities or dissimilarities between the instances. DBSCAN is a clustering algorithm, which discovers the arbitrary shapes & sizes of clusters from huge volume of using spatial density method. These two approaches of clustering are the classical methods for efficient clustering but underperform when the data is updated frequently in the databases so, the incremental or gradual clustering approaches are always preferred in this environment. In this paper, an incremental approach for clustering is introduced using K-means and DBSCAN to handle the new datasets dynamically updated in the database in an interval.

Analytical Comparison of Some Traditional Partitioning based and Incremental Partitioning based Clustering Methods

International Journal of Computer Applications, 2012

Data clustering is a highly valuable field of computational statistics and data mining. Data clustering can be considered as the most important unsupervised learning technique as it deals with finding a structure in a collection of unlabeled data. A Clustering is division of data into similar objects. A major difficulty in the design of data clustering algorithms is that, in majority of applications, new data are dynamically appended into an existing database and it is not feasible to perform data clustering from scratch every time new data instances get added up in the database. The development of clustering algorithms which handle the incremental updating of data points is known as an Incremental clustering. In this paper authors have reviewed Partition based clustering methods mainly, K-means & DBSCAN and provided a detailed comparison of Traditional clustering and Incremental clustering method for both.

An Incremental K-means algorithm

Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 2004

Data clustering is an important data exploration technique with many applications in engineering, including parts family formation in group technology and segmentation in image processing. One of the most popular data clustering methods is K-means clustering because of its simplicity and computational efficiency. The main problem with this clustering method is its tendency to converge at a local minimum. In this paper, the cause of this problem is explained and an existing solution involving a cluster centre jumping operation is examined. The jumping technique alleviates the problem with local minima by enabling cluster centres to move in such a radical way as to reduce the overall cluster distortion. However, the method is very sensitive to errors in estimating distortion. A clustering scheme that is also based on distortion reduction through cluster centre movement but is not so sensitive to inaccuracies in distortion estimation is proposed in this paper. The scheme, which is an incremental version of the K-means algorithm, involves adding cluster centres one by one as clusters are being formed. The paper presents test results to demonstrate the efficacy of the proposed algorithm.

A Density Based Dynamic Data Clustering Algorithm based on Incremental Dataset

Journal of Computer Science, 2012

Problem statement: Clustering and visualizing high-dimensional dynamic data is a challenging problem. Most of the existing clustering algorithms are based on the static statistical relationship among data. Dynamic clustering is a mechanism to adopt and discover clusters in real time environments. There are many applications such as incremental data mining in data warehousing applications, sensor network, which relies on dynamic data clustering algorithms. Approach: In this work, we present a density based dynamic data clustering algorithm for clustering incremental dataset and compare its performance with full run of normal DBSCAN, Chameleon on the dynamic dataset. Most of the clustering algorithms perform well and will give ideal performance with good accuracy measured with clustering accuracy, which is calculated using the original class labels and the calculated class labels. However, if we measure the performance with a cluster validation metric, then it will give another kind of result. Results: This study addresses the problems of clustering a dynamic dataset in which the data set is increasing in size over time by adding more and more data. So to evaluate the performance of the algorithms, we used Generalized Dunn Index (GDI), Davies-Bouldin index (DB) as the cluster validation metric and as well as time taken for clustering. Conclusion: In this study, we have successfully implemented and evaluated the proposed density based dynamic clustering algorithm. The performance of the algorithm was compared with Chameleon and DBSCAN clustering algorithms. The proposed algorithm performed significantly well in terms of clustering accuracy as well as speed.

DBSCAN: Past, present and future

The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), 2014

Data Mining is all about data analysis techniques. It is useful for extracting hidden and interesting patterns from large datasets. Clustering techniques are important when it comes to extracting knowledge from large amount of spatial data collected from various applications including GIS, satellite images, X-ray crystallography, remote sensing and environmental assessment and planning etc. To extract useful pattern from these complex data sources several popular spatial data clustering techniques have been proposed. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a pioneer density based algorithm. It can discover clusters of any arbitrary shape and size in databases containing even noise and outliers. DBSCAN however are known to have a number of problems such as: (a) it requires user's input to specify parameter values for executing the algorithm; (b) it is prone to dilemma in deciding meaningful clusters from datasets with varying densities; (c) and it incurs certain computational complexity. Many researchers attempted to enhance the basic DBSCAN algorithm, in order to overcome these drawbacks, such as VDBSCAN, FDBSCAN, DD_DBSCAN, and IDBSCAN. In this study, we survey over different variations of DBSCAN algorithms that were proposed so far. These variations are critically evaluated and their limitations are also listed.

A Comparative Review of Incremental Clustering Methods for Large Dataset

International Journal of Advanced Trends in Computer Science and Engineering, 2021

Several algorithms have developed for analyzing large incremental datasets. Incremental algorithms are relatively efficient in dynamic evolving environment to seek out small clusters in large datasets. Many algorithms have devised for limiting the search space, building, and updating arbitrary shaped clusters in large incremented datasets. Within the real time visualization of real time data, when data in motion and growing dynamically, new data points arrive that generates instant cluster labels. In this paper, the comparative review of Incremental clustering methods for large dataset has done.

Efficient incremental density-based algorithm for clustering large datasets

Alexandria Engineering Journal, 2015

In dynamic information environments such as the web, the amount of information is rapidly increasing. Thus, the need to organize such information in an efficient manner is more important than ever. With such dynamic nature, incremental clustering algorithms are always preferred compared to traditional static algorithms. In this paper, an enhanced version of the incremental DBSCAN algorithm is introduced for incrementally building and updating arbitrary shaped clusters in large datasets. The proposed algorithm enhances the incremental clustering process by limiting the search space to partitions rather than the whole dataset which results in significant improvements in the performance compared to relevant incremental clustering algorithms. Experimental results with datasets of different sizes and dimensions show that the proposed algorithm speeds up the incremental clustering process by factor up to 3.2 compared to existing incremental algorithms.

Dynamic Incremental K-means Clustering

2014 International Conference on Computational Science and Computational Intelligence, 2014

K-means clustering is one of the most commonly used methods for classification and data-mining. When the amount of data to be clustered is "huge," and/or when data becomes available in increments, one has to devise incremental K-means procedures. Current research on incremental clustering does not address several of the specific problems of incremental K-means including the seeding problem, sensitivity of the algorithm to the order of the data, and the number of clusters. In this paper we present static and dynamic single-pass incremental K-means procedures that overcome these limitations.

Comparative Analysis of K-Means and Enhanced K-Means Algorithms for Clustering

NUTA Journal

Clustering in data mining is a way of organizing a set of objects in such a way that the objects in same bunch are more comparable and relevant to each other than to those objects in other bunches. In the modern information retrieval system, clustering algorithms are better if they result high quality clusters in efficient time. This study includes analysis of clustering algorithms k-means and enhanced k-means algorithm over the wholesale customers and wine data sets respectively. In this research, the enhanced k-means algorithm is found to be 5% faster for wholesale customers dataset for 4 clusters and 49%, 38% faster when the clusters size is increased to 8 and 13 respectively. The wholesale customers dataset when classified with 18 clusters the speedup was seen to be 29%. Similarly, in the case of wine dataset, the speed up is seen to be 10%, 30%, 49%, and 41% for 3, 8, 13 and 18 clusters respectively. Both of the algorithms are found very similar in terms of the clustering accur...

Dynamic Clustering of Data with Modified K-Means Algorithm

K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether the optimal number of clusters can be found on the run based on the cluster quality measure. This paper presents a modified K-means algorithm with the intension of improving cluster quality and to fix the optimal number of cluster. The K-means algorithm takes number of clusters (K) as input from the user. But in the practical scenario, it is very difficult to fix the number of clusters in advance. The proposed method works for both the cases i.e. for known number of clusters in advance as well as unknown number of clusters. The user has the flexibility either to fix the number of clusters or input the minimum number of clusters required. In the former case it works same as K-means algorithm. In the latter case the algorithm computes the new cluster centers by incrementing the cl...