A Density Based Dynamic Data Clustering Algorithm based on Incremental Dataset (original) (raw)

Efficient incremental density-based algorithm for clustering large datasets

Alexandria Engineering Journal, 2015

In dynamic information environments such as the web, the amount of information is rapidly increasing. Thus, the need to organize such information in an efficient manner is more important than ever. With such dynamic nature, incremental clustering algorithms are always preferred compared to traditional static algorithms. In this paper, an enhanced version of the incremental DBSCAN algorithm is introduced for incrementally building and updating arbitrary shaped clusters in large datasets. The proposed algorithm enhances the incremental clustering process by limiting the search space to partitions rather than the whole dataset which results in significant improvements in the performance compared to relevant incremental clustering algorithms. Experimental results with datasets of different sizes and dimensions show that the proposed algorithm speeds up the incremental clustering process by factor up to 3.2 compared to existing incremental algorithms.

An Incremental Density-Based Clustering Technique for Large Datasets

2000

Data mining, also known as knowledge discovery in databases, is a statistical analysis technique used to find hidden patterns and identify untapped value in large datasets. Clustering is a principal data discovery technique in data mining that segregates a dataset into subsets or clusters so that data values in the same cluster have some common characteristics or attributes. A number of clustering techniques have been proposed in the past by many researchers that can identify arbitrary shaped cluster; where a cluster is defined as a dense region separated by the low-density regions and among them DBSCAN is a prime density-based clustering algorithm. DBSCAN is capable of discovering clusters of any arbitrary shape and size in databases which even include noise and outliers. Many researchers have attempted to overcome certain deficiencies in the original DBSCAN like identifying patterns within datasets of varied densities and its high computational complexity; hence a number of augmented forms of DBSCAN algorithm are available. We present an incremental density-based clustering technique which is based on the fundamental DBSCAN clustering algorithm to enhance its computational complexity. Our proposed algorithm can be used in different knowledge domains like image processing, classification of patterns in GIS maps, x-ray crystallography and information security.

Incremental Shared Nearest Neighbor Density-Based Clustering Algorithms for Dynamic Datasets

2016

Dynamic datasets undergo frequent changes where small number of data points are added and deleted. Such dynamic datasets are frequently encountered in many real world applications such as search engines and recommender systems. Incremental data mining algorithms process these updates to datasets efficiently to avoid redundant computation. Shared nearest neighbor density based clustering (SNN-DBSCAN) is a widely used clustering algorithm, mainly for its robustness. Existing incremental extension to SNNDBSCAN cannot handle deletions to dataset and handles insertions only point by point. We overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We present three different incremental algorithms with varying efficiency at elimination of redundant computation. We show effectiveness of our algorithms by performing experiments on large synthetic as well as real world datasets. Our algorithms are up to 2 Orders...

Batch Incremental Shared Nearest Neighbor Density Based Clustering Algorithm for Dynamic Datasets

Lecture Notes in Computer Science, 2017

Incremental data mining algorithms process frequent updates to dynamic datasets efficiently by avoiding redundant computation. Existing incremental extension to shared nearest neighbor density based clustering (SNND) algorithm cannot handle deletions to dataset and handles insertions only one point at a time. We present an incremental algorithm to overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We show effectiveness of our algorithm by performing experiments on large synthetic as well as real world datasets. Our algorithm is up to four orders of magnitude faster than SNND and requires up to 60% extra memory than SNND while providing output identical to SNND.

Enhanced Density Based Algorithm for Clustering Large Datasets

Clustering is one of the data mining techniques that extracts knowledge from spatial datasets. DBSCAN algorithm was considered as well-founded algorithm as it discovers clusters in different shapes and handles noise effectively. There are several algorithms that improve DBSCAN as fast hybrid density algorithm (L-DBSCAN) and fast density-based clustering algorithm. In this paper, an enhanced algorithm is proposed that improves fast density-based clustering algorithm in the ability to discover clusters with different densities and clustering large datasets.

Dynamic Clustering of Data with Modified K-Means Algorithm

K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether the optimal number of clusters can be found on the run based on the cluster quality measure. This paper presents a modified K-means algorithm with the intension of improving cluster quality and to fix the optimal number of cluster. The K-means algorithm takes number of clusters (K) as input from the user. But in the practical scenario, it is very difficult to fix the number of clusters in advance. The proposed method works for both the cases i.e. for known number of clusters in advance as well as unknown number of clusters. The user has the flexibility either to fix the number of clusters or input the minimum number of clusters required. In the former case it works same as K-means algorithm. In the latter case the algorithm computes the new cluster centers by incrementing the cl...

Computational analysis of incremental clustering approaches for Large Data

2021

Clustering is an approach of data mining, which helps us to find the underlying hidden structure in the dataset. K-means is a clustering method which usages distance functions to find the similarities or dissimilarities between the instances. DBSCAN is a clustering algorithm, which discovers the arbitrary shapes & sizes of clusters from huge volume of using spatial density method. These two approaches of clustering are the classical methods for efficient clustering but underperform when the data is updated frequently in the databases so, the incremental or gradual clustering approaches are always preferred in this environment. In this paper, an incremental approach for clustering is introduced using K-means and DBSCAN to handle the new datasets dynamically updated in the database in an interval.

Multi-Density based Incremental Clustering

International journal of computer applications, 2015

Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It is a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. A major difficulty in design of modern clustering algorithms is that, new datasets are dynamically added to the existing large database and it is not efficient to perform data clustering on the entire database every time a new dataset is added to the database. The new data added dynamically to the existing database is called incremental data. DBSCAN is widely used density based clustering algorithm. However it is known that DBSCAN fails to identify clusters of different densities. This paper presents a simple and efficient algorithm that identifies clusters of different densities and arbitrary shapes with automatic Eps estimation. Eps is estimated by using distance curve and difference of slopes and DBSCAN is applied on the data for each estimated Eps, resulting in multi-density clusters. Then by making use of formed clusters, incrementally updated data is clustered.

Certain Investigation on Dynamic Clustering in Dynamic Datamining

Clustering is the process of grouping a set of objects into classes of similar objects. Dynamic clustering comes in a new research area that is concerned about dataset with dynamic aspects. It requires updates of the clusters whenever new data records are added to the dataset and may result in a change of clustering over time. When there is a continuous update and huge amount of dynamic data, rescan the database is not possible in static data mining. But this is possible in Dynamic data mining process. This dynamic data mining occurs when the derived information is present for the purpose of analysis and the environment is dynamic, i.e. many updates occur. Since this has now been established by most researchers and they will move into solving some of the problems and the research is to concentrate on solving the problem of using data mining dynamic databases. This paper gives some investigation of existing work done in some papers related with dynamic clustering and incremental data clustering.

Techniques to Enhance the Performance of DBSCAN Clustering Algorithm in Data Mining

International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022

Clustering is a form of learning by observations. It is an unsupervised learning method and does not require training data set to generate a model. Clustering can lead to the discovery of previously unknown groups within the data. It is a common method of data mining in which similar and dissimilar type of data would be clustered into different clusters for better analysis of the data. In this paper the DBSCAN algorithm has been applied to compute the EPS value and Euclidian distance on the basis of similarity or dissimilarity of the input data. Also back propagation algorithm is applied to calculate Euclidian distance dynamically and simulation study is conducted that shows improvement to increase accuracy and reduce execution time.