Outlier Removal Approach as a Continuous Process in Basic K-Means Clustering Algorithm (original) (raw)

An adaptive outlier removal aided k-means clustering algorithm

Journal of King Saud University - Computer and Information Sciences, 2021

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

An Analysis of Outlier Detection through clustering method

This research paper deals with an outlier which is known as an unusual behavior of any substance present in the spot. This is a detection process that can be employed for both anomaly detection and abnormal observation. This can be obtained through other members who belong to that data set. The deviation present in the outlier process can be attained by measuring certain terms like range, size, activity, etc. By detecting outlier one can easily reject the negativity present in the field. For instance, in healthcare, the health condition of a person can be determined through his latest health report or his regular activity. When found the person being inactive there may be a chance for that person to be sick. Many approaches have been used in this research paper for detecting outliers. The approaches used in this research are 1) Centroid based approach based on K-Means and Hierarchical Clustering algorithm and 2) through Clustering based approach. This approach may help in detecting outlier by grouping all similar elements in the same group. For grouping, the elements clustering method paves a way for it. This research paper will be based on the above mentioned 2 approaches.

A Review Paper on Comparison of Clustering Algorithms based on Outliers

Data mining, in general, deals with the discovery of non-trivial, hidden and interesting knowledge from different types of data. With the development of information technologies, the number of databases, as well as their dimension and complexity, grow rapidly. It is necessary what we need automated analysis of great amount of information. The analysis results are then used for making a decision by a human or program. One of the basic problems of data mining is the outlier detection. The outlier detection problem in some cases is similar to the classification problem. For example, the main concern of clustering-based outlier detection algorithms is to find clusters and outliers, which are often regarded as noise that should be removed in order to make more reliable clustering. In this thesis, the ability to detect outliers can be improved using a combined perspective from outlier detection and cluster identification. In proposed work comparison of four methods will be done like K-Mean, k-Mediods, Iterative k-Mean and density based method. Unlike the traditional clustering-based methods, the proposed algorithm provides much efficient outlier detection and data clustering capabilities in the presence of outliers, so comparison has been made. The purpose of our method is not only to produce data clustering but at the same time to find outliers from the resulting clusters. The goal is to model an unknown nonlinear function based on observed input-output pairs. The whole simulation of this proposed work has been taken in MATLAB environment.

A REVIEW PAPER ON IMPROVED K-MEANS TECHNIQUE FOR OUTLIER DETECTION IN HIGH DIMENSIONAL DATASET

In many data mining application domain outlier detection is an important task, it can be regard as a binary asymmetric or unbalanced classification of pattern where one class has higher cardinality than the other, finding outlier is very challenging in high dimensional dataset where data contain large amount of noise which causes effectiveness problem, they are more useful based on their diagnosis of data characteristics which deviate significantly from average, this paper present Improved K-Means Technique for Outlier Detection in High Dimensional Dataset. Various subspace based method has been proposed for searching abnormal saprse density unit in subspace, this paper proposes a Clique density based clustering algorithm that attempt to deal with subspace that create dense reason when projected onto lower subspace in high dimensional data set and then apply the improved K-Means algorithm on generated subspaces for effectively and efficiently identifying outliers for getting the more meaningful and interpretable result.

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

World Journal of Computer Application and Technology

An outlier in a pattern is dissimilar with rest of the pattern in a dataset. Outlier detection is an important issue in data mining. It has been used to detect and remove anomalous objects from data. Outliers occur due to mechanical faults, changes in system behavior, fraudulent behavior, and human errors. This paper describes the methodology or detecting and removing outlier in K-Means and Hierarchical clustering. First apply clustering algorithm K-Means and Hierarchical clustering on a data set then find outliers from the each resulting clustering. In K-Means clustering outliers are found by distance based approach and cluster based approach. In case of hierarchical clustering, by using dendrogram outliers are found. The goal of the project is to detect the outlier and remove the outliers to make the clustering more reliable.

A Novel Approach for Data Clustering using Improved K- means Algorithm

In statistic and data mining, k-means is well known for its efficiency in clustering large data sets. The aim is to group data points into clusters such that similar items are lumped together in the same cluster. The K-means clustering algorithm is most commonly used algorithms for clustering analysis. The existing K-means algorithm is, inefficient while working on large data and improving the algorithm remains a problem. However, there exist some flaws in classical K-means clustering algorithm. According to the method, the algorithm is sensitive to selecting initial Centroid. The quality of the resulting clusters heavily depends on the selection of initial centroids. K-means clustering is a method of cluster analysis which aims to partition " n " observations into k clusters in which each observation belongs to the cluster with the nearest mean. In the proposed project performing data clustering efficiently by decreasing the time of generating cluster. In this project, our aim is to improve the performance using normalization and initial centroid selection techniques in already existing algorithm. The experimental result shows that, the proposed algorithm can overcome shortcomings of the K-means algorithm.

Clustering of Patient Disease Data by Using K-Means Clustering

Clustering is a method of grouping records in a database based on certain criteria. One method of clustering is K-Means Clustering. K-Means Clustering divides data into multiple data sets and can accept data inputs without class labels. This research uses K-Means Clustering method and implemented on patient disease data at Haji Adam Malik Hospital in Medan. The results of this study provide an illustration of the tendency of patient diseases at Haji Adam Malik Hospital. Through this research is expected to be a reference to anticipate priority services for patients, especially patients Social Security and Healthcare Security user.

Outlier Reduction using Hybrid Approach in Data Mining

International Journal of Modern Education and Computer Science, 2015

The Outlier detection is very active area of research in data mining where outlier is a mismatched data in dataset with respect to the other available data. In existing approaches the outlier detection done only on numeric dataset. For outlier detection if we use clustering method , then they mainly focus on those elements as outliers which are lying outside the clusters but it may possible that some of the unknown elements with any possible reasons became the part of the cluster so we have to concentrate on that also. The Proposed method uses hybrid approach to reduce the number of outliers. The number of outlier can only reduce by improving the cluster formulation method. The proposed method uses two data mining techniques for cluster formulation i.e. weighted k-means and neural network where weighted kmeans is the clustering technique that can apply on text and date data set as well as numeric data set. Weighted kmeans assign the weights to each element in dataset. The output of weighted k-means becomes the input for neural network where the neural network is the classification and clustering technique of data mining. Training is provided to the neural network and according to that neurons performed the testing. The neural network test the cluster formulated by weighted k-means to ensure that the clusters formulated by weighted k-means are group accordingly. There is lots of outlier detection methods present in data mining. The proposed method use Integrating Semantic Knowledge (SOF) for outlier detection. This method detects the semantic outlier where the semantic outlier is a data point that behaves differently with other data points in the same class or cluster. The main motive of this research work is to reduce the number of outliers by improving the cluster formulation methods so that outlier rate reduces and also to decrease the mean square error and improve the accuracy. The simulation result clearly shows that proposed method works pretty well as it significantly reduces the outlier.

Comparative Analysis with Implementation of Cluster Based, Distance Based and Density Based Outlier Detection Techniques Using Different Healthcare Datasets

Outliers is view as an error data in information which is turned into important crisis that has been investigated in various areas of study plus functional fields. Several outlier detection methods have been implemented to assured functional fields, whereas several methods are supplementary basic. Various functional areas are also investigated in severe privacy like study on offense as well as terrorist behaviors. Through the improvement in information skills, the numeral of records, plus their measurement as well as difficulty, raise fast, that outcome in the need of computerized examination of huge quantity of various ordered data. For this intention, different data mining systems are utilized. The objective of these types of systems is to detect unseen dependencies from the records. Outlier detection in data mining is the detection of objects, remarks or observations that doesn't match to a predictable sample in a set of record. This detection technique is more beneficial in the several areas such as health trade, offense finding, fake operation, community protection and so on. In this paper we have studied different outlier detection algorithms such as Cluster based outlier detection, Distance based outlier detection plus Density based outlier detection. Result experimentation is done on different four dataset to identify the outliers and the comparative result shows that the cluster based methods are efficient for calculation of clusters and density-based outlier detection algorithm offers improved accuracy and faster execution for identification of outliers than other two outlier detection algorithm.