On Detection Of Outliers And Their Effect In Supervised Classification (original) (raw)

A meta analysis study of outlier detection methods in classification

An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins, 1980). Outlier detection has many applications, such as data cleaning, Fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that have behavior very different to the most of the individuals of the dataset. Frequently, outliers are removed to improve accuracy of the estimators. But sometimes the presence of an outlier has a certain meaning which explanation can be lost if the outlier is deleted. In this work we compare detection outlier techniques based on statistical measures, clustering methods and data mining methods. In particular we compare detection of outliers using robust estimators of the center and the covariance matrix used in the Mahalanobis distance, detection of outliers using partitioning around medoids (PAM), and two data mining techniques to detect outliers: The Bay&#39...

Outlier Detection: Applications And Techniques

2012

Outliers once upon a time regarded as noisy data in statistics, has turned out to be an important problem which is being researched in diverse fields of research and application domains. Many outlier detection techniques have been developed specific to certain application domains, while some techniques are more generic. Some application domains are being researched in strict confidentiality such as research on crime and terrorist activities. The techniques and results of such techniques are not readily forthcoming. A number of surveys, research and review articles and books cover outlier detection techniques in machine learning and statistical domains individually in great details. In this paper we make an attempt to bring together various outlier detection techniques, in a structured and generic description. With this exercise, we hope to attain a better understanding of the different directions of research on outlier analysis for ourselves as well as for beginners in this research field who could then pick up the links to different areas of applications in details.

Relative Study of Outlier Detection Procedures

International Journal of Engineering Sciences and Research Technology, 2016

Data Mining just alludes to the extraction of exceptionally intriguing patterns of the data from the monstrous data sets. Outlier detection is one of the imperative parts of data mining which Rexall discovers the perceptions that are going amiss from the normal expected conduct. Outlier detection and investigation is once in a while known as Outlier mining. In this paper, we have attempted to give the expansive and a far reaching literature survey of Outliers and Outlier detection procedures under one rooftop, to clarify the lavishness and multifaceted nature connected with each Outlier detection technique. Besides, we have likewise given a wide correlation of the different strategies for the diverse Outlier techniques. Outliers are the focuses which are unique in relation to or conflicting with whatever is left of the information. They can be novel, new, irregular, strange or uproarious data. Outliers are in some cases more fascinating than most of the information. The principle di...

Outlier Detection for Different Applications: Review

2013

Outlier Detection is a Data Mining Application. Outlier contains noisy data which is researched in various domains. The various techniques are already being researched that is more generic. We surveyed on various techniques and applications of outlier detection that provides a novel approach that is more useful for the beginners. The proposed approach helps to clean data at university level in less time with great accuracy. This survey includes the existing outlier techniques and applications where the noisy data exists. Our paper defines critical review on various techniques used in different applications of outlier detection that are to be researched further and they gives a particular type of knowledge based data i.e. more useful in research activities. So where the Anomalies is present it will be detected through outlier detection techniques and monitored accordingly.

A Performance Analysis of the Innovative Methods Employed for Outlier Detection using Data Mining Algorithms with Three Different Applications

2016

Data Mining simply refers to the mining of very interesting patterns of the data from the massive data sets. Outlier detection is one of the important characteristics of data mining. It is a task that finds objects that are considerably dissimilar, incomparable or inconsistent with respect to the remaining data. Outlier detection has wide applications which include data analysis, network intrusion detection, financial fraud detection, and clinical diagnosis of diseases. This paper proposes three outlier detection models such as OFWDT (Outlier Finding with Decision Tree), OFWNB (Outlier Finding with Naïve Bayes) and OFWQR (Outlier Finding With Quartile Range) with three different applications. OFWDT model has three steps of a process. In the first step, groups the data in to number of clusters using Farthest First clustering algorithm. Due to minimize the size of dataset, the computation time reduced greatly.In the second step, outliers are detected from wisconsin breast cancer datas...

RODHA: Robust Outlier Detection using Hybrid Approach

2012

The task of outlier detection is to find the small groups of data objects that are exceptional to the inherent behavior of the rest of the data. Detection of such outliers is fundamental to a variety of database and analytic tasks such as fraud detection and customer migration. There are several approaches[10] of outlier detection employed in many study areas amongst which distance based and density based outlier detection techniques have gathered most attention of researchers. In informat ion theory, entropy is a core concept that measures uncertainty about a stochastic event, and it means that entropy describes the distribution of an event. Because of its ability to describe the distribution of data, entropy has been applied in clustering applications in data mining. In this paper, we have developed a robust supervised outlier detection algorith m using hybrid approach (RODHA) which incorporates both the concept of distance and density along with entropy measure while determining an outlier. We have provided an empirical study of different existing outlier detection algorithms and established the effectiveness of the proposed RODHA in co mparison to other outlier detection algorith ms.

Outlier Recognition in Clustering

2014

Outlier detection is a fundamental issue in data mining, specifically it has been used to detect and remove anomalous objects from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, network intrusions or human errors. Firstly, this thesis presents a theoretical overview of outlier detection approaches. A novel outlier detection method is proposed and analyzed, it is called Clustering Outlier Removal (COR) algorithm. It provides efficient outlier detection and data clustering capabilities in the presence of outliers, and based on filtering of the data after clustering process. The algorithm of our outlier detection method is divided into two stages. The first stage provides k-means process. The main objective of the second stage is an iterative removal of objects, which are far away from their cluster centroids. The removal occurs according to a chosen threshold. Finally, we provide experimental results from the application of our algori...

A Comparative Study on Outlier Detection Techniques

International Journal of Computer Applications, 2013

Outlier detection is an extremely important problem with direct application in a wide variety of domains. A key challenge with outlier detection is that it is not a wellformulated problem like clustering. In this paper, discussion on different techniques and then comparison by analyzing their different aspects, essentially, time complexity. Every unique problem formulation entails a different approach, resulting in a huge literature on outlier detection techniques. Several techniques have been proposed to target a particular application domain. The classification of outlier detection techniques based on the applied knowledge discipline provides an idea of the research done by different communities and also highlights the unexplored research avenues for the outlier detection problem. Discussed of the behavior of different techniques will be done, in this paper, with respect to the nature. The feasibility of a technique in a particular problem setting also depends on other constraints. For example, Statistical techniques assume knowledge about the underlying distribution characteristics of the data. Distance based techniques are typically expensive and hence are not applied in scenarios where computational complexity is an important issue.

Identification of Outliers: A Simulation Study

2015

This paper compares two approaches in identifying outliers in multivariate datasets; Mahalanobis distance (MD) and robust distance (RD). MD has been known suffering from masking and swamping effects and RD is an approach that was developed to overcome problems that arise in MD. There are two purposes of this paper, first is to identify outliers using MD and RD and the second is to show that RD performs better than MD in identifying outliers. An observation is classified as an outlier if MD or RD is larger than a cut-off value. Outlier generating model is used to generate a set of data and MD and RD are computed from this set of data. The results showed that RD can identify outliers better than MD. However, in non-outliers data the performance for both approaches are similar. The results for RD also showed that RD can identify multivariate outliers much better when the number of dimension is large.

A Novel Approach for Univariate Outlier Detection

2014

In many applications outlier detection is an important task . In the process of Knowledge Discovery in Databases, isolation of outlying data is important. This isolation process improves the quality of data and reduces the impact of outlying data on the existing values. Numerous methods are available in the detection process of outliers in univariate data sets. Most of these methods handle one outlier at a time. In this paper, Grubb’s statistics, sigma rule and fence rules deal more than one outliers at a time. In general, when multiple outliers are present, presence of such outliers prevents us from detecting other outliers. Hence, as soon as outliers are found, removing outlier is an important task. Multiple outliers are evaluated on different data sets and proved that results are effective. Separate procedures are used for detecting outliers in continuous and discrete data. Experimental results show that our method works well for different data.