Communications in Statistics -Simulation and Computation Robust cluster-based multivariate outlier diagnostics and parameter estimation in regression analysis (original) (raw)

Cluster-based multivariate outlier identification and re-weighted regression in linear models

A cluster methodology, motivated by a robust similarity matrix is proposed for identifying likely multivariate outlier structure and to estimate weighted least-square (WLS) regression parameters in linear models. The proposed method is an agglomeration of procedures that begins from clustering the nobservations through a test of 'no-outlier hypothesis' (TONH) to a weighted least-square regression estimation. The cluster phase partition the n-observations into h-set called main cluster and a minor cluster of size n − h. A robust distance emerge from the main cluster upon which a test of no outlier hypothesis' is conducted. An initial WLS regression estimation is computed from the robust distance obtained from the main cluster. Until convergence, a re-weighted least-squares (RLS) regression estimate is updated with weights based on the normalized residuals. The proposed procedure blends an agglomerative hierarchical cluster analysis of a complete linkage through the TONH to the Re-weighted regression estimation phase. Hence, we propose to call it cluster-based re-weighted regression (CBRR). The CBRR is compared with three existing procedures using two data sets known to exhibit masking and swamping. The performance of CBRR is further examined through simulation experiment. The results obtained from the data set illustration and the Monte Carlo study shows that the CBRR is effective in detecting multivariate outliers where other methods are susceptible to it. The CBRR does not require enormous computation and is substantially not susceptible to masking and swamping.

A performance evaluation in multivariate outliers identification methods

Ciência e Natura

Methodologies for identifying multivariate outliers are extremely important in statistical analysis. Outliers may reveal relevant information to variables under investigation. Statistical applications without prior identification of possible extreme values may yield controversial results and induce mistaken decision making. In many contexts, outliers are points of great practical interest. Given this, this paper seeks to discuss methodologies for the detection of multivariate outliers through a fair and adequate comparative technique in their simulation procedure. The comparison considers detection techniques based on Mahalanobis distance, besides a methodology based on cluster analysis technique. Sensitivity, specificity, and accuracy metrics are used to measure the method quality. An analysis of the computational time required to perform the procedures is evaluated. The technique based on cluster analysis revealed a noticeable superiority over the others in detection quality and a...

Outlier Detection by Regression Diagnostics Based on Robust Parameter Estimates FULL TEXT

Hacettepe Journal of Mathematics and Statistics, 2012

In this article, robust versions of some of the frequently used diagnostics are considered to identify outliers instead of the diagnostics based on the least square method. These diagnostics are Cook's distance, the Welsch-Kuh distance and the Hadi measure. A simulation study is performed to compare the performance of the classical diagnostics with the proposed diagnostics based on robust M estimation to identify outliers.

Journal of Statistical Computation and Simulation Outlier detection and robust estimation in linear regression models with fixed group effects Outlier detection and robust estimation in linear regression models with fixed group effects

This work studies outlier detection and robust estimation with data that are naturally distributed into groups and which follow approximately a linear regression model with fixed group effects. For this, several methods are considered. First, the robust fitting method of Peña and Yohai [A fast procedure for outlier diagnostics in large regression problems. J Am Stat Assoc. 1999;94:434–445], called principal sensitivity components (PSC) method, is adapted to the grouped data structure and the mentioned model. The robust methods RDL 1 of Hubert and Rousseeuw [Robust regression with both continuous and binary regressors. J Stat Plan Inference. 1997;57:153–163] and M-S of Maronna and Yohai [Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference 2000;89:197–214] are also considered. These three methods are compared in terms of their effectiveness in outlier detection and their robustness through simulations, considering several contamination scenarios and growing contamination levels. Results indicate that the adapted PSC procedure is able to detect a high percentage of true outliers and a small number of false outliers. It is appropriate when the contamination is in the error term or in the covariates, detecting also possibly masked high leverage points. Moreover, in simulations the final robust regression estimator preserved good efficiency under Normality while keeping good robustness properties.

A further study comparing forward search multivariate outlier methods including ATLA with an application to clustering

Statistical Papers

This paper makes comparisons of automated procedures for robust multivariate outlier detection through discussion and simulation. In particular, automated procedures that use the forward search along with Mahalanobis distances to identify and classify multivariate outliers subject to predefined criteria are examined. Procedures utilizing a parametric model criterion based on a \chi ^2$$ χ 2 -distribution are among these, whereas the multivariate Adaptive Trimmed Likelihood Algorithm (ATLA) identifies outliers based on an objective function that is derived from the asymptotics of the location estimator assuming a multivariate normal distribution. Several criterion including size (false positive rate), sensitivity, and relative efficiency are canvassed. To illustrate relative efficiency in a multivariate setting in a new way, measures of variability of the multivariate location parameter when the underlying distribution is chosen from a multivariate generalization of the Tukey–Huber...

Comparison of Robust Estimators’ Performance for Detecting Outliers in Multivariate Data

Journal of Statistical Modelling and Analytics, 2021

In multivariate data, outliers are difficult to detect especially when the dimension of the data increase. Mahalanobis distance (MD) has been one of the classical methods to detect outliers for multivariate data. However, the classical mean and covariance matrix in MD suffered from masking and swamping effects if the data contain outliers. Due to this problem, many studies used a robust estimator instead of the classical estimator of mean and covariance matrix. In this study, the performance of five robust estimators namely Fast Minimum Covariance Determinant (FMCD), Minimum Vector Variance (MVV), Covariance Matrix Equality (CME), Index Set Equality (ISE), and Test on Covariance (TOC) are investigated and compared. FMCD has been widely used and is known as among the best robust estimator. However, there are certain conditions that FMCD still lacks. MVV, CME, ISE and TOC are innovative of FMCD. These four robust estimators improve the last step of the FMCD algorithm. Hence, the objec...

The Performance of Clustering Approach with Robust MM–Estimator for Multiple Outlier Detection in Linear Regression

Jurnal Teknologi, 2006

Identifying outlier is a fundamental step in the regression model building process. Outlying observations should be identified because of their potential effect on the fitted model. As a result of the need to identify outliers, numerous outlying measures such as residuals and hat matrix diagonal are built. However, these outlying measures works well when a regression data set contains only a single outlying point and it is well established that regression real data sets may have multiple outlying observations that individually are not easy to identify by the same measures. In this paper, an alternative approach is proposed, that is clustering technique incorporated with robust estimator for multiple outlier identification. The robust estimator proposes is MM-Estimator. The performance of clustering approach with proposed estimator is compared with other estimator that is the classical estimator namely Least Square (LS) and other robust estimator that is Least Trimmed Square (LTS). The evaluation of the estimator performance is carried out through analyses on a classical multiple outlier data sets found in the literature and simulated multiple outlier data sets. Additionally, the analysis of Root Mean Square Error (RMSE) value and coverage probabilities of Bootstrap Bias Corrected and Accelerated (BC a ) confidence interval are also being conducted to identify the best estimator in identification of multiple outliers. From the analysis, it has been revealed that the MM-Estimator performed excellently on the classical multiple outlier data sets and a wide variety of simulated data sets with any percentage of outliers, any number of regressor variables and any sample sizes followed by LTS and LS. The analysis also showed that the value of RMSE of the proposed estimator is always smaller than the other two estimators. Whereupon, the coverage probabilities of BC a confidence interval also conclude that the MM-Estimator confidence interval have all the criteria's to be the best estimator since it has a good coverage probabilities, good equatailness and the shortest average confident length followed by LTS and LS.

Multivariate Outlier Detection and Robust Covariance Matrix Estimation

Technometrics, 2001

In this article, we present a simple multivariate outlier-detection procedure and a robust estimator for the covariance matrix, based on the use of information obtained from projections onto the directions that maximize and minimize the kurtosis coef cient of the projected data. The properties of this estimator (computationa l cost, bias) are analyzed and compared with those of other robust estimators described in the literature through simulation studies. The performance of the outlier-detection procedure is analyzed by applying it to a set of well-known examples.

Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators

RePEc: Research Papers in Economics, 2017

A collection of robust Mahalanobis distances for multivariate outlier detection is proposed, based on the notion of shrinkage. Robust intensity and scaling factors are optimally estimated to define the shrinkage. Some properties are investigated, such as affine equivariance and breakdown value. The performance of the proposal is illustrated through the comparison to other techniques from the literature, in a simulation study and with a real dataset. The behavior when the underlying distribution is heavy-tailed or skewed, shows the appropriateness of the method when we deviate from the common assumption of normality. The resulting high correct detection rates and low false detection rates in the vast majority of cases, as well as the significantly smaller computation time shows the advantages of our proposal.