fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search (original) (raw)

A robust procedure based on forward search to detect outliers

It is now widely recognized that the presence of outliers or errors in the data collection process can affect the results of any statistical analysis. The effect is likely to be even more severe in presence of complex surveys like Census. In the context of the VI Italian agriculture census, ISTAT has used a robust procedure based on the Forward Search to detect cases in which the collected information by the census was not in agreement with that coming from the General Agency for Agricultural Subsidies (AGEA). The controls have concerned total agricultural area (SAT), used agricultural area (SAU), land for vineyards and olive groves. The outliers have been subject to further investigation to subject matter experts of the regions. This process has enabled to improve in a significant way both the quality of data in the Agriculture census and those of AGEA. This paper summarizes how ISTAT tackled the problems of data correction and control, discusses the methodological problems found d...

A Fast Procedure for Outlier Diagnostics in Large Regression Problems

Journal of The American Statistical Association, 1999

We propose a procedure for computing a fast approximation to regression estimates based on the minimization of a robust scale. The procedure can be applied with a large number of independent variables where the usual algorithms require an unfeasible or extremely costly computer time. Also, it can be incorporated in any high-breakdown estimation method and may improve it with just little additional computer time. The procedure minimizes the robust scale over a set of tentative parameter vectors estimated by least squares after eliminating a set of possible outliers, which are obtained as follows. We represent each observation by the vector of changes of the least squares forecasts of the observation when each of the data points is deleted. Then we obtain the sets of possible outliers as the extreme points in the principal components of these vectors, or as the set of points with large residuals. The good performance of the procedure allows identification of multiple outliers, avoiding masking effects. We investigate the procedure's efficiency for robust estimation and power as an outlier detection tool in a large real dataset and in a simulation study.

Paper 265-27 Robust Regression and Outlier Detection with the ROBUSTREG Procedure

2002

Robust regression is an important tool for analyzing data that are contaminated with outliers. It can be used to detect outliers and to provide resistant (stable) results in the presence of outliers. This paper introduces the ROBUSTREG procedure, which is experimental in SAS/STAT Version 9. The ROBUSTREG procedure implements the most commonly used robust regression techniques. These include M estimation (Huber, 1973), LTS estimation (Rousseeuw, 1984), S estimation (Rousseeuw and Yohai, 1984), and MM estimation (Yohai, 1987). The paper will provide an overview of robust regression methods, describe the syntax of PROC ROBUSTREG, and illustrate the use of the procedure to fit regression models and display outliers and leverage points. This paper will also discuss scalability of the ROBUSTREG procedure for applications in data cleansing and data mining. Introduction The main purpose of robust regression is to provide resistant (stable) results in the presence of outliers. In order to a...

Outlier Lies: An Illustrative Example of Identifying Outliers and Applying Robust Models

2000

The presence of outliers can contribute to serious deviance in findings of statistical models. In this study, we illustrate how a minor, typographical error in the data could make a standard OLS model "lie" in the estimates and model fit. We propose robust techniques that are insensitive to extreme, outlying cases and provide better predictions. With implementation examples, we demonstrate

A Methodology for Automatised Outlier Detection in High-Dimensional Datasets: An Application to Euro Area Banks’ Supervisory Data

European Economics: Macroeconomics & Monetary Economics eJournal, 2018

Outlier detection in high-dimensional datasets poses new challenges that have not been investigated in the literature. In this paper, we present an integrated methodology for the identification of outliers which is suitable for datasets with higher number of variables than observations. Our method aims to utilise the entire relevant information present in a dataset to detect outliers in an automatized way, a feature that renders the method suitable for application in large dimensional datasets. Our proposed five-step procedure for regression outlier detection entails a robust selection stage of the most explicative variables, the estimation of a robust regression model based on the selected variables, and a criterion to identify outliers based on robust measures of the residuals' dispersion. The proposed procedure deals also with data redundancy and missing observations which may inhibit the statistical processing of the data due to the ill-conditioning of the covariance matrix....

Benchmark testing of algorithms for very robust regression: FS, LMS and LTS

Computational Statistics & Data Analysis, 2012

The methods of very robust regression resist up to 50% of outliers. The algorithms for very robust regression rely on selecting numerous subsamples of the data. New algorithms for LMS and LTS estimators that have increased computational efficiency due to improved combinatorial sampling are proposed. These and other publicly available algorithms are compared for outlier detection. Timings and estimator quality are also considered. An algorithm using the forward search (FS) has the best properties for both size and power of the outlier tests.

Comparison of M Estimation, S Estimation, with MM Estimation to Get the Best Estimation of Robust Regression in Criminal Cases in Indonesia

Jurnal Matematika, Statistika dan Komputasi, 2022

Crime incidents that occurred in Indonesia in 2019 based on Survey Based Data on criminal data sourced from the National Socio-Economic Survey and Village Potential Data Collection produced by the Central Statistics Agency recorded 269,324 cases. The high crime rate is caused by several factors, including poverty and population density. Determination of the most influential factors in criminal acts in Indonesia can be done with Regression Analysis. One method of Regression Analysis that is very commonly used is the Least Square Method. However, Regression Analysis can be used if the assumption test is met. If outliers are found, then the assumption test is not completed. The outlier problem can be overcome by using a robust estimation method. This study aims to determine the best estimation method between Maximum Likelihood Type (M) estimation, Scale (S) estimation, and Method of Moment (MM) estimation on Robust Regression. The best estimate of Robust Regression is the smallest Resi...

Regression Estimation in the Presence of Outliers: A Comparative Study

2016

In linear models, the ordinary least squares (OLS) estimators of parameters have always turned out to be the best linear unbiased estimators. However, if the data contain outliers, this may affect the least-squares estimates. So, an alternative approach; the so-called robust regression methods, is needed to obtain a better fit of the model or more precise estimates of parameters. In this article, various robust regression methods have been reviewed. The focus is on the presence of outliers in the y-direction (response direction). Comparison of the properties of these methods is done through a simulation study. The comparison's criteria were the efficiency and breakdown point. Also, the methods are applied to a real data set.

The Forward Search for Very Large Datasets

Journal of Statistical Software, 2015

The identification of atypical observations and the immunization of data analysis against both outliers and failures of modeling are important aspects of modern statistics. The forward search is a graphics rich approach that leads to the formal detection of outliers and to the detection of model inadequacy combined with suggestions for model enhancement. The key idea is to monitor quantities of interest, such as parameter estimates and test statistics, as the model is fitted to data subsets of increasing size. In this paper we propose some computational improvements of the forward search algorithm and we provide a recursive implementation of the procedure which exploits the information of the previous step. The output is a set of efficient routines for fast updating of the model parameter estimates, which do not require any data sorting, and fast computation of likelihood contributions, which do not require matrix inversion or qr decomposition. It is shown that the new algorithms enable a reduction of the computation time by more than 80%. Furthemore, the running time now increases almost linearly with the sample size. All the routines described in this paper are included in the FSDA toolbox for MATLAB which is freely downloadable from the internet.