Model Selection in Analyzing Spatial Groups in Regression Analysis (original) (raw)

Regression analysis of spatial data

Ecology Letters, 2010

Many of the most interesting questions ecologists ask lead to analyses of spatial data. Yet, perhaps confused by the large number of statistical models and fitting methods available, many ecologists seem to believe this is best left to specialists. Here, we describe the issues that need consideration when analysing spatial data and illustrate these using simulation studies. Our comparative analysis involves using methods including generalized least squares, spatial filters, wavelet revised models, conditional autoregressive models and generalized additive mixed models to estimate regression coefficients from synthetic but realistic data sets, including some which violate standard regression assumptions. We assess the performance of each method using two measures and using statistical error rates for model selection. Methods that performed well included generalized least squares family of models and a Bayesian implementation of the conditional auto-regressive model. Ordinary least squares also performed adequately in the absence of model selection, but had poorly controlled Type I error rates and so did not show the improvements in performance under model selection when using the above methods. Removing large-scale spatial trends in the response led to poor performance. These are empirical results; hence extrapolation of these findings to other situations should be performed cautiously. Nevertheless, our simulation-based approach provides much stronger evidence for comparative analysis than assessments based on single or small numbers of data sets, and should be considered a necessary foundation for statements of this type in future.

Specification and estimation of spatial linear regression models

Regional Science and Urban Economics, 1992

Spatially correlated residuals lead to various serious problems in applied spatial research. In this paper several conventional specification and estimation procedures for models with spatially dependent residuals are compared with alternative procedures. The essence of the latter is a search procedure for spatially lagged variables. By incorporating the omitted spatially lagged variables into the model spatially dependent residuals may be remedied, in particular if the spatial dependence is substantive. The efficacy of the conventional and alternative procedures in small samples will be investigated by means of Monte Carlo techniques for an irregular lattice structure.

An empirical evaluation of spatial regression models

Computers & Geosciences, 2006

Conventional statistical methods are often ineffective to evaluate spatial regression models. One reason is that spatial regression models usually have more parameters or smaller sample sizes than a simple model, so their degree of freedom is reduced. Thus, it is often unlikely to evaluate them based on traditional tests. Another reason, which is theoretically associated with statistical methods, is that statistical criteria are crucially dependent on such assumptions as normality, independence, and homogeneity. This may create problems because the assumptions are open for testing. In view of these problems, this paper proposes an alternative empirical evaluation method. To illustrate the idea, a few hedonic regression models for a house and land price data set are evaluated, including a simple, ordinary linear regression model and three spatial models. Their performance as to how well the price of the house and land can be predicted is examined. With a cross-validation technique, the prices at each sample point are predicted with a model estimated with the samples excluding the one being concerned. Then, empirical criteria are established whereby the predicted prices are compared with the real, observed prices. The proposed method provides an objective guidance for the selection of a suitable model specification for a data set. Moreover, the method is seen as an alternative way to test the significance of the spatial relationships being concerned in spatial regression models.

Variable selection in multivariate regression model for spatially dependent data

arXiv (Cornell University), 2023

This paper deals with variable selection in multivariate linear regression model when the data are observations on a spatial domain being a grid of sites in Z d with d 2. We use a criterion that allows to characterize the subset of relevant variables as depending on two parameters, and we propose estimators for these parameters based on spatially dependent observations. We prove the consistency, under specified assumptions, of the method thus proposed. A simulation study made in order to assess the finite-sample behaviour of the proposed method with comparison to existing ones is presented.

Improved inferences for spatial regression models

Regional Science and Urban Economics, 2015

The quasi-maximum likelihood (QML) method is popular in the estimation and inference for spatial regression models. However, the QML estimators (QMLEs) of the spatial parameters can be quite biased and hence the standard inferences for the regression coefficients (based on t-ratios) can be seriously affected. This issue, however, has not been addressed. The QMLEs of the spatial parameters can be bias-corrected based on the general method of Yang (2015b, J. of Econometrics 186, 178-200). In this paper, we demonstrate that by simply replacing the QMLEs of the spatial parameters by their bias-corrected versions, the usual t-ratios for the regression coefficients can be greatly improved. We propose further corrections on the standard errors of the QMLEs of the regression coefficients, and the resulted t-ratios perform superbly, leading to much more reliable inferences.

Pitfalls in Higher Order Model Extensions of Basic Spatial Regression Methodology

Review of Regional Studies

Spatial regression methodology has been around for most of the 50 years (1961-2011) that the Southern Regional Science Association has been in existence. Cliff and Ord (1969) devised a parsimonious specification for the structure of spatial dependence among observations that could be used to empirically model spatial interdependence. Later work (Cliff and Ord, 1973, 1981; Ord, 1975) further developed these ideas into basic spatial regression models, which were popularized and augmented by Anselin (1988). We discuss several issues that have arisen in recent work that attempts to extend basic models of spatial interdependence to include more types of spatial and non-spatial interdependencies. Understanding these issues should help future work avoid several pitfalls that plague current and past attempts at extensions along these lines.

Spatial Regression Models for Field Trials: A Comparative Study and New Ideas

Frontiers in Plant Science, 2022

Naturally occurring variability within a study region harbors valuable information on relationships between biological variables. Yet, spatial patterns within these study areas, e.g., in field trials, violate the assumption of independence of observations, setting particular challenges in terms of hypothesis testing, parameter estimation, feature selection, and model evaluation. We evaluate a number of spatial regression methods in a simulation study, including more realistic spatial effects than employed so far. Based on our results, we recommend generalized least squares (GLS) estimation for experimental as well as for observational setups and demonstrate how it can be incorporated into popular regression models for high-dimensional data such as regularized least squares. This new method is available in the BioConductor R-package pengls. Inclusion of a spatial error structure improves parameter estimation and predictive model performance in low-dimensional settings and also improves feature selection in high-dimensional settings by reducing "red-shift": the preferential selection of features with spatial structure. In addition, we argue that the absence of spatial autocorrelation (SAC) in the model residuals should not be taken as a sign of a good fit, since it may result from overfitting the spatial trend. Finally, we confirm our findings in a case study on the prediction of winter wheat yield based on multispectral measurements.

Geographically weighted regression: a natural evolution of the expansion method for spatial data analysis

Environment and Planning A, 1998

Geographically weighted regression and the expansion method are two statistical techniques which can be used to examine the spatial variability of regression results across a region and so inform on the presence of spatial nonstationarity. Rather than accept one set of 'global' regression results, both techniques allow the possibility of producing 'local' regression results from any point within the region so that the output from the analysis is a set of mappable statistics which denote local relationships. Within the paper, the application of each technique to a set of health data from northeast England is compared. Geographically weighted regression is shown to produce more informative results regarding parameter variation over space.

Effective sample size for spatial regression models

Electronic Journal of Statistics, 2018

We propose a new definition of effective sample size. Although the recent works of Griffith (2005, 2008) and Vallejos and Osorio (2014) provide a theoretical framework to address the reduction of information in a spatial sample due to spatial autocorrelation, the asymptotic properties of the estimations have not been studied in those studies or in previously ones. In addition, the concept of effective sample size has been developed primarily for spatial regression processes with a constant mean. This paper introduces a new definition of effective sample size for general spatial regression models that is coherent with previous definitions. The asymptotic normality of the maximum likelihood estimation is obtained under an increasing domain framework. In particular, the conditions for which the limiting distribution holds are established for the Matérn covariance family. Illustrative examples accompany the discussion of the limiting results, including some cases where the asymptotic variance has a closed form. The asymptotic normality leads to an approximate hypothesis testing that establishes whether there is redundant information in the sample. Simulation results support the theoretical findings and provide information about the behavior of the power of the suggested test. A real dataset in which a transect sampling scheme has been used is analyzed to estimate the effective sample size when a spatial linear regression model is assumed.