Dealing with Spatial Autocorrelation when Learning Predictive Clustering Trees (original) (raw)
Related papers
2016
Spatial autocorrelation is the correlation among data values which is strictly due to the relative spatial proximity of the objects that the data refer to. Inappropriate treatment of data with spatial dependencies, where spatial autocorrelation is ignored, can obfuscate important insights. In this paper, we propose a data mining method that explicitly considers spatial autocorrelation in the values of the response (target) variable when learning predictive clustering models. The method is based on the concept of predictive clustering trees (PCTs), according to which hierarchies of clusters of similar data are identified and a predictive model is associated to each cluster. In particular, our approach is able to learn predictive models for both a continuous response (regression task) and a discrete response (classification task). We evaluate our approach on several real world problems of spatial regression and spatial classification. The consideration of the autocorrelation in the models improves predictions that are consistently clustered in space and that clusters try to preserve the spatial arrangement of the data, at the same time providing a multi-level insight into the spatial autocorrelation phenomenon. The evaluation of SCLUS in several ecological domains (e.g. predicting outcrossing rates within a conventional field due to the surrounding genetically modified fields, as well as predicting pollen dispersal rates from two lines of plants) confirms its capability of building spatial aware models which capture the spatial distribution of the target variable. In general, the maps obtained by using SCLUS do not require further post-smoothing of the results if we want to use them in practice.
Extending data mining for spatial applications: a Case Study in Predicting Nest Locations
2000
Spatial data mining is a process to discover interesting and potentially useful spatial patterns embedded in spatial databases. E cient tools for extracting information from spatial data sets can be of importance to organizations which own, generate and manage large geo-spatial data sets. The current approach towards solving spatial data mining problems is to use classical data mining tools after "materializing" spatial relationships and assuming independence between di erent data points. However, classical data mining methods often perform poorly on spatial data sets which have high spatial auto-correlation. In this paper we will review spatial statistical techniques which can e ectively model the notion of spatial-autocorrelation and apply it to the problem of predicting bird nest locations in a wetland.
Precision agriculture aims at sustainably optimizing the management of cultivated fields by addressing the spatial variability found in crops and their environment. Spatial variability can be evaluated using spatial cluster analysis, which partitions data into homogeneous groups, considering the geographical location of features and their spatial relationships. Spatial clustering methods evaluate the degree of spatial autocorrelation between features and quantify the statistical significance of identified clusters. Clustering of orchard data calls for an approach which is based on modeling point data, i.e. individual trees, which can be related to site-specific measurements. We present and evaluate a spatial clustering method using the Getis-Ord G i ⁄ statistic to the analysis of tree-based data in an experimental orchard. We examine the robustness of this method for the analysis of ''hot-spots'' (clusters of high data values) and ''coldspots'' (clusters of low data values) in orchards and compare it to the k-means clustering algorithm, a widely-used aspatial method. We then present a novel approach which accounts for the spatial structure of data in a multivariate cluster analysis by combining the spatial Getis-Ord G i ⁄ statistic with k-means multivariate clustering. The combined method improved results by both discriminating among features values as well as representing their spatial structure and therefore represents a superior technique for identifying homogenous spatial clusters in orchards. This approach can be used as a tool for precision management of orchards by partitioning trees into management zones.
Machine learning is a computational technology widely used in regression and classification tasks. One of the drawbacks of its use in the analysis of spatial variables is that machine learning algorithms are in general, not designed to deal with spatially autocorrelated data. This often causes the residuals to exhibit clustering, in clear violation of the condition of independent and identically distributed random variables. In this work we analyze the performance of some well-established Machine Learning algorithms and one spatial algorithm in regression tasks for situations where the data presents varying degrees of clustering. We defined “performance” as the goodness of fit achieved by an algorithm in conjunction with the degree of spatial association of the residuals. We generated a set of synthetic datasets with varying degrees of clustering and built regression models with synthetic autocorrelated explanatory variables and regression coefficients. We then solved these regression models with the algorithms chosen. We identified significant differences between the machine learning algorithms in their sensitivity to spatial autocorrelation and the achieved goodness of fit. We also exposed the superiority of machine learning algorithms over generalized least squares in both goodness of fit and residual spatial autocorrelation. Our findings can be useful in choosing the best regression algorithm for the analysis of spatial variables
A spatial clustering perspective on autocorrelation and regionalization
Environmental and Ecological Statistics, 2008
We revisit one of the classical problems in geography and cartography where multiple observations on a lattice (N ) need to be grouped into many fewer regions (G), especially when this number of desired regions is unknown a priori. Since an optimization through all possible aggregations is not feasible, a hierarchical classification scheme is proposed with an objective function sensitive to spatial pattern. The objective function to be minimized during the assignment of observations to regions (classification) consists of two terms: the first characterizes accuracy and the second, model complexity. For the latter, we introduce a spatial measure that characterizes the number of homogeneous patches rather than the usual number of classes. A simulation study shows that such a classification procedure is less sensitive to random and spatially correlated error (noise) than non-spatial classification. We also show that for conditional autoregressive error (noise) fields the optimal partitioning is the one that has the highest within-units generalized Moran coefficient. The classifier is implemented in ArcView to demonstrate both a socio-economic and an environmental application to illustrate some potential applications.
Spatial contextual classification and prediction models for mining geospatial data
IEEE Transactions on Multimedia, 2002
Modeling spatial context (e.g., autocorrelation) is a key challenge in classification problems that arise in geospatial domains. Markov random fields (MRF) is a popular model for incorporating spatial context into image segmentation and land-use classification problems. The spatial autoregression (SAR) model, which is an extension of the classical regression model for incorporating spatial dependence, is popular for prediction and classification of spatial data in regional economics, natural resources, and ecological studies. There is little literature comparing these alternative approaches to facilitate the exchange of ideas (e.g., solution procedures). We argue that the SAR model makes more restrictive assumptions about the distribution of feature values and class boundaries than MRF. The relationship between SAR and MRF is analogous to the relationship between regression and Bayesian classifiers. This paper provides comparisons between the two models using a probabilistic and an experimental framework.
Mining Spatial Data & Enhancing Classification Using Bio-Inspired Approaches
2014
Data-Mining (DM) has become one of the most valuable tools for extracting and manipulating data and for establishing patterns in order to produce useful information for decision-making. It is a generic term that is used to find hidden patterns of data(tabular, spatial, temporal, spatio-temporal etc.) Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from spatial databases Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationship and spatial autocorrelation. Spatial data are the data related to objects that occupy space. A spatial database stores spatial objects represented by spatial data types and spatial relationship among such objects. Clustering is the process of partitioning a set of data objects into subsets such that the data elemen...
Machine Learning Approaches in Spatial Data Mining
International Journal of Innovative Research in Computer Science and Technology, 2024
This review paper surveys the integration of machine learning techniques in spatial data mining, a crucial intersection of geographic information systems and data mining. It examines the application of various machine learning algorithms such as classification, regression, clustering, and deep learning in spatial data analysis. The paper discusses challenges like data preprocessing, feature selection, and model interpretability, alongside recent advancements including spatial-temporal analysis and heterogeneous data integration. Through critical analysis of existing literature, it identifies trends, methodologies, and future research directions. Practical implications and applications across domains like urban planning, environmental monitoring, and epidemiology are explored. As a comprehensive resource, this review facilitates understanding and utilization of machine learning approaches for extracting insights from spatial data, benefiting researchers, practitioners, and policymakers alike.