Geographical Random Forests: A Spatial Extension of the Random Forest Algorithm to Address Spatial Heterogeneity in Remote Sensing and Population Modelling (original) (raw)
Related papers
ISPRS International Journal of Geo-Information
The aim of this paper is to present developments of an advanced geospatial analytics algorithm that improves the prediction power of a random forest regression model while addressing the issue of spatial dependence commonly found in geographical data. We applied the methodology to a simple model of mean household income in the European Union regions to allow easy understanding and reproducibility of the analysis. The results are encouraging and suggest an improvement in the prediction power compared to previous techniques. The algorithm has been implemented in R and is available in the updated version of the SpatialML package in the CRAN repository.
2019
The increasing access to spatio-temporal datasets, data-driven modelling methods and computational power have transformed the way we do science. Yet most geodata-driven approaches currently disregard the spatial and temporal aspects of the data they are based on. Here we present and evaluate a hybrid machine learning approach that combines statistical mixed effects theory with the power of random forests. This approach, namely mixed-effects random forests or MERF, is used to model monthly crimes in New York City (USA). Our results show that MERF leads to lower prediction errors and to lower spatial autocorrelation in the residuals than a standard random forest model. This shows that there are approaches to mitigate the non-geocomputational nature of machine learning methods.
PLOS ONE, 2015
High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, "Random Forest" estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and
GWmodel: an R Package for Exploring Spatial Heterogeneity using Geographically Weighted Models
2013
Spatial statistics is a growing discipline providing important analytical techniques in a wide range of disciplines in the natural and social sciences. In the R package GWmodel, we introduce techniques from a particular branch of spatial statistics, termed geographically weighted (GW) models. GW models suit situations when data are not described well by some global model, but where there are spatial regions where a suitably localised calibration provides a better description. The approach uses a moving window weighting technique, where localised models are found at target locations. Outputs are mapped to provide a useful exploratory tool into the nature of the data spatial heterogeneity. GWmodel includes: GW summary statistics, GW principal components analysis, GW regression, GW regression with a local ridge compensation, and GW regression for prediction; some of which are provided in basic and robust forms.
A Robust Test of Spatial Predictive Models: Geographic Cross-Validation
Journal of Environmental Informatics, 2011
Predictive modeling is an important tool for identifying areas for conservation prioritization. But the reliability of any model depends on how well its predictions can be generalized beyond the area surveyed. Recent work points to the potential for enhancing predictive power by incorporating such spatial processes as autocorrelation or the influence of location, so this study addressed two questions: (1) what affect does model complexity, spatial autocorrelation and spatial location have on model accuracy? (2) how generalizable are different methods when applied to new geographic test regions? On average, predictive power declined 22.7% ± 2.7% SE when models were used to predict occurrences in "unsampled" geographic test regions. Overall variability in performance depended on the method used. AUTO and GAM models tended to be amongst the least variable, but results depended upon species. Our results suggest that models with complex functional relationships between the response and predictor variables (such as GAMs fit with up to 5 knots) tended to either improve accuracy, or perform more consistently across species, but not both at the same time. In general, it is very difficult to accurately extrapolate model predictions into unsampled geographic areas. However, we found that habitat specialists such as the Sedge Wren were consistently well predicted, regardless of method, and that autocorrelated regression (using a Gibbs sampler and simulation of presence/absence) could be more reliably generalized for species showing strong social structure (e.g., patchiness). GWR was especially sensitive to the plots used to train the model.
Geospatial Health
As found in the health studies literature, the levels of climate association between epidemiological diseases have been found to vary across regions. Therefore, it seems reasonable to allow for the possibility that relationships might vary spatially within regions. We implemented the geographically weighted random forest (GWRF) machine learning method to analyze ecological disease patterns caused by spatially non-stationary processes using a malaria incidence dataset for Rwanda. We first compared the geographically weighted regression (WGR), the global random forest (GRF), and the geographically weighted random forest (GWRF) to examine the spatial non-stationarity in the non-linear relationships between malaria incidence and their risk factors. We used the Gaussian areal kriging model to disaggregate the malaria incidence at the local administrative cell level to understand the relationships at a fine scale since the model goodness of fit was not satisfactory to explain malaria inci...
A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping
PLoS ONE, 2014
Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including-in the latter case-x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called ''out-of-bag''), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha 21 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation.
International Journal of Remote Sensing , 2018
Rapid urban growth in developing countries is causing a great number of urban planning problems. To control and analyse this growth, new and better methods for urban land use mapping are needed. This article proposes a new method for urban land-use mapping, which integrates spatial metrics and texture analysis in an object-based image analysis classification. A high-resolution satellite image was used to generate spatial and texture metrics from the machine learning algorithm of Random Forests land-cover classification. The most meaningful spatial indices were selected by visual inspection and then combined with the image and texture values to generate the classification. The proposed method for land-use mapping was tested using a 10-fold cross-validation scheme, achieving an overall accuracy of 92.3% and a kappa coefficient of 0.896. These steps produced an accurate model of urban land use, without the use of any census or ancillary data, and suggest that the combined use of spatial metrics and texture is promising for urban land-use mapping in developing countries. The maps produced can provide the land-use data needed by urban planners for effective planning in developing countries. ARTICLE HISTORY
Mathematical Geosciences, 2010
Increasingly, the geographically weighted regression (GWR) model is being used for spatial prediction rather than for inference. Our study compares GWR as a predictor to (a) its global counterpart of multiple linear regression (MLR); (b) traditional geostatistical models such as ordinary kriging (OK) and universal kriging (UK), with MLR as a mean component; and (c) hybrids, where kriging models are specified with GWR as a mean component. For this purpose, we test the performance of each model on data simulated with differing levels of spatial heterogeneity (with respect to data relationships in the mean process) and spatial autocorrelation (in the residual process). Our results demonstrate that kriging (in a UK form) should be the preferred predictor, reflecting its optimal statistical properties. However the GWRkriging hybrids perform with merit and, as such, a predictor of this form may provide a worthy alternative to UK for particular (non-stationary relationship) situations when UK models cannot be reliably calibrated. GWR predictors tend to perform more poorly than their more complex GWR-kriging counterparts, but both GWR-based models are useful in that they provide extra information on the spatial processes generating the data that are being predicted.