Christopher Wikle - Profile on Academia.edu (original) (raw)

Papers by Christopher Wikle

arXiv (Cornell University), Dec 9, 2018

We introduce a Bayesian approach for analyzing high-dimensional multinomial data that are referen... more We introduce a Bayesian approach for analyzing high-dimensional multinomial data that are referenced over space and time. In particular, the proportions associated with multinomial data are assumed to have a logit link to a latent spatio-temporal mixed effects model. This strategy allows for covariances that are nonstationarity in both space and time, asymmetric, and parsimonious. We also introduce the use of the conditional multivariate logit-beta distribution into the dependent multinomial data setting, which leads to conjugate full-conditional distributions for use in a collapsed Gibbs sampler. We refer to this model as the multinomial spatio-temporal mixed effects model (MN-STM). Additionally, we provide methodological developments including: the derivation of the associated full-conditional distributions, a relationship with a latent Gaussian process model, and the stability of the non-stationary vector autoregressive model. We illustrate the MN-STM through simulations and through a demonstration with public-use Quarterly Workforce Indicators (QWI) data from the Longitudinal Employer Household Dynamics (LEHD) program of the U.S. Census Bureau.

arXiv (Cornell University), Jan 25, 2017

We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that ar... more We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called "big n problem." The computational complexity of the "big n problem" is further exacerbated when allowing for non-Gaussian data models, as is the case here. Thus, we develop new computationally efficient distribution theory for this setting. In particular, we introduce the "conjugate multivariate distribution," which is motivated by the univariate distribution introduced in Diaconis and Ylvisaker (1979). Furthermore, we provide substantial theoretical and methodological development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, conjugate prior distributions, and full-conditional distributions for a Gibbs sampler. To demonstrate the wide-applicability of the proposed methodology, we provide two simulation studies and three applications based on an epidemiology dataset, a federal statistics dataset, and an environmental dataset, respectively.

Statistical Deep Learning for Spatial and Spatiotemporal Data

Annual review of statistics and its application, Mar 10, 2023

Deep neural network models have become ubiquitous in recent years and have been applied to nearly... more Deep neural network models have become ubiquitous in recent years and have been applied to nearly all areas of science, engineering, and industry. These models are particularly useful for data that have strong dependencies in space (e.g., images) and time (e.g., sequences). Indeed, deep models have also been extensively used by the statistical community to model spatial and spatiotemporal data through, for example, the use of multilevel Bayesian hierarchical models and deep Gaussian processes. In this review, we first present an overview of traditional statistical and machine learning perspectives for modeling spatial and spatiotemporal data, and then focus on a variety of hybrid models that have recently been developed for latent process, data, and parameter specifications. These hybrid models integrate statistical modeling ideas with deep neural network models in order to take advantage of the strengths of each modeling paradigm. We conclude by giving an overview of computational technologies that have proven useful for these hybrid models, and with a brief discussion on future research directions.

Introduction to Spatio-Temporal Statistics

Statistica Sinica, 2019

Prediction of a spatial process using a "big dataset" has become a topical area of research over ... more Prediction of a spatial process using a "big dataset" has become a topical area of research over the last decade. The available solutions often involve placing strong assumptions on the error process associated with the data. Specifically, it has typically been assumed that the data are equal to the spatial process of principal interest plus a mutually independent error process. This is done to avoid modeling confounded cross-covariances between the signal and noise within an additive model. In this article, we consider an alternative latent process modeling schematic where it is assumed that the error process is spatially correlated and correlated with the latent process of interest. We show that such error process dependencies allow one to obtain precise predictions, and avoids confounded error covariances within the expression of the marginal distribution of the data. We refer to these covariances as "non-confounded discrepancy error covariances." Additionally, a "process augmentation" technique is developed to aid in computation. Demonstrations are provided through simulated examples and through an application using a large dataset consisting of the U.S. Census Bureau's American Community Survey 5-year period estimates of median household income on census tracts.

Journal of data science, 2022

The article presents a methodology for supervised regionalization of data on a spatial domain. De... more The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.

An Overview of Univariate and Multivariate Karhunen Loève Expansions in Statistics

Journal of the Indian Society for Probability and Statistics, Jun 9, 2022

Stat, 2017

Particle swarm optimization (PSO) algorithms are a class of heuristic optimization algorithms tha... more Particle swarm optimization (PSO) algorithms are a class of heuristic optimization algorithms that are attractive for complex optimization problems. We propose using PSO to solve spatial design problems, e.g. choosing new locations to add to an existing monitoring network. Additionally, we introduce two new classes of PSO algorithms that perform well in a wide variety of circumstances, called adaptively tuned PSO and adaptively tuned bare bones PSO. To illustrate these algorithms, we apply them to a common spatial design problem: choosing new locations to add to an existing monitoring network. Specifically, we consider a network in the Houston, TX, area for monitoring ambient ozone levels, which have been linked to out-of-hospital cardiac arrest rates.

Multivariate spatio-temporal survey fusion with application to the American Community Survey and Local Area Unemployment Statistics

Stat, 2016

There are often multiple surveys available that estimate and report related demographic variables... more There are often multiple surveys available that estimate and report related demographic variables of interest that are referenced over space and/or time. Not all surveys produce the same information, and thus, combining these surveys typically leads to higher quality estimates. That is, not every survey has the same level of precision nor do they always provide estimates of the same variables. In addition, various surveys often produce estimates with incomplete spatio‐temporal coverage. By combining surveys using a Bayesian approach, we can account for different margins of error and leverage dependencies to produce estimates of every variable considered at every spatial location and every time point. Specifically, our strategy is to use a hierarchical modelling approach, where the first stage of the model incorporates the margin of error associated with each survey. Then, in a lower stage of the hierarchical model, the multivariate spatio‐temporal mixed effects model is used to incorporate multivariate spatio‐temporal dependencies of the processes of interest. We adopt a fully Bayesian approach for combining surveys; that is, given all of the available surveys, the conditional distributions of the latent processes of interest are used for statistical inference. To demonstrate our proposed methodology, we jointly analyze period estimates from the US Census Bureau's American Community Survey, and estimates obtained from the Bureau of Labor Statistics Local Area Unemployment Statistics program. Copyright © 2016 John Wiley & Sons, Ltd.

A dimension-reduced approach to space-time Kalman filtering

Biometrika, Dec 1, 1999

Biometrika (1999), 86, 4, pp. 815829 © 1999 Biometrika Trust Printed in Great Britain A dimensio... more Biometrika (1999), 86, 4, pp. 815829 © 1999 Biometrika Trust Printed in Great Britain A dimension-reduced approach to space-time Kalman filtering ... Space does not have a natural ordering and hence the dynamic updating that is so important to Kalman filtering is missing. ...

Applied Stochastic Models in Business and Industry, Jul 1, 2010

Bayesian source detection and parameter estimation of a plume model based on sensor network measu... more Bayesian source detection and parameter estimation of a plume model based on sensor network measurements' by C. Huang et al.: Discussion 2 The problem of source detection and parameter estimation for plume models based on sensor network measurements is timely and important. The authors are to be congratulated for going beyond the product-form model to a model motivated by an advection-diffusion PDE model. From a historical perspective, plumes have often been modeled using analytic solutions to various diffusion PDEs, leading to such formulations as the so-called Gaussian plume model (see Ermak [1], Lushi and Stokie [2], and the references therein). Relatively few efforts (by comparison) have been made to 'fit' these models from a rigorous statistical perspective and to perform statistical inference. Therefore, we believe that the authors' contribution is also rooted in raising the awareness of the statistics community to the problem of rigorously modeling plumes in the presence of uncertainty. In spite of the authors' careful and detailed coverage of the problem, there are a few places we believe that more extensive treatment/exposition might be illuminating. First, a more comprehensive simulation study would be informative. In this direction, there would be several avenues for extending the simulation (e.g. varying the spatial covariance, the design, and examining model mis-specification, among others). In the absence of a real data example, the simulations become critically important. In this case, it is unclear how effective the model will perform when applied to real sensor network data. In future research (by the authors and/or other researchers) we look forward to seeing the authors' model applied in practice. The authors suggest using an approximate (quasi-) likelihood, instead of the true likelihood, that may prove to be useful in extremely high dimensions. Although the simulation example presented does not necessitate such an approximation, real-world applications might require such approximations for real-time implementation. The approach, taken by the authors, assumes independent measurement errors and, as a result, produces an MCMC algorithm in which samples do not come from the 'target' distribution. Instead, samples are taken from a, potentially biased, approximation to the target distribution. Although this technique may be preferable from a purely computational efficiency perspective, it is philosophically appealing (and potentially more accurate) to use the exact likelihood. One possible alternative, that can be viewed as a compromise between the authors' approach and using the exact likelihood, would be to use a Whittle formulation . In this context one would also avoid determinant calculations and matrix inversions. While the authors demonstrate the 'robustness' for their approximate approach, the simulation study is limited. The one MCMC simulation with 50 samples examines the independent measurement error case. Thus, in the future, it will be of interest to further investigate the accuracy of the approximation under a wide range of spatial and/or temporal dependence specifications.

RePEc: Research Papers in Economics, 2017

arXiv (Cornell University), Feb 7, 2018

Statistical agencies often publish multiple data products from the same survey. First, they produ... more Statistical agencies often publish multiple data products from the same survey. First, they produce aggregate estimates of various features of the distributions of several socio-demographic quantities of interest. Often these area-level estimates are tabulated at small geographies. Second, statistical agencies frequently produce weighted public-use microdata samples (PUMS) that provide detailed information of the entire distribution for the same socio-demographic variables. However, the public-use micro areas usually constitute relatively large geographies in order to protect against the identification of households or individuals included in the sample. These two data products represent a trade-off in official statistics: publicly available data products can either provide detailed spatial information or detailed distributional information, but not both. We propose a model-based method to combine these two data products to produce estimates of detailed features of a given variable at a high degree of spatial resolution. Our motivating example uses the disseminated tabulations and PUMS from the American Community Survey to estimate U.S. Census tract-level income distributions and statistics associated with these distributions.

ACS 5-year period estimates of median household income for 2013 over selected states in the NE US... more ACS 5-year period estimates of median household income for 2013 over selected states in the NE US. Our interest: Quantifying aggregation error and using that to find "optimal" regionalizations that minimize the e↵ects of the ecological fallacy and MAUP.

arXiv (Cornell University), Nov 8, 2022

There has been a great deal of recent interest in the development of spatial prediction algorithm... more There has been a great deal of recent interest in the development of spatial prediction algorithms for very large datasets and/or prediction domains. These methods have primarily been developed in the spatial statistics community, but there has been growing interest in the machine learning community for such methods, primarily driven by the success of deep Gaussian process regression approaches and deep convolutional neural networks. These methods are often computationally expensive to train and implement and consequently, there has been a resurgence of interest in random projections and deep learning models based on random weights -so called reservoir computing methods. Here, we combine several of these ideas to develop the Random Ensemble Deep Spatial (REDS) approach to predict spatial data. The procedure uses random Fourier features as inputs to an extreme learning machine (a deep neural model with random weights), and with calibrated ensembles of outputs from this model based on different random weights, it provides a simple uncertainty quantification. The REDS method is demonstrated on simulated data and on a classic large satellite data set.

Trends in Ecology & Evolution, 2020

Understanding ecological processes and predicting long-term dynamics are ongoing challenges in ec... more Understanding ecological processes and predicting long-term dynamics are ongoing challenges in ecology. To address these challenges, we suggest an approach combining mathematical analyses and Bayesian hierarchical statistical modeling with diverse data sources. Novel mathematical analysis of ecological dynamics permits a process-based understanding of conditions under which systems approach equilibrium, experience large oscillations, or persist in transient states. This understanding is improved by combining ecological models with empirical observations from a variety of sources. Bayesian hierarchical models explicitly couple process-based models and data, yielding probabilistic quantification of model parameters, system characteristics, and associated uncertainties. We outline relevant tools from dynamical analysis and hierarchical modeling and argue for their integration, demonstrating the value of this synthetic approach through a simple predator-prey example.

Journal of Agricultural Biological and Environmental Statistics, Jun 15, 2020

The use of accelerometers in wildlife tracking provides a fine-scale data source for understandin... more The use of accelerometers in wildlife tracking provides a fine-scale data source for understanding animal behavior and decision-making. Current methods in movement ecology focus on behavior as a driver of movement mechanisms. Our Markov model is a flexible and efficient method for inference related to effects on behavior that considers dependence between current and past behaviors. We applied this model to behavior data from six greater white-fronted geese (Anser albifrons frontalis) during spring migration in mid-continent North America and considered likely drivers of behavior, including habitat, weather and time of day effects. We modeled the transitions between flying, feeding, stationary and walking behavior states using a first-order Bayesian Markov model. We introduced Pólya-Gamma latent variables for automatic sampling of the covariate coefficients from the posterior distribution and we calculated the odds ratios from the posterior samples. Our model provides a unifying framework for including both acceleration and Global Positioning System data. We found significant differences in behavioral transition rates among habitat types, diurnal behavior and behavioral changes due to weather. Our model provides straightforward inference of behavioral time allocation across used habitats, which is not amenable in activity budget or resource selection frameworks.

Models for Ecological Models: Ocean Primary Productivity

Chance, Apr 2, 2016

23 The ocean accounts for more than 70% of planet Earth’s surface, and its processes are critical... more 23 The ocean accounts for more than 70% of planet Earth’s surface, and its processes are critically important to marine and terrestrial life. Ocean ecosystems are strongly dependent on the physical state of the ocean (e.g., transports, mixing, upwelling, runoff, and ice dynamics). As an example, consider the Coastal Gulf of Alaska (CGOA) region (Figure 1). The CGOA is an important area for primary production in the lower trophic-level ecosystem, with large spring phytoplankton blooms (and smaller fall blooms) that are a critical component of the food web Models for Ecological Models: Ocean Primary Productivity