Adrian Dobra | University of Washington (original) (raw)
Papers by Adrian Dobra
Overview SSS is a suite of software suite implementing "shotgun stochastic search" for "large p" ... more Overview SSS is a suite of software suite implementing "shotgun stochastic search" for "large p" regression variable uncertainty and selection. The SSS theory and methodology for regression models is described and exemplified in Hans, [1]. The general framework is that of regression with uncertainty about which predictors are in the model; model uncertainty is represented in terms of a prior variable inclusion probability to penalise model dimension. SSS explores the space of potentially very many models defined by subsets of predictor variables, guided by the (unnormalised) posterior model probabilities, and ranks and summarises sets of "top models" for assessment and prediction. The parallel implementation also provides some basic support for leave-one-out cross-validation analysis.
We describe a new stochastic search algorithm for linear regression models called the bounded mod... more We describe a new stochastic search algorithm for linear regression models called the bounded mode stochastic search (BMSS). We make use of BMSS to perform variable selection and classification as well as to construct sparse dependency networks. Furthermore, we show how to determine genetic networks from genome-wide data that involves any combination of continuous and discrete variables. We illustrate our methodology with several simulated and real-world datasets.
Statistical Methodology, 2009
Model uncertainty has become a central focus of policy discussion surrounding the determinants of... more Model uncertainty has become a central focus of policy discussion surrounding the determinants of economic growth. Over 140 regressors have been employed in growth empirics due to the proliferation of several new growth theories in the past two decades. Recently Bayesian model averaging (BMA) has been employed to address model uncertainty and to provide clear policy implications by identifying robust growth determinants. The BMA approaches were, however, limited to linear regression models that abstract from possible dependencies embedded in the covariance structures of growth determinants. The recent empirical growth literature has developed jointness measures to highlight such dependencies. We address model uncertainty and covariate dependencies in a comprehensive Bayesian framework that allows for structural learning in linear regressions and Gaussian graphical models. A common prior specification across the entire comprehensive framework provides consistency. Gaussian graphical models allow for a principled analysis of dependency structures, which allows us to generate a much more parsimonious set of fundamental growth determinants. Our empirics are based on a prominent growth dataset with 41 potential economic factors that has been the utilized in numerous previous analyses to account for model uncertainty as well as jointness.
Biostatistics, 2009
We describe a new stochastic search algorithm for linear regression models called the bounded mod... more We describe a new stochastic search algorithm for linear regression models called the bounded mode stochastic search (BMSS). We make use of BMSS to perform variable selection and classification as well as to construct sparse dependency networks. Furthermore, we show how to determine genetic networks from genomewide data that involves any combination of continuous and discrete variables. We illustrate our methodology with several real-world datasets.
Electronic Journal of Statistics, 2011
Standard Gaussian graphical models (GGMs) implicitly assume that the conditional independence amo... more Standard Gaussian graphical models (GGMs) implicitly assume that the conditional independence among variables is common to all observations in the sample. However, in practice, observations are usually collected form heterogeneous populations where such assumption is not satisfied, leading in turn to nonlinear relationships among variables. To tackle these problems we explore mixtures of GGMs; in particular, we consider both infinite mixture models of GGMs and infinite hidden Markov models with GGM emission distributions. Such models allow us to divide a heterogeneous population into homogenous groups, with each cluster having its own conditional independence structure. The main advantage of considering infinite mixtures is that they allow us easily to estimate the number of number of subpopulations in the sample. As an illustration, we study the trends in exchange rate fluctuations in the pre-Euro era. This example demonstrates that the models are very flexible while providing extremely interesting interesting insights into real-life applications.
The paper considers general multiplicative models for complete and incomplete contingency tables ... more The paper considers general multiplicative models for complete and incomplete contingency tables that generalize log-linear and several other models and are entirely coordinate free. Sufficient conditions of the existence of maximum likelihood estimates under these models are given, and it is shown that the usual equivalence between multinomial and Poisson likelihoods holds if and only if an overall effect is present in the model. If such an effect is not assumed, the model becomes a curved exponential family and a related mixed parameterization is given that relies on non-homogeneous odds ratios. Several examples are presented to illustrate the properties and use of such models.
One major strain of the statistical literature on disclosure limitation for contingency table dat... more One major strain of the statistical literature on disclosure limitation for contingency table data has focused on the the risk-utility tradeoff where utility has been measure either formally or informally in terms of information contained in marginal tables linked to a log-linear model analysis and risk has focused on disclosure potential of small cell counts, especially those equal to 1 or 2. Utility of margins for log-linear model analysis depends on estimability, e.g., existence of maximum likelihood estimates, and the ability to assess goodness-of-fit of models. One simple way to assess risk is to compute bounds for cell entries given a set of released marginals. Both of these methodologies become non-trivial to implement for large sparse tables. This paper revisits the problem of computing bounds for cell entries and picks up on a theme, first suggested in Fienberg , that there is an intimate link between the ideas on bounds and the existence of maximum likelihood estimates, and shows how these ideas can be made rigorous through the underlying mathematics of the same geometric/algebraic framework. We illustrate the linkages through a series of examples.
Bioinformatics/computer Applications in The Biosciences, 2005
We describe a database and information discovery system termed DIG (Duke Integrated Genomics) des... more We describe a database and information discovery system termed DIG (Duke Integrated Genomics) designed to facilitate the process of gene annotation and discovery of functional context. The DIG system collects and organizes gene annotation and functional information, and includes tools that support an understanding of genes in a functional context by providing a framework for integrating and visualizing gene expression, protein interaction, and literature-based interaction networks. by guest on
Model search in regression with very large numbers of candidate predictors raises challenges for ... more Model search in regression with very large numbers of candidate predictors raises challenges for both model specification and computation, and standard approaches such as Markov chain Monte Carlo (MCMC) and step-wise methods are often infeasible or ineffective. We describe a novel shotgun stochastic search (SSS) approach that explores "interesting" regions of the resulting, very high-dimensional model spaces to quickly identify regions of high posterior probability over models. We describe algorithmic and modeling aspects, priors over the model space that induce sparsity and parsimony over and above the traditional dimension penalization implicit in Bayesian and likelihood analyses, and parallel computation using cluster computers. We discuss an example from gene expression cancer genomics, comparisons with MCMC and other methods, and theoretical and simulationbased aspects of performance characteristics in large-scale regression model search. We also provide software implementing the methods.
We propose a new stochastic search algorithm for Gaussian graphical models called the mode orient... more We propose a new stochastic search algorithm for Gaussian graphical models called the mode oriented stochastic search. Our algorithm relies on the existence of a method to accurately and efficiently approximate the marginal likelihood associated with a graphical model when it cannot be computed in closed form. To this end, we develop a new Laplace approximation method to the normalizing constant of a G-Wishart distribution. We show that combining the mode oriented stochastic search with our marginal likelihood estimation method leads to excellent results with respect to other techniques discussed in the literature. We also describe how to perform inference through Bayesian model averaging based on the reduced set of graphical models identified. Finally, we give a novel stochastic search technique for multivariate regression models.
Annals of Statistics, 2007
In Bayesian analysis of multi-way contingency tables, the selection of a prior distribution for e... more In Bayesian analysis of multi-way contingency tables, the selection of a prior distribution for either the log-linear parameters or the cell probabilities parameters is a major challenge. In this paper, we define a flexible family of conjugate priors for the wide class of discrete hierarchical log-linear models, which includes the class of graphical models. These priors are defined as the Diaconis-Ylvisaker conjugate priors on the log-linear parameters subject to "baseline constraints" under multinomial sampling. We also derive the induced prior on the cell probabilities and show that the induced prior is a generalization of the hyper Dirichlet prior. We show that this prior has several desirable properties and illustrate its usefulness by identifying the most probable decomposable, graphical and hierarchical log-linear models for a six-way contingency table.
Bounds for the cell counts in multi-way contingency tables given a set of marginal totals arise i... more Bounds for the cell counts in multi-way contingency tables given a set of marginal totals arise in a variety of different statistical contexts including disclosure limitation. We describe the Generalized Shuttle Algorithm for computing integer bounds of multi-way contingency tables induced by arbitrary linear constraints on cell counts. We study the convergence properties of our method by exploiting the theory of discrete graphical models and demonstrate the sharpness of the bounds for some specific settings. We give a procedure for adjusting these bounds to the sharp bounds that can also be employed to enumerate all tables consistent with the given constraints. Our algorithm for computing sharp bounds and enumerating multiway contingency tables is the first approach that relies exclusively on the unique structure of the categorical data and does not employ any other optimization techniques such as linear or integer programming. We illustrate how our algorithm can be used to compute exact p-values of goodness-of-fit tests in exact conditional inference.
We describe and illustrate approaches to Bayesian inference in multi-way contingency tables for w... more We describe and illustrate approaches to Bayesian inference in multi-way contingency tables for which partial information, in the form of subsets of marginal totals, is available. In such problems, interest lies in questions of inference about the parameters of models underlying the table together with imputation for the individual cell entries. We discuss questions of structure related to the implications for inference on cell counts arising from assumptions about log-linear model forms, and a class of simple and useful prior distributions on the parameters of log-linear models. We then discuss "local move" and "global move" Metropolis-Hastings simulation methods for exploring the posterior distributions for parameters and cell counts, focusing particularly on higher-dimensional problems. As a byproduct, we note potential uses of the "global move" approach for inference about numbers of tables consistent with a prescribed subset of marginal counts. Illustration and comparison of MCMC approaches is given, and we conclude with discussion of areas for further developments and current open issues.
We describe a novel gene expression analysis method for the creation of overlapping gene clusters... more We describe a novel gene expression analysis method for the creation of overlapping gene clusters and associated metagene signatures that aim to characterize the dominant common expression patterns within each cluster. The analysis is based on the use of statistical graphical models to identify and estimate patterns of association among gene subsets from gene expression data, and then clustering is based formal estimates of very sparse covariance matrices arising from these models. Metagene summaries, which are of interest as reduced dimensional summaries for phenotyping studies, are simply the resulting model-based estimates of dominant singular factors (principal components) of population variance matrices within resulting overlapping clusters. We describe connections between graph-theoretic approaches to exploring gene expression graphical models and exploration in biological contexts of gene subsets represented by identified metagenes, illustrating some aspects of the utility of this framework for summary representation of observational gene expression data. Availability: The software implementing our method is called MetageneCreator and is available for download at
Page 1. 1 Reconstruction of Contingency Tables With Missing Data Adrian Dobra National Institute ... more Page 1. 1 Reconstruction of Contingency Tables With Missing Data Adrian Dobra National Institute of Statistical Sciences Claudia Tebaldi National Center for Atmospheric Research Mike West ISDS, Duke University Page 2. 2 Incomplete Tables ).| Pr(F ...
Abstract The underlying connection between disclosure avoidance techniques for categorical data a... more Abstract The underlying connection between disclosure avoidance techniques for categorical data and sampling from the exact conditional distribution associated with a loglinear model is the data swaps necessary to link all the contingency tables having a set of xed marginal ...
Overview SSS is a suite of software suite implementing "shotgun stochastic search" for "large p" ... more Overview SSS is a suite of software suite implementing "shotgun stochastic search" for "large p" regression variable uncertainty and selection. The SSS theory and methodology for regression models is described and exemplified in Hans, [1]. The general framework is that of regression with uncertainty about which predictors are in the model; model uncertainty is represented in terms of a prior variable inclusion probability to penalise model dimension. SSS explores the space of potentially very many models defined by subsets of predictor variables, guided by the (unnormalised) posterior model probabilities, and ranks and summarises sets of "top models" for assessment and prediction. The parallel implementation also provides some basic support for leave-one-out cross-validation analysis.
We describe a new stochastic search algorithm for linear regression models called the bounded mod... more We describe a new stochastic search algorithm for linear regression models called the bounded mode stochastic search (BMSS). We make use of BMSS to perform variable selection and classification as well as to construct sparse dependency networks. Furthermore, we show how to determine genetic networks from genome-wide data that involves any combination of continuous and discrete variables. We illustrate our methodology with several simulated and real-world datasets.
Statistical Methodology, 2009
Model uncertainty has become a central focus of policy discussion surrounding the determinants of... more Model uncertainty has become a central focus of policy discussion surrounding the determinants of economic growth. Over 140 regressors have been employed in growth empirics due to the proliferation of several new growth theories in the past two decades. Recently Bayesian model averaging (BMA) has been employed to address model uncertainty and to provide clear policy implications by identifying robust growth determinants. The BMA approaches were, however, limited to linear regression models that abstract from possible dependencies embedded in the covariance structures of growth determinants. The recent empirical growth literature has developed jointness measures to highlight such dependencies. We address model uncertainty and covariate dependencies in a comprehensive Bayesian framework that allows for structural learning in linear regressions and Gaussian graphical models. A common prior specification across the entire comprehensive framework provides consistency. Gaussian graphical models allow for a principled analysis of dependency structures, which allows us to generate a much more parsimonious set of fundamental growth determinants. Our empirics are based on a prominent growth dataset with 41 potential economic factors that has been the utilized in numerous previous analyses to account for model uncertainty as well as jointness.
Biostatistics, 2009
We describe a new stochastic search algorithm for linear regression models called the bounded mod... more We describe a new stochastic search algorithm for linear regression models called the bounded mode stochastic search (BMSS). We make use of BMSS to perform variable selection and classification as well as to construct sparse dependency networks. Furthermore, we show how to determine genetic networks from genomewide data that involves any combination of continuous and discrete variables. We illustrate our methodology with several real-world datasets.
Electronic Journal of Statistics, 2011
Standard Gaussian graphical models (GGMs) implicitly assume that the conditional independence amo... more Standard Gaussian graphical models (GGMs) implicitly assume that the conditional independence among variables is common to all observations in the sample. However, in practice, observations are usually collected form heterogeneous populations where such assumption is not satisfied, leading in turn to nonlinear relationships among variables. To tackle these problems we explore mixtures of GGMs; in particular, we consider both infinite mixture models of GGMs and infinite hidden Markov models with GGM emission distributions. Such models allow us to divide a heterogeneous population into homogenous groups, with each cluster having its own conditional independence structure. The main advantage of considering infinite mixtures is that they allow us easily to estimate the number of number of subpopulations in the sample. As an illustration, we study the trends in exchange rate fluctuations in the pre-Euro era. This example demonstrates that the models are very flexible while providing extremely interesting interesting insights into real-life applications.
The paper considers general multiplicative models for complete and incomplete contingency tables ... more The paper considers general multiplicative models for complete and incomplete contingency tables that generalize log-linear and several other models and are entirely coordinate free. Sufficient conditions of the existence of maximum likelihood estimates under these models are given, and it is shown that the usual equivalence between multinomial and Poisson likelihoods holds if and only if an overall effect is present in the model. If such an effect is not assumed, the model becomes a curved exponential family and a related mixed parameterization is given that relies on non-homogeneous odds ratios. Several examples are presented to illustrate the properties and use of such models.
One major strain of the statistical literature on disclosure limitation for contingency table dat... more One major strain of the statistical literature on disclosure limitation for contingency table data has focused on the the risk-utility tradeoff where utility has been measure either formally or informally in terms of information contained in marginal tables linked to a log-linear model analysis and risk has focused on disclosure potential of small cell counts, especially those equal to 1 or 2. Utility of margins for log-linear model analysis depends on estimability, e.g., existence of maximum likelihood estimates, and the ability to assess goodness-of-fit of models. One simple way to assess risk is to compute bounds for cell entries given a set of released marginals. Both of these methodologies become non-trivial to implement for large sparse tables. This paper revisits the problem of computing bounds for cell entries and picks up on a theme, first suggested in Fienberg , that there is an intimate link between the ideas on bounds and the existence of maximum likelihood estimates, and shows how these ideas can be made rigorous through the underlying mathematics of the same geometric/algebraic framework. We illustrate the linkages through a series of examples.
Bioinformatics/computer Applications in The Biosciences, 2005
We describe a database and information discovery system termed DIG (Duke Integrated Genomics) des... more We describe a database and information discovery system termed DIG (Duke Integrated Genomics) designed to facilitate the process of gene annotation and discovery of functional context. The DIG system collects and organizes gene annotation and functional information, and includes tools that support an understanding of genes in a functional context by providing a framework for integrating and visualizing gene expression, protein interaction, and literature-based interaction networks. by guest on
Model search in regression with very large numbers of candidate predictors raises challenges for ... more Model search in regression with very large numbers of candidate predictors raises challenges for both model specification and computation, and standard approaches such as Markov chain Monte Carlo (MCMC) and step-wise methods are often infeasible or ineffective. We describe a novel shotgun stochastic search (SSS) approach that explores "interesting" regions of the resulting, very high-dimensional model spaces to quickly identify regions of high posterior probability over models. We describe algorithmic and modeling aspects, priors over the model space that induce sparsity and parsimony over and above the traditional dimension penalization implicit in Bayesian and likelihood analyses, and parallel computation using cluster computers. We discuss an example from gene expression cancer genomics, comparisons with MCMC and other methods, and theoretical and simulationbased aspects of performance characteristics in large-scale regression model search. We also provide software implementing the methods.
We propose a new stochastic search algorithm for Gaussian graphical models called the mode orient... more We propose a new stochastic search algorithm for Gaussian graphical models called the mode oriented stochastic search. Our algorithm relies on the existence of a method to accurately and efficiently approximate the marginal likelihood associated with a graphical model when it cannot be computed in closed form. To this end, we develop a new Laplace approximation method to the normalizing constant of a G-Wishart distribution. We show that combining the mode oriented stochastic search with our marginal likelihood estimation method leads to excellent results with respect to other techniques discussed in the literature. We also describe how to perform inference through Bayesian model averaging based on the reduced set of graphical models identified. Finally, we give a novel stochastic search technique for multivariate regression models.
Annals of Statistics, 2007
In Bayesian analysis of multi-way contingency tables, the selection of a prior distribution for e... more In Bayesian analysis of multi-way contingency tables, the selection of a prior distribution for either the log-linear parameters or the cell probabilities parameters is a major challenge. In this paper, we define a flexible family of conjugate priors for the wide class of discrete hierarchical log-linear models, which includes the class of graphical models. These priors are defined as the Diaconis-Ylvisaker conjugate priors on the log-linear parameters subject to "baseline constraints" under multinomial sampling. We also derive the induced prior on the cell probabilities and show that the induced prior is a generalization of the hyper Dirichlet prior. We show that this prior has several desirable properties and illustrate its usefulness by identifying the most probable decomposable, graphical and hierarchical log-linear models for a six-way contingency table.
Bounds for the cell counts in multi-way contingency tables given a set of marginal totals arise i... more Bounds for the cell counts in multi-way contingency tables given a set of marginal totals arise in a variety of different statistical contexts including disclosure limitation. We describe the Generalized Shuttle Algorithm for computing integer bounds of multi-way contingency tables induced by arbitrary linear constraints on cell counts. We study the convergence properties of our method by exploiting the theory of discrete graphical models and demonstrate the sharpness of the bounds for some specific settings. We give a procedure for adjusting these bounds to the sharp bounds that can also be employed to enumerate all tables consistent with the given constraints. Our algorithm for computing sharp bounds and enumerating multiway contingency tables is the first approach that relies exclusively on the unique structure of the categorical data and does not employ any other optimization techniques such as linear or integer programming. We illustrate how our algorithm can be used to compute exact p-values of goodness-of-fit tests in exact conditional inference.
We describe and illustrate approaches to Bayesian inference in multi-way contingency tables for w... more We describe and illustrate approaches to Bayesian inference in multi-way contingency tables for which partial information, in the form of subsets of marginal totals, is available. In such problems, interest lies in questions of inference about the parameters of models underlying the table together with imputation for the individual cell entries. We discuss questions of structure related to the implications for inference on cell counts arising from assumptions about log-linear model forms, and a class of simple and useful prior distributions on the parameters of log-linear models. We then discuss "local move" and "global move" Metropolis-Hastings simulation methods for exploring the posterior distributions for parameters and cell counts, focusing particularly on higher-dimensional problems. As a byproduct, we note potential uses of the "global move" approach for inference about numbers of tables consistent with a prescribed subset of marginal counts. Illustration and comparison of MCMC approaches is given, and we conclude with discussion of areas for further developments and current open issues.
We describe a novel gene expression analysis method for the creation of overlapping gene clusters... more We describe a novel gene expression analysis method for the creation of overlapping gene clusters and associated metagene signatures that aim to characterize the dominant common expression patterns within each cluster. The analysis is based on the use of statistical graphical models to identify and estimate patterns of association among gene subsets from gene expression data, and then clustering is based formal estimates of very sparse covariance matrices arising from these models. Metagene summaries, which are of interest as reduced dimensional summaries for phenotyping studies, are simply the resulting model-based estimates of dominant singular factors (principal components) of population variance matrices within resulting overlapping clusters. We describe connections between graph-theoretic approaches to exploring gene expression graphical models and exploration in biological contexts of gene subsets represented by identified metagenes, illustrating some aspects of the utility of this framework for summary representation of observational gene expression data. Availability: The software implementing our method is called MetageneCreator and is available for download at
Page 1. 1 Reconstruction of Contingency Tables With Missing Data Adrian Dobra National Institute ... more Page 1. 1 Reconstruction of Contingency Tables With Missing Data Adrian Dobra National Institute of Statistical Sciences Claudia Tebaldi National Center for Atmospheric Research Mike West ISDS, Duke University Page 2. 2 Incomplete Tables ).| Pr(F ...
Abstract The underlying connection between disclosure avoidance techniques for categorical data a... more Abstract The underlying connection between disclosure avoidance techniques for categorical data and sampling from the exact conditional distribution associated with a loglinear model is the data swaps necessary to link all the contingency tables having a set of xed marginal ...