Sylvia Richardson - Academia.edu (original) (raw)

Papers by Sylvia Richardson

Research paper thumbnail of Two-pronged Strategy for Using DIC to Compare Selection Models with Non-Ignorable Missing Responses

Research paper thumbnail of Markov chain Monte Carlo methods

Research paper thumbnail of JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects

Genetic epidemiology, 2016

Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number ... more Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one-at-a-time. This complicates the ability of fine-mapping to identify a small set of SNPs for further functional follow-up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re-analysis of published marginal summary statistics under joint multi-SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed f...

Research paper thumbnail of A Bayesian Model of NMR Spectra for the Deconvolution and Quantification of Metabolites in Complex Biological Mixtures

Journal of the American Statistical Association, 2012

Astle's present position is supported by a Team Grant from the Fonds de recherche du Québec-Natur... more Astle's present position is supported by a Team Grant from the Fonds de recherche du Québec-Nature et technologies. The authors thank Jake Bundy for providing the yeast dataset and for help interpreting deconvolutions of the NMR spectra, Jie Hao for programming a C++ implementation and David Balding and Ernest Turro for comments on a draft manuscript.

Research paper thumbnail of BayesSUR: An R Package for High-Dimensional Multivariate Bayesian Variable and Covariance Selection in Linear Regression

Journal of Statistical Software

In molecular biology, advances in high-throughput technologies have made it possible to study com... more In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with highdimensional genomic and other omics data, a problem that can be studied with highdimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. We also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.

Research paper thumbnail of Noname manuscript No

We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Diric... more We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Dirichlet process mixture model, with concentration parameter α. This paper introduces a Gibbs sampling algorithm that combines the slice sampling approach of Walker (2007) and the retrospective sampling approach of Papaspiliopoulos and Roberts (2008). Our general algorithm is implemented as efficient open source C++ software, available as an R package, and is based on a blocking strategy similar to that suggested by Papaspiliopoulos (2008) and implemented by Yau et al (2011). We discuss the difficulties of achieving good mixing in MCMC samplers of this nature in large data sets and investigate sensitivity to initialisation. We additionally consider the challenges when an additional layer of hierarchy is added such that joint inference is to be made on α. We introduce a new label-switching move and compute the marginal partition posterior to help to surmount these difficulties. Our work is illustrated using a profile regression (Molitor et al, 2010) application, where we demonstrate good mixing behaviour for both synthetic and real examples.

Research paper thumbnail of A global-local approach for detecting hotspots in multiple-response regression

The Annals of Applied Statistics, 2020

We tackle modelling and inference for variable selection in regression problems with many predict... more We tackle modelling and inference for variable selection in regression problems with many predictors and many responses. We focus on detecting hotspots, that is, predictors associated with several responses. Such a task is critical in statistical genetics, as hotspot genetic variants shape the architecture of the genome by controlling the expression of many genes and may initiate decisive functional mechanisms underlying disease endpoints. Existing hierarchical regression approaches designed to model hotspots suffer from two limitations: their discrimination of hotspots is sensitive to the choice of top-level scale parameters for the propensity of predictors to be hotspots, and they do not scale to large predictor and response vectors, for example, of dimensions 10 3-10 5 in genetic applications. We address these shortcomings by introducing a flexible hierarchical regression framework that is tailored to the detection of hotspots and scalable to the above dimensions. Our proposal implements a fully Bayesian model for hotspots based on the horseshoe shrinkage prior. Its global-local formulation shrinks noise globally and, hence, accommodates the highly sparse nature of genetic analyses while being robust to individual signals, thus leaving the effects of hotspots unshrunk. Inference is carried out using a fast variational algorithm coupled with a novel simulated annealing procedure that allows efficient exploration of multimodal distributions.

Research paper thumbnail of Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms

Journal of Statistical Planning and Inference, 2016

This manuscript is concerned with relating two approaches that can be used to explore complex dep... more This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.

Research paper thumbnail of PReMiuM: AnRPackage for Profile Regression Mixture Models Using Dirichlet Processes

Journal of Statistical Software, 2015

PReMiuM is a recently developed R package for Bayesian clustering using a Dirichlet process mixtu... more PReMiuM is a recently developed R package for Bayesian clustering using a Dirichlet process mixture model. This model is an alternative to regression models, nonparametrically linking a response vector to covariate data through cluster membership (Molitor, Papathomas, Jerrett, and Richardson 2010). The package allows binary, categorical, count and continuous response, as well as continuous and discrete covariates. Additionally, predictions may be made for the response, and missing values for the covariates are handled. Several samplers and label switching moves are implemented along with diagnostic tools to assess convergence. A number of R functions for post-processing of the output are also provided. In addition to fitting mixtures, it may additionally be of interest to determine which covariates actively drive the mixture components. This is implemented in the package as variable selection.

Research paper thumbnail of Bayesian Models for Sparse Regression Analysis of High Dimensional Data*

Bayesian Statistics 9, 2011

This paper considers the task of building efficient regression models for sparse multivariate ana... more This paper considers the task of building efficient regression models for sparse multivariate analysis of high dimensional data sets, in particular it focuses on cases where the numbers q of responses Y = (y k , 1 ≤ k ≤ q) and p of predictors X = (x j , 1 ≤ j ≤ p) to analyse jointly are both large with respect to the sample size n, a challenging bi-directional task. The analysis of such data sets arise commonly in genetical genomics, with X linked to the DNA characteristics and Y corresponding to measurements of fundamental biological processes such as transcription, protein or metabolite production. Building on the Bayesian variable selection setup for the linear model and associated efficient MCMC algorithms developed for single responses, we discuss the generic framework of hierarchical related sparse regressions, where parallel regressions of y k on the set of covariates X are linked in a hierarchical fashion, in particular through the prior model of the variable selection indicators γ kj , which indicate among the covariates x j those which are associated to the response y k in each multivariate regression. Structures for the joint model of the γ kj , which correspond to different compromises between the aims of controlling sparsity and that of enhancing the detection of predictors that are associated with many responses ('hot spots'), will be discussed and a new multiplicative model for the probability structure of the γ kj will be presented. To perform inference for these models in high dimensional setups , novel adaptive MCMC algorithms are needed. As sparsity is paramount and most of the associations expected to be zero, new algorithms that progressively focus on part of the space where the most interesting associations occur are of great interest. We shall discuss their formulation and theoretical properties, and demonstrate their use on simulated and real data from genomics.

Research paper thumbnail of Bayesian Methods for Microarray Data

We review the use of Bayesian methods for analyzing gene expression data. We focus on methods whi... more We review the use of Bayesian methods for analyzing gene expression data. We focus on methods which select groups of genes on the basis of their expression in RNA samples derived under different experimental conditions. We first describe Bayesian methods for estimating gene expression level from the intensity measurements obtained from analysis of microarray images. We next discuss the issues involved in assessing differential gene expression between two conditions at a time, including models for classifying the genes as differentially expressed or not. In the last two sections, we present models for grouping gene expression profiles over different experimental conditions, in order to find co-expressed genes, and multivariate models for finding gene signatures, i.e. for selecting a parsimonious group of genes that discriminate between entities such as subtypes of disease.

Research paper thumbnail of Sampling from Dirichlet process mixture models with unknown concentration parameter: mixing issues in large data implementations

Statistics and Computing, 2014

We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Diric... more We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Dirichlet process mixture model, with concentration parameter α. This paper introduces a Gibbs sampling algorithm that combines the slice sampling approach of Walker (Communications in Statistics-Simulation and Computation 36:45-54, 2007) and the retrospective sampling approach of Papaspiliopoulos and Roberts (Biometrika 95(1):169-186, 2008). Our general algorithm is implemented as efficient open source C++ software, available as an R package, and is based on a blocking strategy similar to that suggested by Papaspiliopoulos (A note on posterior sampling from Dirichlet mixture models, 2008) and implemented by Yau et al. (Journal of the Royal Statistical Society, Series B (Statistical Methodology) 73:37-57, 2011). We discuss the difficulties of achieving good mixing in MCMC samplers of this nature in large data sets and investigate sensitivity to initialisation. We additionally consider the challenges when an additional layer of hierarchy is added such that joint inference is to be made on α. We introduce a new label-switching move and compute the marginal partition posterior to help to surmount these difficulties. Our work is illustrated using a profile regression (Molitor et al. Biostatistics 11(3):484-498, 2010) application, where we Electronic supplementary material The online version of this article (

Research paper thumbnail of Modelling Heterogeneity With and Without the Dirichlet Process

Scandinavian Journal of Statistics, 2001

We investigate the relationships between Dirichlet process (DP) based models and allocation model... more We investigate the relationships between Dirichlet process (DP) based models and allocation models for a variable number of components, based on exchangeable distributions. It is shown that the DP partition distribution is a limiting case of a Dirichlet± multinomial allocation model. Comparisons of posterior performance of DP and allocation models are made in the Bayesian paradigm and illustrated in the context of univariate mixture models. It is shown in particular that the unbalancedness of the allocation distribution, present in the prior DP model, persists a posteriori. Exploiting the model connections, a new MCMC sampler for general DP based models is introduced, which uses split/merge moves in a reversible jump framework. Performance of this new sampler relative to that of some traditional samplers for DP processes is then explored.

Research paper thumbnail of Mixture Models in Measurement Error Problems, with Reference to Epidemiological Studies

Journal of the Royal Statistical Society Series A: Statistics in Society, 2002

SummaryThe paper focuses on a Bayesian treatment of measurement error problems and on the questio... more SummaryThe paper focuses on a Bayesian treatment of measurement error problems and on the question of the specification of the prior distribution of the unknown covariates. It presents a flexible semiparametric model for this distribution based on a mixture of normal distributions with an unknown number of components. Implementation of this prior model as part of a full Bayesian analysis of measurement error problems is described in classical set-ups that are encountered in epidemiological studies: logistic regression between unknown covariates and outcome, with a normal or log-normal error model and a validation group. The feasibility of this combined model is tested and its performance is demonstrated in a simulation study that includes an assessment of the influence of misspecification of the prior distribution of the unknown covariates and a comparison with the semiparametric maximum likelihood method of Roeder, Carroll and Lindsay. Finally, the methodology is illustrated on a d...

Research paper thumbnail of Hidden Markov Models and Disease Mapping

Journal of the American Statistical Association, 2002

We present new methodology to extend hidden Markov models to the spatial domain, and use this cla... more We present new methodology to extend hidden Markov models to the spatial domain, and use this class of models to analyze spatial heterogeneity of count data on a rare phenomenon. This situation occurs commonly in many domains of application, particularly in disease mapping. We assume that the counts follow a Poisson model at the lowest level of the hierarchy, and introduce a finite-mixture model for the Poisson rates at the next level. The novelty lies in the model for allocation to the mixture components, which follows a spatially correlated process, the Potts model, and in treating the number of components of the spatial mixture as unknown. Inference is performed in a Bayesian framework using reversible jump Markov chain Monte Carlo. The model introduced can be viewed as a Bayesian semiparametric approach to specifying flexible spatial distribution in hierarchical models. Performance of the model and comparison with an alternative well-known Markov random field specification for the Poisson rates are demonstrated on synthetic datasets. We show that our allocation model avoids the problem of oversmoothing in cases where the underlying rates exhibit discontinuities, while giving equally good results in cases of smooth gradient-like or highly autocorrelated rates. The methodology is illustrated on an epidemiologic application to data on a rare cancer in France.

Research paper thumbnail of Bayesian Detection of Expression Quantitative Trait Loci Hot Spots

Genetics, 2011

High-throughput genomics allows genome-wide quantification of gene expression levels in tissues a... more High-throughput genomics allows genome-wide quantification of gene expression levels in tissues and cell types and, when combined with sequence variation data, permits the identification of genetic control points of expression (expression QTL or eQTL). Clusters of eQTL influenced by single genetic polymorphisms can inform on hotspots of regulation of pathways and networks, although very few hotspots have been robustly detected, replicated, or experimentally verified. Here we present a novel modeling strategy to estimate the propensity of a genetic marker to influence several expression traits at the same time, based on a hierarchical formulation of related regressions. We implement this hierarchical regression model in a Bayesian framework using a stochastic search algorithm, HESS, that efficiently probes sparse subsets of genetic markers in a high-dimensional data matrix to identify hotspots and to pinpoint the individual genetic effects (eQTL). Simulating complex regulatory scenar...

Research paper thumbnail of Examining the Joint Effect of Multiple Risk Factors Using Exposure Risk Profiles: Lung Cancer in Nonsmokers

Environmental Health Perspectives, 2010

Research paper thumbnail of A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer

BMC Medical Research Methodology, 2013

Background A common characteristic of environmental epidemiology is the multi-dimensional aspect ... more Background A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study. Methods Our study includes 4658 males (1995 cases, 2663 controls) with full smoking history (intensity, duration, time since cessation, pack-years) from the ICARE multi-centre study conducted from 2001-2007. We extend Bayesian clustering techniques to explore predictive risk surfaces for covariate profiles of interest. Results We were able to partition the population into 12 clusters with different smoking profiles and lung cancer risk. Our results confirm that when compared to intensity, duration is the pred...

Research paper thumbnail of Bayesian profile regression with an application to the National survey of children's health

Biostatistics, 2010

Standard regression analyses are often plagued with problems encountered when one tries to make i... more Standard regression analyses are often plagued with problems encountered when one tries to make inference going beyond main effects using data sets that contain dozens of variables that are potentially correlated. This situation arises, for example, in epidemiology where surveys or study questionnaires consisting of a large number of questions yield a potentially unwieldy set of interrelated data from which teasing out the effect of multiple covariates is difficult. We propose a method that addresses these problems for categorical covariates by using, as its basic unit of inference, a profile formed from a sequence of covariate values. These covariate profiles are clustered into groups and associated via a regression model to a relevant outcome. The Bayesian clustering aspect of the proposed modeling framework has a number of advantages over traditional clustering approaches in that it allows the number of groups to vary, uncovers subgroups and examines their association with an out...

Research paper thumbnail of Bayesian Modeling of Differential Gene Expression

Biometrics, 2005

We present a Bayesian hierarchical model for detecting differentially expressing genes that inclu... more We present a Bayesian hierarchical model for detecting differentially expressing genes that includes simultaneous estimation of array effects, and show how to use the output for choosing lists of genes for further investigation. We give empirical evidence that expression-level dependent array effects are needed, and explore different non-linear functions as part of our model-based approach to normalization. The model includes gene-specific variances but imposes some necessary shrinkage through a hierarchical structure. Model criticism via posterior predictive checks is discussed. Modelling the array effects (normalization) simultaneously with differential expression gives fewer false positive results. To choose a list of genes, we propose to combine various criteria (for instance, fold change and overall expression) into a single indicator variable for each gene. The posterior distribution of these variables is used to pick the list of genes, thereby taking into account uncertainty in parameter estimates. In an application to

Research paper thumbnail of Two-pronged Strategy for Using DIC to Compare Selection Models with Non-Ignorable Missing Responses

Research paper thumbnail of Markov chain Monte Carlo methods

Research paper thumbnail of JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects

Genetic epidemiology, 2016

Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number ... more Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one-at-a-time. This complicates the ability of fine-mapping to identify a small set of SNPs for further functional follow-up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re-analysis of published marginal summary statistics under joint multi-SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed f...

Research paper thumbnail of A Bayesian Model of NMR Spectra for the Deconvolution and Quantification of Metabolites in Complex Biological Mixtures

Journal of the American Statistical Association, 2012

Astle's present position is supported by a Team Grant from the Fonds de recherche du Québec-Natur... more Astle's present position is supported by a Team Grant from the Fonds de recherche du Québec-Nature et technologies. The authors thank Jake Bundy for providing the yeast dataset and for help interpreting deconvolutions of the NMR spectra, Jie Hao for programming a C++ implementation and David Balding and Ernest Turro for comments on a draft manuscript.

Research paper thumbnail of BayesSUR: An R Package for High-Dimensional Multivariate Bayesian Variable and Covariance Selection in Linear Regression

Journal of Statistical Software

In molecular biology, advances in high-throughput technologies have made it possible to study com... more In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with highdimensional genomic and other omics data, a problem that can be studied with highdimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. We also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.

Research paper thumbnail of Noname manuscript No

We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Diric... more We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Dirichlet process mixture model, with concentration parameter α. This paper introduces a Gibbs sampling algorithm that combines the slice sampling approach of Walker (2007) and the retrospective sampling approach of Papaspiliopoulos and Roberts (2008). Our general algorithm is implemented as efficient open source C++ software, available as an R package, and is based on a blocking strategy similar to that suggested by Papaspiliopoulos (2008) and implemented by Yau et al (2011). We discuss the difficulties of achieving good mixing in MCMC samplers of this nature in large data sets and investigate sensitivity to initialisation. We additionally consider the challenges when an additional layer of hierarchy is added such that joint inference is to be made on α. We introduce a new label-switching move and compute the marginal partition posterior to help to surmount these difficulties. Our work is illustrated using a profile regression (Molitor et al, 2010) application, where we demonstrate good mixing behaviour for both synthetic and real examples.

Research paper thumbnail of A global-local approach for detecting hotspots in multiple-response regression

The Annals of Applied Statistics, 2020

We tackle modelling and inference for variable selection in regression problems with many predict... more We tackle modelling and inference for variable selection in regression problems with many predictors and many responses. We focus on detecting hotspots, that is, predictors associated with several responses. Such a task is critical in statistical genetics, as hotspot genetic variants shape the architecture of the genome by controlling the expression of many genes and may initiate decisive functional mechanisms underlying disease endpoints. Existing hierarchical regression approaches designed to model hotspots suffer from two limitations: their discrimination of hotspots is sensitive to the choice of top-level scale parameters for the propensity of predictors to be hotspots, and they do not scale to large predictor and response vectors, for example, of dimensions 10 3-10 5 in genetic applications. We address these shortcomings by introducing a flexible hierarchical regression framework that is tailored to the detection of hotspots and scalable to the above dimensions. Our proposal implements a fully Bayesian model for hotspots based on the horseshoe shrinkage prior. Its global-local formulation shrinks noise globally and, hence, accommodates the highly sparse nature of genetic analyses while being robust to individual signals, thus leaving the effects of hotspots unshrunk. Inference is carried out using a fast variational algorithm coupled with a novel simulated annealing procedure that allows efficient exploration of multimodal distributions.

Research paper thumbnail of Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms

Journal of Statistical Planning and Inference, 2016

This manuscript is concerned with relating two approaches that can be used to explore complex dep... more This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.

Research paper thumbnail of PReMiuM: AnRPackage for Profile Regression Mixture Models Using Dirichlet Processes

Journal of Statistical Software, 2015

PReMiuM is a recently developed R package for Bayesian clustering using a Dirichlet process mixtu... more PReMiuM is a recently developed R package for Bayesian clustering using a Dirichlet process mixture model. This model is an alternative to regression models, nonparametrically linking a response vector to covariate data through cluster membership (Molitor, Papathomas, Jerrett, and Richardson 2010). The package allows binary, categorical, count and continuous response, as well as continuous and discrete covariates. Additionally, predictions may be made for the response, and missing values for the covariates are handled. Several samplers and label switching moves are implemented along with diagnostic tools to assess convergence. A number of R functions for post-processing of the output are also provided. In addition to fitting mixtures, it may additionally be of interest to determine which covariates actively drive the mixture components. This is implemented in the package as variable selection.

Research paper thumbnail of Bayesian Models for Sparse Regression Analysis of High Dimensional Data*

Bayesian Statistics 9, 2011

This paper considers the task of building efficient regression models for sparse multivariate ana... more This paper considers the task of building efficient regression models for sparse multivariate analysis of high dimensional data sets, in particular it focuses on cases where the numbers q of responses Y = (y k , 1 ≤ k ≤ q) and p of predictors X = (x j , 1 ≤ j ≤ p) to analyse jointly are both large with respect to the sample size n, a challenging bi-directional task. The analysis of such data sets arise commonly in genetical genomics, with X linked to the DNA characteristics and Y corresponding to measurements of fundamental biological processes such as transcription, protein or metabolite production. Building on the Bayesian variable selection setup for the linear model and associated efficient MCMC algorithms developed for single responses, we discuss the generic framework of hierarchical related sparse regressions, where parallel regressions of y k on the set of covariates X are linked in a hierarchical fashion, in particular through the prior model of the variable selection indicators γ kj , which indicate among the covariates x j those which are associated to the response y k in each multivariate regression. Structures for the joint model of the γ kj , which correspond to different compromises between the aims of controlling sparsity and that of enhancing the detection of predictors that are associated with many responses ('hot spots'), will be discussed and a new multiplicative model for the probability structure of the γ kj will be presented. To perform inference for these models in high dimensional setups , novel adaptive MCMC algorithms are needed. As sparsity is paramount and most of the associations expected to be zero, new algorithms that progressively focus on part of the space where the most interesting associations occur are of great interest. We shall discuss their formulation and theoretical properties, and demonstrate their use on simulated and real data from genomics.

Research paper thumbnail of Bayesian Methods for Microarray Data

We review the use of Bayesian methods for analyzing gene expression data. We focus on methods whi... more We review the use of Bayesian methods for analyzing gene expression data. We focus on methods which select groups of genes on the basis of their expression in RNA samples derived under different experimental conditions. We first describe Bayesian methods for estimating gene expression level from the intensity measurements obtained from analysis of microarray images. We next discuss the issues involved in assessing differential gene expression between two conditions at a time, including models for classifying the genes as differentially expressed or not. In the last two sections, we present models for grouping gene expression profiles over different experimental conditions, in order to find co-expressed genes, and multivariate models for finding gene signatures, i.e. for selecting a parsimonious group of genes that discriminate between entities such as subtypes of disease.

Research paper thumbnail of Sampling from Dirichlet process mixture models with unknown concentration parameter: mixing issues in large data implementations

Statistics and Computing, 2014

We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Diric... more We consider the question of Markov chain Monte Carlo sampling from a general stick-breaking Dirichlet process mixture model, with concentration parameter α. This paper introduces a Gibbs sampling algorithm that combines the slice sampling approach of Walker (Communications in Statistics-Simulation and Computation 36:45-54, 2007) and the retrospective sampling approach of Papaspiliopoulos and Roberts (Biometrika 95(1):169-186, 2008). Our general algorithm is implemented as efficient open source C++ software, available as an R package, and is based on a blocking strategy similar to that suggested by Papaspiliopoulos (A note on posterior sampling from Dirichlet mixture models, 2008) and implemented by Yau et al. (Journal of the Royal Statistical Society, Series B (Statistical Methodology) 73:37-57, 2011). We discuss the difficulties of achieving good mixing in MCMC samplers of this nature in large data sets and investigate sensitivity to initialisation. We additionally consider the challenges when an additional layer of hierarchy is added such that joint inference is to be made on α. We introduce a new label-switching move and compute the marginal partition posterior to help to surmount these difficulties. Our work is illustrated using a profile regression (Molitor et al. Biostatistics 11(3):484-498, 2010) application, where we Electronic supplementary material The online version of this article (

Research paper thumbnail of Modelling Heterogeneity With and Without the Dirichlet Process

Scandinavian Journal of Statistics, 2001

We investigate the relationships between Dirichlet process (DP) based models and allocation model... more We investigate the relationships between Dirichlet process (DP) based models and allocation models for a variable number of components, based on exchangeable distributions. It is shown that the DP partition distribution is a limiting case of a Dirichlet± multinomial allocation model. Comparisons of posterior performance of DP and allocation models are made in the Bayesian paradigm and illustrated in the context of univariate mixture models. It is shown in particular that the unbalancedness of the allocation distribution, present in the prior DP model, persists a posteriori. Exploiting the model connections, a new MCMC sampler for general DP based models is introduced, which uses split/merge moves in a reversible jump framework. Performance of this new sampler relative to that of some traditional samplers for DP processes is then explored.

Research paper thumbnail of Mixture Models in Measurement Error Problems, with Reference to Epidemiological Studies

Journal of the Royal Statistical Society Series A: Statistics in Society, 2002

SummaryThe paper focuses on a Bayesian treatment of measurement error problems and on the questio... more SummaryThe paper focuses on a Bayesian treatment of measurement error problems and on the question of the specification of the prior distribution of the unknown covariates. It presents a flexible semiparametric model for this distribution based on a mixture of normal distributions with an unknown number of components. Implementation of this prior model as part of a full Bayesian analysis of measurement error problems is described in classical set-ups that are encountered in epidemiological studies: logistic regression between unknown covariates and outcome, with a normal or log-normal error model and a validation group. The feasibility of this combined model is tested and its performance is demonstrated in a simulation study that includes an assessment of the influence of misspecification of the prior distribution of the unknown covariates and a comparison with the semiparametric maximum likelihood method of Roeder, Carroll and Lindsay. Finally, the methodology is illustrated on a d...

Research paper thumbnail of Hidden Markov Models and Disease Mapping

Journal of the American Statistical Association, 2002

We present new methodology to extend hidden Markov models to the spatial domain, and use this cla... more We present new methodology to extend hidden Markov models to the spatial domain, and use this class of models to analyze spatial heterogeneity of count data on a rare phenomenon. This situation occurs commonly in many domains of application, particularly in disease mapping. We assume that the counts follow a Poisson model at the lowest level of the hierarchy, and introduce a finite-mixture model for the Poisson rates at the next level. The novelty lies in the model for allocation to the mixture components, which follows a spatially correlated process, the Potts model, and in treating the number of components of the spatial mixture as unknown. Inference is performed in a Bayesian framework using reversible jump Markov chain Monte Carlo. The model introduced can be viewed as a Bayesian semiparametric approach to specifying flexible spatial distribution in hierarchical models. Performance of the model and comparison with an alternative well-known Markov random field specification for the Poisson rates are demonstrated on synthetic datasets. We show that our allocation model avoids the problem of oversmoothing in cases where the underlying rates exhibit discontinuities, while giving equally good results in cases of smooth gradient-like or highly autocorrelated rates. The methodology is illustrated on an epidemiologic application to data on a rare cancer in France.

Research paper thumbnail of Bayesian Detection of Expression Quantitative Trait Loci Hot Spots

Genetics, 2011

High-throughput genomics allows genome-wide quantification of gene expression levels in tissues a... more High-throughput genomics allows genome-wide quantification of gene expression levels in tissues and cell types and, when combined with sequence variation data, permits the identification of genetic control points of expression (expression QTL or eQTL). Clusters of eQTL influenced by single genetic polymorphisms can inform on hotspots of regulation of pathways and networks, although very few hotspots have been robustly detected, replicated, or experimentally verified. Here we present a novel modeling strategy to estimate the propensity of a genetic marker to influence several expression traits at the same time, based on a hierarchical formulation of related regressions. We implement this hierarchical regression model in a Bayesian framework using a stochastic search algorithm, HESS, that efficiently probes sparse subsets of genetic markers in a high-dimensional data matrix to identify hotspots and to pinpoint the individual genetic effects (eQTL). Simulating complex regulatory scenar...

Research paper thumbnail of Examining the Joint Effect of Multiple Risk Factors Using Exposure Risk Profiles: Lung Cancer in Nonsmokers

Environmental Health Perspectives, 2010

Research paper thumbnail of A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer

BMC Medical Research Methodology, 2013

Background A common characteristic of environmental epidemiology is the multi-dimensional aspect ... more Background A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study. Methods Our study includes 4658 males (1995 cases, 2663 controls) with full smoking history (intensity, duration, time since cessation, pack-years) from the ICARE multi-centre study conducted from 2001-2007. We extend Bayesian clustering techniques to explore predictive risk surfaces for covariate profiles of interest. Results We were able to partition the population into 12 clusters with different smoking profiles and lung cancer risk. Our results confirm that when compared to intensity, duration is the pred...

Research paper thumbnail of Bayesian profile regression with an application to the National survey of children's health

Biostatistics, 2010

Standard regression analyses are often plagued with problems encountered when one tries to make i... more Standard regression analyses are often plagued with problems encountered when one tries to make inference going beyond main effects using data sets that contain dozens of variables that are potentially correlated. This situation arises, for example, in epidemiology where surveys or study questionnaires consisting of a large number of questions yield a potentially unwieldy set of interrelated data from which teasing out the effect of multiple covariates is difficult. We propose a method that addresses these problems for categorical covariates by using, as its basic unit of inference, a profile formed from a sequence of covariate values. These covariate profiles are clustered into groups and associated via a regression model to a relevant outcome. The Bayesian clustering aspect of the proposed modeling framework has a number of advantages over traditional clustering approaches in that it allows the number of groups to vary, uncovers subgroups and examines their association with an out...

Research paper thumbnail of Bayesian Modeling of Differential Gene Expression

Biometrics, 2005

We present a Bayesian hierarchical model for detecting differentially expressing genes that inclu... more We present a Bayesian hierarchical model for detecting differentially expressing genes that includes simultaneous estimation of array effects, and show how to use the output for choosing lists of genes for further investigation. We give empirical evidence that expression-level dependent array effects are needed, and explore different non-linear functions as part of our model-based approach to normalization. The model includes gene-specific variances but imposes some necessary shrinkage through a hierarchical structure. Model criticism via posterior predictive checks is discussed. Modelling the array effects (normalization) simultaneously with differential expression gives fewer false positive results. To choose a list of genes, we propose to combine various criteria (for instance, fold change and overall expression) into a single indicator variable for each gene. The posterior distribution of these variables is used to pick the list of genes, thereby taking into account uncertainty in parameter estimates. In an application to