Sandipan Roy | University of Bath (original) (raw)

Papers by Sandipan Roy

Research paper thumbnail of Application of machine learning approaches in predicting clinical outcomes in older adults – a systematic review and meta-analysis

Background Machine learning-based prediction models have the potential to have a considerable pos... more Background Machine learning-based prediction models have the potential to have a considerable positive impact on geriatric care. Design: Systematic review and meta-analyses. Participants: Older adults (≥ 65 years) in any setting. Intervention: Machine learning models for predicting clinical outcomes in older adults were evaluated. A meta-analysis was conducted where the predictive models were compared based on their performance in predicting mortality. Outcome measures: Studies were grouped by the clinical outcome, and the models were compared based on the area under the receiver operating characteristic curve metric. Results 29 studies that satisfied the systematic review criteria were appraised and six studies predicting a mortality outcome were included in the meta-analyses. We could only pool studies by mortality as there were inconsistent definitions and sparse data to pool studies for other clinical outcomes. The area under the receiver operating characteristic curve from six ...

Research paper thumbnail of Risk of antimicrobial-associated organ injury among the older adults: a systematic review and meta-analysis

BMC Geriatrics

Background Older adults (aged 65 years and above) constitute the fastest growing population cohor... more Background Older adults (aged 65 years and above) constitute the fastest growing population cohort in the western world. There is increasing evidence that the burden of infections disproportionately affects older adults, and hence this vulnerable population is frequently exposed to antimicrobials. There is currently no systematic review summarising the evidence for organ injury risk among older adults following antimicrobial exposure. This systematic review and meta-analysis examined the relationship between antimicrobial exposure and organ injury in older adults. Methodology We searched for original research articles in PubMed, Embase.com, Web of Science core collection, Web of Science BIOSIS citation index, Scopus, Cochrane Central Register of Controlled Trials, ProQuest, and PsycINFO databases, using key words in titles and abstracts, and using MeSH terms. We searched for all available articles up to 31 May 2021. After removing duplicates, articles were screened for inclusion int...

Research paper thumbnail of Interpretable brain age prediction using linear latent variable models of functional connectivity

PLOS ONE, 2020

Neuroimaging-driven prediction of brain age, defined as the predicted biological age of a subject... more Neuroimaging-driven prediction of brain age, defined as the predicted biological age of a subject using only brain imaging data, is an exciting avenue of research. In this work we seek to build models of brain age based on functional connectivity while prioritizing model interpretability and understanding. This way, the models serve to both provide accurate estimates of brain age as well as allow us to investigate changes in functional connectivity which occur during the ageing process. The methods proposed in this work consist of a two-step procedure: first, linear latent variable models, such as PCA and its extensions, are employed to learn reproducible functional connectivity networks present across a cohort of subjects. The activity within each network is subsequently employed as a feature in a linear regression model to predict brain age. The proposed framework is employed on the data from the CamCAN repository and the inferred brain age models are further demonstrated to generalize using data from two open-access repositories: the Human Connectome Project and the ATR Wide-Age-Range.

Research paper thumbnail of Antimicrobial-associated Organ Injury among the Older Adults: A Systematic Review and Meta-Analysis

Background: Older adults (aged 65 years and above) constitute the fastest growing population coho... more Background: Older adults (aged 65 years and above) constitute the fastest growing population cohort in the western world. There is increasing evidence that the burden of infections disproportionately affects older adults, and hence this vulnerable population is frequently exposed to antimicrobials. There is currently no systematic review summarising the evidence for organ injury risk among older adults following antimicrobial exposure. This systematic review and meta-analysis examined the relationship between antimicrobial exposure and organ injury in older adults. Methodology: We searched for Psych INFO, PubMed, and EMBASE databases for relevant articles using MeSH terms where applicable. After removing duplicates, articles were screened for inclusion into or exclusion from the study by two reviewers. The Newcastle-Ottawa scale was used to assess the risk of bias for cohort and case-control studies. The Cochrane collaboration's tool for risk of bias (version 2) was used to asse...

Research paper thumbnail of Hedging parameter selection for basis pursuit

arXiv: Computation, 2018

In Compressed Sensing and high dimensional estimation, signal recovery often relies on sparsity a... more In Compressed Sensing and high dimensional estimation, signal recovery often relies on sparsity assumptions and estimation is performed via l1-penalized least-squares optimization, a.k.a. LASSO. The l1 penalisation is usually controlled by a weight, also called "relaxation parameter", denoted by λ. It is commonly thought that the practical efficiency of the LASSO for prediction crucially relies on accurate selection of λ. In this short note, we propose to consider the hyper-parameter selection problem from a new perspective which combines the Hedge online learning method by Freund and Shapire, with the stochastic Frank-Wolfe method for the LASSO. Using the Hedge algorithm, we show that a our simple selection rule can achieve prediction results comparable to Cross Validation at a potentially much lower computational cost.

Research paper thumbnail of Consistent multiple changepoint estimation with fused Gaussian graphical models

Annals of the Institute of Statistical Mathematics, 2020

We consider the consistency properties of a regularised estimator for the simultaneous identifica... more We consider the consistency properties of a regularised estimator for the simultaneous identification of both changepoints and graphical dependency structure in multivariate time-series. Traditionally, estimation of Gaussian graphical models (GGM) is performed in an i.i.d setting. More recently, such models have been extended to allow for changes in the distribution, but primarily where changepoints are known a priori. In this work, we study the Group-Fused Graphical Lasso (GFGL) which penalises partial correlations with an L1 penalty while simultaneously inducing block-wise smoothness over time to detect multiple changepoints. We present a proof of consistency for the estimator, both in terms of changepoints, and the structure of the graphical models in each segment. We contrast our results, which are based on a global, i.e. graph-wide likelihood, with those previously obtained for performing dynamic graph estimation at a node-wise (or neighbourhood) level.

Research paper thumbnail of Likelihood Inference for Large Scale Stochastic Blockmodels With Covariates Based on a Divide-and-Conquer Parallelizable Algorithm With Communication

Journal of Computational and Graphical Statistics, 2018

We consider a stochastic blockmodel equipped with node covariate information, that is helpful in ... more We consider a stochastic blockmodel equipped with node covariate information, that is helpful in analyzing social network data. The key objective is to obtain maximum likelihood estimates of the model parameters. For this task, we devise a fast, scalable Monte Carlo EM type algorithm based on case-control approximation of the log-likelihood coupled with a subsampling approach. A key feature of the proposed algorithm is its parallelizability, by processing portions of the data on several cores, while leveraging communication of key statistics across the cores during each iteration of the algorithm. The performance of the algorithm is evaluated on synthetic data sets and compared with competing methods for blockmodel parameter estimation. We also illustrate the model on data from a Facebook derived social network enhanced with node covariate information.

Research paper thumbnail of Change Point Estimation in High Dimensional Markov Random-Field Models

Journal of the Royal Statistical Society Series B: Statistical Methodology, 2016

Summary The paper investigates a change point estimation problem in the context of high dimension... more Summary The paper investigates a change point estimation problem in the context of high dimensional Markov random-field models. Change points represent a key feature in many dynamically evolving network structures. The change point estimate is obtained by maximizing a profile penalized pseudolikelihood function under a sparsity assumption. We also derive a tight bound for the estimate, up to a logarithmic factor, even in settings where the number of possible edges in the network far exceeds the sample size. The performance of the estimator proposed is evaluated on synthetic data sets and is also used to explore voting patterns in the US Senate in the 1979–2012 period.

Research paper thumbnail of Bayesian inference in nonparametric dynamic state-space models

Statistical Methodology, 2014

We introduce state-space models where the functionals of the observational and the evolutionary e... more We introduce state-space models where the functionals of the observational and the evolutionary equations are unknown, and treated as random functions evolving with time. Thus, our model is nonparametric and generalizes the traditional parametric state-space models. This random function approach also frees us from the restrictive assumption that the functional forms, although time-dependent, are of fixed forms. The traditional approach of assuming known, parametric functional forms is questionable, particularly in state-space models, since the validation of the assumptions require data on both the observed time series and the latent states; however, data on the latter are not available in state-space models. We specify Gaussian processes as priors of the random functions and exploit the "lookup table approach" of Bhattacharya (2007) to efficiently handle the dynamic structure of the model. We consider both univariate and multivariate situations, using the Markov chain Monte Carlo (MCMC) approach for studying the posterior distributions of interest. We illustrate our methods with simulated data sets, in both univariate and multivariate situations. Moreover, using our Gaussian process approach we analyse a real data set, which has also been analysed by Shumway & Stoffer (1982) and Carlin, Polson & Stoffer (1992) using the linearity assumption. Interestingly, our analyses indicate that towards the end of the time series, the linearity assumption is perhaps questionable.

Research paper thumbnail of Statistical Inference and Computational Methods for Large High-Dimensional Data with Network Structure

Statistical Inference and Computational Methods for Large High-Dimensional Data with Network Stru... more Statistical Inference and Computational Methods for Large High-Dimensional Data with Network Structure by Sandipan Roy Chair: Yves Atchadé and George Michailidis New technological advancements have allowed collection of datasets of large volume and different levels of complexity. Many of these datasets have an underlying network structure. Networks are capable of capturing dependence relationship among a group of entities and hence analyzing these datasets unearth the underlying structural dependence among the individuals. Examples include gene regulatory networks, understanding stock markets, protein-protein interaction within the cell, online social networks etc. The thesis addresses two important aspects of large high-dimensional data with network structure. The first one focuses on a high-dimensional data with network structure that evolves over time. Examples of such data sets include time course gene expression data, voting records of legislative bodies etc. The main task is t...

Research paper thumbnail of ISIS and NISIS: New bilingual dual-channel speech corpora for robust speaker recognition

It is standard practice to use benchmark datasets for comparing meaningfully the performance of a... more It is standard practice to use benchmark datasets for comparing meaningfully the performance of a number of competing speaker identification systems. Generally, such datasets consist of speech recordings from different speakers made at a single point of time, typically in the same language. That is, the training and test sets both consist of speech recorded at the same point of time in the same language over the same recording channel. This is generally not the case in real-life applications. In this paper, we introduce a new database consisting of speech recordings of 105 speakers, made over four sessions, in two languages and simultaneously over two channels. This database provides scope for experimentation regarding loss in efficiency due to possible mismatch in language, channel and recording session. Results of experiments with MFCC-based GMM speaker models are presented to highlight the need of such benchmark datasets for identifying robust speaker identification systems.

Research paper thumbnail of Inference of tissue relative proportions of the breast epithelial cell types luminal progenitor, basal, and luminal mature

Single-Cell Analysis has revolutionised genomic science in recent years. However, due to cost and... more Single-Cell Analysis has revolutionised genomic science in recent years. However, due to cost and other practical considerations, single-cell analyses are impossible for studies based on medium or large patient cohorts. For example, a single-cell analysis usually costs thousands of euros for one tissue sample from one volunteer, meaning that typical studies using single-cell analyses are based on very few individuals. While single-cell genomic data can be used to examine the phenotype of individual cells, cell-type deconvolution methods are required to track the quantities of these cells in bulk-tissue genomic data. Hormone receptor negative breast cancers are highly aggressive, and are thought to originate from a subtype of epithelial cells called the luminal progenitor. In this paper, we show how to quantify the number of luminal progenitor cells as well as other epithelial subtypes in breast tissue samples using DNA and RNA based measurements. We find elevated levels of cells whi...

Research paper thumbnail of ISIS and NISIS: New Bilingual Dual-Channel Speech Corpora for Robust Speaker Recognition

It is standard practice to use benchmark datasets for comparing meaningfully the performance of a... more It is standard practice to use benchmark datasets for comparing meaningfully the performance of a number of competing speaker identification systems. Generally, such datasets consist of speech recordings from different speakers made at a single point of time, typically in the same language. That is, the training and test sets both consist of speech recorded at the same point of time in the same language over the same recording channel. This is generally not the case in real-life applications.

Research paper thumbnail of Change-point Estimation in High Dimensional Markov Random Field

This paper investigates a change-point estimation problem in the context of high-dimensional Mark... more This paper investigates a change-point estimation problem in the context of high-dimensional Markov random field models. Change-points represent a key feature in many dynamically evolving network structures. The change-point estimate is obtained by maximizing a profile penalized pseudo-likelihood function under a sparsity assumption. We also derive a tight bound for the estimate, up to a logarithmic factor, even in settings where the number of possible edges in the network far exceeds the sample size. The performance of the proposed estimator is evaluated on synthetic data sets and is also used to explore voting patterns in the US Senate in the 1979-2012 period.

Research paper thumbnail of Bayesian Inference in Nonparametric Dynamic State-Space Models

We introduce state-space models where the functionals of the observational and the evolutionary e... more We introduce state-space models where the functionals of the observational and the evolutionary equations are unknown, and treated as random functions evolving with time. Thus, our model is nonparametric and generalizes parametric state-space models, such as the extended Kalman filter. This random function approach also frees us from the restrictive assumption that the functional forms, although time-dependent, are of fixed forms. We specify Gaussian processes as priors of the random functions and exploit the "look-up table approach" of Bhattacharya (2007) to efficiently handle the dynamic structure of the model. We consider both univariate and multivariate situations, using the Markov chain Monte Carlo (MCMC) approach for studying the posterior distributions of interest. In the case of challenging multivariate situations we demonstrate that the newly developed Transformation-based MCMC (TMCMC) provides interesting and efficient alternatives to the usual proposal distributions. We illustrate our methods with simulated data sets, obtaining very encouraging results in both univariate and multivariate situations. Moreover, using our Gaussian process approach we analysed a real data set, which has also been analysed by and using the linearity assumption. Our analyses show that towards the end of the time series, the linearity assumption of the previous authors breaks down.

Drafts by Sandipan Roy

Research paper thumbnail of Likelihood Inference for Large Scale Stochastic Blockmodels with Covariates based on a Divide-and-Conquer Parallelizable Algorithm with Communication

We consider a stochastic blockmodel equipped with node covariate information, that is useful in a... more We consider a stochastic blockmodel equipped with node covariate information, that is useful in analyzing social network data. The objective is to obtain maximum likelihood estimates of the model parameters. For this task, we devise a fast, scal-able Monte Carlo EM type algorithm based on case-control approximation of the log-likelihood coupled with a subsampling approach. A key feature of the proposed algorithm is its parallelizability, by processing chunks of the data on several cores, while leveraging communication of key statistics across the cores during every iteration. The performance of the algorithm is evaluated on synthetic data sets and compared with competing methods for blockmodel parameter estimation. We also illustrate the model on data from a Facebook social network enhanced with node covariate information.

Research paper thumbnail of Application of machine learning approaches in predicting clinical outcomes in older adults – a systematic review and meta-analysis

Background Machine learning-based prediction models have the potential to have a considerable pos... more Background Machine learning-based prediction models have the potential to have a considerable positive impact on geriatric care. Design: Systematic review and meta-analyses. Participants: Older adults (≥ 65 years) in any setting. Intervention: Machine learning models for predicting clinical outcomes in older adults were evaluated. A meta-analysis was conducted where the predictive models were compared based on their performance in predicting mortality. Outcome measures: Studies were grouped by the clinical outcome, and the models were compared based on the area under the receiver operating characteristic curve metric. Results 29 studies that satisfied the systematic review criteria were appraised and six studies predicting a mortality outcome were included in the meta-analyses. We could only pool studies by mortality as there were inconsistent definitions and sparse data to pool studies for other clinical outcomes. The area under the receiver operating characteristic curve from six ...

Research paper thumbnail of Risk of antimicrobial-associated organ injury among the older adults: a systematic review and meta-analysis

BMC Geriatrics

Background Older adults (aged 65 years and above) constitute the fastest growing population cohor... more Background Older adults (aged 65 years and above) constitute the fastest growing population cohort in the western world. There is increasing evidence that the burden of infections disproportionately affects older adults, and hence this vulnerable population is frequently exposed to antimicrobials. There is currently no systematic review summarising the evidence for organ injury risk among older adults following antimicrobial exposure. This systematic review and meta-analysis examined the relationship between antimicrobial exposure and organ injury in older adults. Methodology We searched for original research articles in PubMed, Embase.com, Web of Science core collection, Web of Science BIOSIS citation index, Scopus, Cochrane Central Register of Controlled Trials, ProQuest, and PsycINFO databases, using key words in titles and abstracts, and using MeSH terms. We searched for all available articles up to 31 May 2021. After removing duplicates, articles were screened for inclusion int...

Research paper thumbnail of Interpretable brain age prediction using linear latent variable models of functional connectivity

PLOS ONE, 2020

Neuroimaging-driven prediction of brain age, defined as the predicted biological age of a subject... more Neuroimaging-driven prediction of brain age, defined as the predicted biological age of a subject using only brain imaging data, is an exciting avenue of research. In this work we seek to build models of brain age based on functional connectivity while prioritizing model interpretability and understanding. This way, the models serve to both provide accurate estimates of brain age as well as allow us to investigate changes in functional connectivity which occur during the ageing process. The methods proposed in this work consist of a two-step procedure: first, linear latent variable models, such as PCA and its extensions, are employed to learn reproducible functional connectivity networks present across a cohort of subjects. The activity within each network is subsequently employed as a feature in a linear regression model to predict brain age. The proposed framework is employed on the data from the CamCAN repository and the inferred brain age models are further demonstrated to generalize using data from two open-access repositories: the Human Connectome Project and the ATR Wide-Age-Range.

Research paper thumbnail of Antimicrobial-associated Organ Injury among the Older Adults: A Systematic Review and Meta-Analysis

Background: Older adults (aged 65 years and above) constitute the fastest growing population coho... more Background: Older adults (aged 65 years and above) constitute the fastest growing population cohort in the western world. There is increasing evidence that the burden of infections disproportionately affects older adults, and hence this vulnerable population is frequently exposed to antimicrobials. There is currently no systematic review summarising the evidence for organ injury risk among older adults following antimicrobial exposure. This systematic review and meta-analysis examined the relationship between antimicrobial exposure and organ injury in older adults. Methodology: We searched for Psych INFO, PubMed, and EMBASE databases for relevant articles using MeSH terms where applicable. After removing duplicates, articles were screened for inclusion into or exclusion from the study by two reviewers. The Newcastle-Ottawa scale was used to assess the risk of bias for cohort and case-control studies. The Cochrane collaboration's tool for risk of bias (version 2) was used to asse...

Research paper thumbnail of Hedging parameter selection for basis pursuit

arXiv: Computation, 2018

In Compressed Sensing and high dimensional estimation, signal recovery often relies on sparsity a... more In Compressed Sensing and high dimensional estimation, signal recovery often relies on sparsity assumptions and estimation is performed via l1-penalized least-squares optimization, a.k.a. LASSO. The l1 penalisation is usually controlled by a weight, also called "relaxation parameter", denoted by λ. It is commonly thought that the practical efficiency of the LASSO for prediction crucially relies on accurate selection of λ. In this short note, we propose to consider the hyper-parameter selection problem from a new perspective which combines the Hedge online learning method by Freund and Shapire, with the stochastic Frank-Wolfe method for the LASSO. Using the Hedge algorithm, we show that a our simple selection rule can achieve prediction results comparable to Cross Validation at a potentially much lower computational cost.

Research paper thumbnail of Consistent multiple changepoint estimation with fused Gaussian graphical models

Annals of the Institute of Statistical Mathematics, 2020

We consider the consistency properties of a regularised estimator for the simultaneous identifica... more We consider the consistency properties of a regularised estimator for the simultaneous identification of both changepoints and graphical dependency structure in multivariate time-series. Traditionally, estimation of Gaussian graphical models (GGM) is performed in an i.i.d setting. More recently, such models have been extended to allow for changes in the distribution, but primarily where changepoints are known a priori. In this work, we study the Group-Fused Graphical Lasso (GFGL) which penalises partial correlations with an L1 penalty while simultaneously inducing block-wise smoothness over time to detect multiple changepoints. We present a proof of consistency for the estimator, both in terms of changepoints, and the structure of the graphical models in each segment. We contrast our results, which are based on a global, i.e. graph-wide likelihood, with those previously obtained for performing dynamic graph estimation at a node-wise (or neighbourhood) level.

Research paper thumbnail of Likelihood Inference for Large Scale Stochastic Blockmodels With Covariates Based on a Divide-and-Conquer Parallelizable Algorithm With Communication

Journal of Computational and Graphical Statistics, 2018

We consider a stochastic blockmodel equipped with node covariate information, that is helpful in ... more We consider a stochastic blockmodel equipped with node covariate information, that is helpful in analyzing social network data. The key objective is to obtain maximum likelihood estimates of the model parameters. For this task, we devise a fast, scalable Monte Carlo EM type algorithm based on case-control approximation of the log-likelihood coupled with a subsampling approach. A key feature of the proposed algorithm is its parallelizability, by processing portions of the data on several cores, while leveraging communication of key statistics across the cores during each iteration of the algorithm. The performance of the algorithm is evaluated on synthetic data sets and compared with competing methods for blockmodel parameter estimation. We also illustrate the model on data from a Facebook derived social network enhanced with node covariate information.

Research paper thumbnail of Change Point Estimation in High Dimensional Markov Random-Field Models

Journal of the Royal Statistical Society Series B: Statistical Methodology, 2016

Summary The paper investigates a change point estimation problem in the context of high dimension... more Summary The paper investigates a change point estimation problem in the context of high dimensional Markov random-field models. Change points represent a key feature in many dynamically evolving network structures. The change point estimate is obtained by maximizing a profile penalized pseudolikelihood function under a sparsity assumption. We also derive a tight bound for the estimate, up to a logarithmic factor, even in settings where the number of possible edges in the network far exceeds the sample size. The performance of the estimator proposed is evaluated on synthetic data sets and is also used to explore voting patterns in the US Senate in the 1979–2012 period.

Research paper thumbnail of Bayesian inference in nonparametric dynamic state-space models

Statistical Methodology, 2014

We introduce state-space models where the functionals of the observational and the evolutionary e... more We introduce state-space models where the functionals of the observational and the evolutionary equations are unknown, and treated as random functions evolving with time. Thus, our model is nonparametric and generalizes the traditional parametric state-space models. This random function approach also frees us from the restrictive assumption that the functional forms, although time-dependent, are of fixed forms. The traditional approach of assuming known, parametric functional forms is questionable, particularly in state-space models, since the validation of the assumptions require data on both the observed time series and the latent states; however, data on the latter are not available in state-space models. We specify Gaussian processes as priors of the random functions and exploit the "lookup table approach" of Bhattacharya (2007) to efficiently handle the dynamic structure of the model. We consider both univariate and multivariate situations, using the Markov chain Monte Carlo (MCMC) approach for studying the posterior distributions of interest. We illustrate our methods with simulated data sets, in both univariate and multivariate situations. Moreover, using our Gaussian process approach we analyse a real data set, which has also been analysed by Shumway & Stoffer (1982) and Carlin, Polson & Stoffer (1992) using the linearity assumption. Interestingly, our analyses indicate that towards the end of the time series, the linearity assumption is perhaps questionable.

Research paper thumbnail of Statistical Inference and Computational Methods for Large High-Dimensional Data with Network Structure

Statistical Inference and Computational Methods for Large High-Dimensional Data with Network Stru... more Statistical Inference and Computational Methods for Large High-Dimensional Data with Network Structure by Sandipan Roy Chair: Yves Atchadé and George Michailidis New technological advancements have allowed collection of datasets of large volume and different levels of complexity. Many of these datasets have an underlying network structure. Networks are capable of capturing dependence relationship among a group of entities and hence analyzing these datasets unearth the underlying structural dependence among the individuals. Examples include gene regulatory networks, understanding stock markets, protein-protein interaction within the cell, online social networks etc. The thesis addresses two important aspects of large high-dimensional data with network structure. The first one focuses on a high-dimensional data with network structure that evolves over time. Examples of such data sets include time course gene expression data, voting records of legislative bodies etc. The main task is t...

Research paper thumbnail of ISIS and NISIS: New bilingual dual-channel speech corpora for robust speaker recognition

It is standard practice to use benchmark datasets for comparing meaningfully the performance of a... more It is standard practice to use benchmark datasets for comparing meaningfully the performance of a number of competing speaker identification systems. Generally, such datasets consist of speech recordings from different speakers made at a single point of time, typically in the same language. That is, the training and test sets both consist of speech recorded at the same point of time in the same language over the same recording channel. This is generally not the case in real-life applications. In this paper, we introduce a new database consisting of speech recordings of 105 speakers, made over four sessions, in two languages and simultaneously over two channels. This database provides scope for experimentation regarding loss in efficiency due to possible mismatch in language, channel and recording session. Results of experiments with MFCC-based GMM speaker models are presented to highlight the need of such benchmark datasets for identifying robust speaker identification systems.

Research paper thumbnail of Inference of tissue relative proportions of the breast epithelial cell types luminal progenitor, basal, and luminal mature

Single-Cell Analysis has revolutionised genomic science in recent years. However, due to cost and... more Single-Cell Analysis has revolutionised genomic science in recent years. However, due to cost and other practical considerations, single-cell analyses are impossible for studies based on medium or large patient cohorts. For example, a single-cell analysis usually costs thousands of euros for one tissue sample from one volunteer, meaning that typical studies using single-cell analyses are based on very few individuals. While single-cell genomic data can be used to examine the phenotype of individual cells, cell-type deconvolution methods are required to track the quantities of these cells in bulk-tissue genomic data. Hormone receptor negative breast cancers are highly aggressive, and are thought to originate from a subtype of epithelial cells called the luminal progenitor. In this paper, we show how to quantify the number of luminal progenitor cells as well as other epithelial subtypes in breast tissue samples using DNA and RNA based measurements. We find elevated levels of cells whi...

Research paper thumbnail of ISIS and NISIS: New Bilingual Dual-Channel Speech Corpora for Robust Speaker Recognition

It is standard practice to use benchmark datasets for comparing meaningfully the performance of a... more It is standard practice to use benchmark datasets for comparing meaningfully the performance of a number of competing speaker identification systems. Generally, such datasets consist of speech recordings from different speakers made at a single point of time, typically in the same language. That is, the training and test sets both consist of speech recorded at the same point of time in the same language over the same recording channel. This is generally not the case in real-life applications.

Research paper thumbnail of Change-point Estimation in High Dimensional Markov Random Field

This paper investigates a change-point estimation problem in the context of high-dimensional Mark... more This paper investigates a change-point estimation problem in the context of high-dimensional Markov random field models. Change-points represent a key feature in many dynamically evolving network structures. The change-point estimate is obtained by maximizing a profile penalized pseudo-likelihood function under a sparsity assumption. We also derive a tight bound for the estimate, up to a logarithmic factor, even in settings where the number of possible edges in the network far exceeds the sample size. The performance of the proposed estimator is evaluated on synthetic data sets and is also used to explore voting patterns in the US Senate in the 1979-2012 period.

Research paper thumbnail of Bayesian Inference in Nonparametric Dynamic State-Space Models

We introduce state-space models where the functionals of the observational and the evolutionary e... more We introduce state-space models where the functionals of the observational and the evolutionary equations are unknown, and treated as random functions evolving with time. Thus, our model is nonparametric and generalizes parametric state-space models, such as the extended Kalman filter. This random function approach also frees us from the restrictive assumption that the functional forms, although time-dependent, are of fixed forms. We specify Gaussian processes as priors of the random functions and exploit the "look-up table approach" of Bhattacharya (2007) to efficiently handle the dynamic structure of the model. We consider both univariate and multivariate situations, using the Markov chain Monte Carlo (MCMC) approach for studying the posterior distributions of interest. In the case of challenging multivariate situations we demonstrate that the newly developed Transformation-based MCMC (TMCMC) provides interesting and efficient alternatives to the usual proposal distributions. We illustrate our methods with simulated data sets, obtaining very encouraging results in both univariate and multivariate situations. Moreover, using our Gaussian process approach we analysed a real data set, which has also been analysed by and using the linearity assumption. Our analyses show that towards the end of the time series, the linearity assumption of the previous authors breaks down.

Research paper thumbnail of Likelihood Inference for Large Scale Stochastic Blockmodels with Covariates based on a Divide-and-Conquer Parallelizable Algorithm with Communication

We consider a stochastic blockmodel equipped with node covariate information, that is useful in a... more We consider a stochastic blockmodel equipped with node covariate information, that is useful in analyzing social network data. The objective is to obtain maximum likelihood estimates of the model parameters. For this task, we devise a fast, scal-able Monte Carlo EM type algorithm based on case-control approximation of the log-likelihood coupled with a subsampling approach. A key feature of the proposed algorithm is its parallelizability, by processing chunks of the data on several cores, while leveraging communication of key statistics across the cores during every iteration. The performance of the algorithm is evaluated on synthetic data sets and compared with competing methods for blockmodel parameter estimation. We also illustrate the model on data from a Facebook social network enhanced with node covariate information.