Effective Bayesian inference for sparse factor analysis models (original) (raw)

Inference algorithms and learning theory for Bayesian sparse factor analysis

Journal of Physics: Conference Series, 2009

Bayesian sparse factor analysis has many applications; for example, it has been applied to the problem of inferring a sparse regulatory network from gene expression data. We describe a number of inference algorithms for Bayesian sparse factor analysis using a slab and spike mixture prior. These include well-established Markov chain Monte Carlo (MCMC) and variational Bayes (VB) algorithms as well as a novel hybrid of VB and Expectation Propagation (EP). For the case of a single latent factor we derive a theory for learning performance using the replica method. We compare the MCMC and VB/EP algorithm results with simulated data to the theoretical prediction. The results for MCMC agree closely with the theory as expected. Results for VB/EP are slightly sub-optimal but show that the new algorithm is effective for sparse inference. In large-scale problems MCMC is infeasible due to computational limitations and the VB/EP algorithm then provides a very useful computationally efficient alternative.

Bayesian sparse factor analysis with kernelized observations

Neurocomputing, 2022

Latent variable models for multi-view learning attempt to find low-dimensional projections that fairly capture the correlations among multiple views that characterise each datum. High-dimensional views in medium-sized datasets and non-linear problems are traditionally handled by kernel methods, inducing a (non)-linear function between the latent projection and the data itself. However, they usually come with scalability issues and exposition to overfitting. To overcome these limitations, instead of imposing a kernel function, here we propose an alternative method. In particular, we combine probabilistic factor analysis with what we refer to as kernelized observations, in which the model focuses on reconstructing not the data itself, but its correlation with other data points measured by a kernel function. This model can combine several types of views (kernelized or not), can handle heterogeneous data and work in semi-supervised settings. Additionally, by including adequate priors, it can provide compact solutions for the kernelized observations (based in a automatic selection of bayesian support vectors) and can include feature selection capabilities. Using several public databases, we demonstrate the potential of our approach (and its extensions) w.r.t. common multi-view learning models such as kernel canonical correlation analysis or manifold relevance determination gaussian processes latent variable models.

Dense Message Passing for Sparse Principal Component Analysis

International Conference on Artificial Intelligence and Statistics, 2010

We describe a novel inference algorithm for sparse Bayesian PCA with a zero-norm prior on the model parameters. Bayesian inference is very challenging in probabilistic models of this type. MCMC procedures are too slow to be practical in a very high-dimensional setting and standard mean-field variational Bayes algorithms are ineffective. We adopt a dense message passing algorithm similar to algorithms developed in the statistical physics community and previously applied to inference problems in coding and sparse classification. The algorithm achieves nearoptimal performance on synthetic data for which a statistical mechanics theory of optimal learning can be derived. We also study two gene expression datasets used in previous studies of sparse PCA. We find our method performs better than one published algorithm and comparably to a second.

Posterior contraction in sparse Bayesian factor models for massive covariance matrices

The Annals of Statistics, 2014

Sparse Bayesian factor models are routinely implemented for parsimonious dependence modeling and dimensionality reduction in highdimensional applications. We provide theoretical understanding of such Bayesian procedures in terms of posterior convergence rates in inferring high-dimensional covariance matrices where the dimension can be potentially larger than the sample size. Under relevant sparsity assumptions on the true covariance matrix, we show that commonlyused point mass mixture priors on the factor loadings lead to consistent estimation in the operator norm even when p ≫ n. One of our major contributions is to develop a new class of continuous shrinkage priors and provide insights into their concentration around sparse vectors. Using such priors for the factor loadings, we obtain the same rate as obtained with point mass mixture priors. To obtain the convergence rates, we construct test functions to separate points in the space of high-dimensional covariance matrices using insights from random matrix theory; the tools developed may be of independent interest.

Sparse Bayesian infinite factor models

Biometrika, 2011

We focus on sparse modelling of high-dimensional covariance matrices using Bayesian latent factor models. We propose a multiplicative gamma process shrinkage prior on the factor loadings which allows introduction of infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. We use our prior on a parameter-expanded loading matrix to avoid the order dependence typical in factor analysis models and develop an efficient Gibbs sampler that scales well as data dimensionality increases. The gain in efficiency is achieved by the joint conjugacy property of the proposed prior, which allows block updating of the loadings matrix. We propose an adaptive Gibbs sampler for automatically truncating the infinite loading matrix through selection of the number of important factors. Theoretical results are provided on the support of the prior and truncation approximation bounds. A fast algorithm is proposed to produce approximate Bayes estimates. Latent factor regression methods are developed for prediction and variable selection in applications with high-dimensional correlated predictors. Operating characteristics are assessed through simulation studies, and the approach is applied to predict survival times from gene expression data.

Sparse Bayesian Methods for Low-Rank Matrix Estimation

IEEE Transactions on Signal Processing, 2000

Recovery of low-rank matrices has recently seen significant activity in many areas of science and engineering, motivated by recent theoretical results for exact reconstruction guarantees and interesting practical applications. In this paper, we present novel recovery algorithms for estimating low-rank matrices in matrix completion and robust principal component analysis based on sparse Bayesian learning (SBL) principles. Starting from a matrix factorization formulation and enforcing the low-rank constraint in the estimates as a sparsity constraint, we develop an approach that is very effective in determining the correct rank while providing high recovery performance. We provide connections with existing methods in other similar problems and empirical results and comparisons with current state-of-the-art methods that illustrate the effectiveness of this approach.

Bayesian Sparse Factor Regression Trees

In this thesis, we focus on sparse principal component analysis (PCA) and nonlinear re- gression problems. We investigate several sparse PCA models and nonlinear regression techniques. We also explore the advantages of applying them sequentially and training them as an integral unit. First, we experiment with three sparse PCA models, which are optimal sparse PCA algorithms (OSPCA), Generalized Power algorithms (GP) and doubly sparse PCA algo- rithm (DSPCA). All the algorithms are compared using information loss and explained variance metrics, and we investigate their performance with both artificial and real data sets. OSPCA has the best control of the sparsity. GP and DSPCA both perform well on the synthetic and real data sets. The sparse factors identified by DSPCA have for the real datasets are the most interpretable. Second, we report the results of experiments designed to test the performance of several nonlinear regression models (Bayesian additive regression trees (BART), random forests, neural networks, Extreme Gradient Boosting) in different scenarios with artificial and real data sets. When the number of predictors is smaller than that of data examples, no model outperforms the others consistently. However, when the data dimension increases, especially when the number of predictors exceeds that of data examples, the ensemble tree models, BART and random forest, are still able to handle the regression problem, whereas neural networks no longer provide a reasonable fit to the data because of the rapid increase in the number of model parameters and a lack of data. Finally, we investigate whether the prediction task can benefit from first applying sparse PCA to data to identify underlying sparse factor patterns and then applying the regression algorithms using the sparse representation of the data. We observe performance improve- ment for synthetic data. We also modified the inference algorithms of Bayesian DSPCA and BART to train these two models as an integral unit, so that prediction performance can inform the sparse PCA algorithms, guiding them to construct better representations of the data. i  Sommaire Cette th`ese se concentre sur des probl`emes danalyses en composantes principales (ACP) incompl`etes et de r ́egression non lin ́eaire. Nous ́etudions plusieurs mod`eles dans ces deux cat ́egories. De plus, nous explorons les avantages dappliquer ces deux techniques s ́equentiellement et de les entraner comme une unit ́e int ́egrale. Premi`erement, nous avons choisi trois mod`eles dACP incompl`etes, ceux sont les al- gorithmes dACP incompl`etes optimales (ACPIO), les algorithmes de puissance g ́en ́erale (PG) et les algorithmes dACP doublement incompl`etes (ACPDI). Les algorithmes dans notre ́etude sont compar ́es en utilisant la perte dinformation et la variance expliqu ́ee. Ces algorithmes sont test ́es avec des donn ́ees synth ́etiques et r ́eelles. Nous avons d ́ecouvert que les algorithmes ACPIO ont le meilleur controˆle sur lincompl ́etude. De plus, PG et ACPDI ont des bons r ́esultats sur les bases de donn ́ees synth ́etiques et r ́eelles. Les facteurs incomplets identifi ́e par ACPDI pour les bases de donn ́ees r ́eelles sont les plus interpr ́etable. Deuxi`emement, nous rapportons les r ́esultats des tests de performance de quelques mod`eles de r ́egressions non lin ́eaire (arbres de r ́egression additive bay ́esienne (ARAB), forˆet darbres d ́ecisionnels, r ́eseau neuronal, amplification de gradient extrˆeme) dans diff ́erents cadres et avec des bases de donn ́ees synth ́etiques et r ́eelles. Quand le nombre de pr ́edicteurs est plus petit que le nombre dexemple dans la base de donn ́ees, aucune m ́ethode est toujours plus performante. Par contre, quand la dimensionnalit ́e augmente, pr ́ecis ́ement quand le nombres de pr ́edicteur exc`ede le nombre de donn ́ees, les algorithmes darbres de d ́ecisions, ARAB et forˆet darbres d ́ecisionnels performent mieux que les autres. Ceci est caus ́e par laugmentation rapide du nombre de param`etres `e adapt ́e au mod`ele et le manque de donn ́ees disponibles. Finalement, nous enquˆetons le b ́en ́efice davoir premi`erement appliqu ́e une m ́ethode dACP incompl`etes et puis dappliquer un algorithme de r ́egression s ́equentiellement. Nous avons observ ́e une hausse de performance pour les donn ́ees synth ́etiques. Nous avons modifi ́e les algorithmes dACP doublement incompl`etes et darbres de r ́egression additive bay ́esienne (ARAB) pour que la performance des pr ́edictions puisse guider la repr ́esentation ACP incompl`ete. Ceci peut mener a de meilleures repr ́esentations des donn ́ees et de pr ́edictions plus pr ́ecisent.

Bayesian Inference in Sparse Gaussian Graphical Models

One of the fundamental tasks of science is to find explainable relationships between observed phenomena. One approach to this task that has received attention in recent years is based on probabilistic graphical modelling with sparsity constraints on model structures. In this paper, we describe two new approaches to Bayesian inference of sparse structures of Gaussian graphical models (GGMs). One is based on a simple modification of the cutting-edge block Gibbs sampler for sparse GGMs, which results in significant computational gains in high dimensions. The other method is based on a specific construction of the Hamiltonian Monte Carlo sampler, which results in further significant improvements. We compare our fully Bayesian approaches with the popular regularisation-based graphical LASSO, and demonstrate significant advantages of the Bayesian treatment under the same computing costs. We apply the methods to a broad range of simulated data sets, and a real-life financial data set.

Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity

Journal of the American Statistical Association, 2016

We here provide the full development of the vanilla EM algorithm outlined in Section 3.1. We remind the reader that B now denotes the truncated approximation B K , for some pre-specified K , θ = (θ (1) ,. .. , θ (K)) and λ 0k = λ 0 for k = 1,. .. , K. Letting ∆ = (B, Σ, θ), the goal of the proposed algorithm will be to find parameter values ∆ which are most likely (a posteriori) to have generated the data, i.e. ∆ = arg max ∆ log π(∆ | Y). This task would be trivial if we knew the hidden factors Ω = [ω 1 ,. .. , ω n ] and the latent allocation matrix Γ. In that case the estimates would be obtained as a unique solution to a series of penalized linear regressions. On the other hand, if ∆ were known, then Γ and Ω could be easily inferred. This "chicken-and-egg" problem can be resolved iteratively by alternating between two steps. Given ∆ (m) at the m th iteration, the E-step computes expected sufficient statistics of hidden/missing data (Γ, Ω). The M-step then follows to find the a-posteriori most likely ∆ (m+1) , given the expected sufficient statistics. These two steps form the basis of a vanilla EM algorithm with a guaranteed monotone convergence to at least a local posterior mode. More formally, the EM algorithm locates modes of π(∆ | Y) iteratively by maximizing the expected logarithm of the augmented posterior. Given an initialization ∆ (0) , the (m + 1) st step of the algorithm outputs ∆ (m+1) = arg max ∆ Q (∆), where Q (∆) = E Γ,Ω | Y ,∆ (m) [log π (∆, Γ, Ω | Y)] , (A.1) with E Γ,Ω | Y ,∆ (m) (•) denoting the conditional expectation given the observed data and current parameter estimates at the m th iteration. Note that we have parametrized our posterior in terms of the ordered inclusion probabilities θ rather than the breaking fractions ν. These can be recovered using the stick-breaking relationship ν k = θ (k) /θ (k−1). This parametrization yields a feasible M-step, as will

Performance Evaluation of Latent Variable Models with Sparse Priors

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

A variety of Bayesian methods have recently been introduced for finding sparse representations from overcomplete dictionaries of candidate features. These methods often capitalize on latent structure inherent in sparse distributions to perform standard MAP estimation, variational Bayes, approximation using convex duality, or evidence maximization. Despite their reliance on sparsity-inducing priors however, these approaches may or may not actually lead to sparse representations in practice, and so it is a challenging task to determine which algorithm and sparse prior is appropriate. Rather than justifying prior selections and modelling assumptions based on the credibility of the full Bayesian model as is commonly done, this paper bases evaluations on the actual cost functions that emerge from each method. Two minimal conditions are postulated that ideally any sparse learning objective should satisfy. Out of all possible cost functions that can be obtained from the methods described above using (virtually) any sparse prior, a unique function is derived that satisfies these conditions. Both sparse Bayesian learning (SBL) and basis pursuit (BP) are special cases. Later, all methods are shown to be performing MAP estimation using potentially non-factorable implicit priors, which suggests new sparse learning cost functions. Index Terms-sparse representations, sparse priors, latent variable models, underdetermined inverse problems, Bayesian learning