Sparse Bayesian learning and the relevance multi-layer perceptron network (original) (raw)

Sparse Bayesian modeling with adaptive kernel learning

IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 2009

Sparse kernel methods are very efficient in solving regression and classification problems. The sparsity and performance of these methods depend on selecting an appropriate kernel function, which is typically achieved using a cross-validation procedure. In this paper, we propose an incremental method for supervised learning, which is similar to the relevance vector machine (RVM) but also learns the parameters of the kernels during model training. Specifically, we learn different parameter values for each kernel, resulting in a very flexible model. In order to avoid overfitting, we use a sparsity enforcing prior that controls the effective number of parameters of the model. We present experimental results on artificial data to demonstrate the advantages of the proposed method and we provide a comparison with the typical RVM on several commonly used regression and classification data sets.

Sparse Kernel Learning and the Relevance Units Machine

Lecture Notes in Computer Science, 2009

The relevance vector machine(RVM) is a state-of-the-art constructing sparse regression kernel model . It not only generates a much sparser model but provides better generalization performance than the standard support vector machine (SVM). In RVM and SVM, relevance vectors (RVs) and support vectors (SVs) are both selected from the input vector set. This may limit model flexibility. In this paper we propose a new sparse kernel model called Relevance Units Machine (RUM). RUM follows the idea of RVM under the Bayesian framework but releases the constraint that RVs have to be selected from the input vectors. RUM treats relevance units as part of the parameters of the model. As a result, a RUM maintains all the advantages of RVM and offers superior sparsity. The new algorithm is demonstrated to possess considerable computational advantages over well-known the state-of-the-art algorithms.

Analysis of Sparse Bayesian Learning

Neural Information Processing Systems, 2001

The recent introduction of the `relevance vector machine" has eectivelydemonstrated how sparsity may be obtained in generalisedlinear models within a Bayesian framework. Using a particularform of Gaussian parameter prior, `learning" is the maximisation,with respect to hyperparameters, of the marginal likelihood of thedata. This paper studies the properties of that objective function,and demonstrates that conditioned on an individual hyperparameter,the marginal likelihood

Adaptive sparseness for supervised learning

2003

The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function being learned. We propose a Bayesian approach to supervised learning, which leads to sparse solutions; that is, in which irrelevant parameters are automatically set exactly to zero. Other ways to obtain sparse classifiers (such as Laplacian priors, support vector machines) involve (hyper)parameters which control the degree of sparseness of the resulting classifiers; these parameters have to be somehow adjusted/estimated from the training data. In contrast, our approach does not involve any (hyper)parameters to be adjusted or estimated. This is achieved by a hierarchical-Bayes interpretation of the Laplacian prior, which is then modified by the adoption of a Jeffreys' noninformative hyperprior. Implementation is carried out by an expectationmaximization (EM) algorithm. Experiments with several benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms SVMs and performs competitively with the best alternative techniques, although it involves no tuning or adjustment of sparseness-controlling hyperparameters.

Bayesian Learning of Sparse Classifiers

2001

Bayesian approaches to supervised learning use priors on the classifier parameters. However, few priors aim at achieving "sparse" classifiers, where irrelevant/redundant parameters are automatically set to zero. Two well-known ways of obtaining sparse classifiers are: use a zero-mean Laplacian prior on the parameters, and the "support vector machine" (SVM). Whether one uses a Laplacian prior or an SVM, one still needs to specify/estimate the parameters that control the degree of sparseness of the resulting classifiers.

Multi-Class Sparse Bayesian Regression for Neuroimaging Data Analysis

Lecture Notes in Computer Science, 2010

The use of machine learning tools is gaining popularity in neuroimaging, as it provides a sensitive assessment of the information conveyed by brain images. In particular, finding regions of the brain whose functional signal reliably predicts some behavioral information makes it possible to better understand how this information is encoded or processed in the brain. However, such a prediction is performed through regression or classification algorithms that suffer from the curse of dimensionality, because a huge number of features (i.e. voxels) are available to fit some target, with very few samples (i.e. scans) to learn the informative regions. A commonly used solution is to regularize the weights of the parametric prediction function. However, model specification needs a careful design to balance adaptiveness and sparsity. In this paper, we introduce a novel method, Multi-Class Sparse Bayesian Regression (MCBR ), that generalizes classical approaches such as Ridge regression and Automatic Relevance Determination. Our approach is based on a grouping of the features into several classes, where each class is regularized with specific parameters. We apply our algorithm to the prediction of a behavioral variable from brain activation images. The method presented here achieves similar prediction accuracies than reference methods, and yields more interpretable feature loadings.

A sparse Bayesian approach for joint feature selection and classifier learning

2008

In this paper we present a new method for Joint Feature Selection and Classifier Learning (JFSCL) using a sparse Bayesian approach. These tasks are performed by optimizing a global loss function that includes a term associated with the empirical loss and another one representing a fea-ture selection and regularization constraint on the parameters. To minimize this function we use a recently proposed technique, the Boosted Lasso algorithm, that follows the regularization path of the empirical risk associated with our loss function. We develop the algorithm for a well known nonparametrical classification method, the Relevance Vector Machine (RVM), and perform experiments using a synthetic data set and three databases from the UCI Machine Learning Repository. The results show that our method is able to select the relevant features, increasing in some cases the classification accuracy when feature selection is performed.

A prior for consistent estimation for the relevance vector machine

2004

The Relevance Vector Machine (RVM) provides an empirical Bayes treatment of function approximation by kernel basis expansion. In its original form ?, RVM achieves a sparse representation of the approximating function by structuring a Gaussian prior distribution in a way that implicitly puts a sparsity pressure on the coefficients appearing in the expansion. RVM aims at retaining the tractability of the Gaussian prior while simultaneously achieving the assumed (and desired) sparse representation. This is achieved by specifying independent Gaussian priors for each of the coefficients. In his introductory paper, ? shows that for such a prior structure, the use of independent Gamma hyperpriors yields a product of independent Student-t marginal prior for the coefficients, thereby achieving the desired sparsity. However, such a prior structure gives complete freedom to the coefficients, making it impossible to isolate a unique solution to the function estimation task.

Adaptive multi-class Bayesian sparse regression-An application to brain activity classification

2009

In this article we describe a novel method for regularized regression and apply it to the prediction of a behavioural variable from brain activation images. In the context of neuroimaging, regression or classification techniques are often plagued with the curse of dimensionality, due to the extremely high number of voxels and the limited number of activation maps. A commonly-used solution is the regularization of the weights used in the parametric prediction function. It entails the difficult issue of introducing an adapted amount of regularization in the model; this question can be addressed in a Bayesian framework, but model specification needs a careful design to balance adaptiveness and sparsity. Thus, we introduce an adaptive multi-class regularization to deal with this cluster-based structure of the data. Based on a hierarchical model and estimated in a Variational Bayes framework, our algorithm is robust to overfit and more adaptive than other regularization methods. Results on simulated data and preliminary results on real data show the accuracy of the method in the context of brain activation images.

Relevance vector machines for sparse learning of biophysical parameters

Image and Signal Processing for Remote Sensing XI, 2005

In this communication, we evaluate the performance of the relevance vector machine (RVM) (Ref. 1) for the estimation of biophysical parameters from remote sensing images. For illustration purposes, we focus on the estimation of chlorophyll concentrations from multispectral imagery, whose measurements are subject to high levels of uncertainty, both regarding the difficulties in ground-truth data acquisition, and when comparing in situ measurements against satellite-derived data. Moreover, acquired data are commonly affected by noise in the acquisition phase, and time mismatch between the acquired image and the recorded measurements, which is critical for instance for coastal water monitoring.

The Bayesian Backfitting Relevance Vector Machine

2004

Traditional non-parametric statistical learningtechniques are often computationally attractive, but lack the same generalization andmodel selection abilities as state-of-the-artBayesian algorithms which, however, are usuallycomputationally prohibitive. This papermakes several important contributions thatallow Bayesian learning to scale to more complex, real-world learning scenarios.

A tutorial on relevance vector machines for regression and classification with applications

University of Ioannina, …, 2006

Relevance vector machines (RVM) have recently attracted much interest in the research community because they provide a number of advantages. They are based on a Bayesian formulation of a linear model with an appropriate prior that results in a sparse representation. As a consequence, they can generalize well and provide inferences at low computational cost. In this tutorial we first present the basic theory of RVM for regression and classification, followed by two examples illustrating the application of RVM for object detection and classification. The first example is target detection in images and RVM is used in a regression context. The second example is detection and classification of microcalcifications from mammograms and RVM is used in a classification framework. Both examples illustrate the application of the RVM methodology and demonstrate its advantages.

Sparse Logistic Regression: Comparison of Regularization and Bayesian Implementations

Algorithms, 2020

In knowledge-based systems, besides obtaining good output prediction accuracy, it is crucial to understand the subset of input variables that have most influence on the output, with the goal of gaining deeper insight into the underlying process. These requirements call for logistic model estimation techniques that provide a sparse solution, i.e., where coefficients associated with non-important variables are set to zero. In this work we compare the performance of two methods: the first one is based on the well known Least Absolute Shrinkage and Selection Operator (LASSO) which involves regularization with an l 1 norm; the second one is the Relevance Vector Machine (RVM) which is based on a Bayesian implementation of the linear logistic model. The two methods are extensively compared in this paper, on real and simulated datasets. Results show that, in general, the two approaches are comparable in terms of prediction performance. RVM outperforms the LASSO both in term of structure rec...

Sparse kernel learning with LASSO and Bayesian inference algorithm

Neural Networks, 2010

Kernelized LASSO (Least Absolute Selection and Shrinkage Operator) has been investigated in two separate recent papers and . This paper is concerned with learning kernels under the LASSO formulation via adopting a generative Bayesian learning and inference approach. A new robust learning algorithm is proposed which produces a sparse kernel model with the capability of learning regularized parameters and kernel hyperparameters. A comparison with state-of-the-art methods for constructing sparse regression models such as the relevance vector machine (RVM) and the local regularization assisted orthogonal least squares regression (LROLS) is given. The new algorithm is also demonstrated to possess considerable computational advantages.

Combined modeling of sparse and dense noise for improvement of Relevance Vector Machine

Using a Bayesian approach, we consider the problem of recovering sparse signals under additive sparse and dense noise. Typically, sparse noise models outliers, impulse bursts or data loss. To handle sparse noise, existing methods simultaneously estimate the sparse signal of interest and the sparse noise of no interest. For estimating the sparse signal, without the need of estimating the sparse noise, we construct a robust Relevance Vector Machine (RVM). In the RVM, sparse noise and ever present dense noise are treated through a combined noise model. The precision of combined noise is modeled by a diagonal matrix. We show that the new RVM update equations correspond to a non-symmetric sparsity inducing cost function. Further, the combined modeling is found to be computationally more efficient. We also extend the method to block-sparse signals and noise with known and unknown block structures. Through simulations, we show the performance and computation efficiency of the new RVM in several applications: recovery of sparse and block sparse signals, housing price prediction and image denoising.

Sparse regression mixture modeling with the multi-kernel relevance vector machine

Knowledge and Information Systems, 2013

A regression mixture model is proposed where each mixture component is a multi-kernel version of the Relevance Vector Machine (RVM). This mixture model exploits the enhanced modeling capability of RVMs, due to their embedded sparsity enforcing properties. In order to deal with the selection problem of kernel parameters, a weighted multi-kernel scheme is employed, where the weights are estimated during training. The mixture model is trained using the maximum a posteriori (MAP) approach, where the Expectation Maximization (EM) algorithm is applied offering closed form update equations for the model parameters. Moreover, an incremental learning methodology is also presented that tackles the parameter initialization problem of the EM algorithm along with a BIC-based model selection methodology to estimate the proper number of mixture components. We provide comparative experimental results using various artificial and real benchmark datasets that empirically illustrate the efficiency of the proposed mixture model.

Smooth relevance vector machine: a smoothness prior extension of the RVM

Machine Learning, 2007

Enforcing sparsity constraints has been shown to be an effective and efficient way to obtain state-of-the-art results in regression and classification tasks. Unlike the support vector machine (SVM) the relevance vector machine (RVM) explicitly encodes the criterion of model sparsity as a prior over the model weights. However the lack of an explicit prior structure over the weight variances means that the degree of sparsity is to a large extent controlled by the choice of kernel (and kernel parameters). This can lead to severe overfitting or oversmoothing -possibly even both at the same time (e.g. for the multiscale Doppler data). We detail an efficient scheme to control sparsity in Bayesian regression by incorporating a flexible noise-dependent smoothness prior into the RVM. We present an empirical evaluation of the effects of choice of prior structure on a selection of popular data sets and elucidate the link between Bayesian wavelet shrinkage and RVM regression. Our model encompasses the original RVM as a special case, but our empirical results show that we can surpass RVM performance in terms of goodness of fit and achieved sparsity as well as computational performance in many cases. The code is freely available.

Smooth Relevance Vector Machine: a smoothness prior extension of the RVM (submitted draft version - do not quote)

2007

Sparse multinomial logistic regression via Bayesian L1 regularisation

Advances in Neural Information Processing Systems, 2007

Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense.

Learning linear Bayes networks with sparse Bayesian models

2009

E. Coli Transcription Factor network• gene expression levels from 100 genes taken at 5, 15, 30 and 60 min, and every hour until 6 hours after transition from glucose to acetate (100× 10).• Objective is to find underlying transcription factor driving signal with or without ground truth regulatory networks (RegulonDB).