Generalised kernel machines (original) (raw)

Kernel learning at the first level of inference

Neural Networks, 2014

Kernel learning methods, whether Bayesian or frequentist, typically involve multiple levels of inference, with the coefficients of the kernel expansion being determined at the first level and the kernel and regularisation parameters carefully tuned at the second level, a process known as model selection. Model selection for kernel machines is commonly performed via optimisation of a suitable model selection criterion, often based on cross-validation or theoretical performance bounds. However, if there are a large number of kernel parameters, as for instance in the case of automatic relevance determination (ARD), there is a substantial risk of over-fitting the model selection criterion, resulting in poor generalisation performance. In this paper we investigate the possibility of learning the kernel, for the Least-Squares Support Vector Machine (LS-SVM) classifier, at the first level of inference, i.e. parameter optimisation. The kernel parameters and the coefficients of the kernel expansion are jointly optimised at the first level of inference, minimising a training criterion with an additional regularisation term acting on the kernel parameters. The key advantage of this approach is that the values of only two regularisation parameters need be determined in model selection, substantially alleviating the problem of over-fitting the model selection criterion. The benefits of this approach are demonstrated using a suite of synthetic and real-world binary classification benchmark problems, where kernel learning at the first level of inference is shown to be statistically superior to the conventional approach, improves on our previous work (Cawley and Talbot, 2007) and is competitive with Multiple Kernel Learning approaches, but with reduced computational expense.

Efficient approximate leave-one-out cross-validation for kernel logistic regression

Machine Learning, 2008

Kernel logistic regression (KLR) is the kernel learning method best suited to binary pattern recognition problems where estimates of a-posteriori probability of class membership are required. Such problems occur frequently in practical applications, for instance because the operational prior class probabilities or equivalently the relative misclassification costs are variable or unknown at the time of training the model. The model parameters are given by the solution of a convex optimization problem, which may be found via an efficient iteratively re-weighted least squares (IRWLS) procedure. The generalization properties of a kernel logistic regression machine are however governed by a small number of hyper-parameters, the values of which must be determined during the process of model selection. In this paper, we propose a novel model selection strategy for KLR, based on a computationally efficient closed-form approximation of the leave-one-out cross-validation procedure. Results obtained on a variety of synthetic and real-world benchmark datasets are given, demonstrating that the proposed model selection procedure is competitive with a more conventional k-fold cross-validation based approach and also with Gaussian process (GP) classifiers implemented using the Laplace approximation and via the Expectation Propagation (EP) algorithm.

Efficient model selection for kernel logistic regression

2004

Abstract Kernel logistic regression models, like their linear counterparts, can be trained using the efficient iteratively reweighted least-squares (IRWLS) algorithm. This approach suggests an approximate leave-one-out cross-validation estimator based on an existing method for exact leave-one-out cross-validation of least-squares models.

Generalized Kernel Classification and Regression

We discuss kernel-based classification and regression in a general context, emphasizing the role of convex duality in the problem formulation. We give conditions for the existence of the dual problem, and derive general globally convergent classification and regression algo- rithms for solving the true (i.e. hard-margin or rigorous) dual problem without resorting to approximations. Kernel methods perform perform optimization in Hilbert space by means of a finite dimensional dual problem. The conditions for the formulation of the dual problem essentially determine what we can "do in feature space", i.e.which opti- mization problems can be solved involving vectors in Hilbert space. Thus convex analysis plays a major role in the theory of kernel methods. The primary pur- pose of this paper is to derive general algorithms for kernel-based classification and regression by considering the problem from the viewpoint of convex analysis as represented by (8). We give a summary of t...

A practical use of regularization for supervised learning with kernel methods

Pattern Recognition Letters, 2013

In several supervised learning applications, it happens that reconstruction methods have to be applied repeatedly before being able to achieve the final solution. In these situations, the availability of learning algorithms able to provide effective predictors in a very short time may lead to remarkable improvements in the overall computational requirement. In this paper we consider the kernel ridge regression problem and we look for solutions given by a linear combination of kernel functions plus a constant term. In particular, we show that the unknown coefficients of the linear combination and the constant term can be obtained very fastly by applying specific regularization algorithms directly to the linear system arising from the Empirical Risk Minimization problem. From the numerical experiments carried out on benchmark datasets, we observed that in some cases the same results achieved after hours of calculations can be obtained in few seconds, thus showing that these strategies are very well-suited for time-consuming applications.

Adaptive metric kernel regression

Neural Networks for Signal Processing VIII. Proceedings of the 1998 IEEE Signal Processing Society Workshop (Cat. No.98TH8378), 1998

Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it su ers from the curse of dimensionality and is usually di cult to apply to high input dimensions. In this contribution, we propose an algorithm that adapts the input metric used in multivariate regression by minimising a cross-validation estimate of the generalisation error. This allows to automatically adjust the importance of di erent dimensions. The improvement in terms of modelling performance is illustrated on a variable selection task where the adaptive metric kernel clearly outperforms the standard approach. Finally, we benchmark the method using the DELVE environment.

The Feature Selection Path in Kernel Methods

Journal of Machine Learning Research, 2010

The problem of automatic feature selection/weighting in kernel methods is examined. We work on a formulation that optimizes both the weights of features and the parameters of the kernel model simultaneously, using L 1 regularization for feature selection. Under quite general choices of kernels, we prove that there exists a unique regularization path for this problem, that runs from 0 to a stationary point of the non-regularized problem. We propose an ODE-based homotopy method to follow this trajectory. By following the path, our algorithm is able to automatically discard irrelevant features and to automatically go back and forth to avoid local optima. Experiments on synthetic and real datasets show that the method achieves low prediction error and is efficient in separating relevant from irrelevant features.

Learning to Predict the Leave-One-Out Error of Kernel Based Classifiers

We propose an algorithm to predict the leave-one-out (LOO) error for kernel based classifiers. To achieve this goal with computational efficiency, we cast the LOO error approximation task into a classification problem. This means that we need to learn a classification of whether or not a given training sample - if left out of the data set - would be misclassified. For this learning task, simple data dependent features are proposed, inspired by geometrical intuition. Our approach allows to reliably select a good model as demonstrated in simulations on Support Vector and Linear Programming Machines. Comparisons to existing learning theoretical bounds, e.g. the span bound, are given for various model selection scenarios.

Inductive regularized learning of kernel functions

2010

In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies.

A Mathematical Programming Approach to the Kernel Fisher Algorithm

2000

We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.