On the robustness of regularized pairwise learning methods based on kernels (original) (raw)

Consistency and robustness of kernel-based regression in convex risk minimization

Bernoulli, 2007

We investigate statistical properties for a broad class of modern kernel-based regression (KBR) methods. These kernel methods were developed during the last decade and are inspired by convex risk minimization in infinite-dimensional Hilbert spaces. One leading example is support vector regression. We first describe the relationship between the loss function L of the KBR method and the tail of the response variable. We then establish the L-risk consistency for KBR which gives the mathematical justification for the statement that these methods are able to "learn". Then we consider robustness properties of such kernel methods. In particular, our results allow us to choose the loss function and the kernel to obtain computationally tractable and consistent KBR methods that have bounded influence functions. Furthermore, bounds for the bias and for the sensitivity curve, which is a finite sample version of the influence function, are developed, and the relationship between KBR and classical M estimators is discussed.

Total stability of kernel methods

Neurocomputing

Regularized empirical risk minimization using kernels and their corresponding reproducing kernel Hilbert spaces (RKHSs) plays an important role in machine learning. However, the actually used kernel often depends on one or on a few hyperparameters or the kernel is even data dependent in a much more complicated manner. Examples are Gaussian RBF kernels, kernel learning, and hierarchical Gaussian kernels which were recently proposed for deep learning. Therefore, the actually used kernel is often computed by a grid search or in an iterative manner and can often only be considered as an approximation to the "ideal" or "optimal" kernel. The paper gives conditions under which classical kernel based methods based on a convex Lipschitz loss function and on a bounded and smooth kernel are stable, if the probability measure P, the regularization parameter λ, and the kernel k may slightly change in a simultaneous manner. Similar results are also given for pairwise learning. Therefore, the topic of this paper is somewhat more general than in classical robust statistics, where usually only the influence of small perturbations of the probability measure P on the estimated function is considered.

Multi-kernel regularized classifiers

Journal of Complexity, 2007

A family of classification algorithms generated from Tikhonov regularization schemes are considered. They involve multi-kernel spaces and general convex loss functions. Our main purpose is to provide satisfactory estimates for the excess misclassification error of these multi-kernel regularized classifiers. The error analysis consists of two parts: regularization error and sample error. Allowing multi-kernels in the algorithm improves the regularization error and approximation error, which is one advantage of the multi-kernel setting. For a general loss function, we show how to bound the regularization error by the approximation in some weighted L q spaces. For the sample error, we use a projection operator. The projection in connection with the decay of the regularization error enables us to improve convergence rates in the literature even for the one kernel schemes and special loss functions: least square loss and hinge loss for support vector machine soft margin classifiers. Existence of the optimization problem for the regularization scheme associated with multi-kernels is verified when the kernel functions are continuous with respect to the index set. Gaussian kernels with flexible variances and probability distributions with some noise conditions are demonstrated to illustrate the general theory.

On Convergence of Kernel Learning Estimators

SIAM Journal on Optimization, 2010

The paper studies convex stochastic optimization problems in a reproducing kernel Hilbert space (RKHS). The objective (risk) functional depends on functions from this RKHS and takes the form of a mathematical expectation (integral) of a nonnegative integrand (loss function) over a probability measure. The problem is generally ill-posed, a difficulty that in statistical learning is addressed through Tihonov regularization, with Monte Carlo approximation of integrals, which also makes it possible to solve the problem by finite dimensional (convex) quadratic optimization. The approximate solutions are referred to as kernel learning estimators and are expressed as a linear combination of kernels evaluated at the sample points. They are functional random variables that depend on the full sample. The paper studies a probabilistic convergence of these approximate solutions under a gradual elimination of the regularization parameter with rising number of observations. Its intended contribution is to derive novel nonasymptotic bounds on the minimization error and exponential bounds on the tail distribution of errors and to establish novel sufficient conditions for uniform convergence of kernel estimators to the true (normal) solution with probability one, jointly with a rule for downward adjustment of the regularization factor with increasing sample size. Applications to least squares, median, and quantile regression estimation, as well as to binary classification, are discussed.

On robustness properties of convex risk minimization methods for pattern recognition

2004

The paper brings together methods from two disciplines: machine learning theory and robust statistics. We argue that robustness is an important aspect and we show that many existing machine learning methods based on the convex risk minimization principle have − besides other good properties − also the advantage of being robust. Robustness properties of machine learning methods based on convex risk minimization are investigated for the problem of pattern recognition. Assumptions are given for the existence of the influence function of the classifiers and for bounds on the influence function. Kernel logistic regression, support vector machines, least squares and the AdaBoost loss function are treated as special cases. Some results on the robustness of such methods are also obtained for the sensitivity curve and the maxbias, which are two other robustness criteria. A sensitivity analysis of the support vector machine is given.

Additive Regularization Trade-Off: Fusion of Training and Validation Levels in Kernel Methods

Machine Learning, 2006

This paper presents a convex optimization perspective towards the task of tuning the regularization trade-off with validation and cross-validation criteria in the context of kernel machines. We focus on the problem of tuning the regularization trade-off in the context of Least Squares Support Vector Machines (LS-SVMs) for function approximation and classification. By adopting an additive regularization trade-off scheme, the task of tuning the regularization trade-off with respect to a validation and cross-validation criterion can be written as a convex optimization problem. The solution of this problem then contains both the optimal regularization constants with respect to the model selection criterion at hand, and the corresponding training solution. We refer to such formulations as the fusion of training with model selection. The major tool to accomplish this task is found in the primal-dual derivations as occuring in convex optimization theory. The paper advances the discussion by relating the additive regularization trade-off scheme with the classical Tikhonov scheme. Motivations are given for the usefulness of the former scheme. Furthermore, it is illustrated how to restrict the additive trade-off scheme towards the solution path corresponding with a Tikhonov scheme while retaining convexity of the overall problem of fusion of model selection and training. We relate such a scheme with an ensemble learning problem and with stability of learning machines. The approach is illustrated on a number of artificial and benchmark datasets relating the proposed method with the classical practice of tuning the Tikhonov scheme with a cross-validation measure.

Convexity, Classification, and Risk Bounds

Journal of The American Statistical Association, 2006

Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise, and we show that, in this case, strictly convex loss functions lead to faster rates of convergence of the risk than would be implied by standard uniform convergence arguments. Finally, we present applications of our results to the estimation of convergence rates in function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.

Asymptotic efficiency of kernel support vector machines (SVM)

Cybernetics and Systems Analysis, 2009

The paper analyzes the asymptotic properties of Vapnik's SVM-estimates of a regression function as the size of the training sample tends to infinity. The estimation problem is considered as infinite-dimensional minimization of a regularized empirical risk functional in a reproducing kernel Hilbert space. The rate of convergence of the risk functional on SVM-estimates to its minimum value is established. The sufficient conditions for the uniform convergence of SVM-estimates to a true regression function with unit probability are given. The present paper analyzes the asymptotic properties of the SVM-estimators of an unknown dependence, obtained by the SVM with unlimited increase in the number of observations (training sample) as is done in mathematical statistics. The literature on statistical learning analyzes the convergence of estimates mainly with respect to a functional [1]. In the case of quadratic functionals, this also yields the probability convergence to a root-mean-square regression function in one norm or another . Note that the SVM usually employs nonquadratic and even nonsmooth quality functionals. The paper estimates the rate of the mean convergence (proportional to 1 4 / m , where m is the number of observations) of the values of an arbitrary convex quality functional of SVM-estimates to its theoretical minimum; such an estimate of the rate of convergence for the confidence bound of a quadratic risk functional is presented in Sec. 4]. In the case of binary classification problems, the results assess the rate of convergence of the Bayesian risk (the probability of erroneous classification) to its theoretical minimum. Note that under a strong assumption that components of the input random vector are independent, an unimprovable estimate of the rate of convergence, proportional to 1 / m, is found in for the Bayesian classification method. An analysis of convergence with respect to a functional is justified for classification problems; however, it is insufficient to consider regression problems such as median and quantile regression . Therefore, the present paper provides sufficient conditions for the uniform convergence of SVM-estimators of regression to an unknown function with unit probability, namely, establishes a rule for changing a regularization parameter in the SVM as the number of observations increases. The measures of the cardinal of a class of models (such as VC-dimension [1, 4, 5], which can be infinite in the case being considered) are not used, and the property of robustness of the SVM with respect to separate observations is taken into account and theorems on the exponential concentration of the distribution of averaged random variables around their expectation are applied . For iterative algorithms of learning, the results on the convergence are available in . The studies give an alternative approach to prove the convergence, with unit probability, of the estimates obtained by empirical risk minimization, provided that the feasible domain is compact and the solution is unique. They also consider the cases of periodic, randomly distributed, and dependent observations. Note that a solution to stochastic programming and classification problems is usually not unique; therefore, the solution uniqueness assumption means that a fixed (not disappearing with increasing number of observations) problem regularization is applied. Such a regularization results in asymptotically biased estimates. Moreover, it provides only a weak compactness of level sets of the objective functional. In contrast to , we consider here the case of multiple solutions and use a gradually degenerating regularization. The results of this study are partially presented in . The paper is structured as follows. The first section briefly reviews the results on reproducing Hilbert spaces. The second section relates a quantile regression problem and a binary classification with the minimization of convex risk functionals. The third section analyzes the convergence of Tikhonov's regularization method to minimize integral risk functionals in reproducing Hilbert spaces. The fourth section presents a computational scheme, and the fifth analyzes the SVM for convergence. In the conclusions, the main results are summed up.

Efficiency of classification methods based on empirical risk minimization

Cybernetics and Systems Analysis, 2009

A binary classification problem is reduced to the minimization of convex regularized empirical risk functionals in a reproducing kernel Hilbert space. The solution is searched for in the form of a finite linear combination of kernel support functions (Vapnik's support vector machines). Risk estimates for a misclassification as a function of the training sample size and other model parameters are obtained.

Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

Journal of Machine Learning Research, 2009

We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of ε- and ν-SVRs. Using these, we show how to adjust regularization parameters to