Kernels on Sample Sets via Nonparametric Divergence Estimates (original) (raw)

Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions

www-cgi.cs.cmu.edu

Low-dimensional embedding, manifold learning, clustering, classification, and anomaly detection are among the most important problems in machine learning. The existing methods usually consider the case when each instance has a fixed, finite-dimensional feature representation. Here we consider a different setting. We assume that each instance corresponds to a continuous probability distribution. These distributions are unknown, but we are given some i.i.d. samples from each distribution. Our goal is to estimate the distances between these distributions and use these distances to perform low-dimensional embedding, clustering/classification, or anomaly detection for the distributions. We present estimation algorithms, describe how to apply them for machine learning tasks on distributions, and show empirical results on synthetic data, real word images, and astronomical data sets.

Nonparametric kernel estimators for image classification

2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012

We introduce a new discriminative learning method for image classification. We assume that the images are represented by unordered, multi-dimensional, finite sets of feature vectors, and that these sets might have different cardinality. This allows us to use consistent nonparametric divergence estimators to define new kernels over these sets, and then apply them in kernel classifiers. Our numerical results demonstrate that in many cases this approach can outperform state-of-the-art competitors on both simulated and challenging real-world datasets.

Nonparametric Divergence Estimation and its Applications to Machine Learning

2011

Low-dimensional embedding, manifold learning, clustering, classification, and anomaly detection are among the most important problems in machine learning. Here we consider the setting where each instance of the inputs corresponds to a continuous probability distribution. These distributions are unknown to us, but we are given some i.i.d. samples from each of them. While most of the existing machine learning methods operate on points, i.e. finite-dimensional feature vectors, in our setting we study algorithms that operate on groups, i.e. sets of feature vectors. For this purpose, we propose new nonparametric, consistent estimators for a large family of divergences and describe how to apply them for machine learning problems. As important special cases, the estimators can be used to estimate Rényi, Tsallis, Kullback-Leibler, Hellinger, Bhattacharyya distance, L 2 divergences, and mutual information. We present empirical results on synthetic data, real word images, and astronomical data sets.

Kernel Mean Embedding of Distributions: A Review and Beyond

Foundations and Trends® in Machine Learning, 2017

A Hilbert space embedding of a distribution-in short, a kernel mean embedding-has recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original "feature map" common to support vector machines (SVMs) and other kernel methods. While initially closely associated with the latter, it has meanwhile found application in fields ranging from kernel machines and probabilistic modeling to statistical inference, causal discovery, and deep learning. The goal of this survey is to give a comprehensive review of existing work and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and a review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel two-sample testing, independent testing, group anomaly detection, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' rules-which are ubiquitous in graphical model, probabilistic inference, and reinforcement learning-in a non-parametric way using the new representation of distributions in RKHS. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions. The targeted audience includes graduate students and researchers in machine learning and statistics who are interested in the theory and applications of kernel mean embeddings. G over some input space Y, we have E Y |x [g(Y) | X = x] = g, U Y |x G , ∀g ∈ G where U Y |x denotes the embedding of the conditional distribution P(Y |X = x). That is, we can compute a conditional expected value of any function g ∈ G w.r.t. P(Y |X = x) by taking an inner product in G between the function g and the embedding of P(Y |X = x) (see Section 4 A Synopsis. As a result of the aforementioned advantages, the kernel mean embedding has made widespread contributions in various directions. Firstly, most tasks in machine learning and statistics involve estimation of the data-generating process whose success depends critically on the accuracy and the reliability of this estimation. It is known that estimating the kernel mean embedding is easier than estimating the distribution itself, which helps improve many statistical inference methods. These include, for example, two-sample testing (Gretton et al. 2012a), independence and conditional independence tests (

Kernel Machines for Non-vectorial Data

Lecture Notes in Computer Science, 2007

This work presents a short introduction to the main ideas behind the design of specific kernel functions when used by machine learning algorithms, for example support vector machines, in the case that involved patterns are described by non-vectorial information. In particular the interval data case will be analysed as an illustrating example: explicit kernels based on the centre-radius diagram will be formulated for closed bounded intervals in the real line.

Nonlinear kernel-based statistical pattern analysis

IEEE Transactions on Neural Networks, 2001

The eigenstructure of the second-order statistics of a multivariate random population can be inferred from the matrix of pairwise combinations of inner products of the samples. Therefore, it can be also efficiently obtained in the implicit, high-dimensional feature spaces defined by kernel functions. We elaborate on this property to obtain general expressions for immediate derivation of nonlinear counterparts of a number of standard pattern analysis algorithms, including principal component analysis, data compression and denoising, and Fisher's discriminant. The connection between kernel methods and nonparametric density estimation is also illustrated. Using these results we introduce the kernel version of Mahalanobis distance, which originates nonparametric models with unexpected and interesting properties, and also propose a kernel version of the minimum squared error (MSE) linear discriminant function. This learning machine is particularly simple and includes a number of generalized linear models such as the potential functions method or the radial basis function (RBF) network. Our results shed some light on the relative merit of feature spaces and inductive bias in the remarkable generalization properties of the support vector machine (SVM). Although in most situations the SVM obtains the lowest error rates, exhaustive experiments with synthetic and natural data show that simple kernel machines based on pseudoinversion are competitive in problems with appreciable class overlapping

Support Distribution Machines

2012

Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as a set of i.i.d. samples from an underlying feature distribution for the group. Our approach is to generalize kernel machines from vectorial inputs to i.i.d. sample sets of vectors. For this purpose, we use a nonparametric estimator that can consistently estimate the inner product and certain kernel functions of two distributions. The projection of the estimated Gram matrix to the cone of semi-definite matrices enables us to employ the kernel trick, and hence use kernel machines for classification, regression, anomaly detection, and low-dimensional embedding in the space of distributions. We present several numerical experiments both on real and simulated datasets to demonstrate the advantages of our new approach.

A family of probabilistic kernels based on information divergence

2004

Abstract Probabilistic kernels offer a way to combine generative models with discriminative classifiers. We establish connections between probabilistic kernels and feature space kernels through a geometric interpretation of the previously proposed probability product kernel. A family of probabilistic kernels, based on information divergence measures, is then introduced and its connections to various existing probabilistic kernels are analyzed.

Generalised kernel machines

International Joint Conference Neural Networks (IJCNN 2007). , 2007

The generalised linear model (GLM) is the standard approach in classical statistics for regression tasks where it is appropriate to measure the data misfit using a likelihood drawn from the exponential family of distributions. In this paper, we apply the kernel trick to give a non-linear variant of the GLM, the generalised kernel machine (GKM), in which a regularised GLM is constructed in a fixed feature space implicitly defined by a Mercer kernel. The MATLAB symbolic maths toolbox is used to automatically create a suite of generalised kernel machines, including methods for automated model selection based on approximate leave-one-out cross-validation. In doing so, we provide a common framework encompassing a wide range of existing and novel kernel learning methods, and highlight their connections with earlier techniques from classical statistics. Examples including kernel ridge regression, kernel logistic regression and kernel Poisson regression are given to demonstrate the flexibility and utility of the generalised kernel machine.