Nonparametric Estimation of Conditional Information and Divergences (original) (raw)
Related papers
Direct estimation of information divergence using nearest neighbor ratios
2017 IEEE International Symposium on Information Theory (ISIT), 2017
We propose a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets X and Y , respectively with N and M samples, where η := M/N is a constant value. Considering the k-nearest neighbor (k-NN) graph of Y in the joint data set (X, Y), we show that the average powered ratio of the number of X points to the number of Y points among all k-NN points is proportional to Rényi divergence of X and Y densities. A similar method can also be used to estimate f-divergence measures. We derive bias and variance rates, and show that for the class of γ-Hölder smooth functions, the estimator achieves the MSE rate of O N −2γ/(γ+d). Furthermore, by using a weighted ensemble estimation technique, for density functions with continuous and bounded derivatives of up to the order d, and some extra conditions at the support set boundary, we derive an ensemble estimator that achieves the parametric MSE rate of O(1/N). Our estimator requires no boundary correction, and remarkably, the boundary issues do not show up. Our approach is also more computationally tractable than other competing estimators, which makes them appealing in many practical applications.
A Bayesian Nonparametric Estimation of Mutual Information
arXiv (Cornell University), 2021
Mutual information is a widely-used information theoretic measure to quantify the amount of association between variables. It is used extensively in many applications such as image registration, diagnosis of failures in electrical machines, pattern recognition, data mining and tests of independence. The main goal of this paper is to provide an efficient estimator of the mutual information based on the approach of Al Labadi et. al. (2021). The estimator is explored through various examples and is compared to its frequentist counterpart due to Berrett et al. (2019). The results show the good performance of the procedure by having a smaller mean squared error.
2014
We propose and analyze estimators for statistical functionals of one or more distributions under nonparametric assumptions. Our estimators are based on the theory of influence functions, which appear in the semiparametric statistics literature. We show that estimators based either on data-splitting or a leave-one-out technique enjoy fast rates of convergence and other favorable theoretical properties. We apply this framework to derive estimators for several popular information theoretic quantities, and via empirical evaluation, show the advantage of this approach over existing estimators.
Nonparametric Divergence Estimation and its Applications to Machine Learning
2011
Low-dimensional embedding, manifold learning, clustering, classification, and anomaly detection are among the most important problems in machine learning. Here we consider the setting where each instance of the inputs corresponds to a continuous probability distribution. These distributions are unknown to us, but we are given some i.i.d. samples from each of them. While most of the existing machine learning methods operate on points, i.e. finite-dimensional feature vectors, in our setting we study algorithms that operate on groups, i.e. sets of feature vectors. For this purpose, we propose new nonparametric, consistent estimators for a large family of divergences and describe how to apply them for machine learning problems. As important special cases, the estimators can be used to estimate Rényi, Tsallis, Kullback-Leibler, Hellinger, Bhattacharyya distance, L 2 divergences, and mutual information. We present empirical results on synthetic data, real word images, and astronomical data sets.
On the Estimation of alpha-Divergences
2011
We propose new nonparametric, consistent Rényi-α and Tsallis-α divergence estimators for continuous distributions. Given two independent and identically distributed samples, a "naïve" approach would be to simply estimate the underlying densities and plug the estimated densities into the corresponding formulas. Our proposed estimators, in contrast, avoid density estimation completely, estimating the divergences directly using only simple k-nearest-neighbor statistics. We are nonetheless able to prove that the estimators are consistent under certain conditions. We also describe how to apply these estimators to mutual information and demonstrate their efficiency via numerical experiments.
Normalized information-based divergences
2007
This paper is devoted to the mathematical study of some divergences based on the mutual information well-suited to categorical random vectors. These divergences are generalizations of the "entropy distance" and "information distance". Their main characteristic is that they combine a complexity term and the mutual information. We then introduce the notion of (normalized) informationbased divergence, propose several examples and discuss their mathematical properties in particular in some prediction framework.
Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples
Entropy
Determining the strength of nonlinear, statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here, we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables—the one with minimal entropy—is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our prop...
Demystifying Fixed kkk -Nearest Neighbor Information Estimators
IEEE Transactions on Information Theory, 2018
Estimating mutual information from i.i.d. samples drawn from an unknown joint density function is a basic statistical problem of broad interest with multitudinous applications. The most popular estimator is one proposed by Kraskov and Stögbauer and Grassberger (KSG) in 2004, and is nonparametric and based on the distances of each sample to its k th nearest neighboring sample, where k is a fixed small integer. Despite its widespread use (part of scientific software packages), theoretical properties of this estimator have been largely unexplored. In this paper we demonstrate that the estimator is consistent and also identify an upper bound on the rate of convergence of the ℓ2 error as a function of number of samples. We argue that the performance benefits of the KSG estimator stems from a curious "correlation boosting" effect and build on this intuition to modify the KSG estimator in novel ways to construct a superior estimator. As a byproduct of our investigations, we obtain nearly tight rates of convergence of the ℓ2 error of the well known fixed k nearest neighbor estimator of differential entropy by Kozachenko and Leonenko.
Estimation of R�nyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
Corr, 2010
We present simple and computationally efficient nonparametric estimators of R\'enyi entropy and mutual information based on an i.i.d. sample drawn from an unknown, absolutely continuous distribution over Rd\R^dRd. The estimators are calculated as the sum of ppp-th powers of the Euclidean lengths of the edges of the `generalized nearest-neighbor' graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis.