Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets (original) (raw)

Geodesic distances in the maximum likelihood estimator of intrinsic dimensionality

Nonlinear Analysis: Modelling and Control

While analyzing multidimensional data, we often have to reduce their dimensionality so that to preserve as much information on the analyzed data set as possible. To this end, it is reasonable to find out the intrinsic dimensionality of the data. In this paper, two techniques for the intrinsic dimensionality are analyzed and compared, i.e., the maximum likelihood estimator (MLE) and ISOMAP method. We also propose the way how to get good estimates of the intrinsic dimensionality by the MLE method.

Estimating Local Intrinsic Dimension with k-Nearest Neighbor Graphs

Many high-dimensional data sets of practical interest exhibit a varying complexity in different parts of the data space. This is the case, for example, of databases of images containing many samples of a few textures of different complexity. Such phenomena can be modeled by assuming that the data lies on a collection of manifolds with different intrinsic dimensionalities. In this extended abstract, we introduce a method to estimate the local dimensionality associated with each point in a data set, without any prior information about the manifolds, their quantity and their sampling distributions. The proposed method uses a global dimensionality estimator based on knearest neighbor (k-NN) graphs, together with an algorithm for computing neighborhoods in the data with similar topological properties.

On Local Intrinsic Dimension Estimation and Its Applications

In this paper, we present multiple novel applications for local intrinsic dimension estimation. There has been much work done on estimating the global dimension of a data set, typically for the purposes of dimensionality reduction. We show that by estimating dimension locally, we are able to extend the uses of dimension estimation to many applications, which are not possible with global dimension estimation. Additionally, we show that local dimension estimation can be used to obtain a better global dimension estimate, alleviating the negative bias that is common to all known dimension estimation algorithms. We illustrate local dimension estimation's uses towards additional applications, such as learning on statistical manifolds, network anomaly detection, clustering, and image segmentation.

Geodesic entropic graphs for dimension and entropy estimation in manifold learning

In the manifold learning problem one seeks to discover a smooth low dimensional surface, i.e., a manifold embedded in a higher dimensional linear vector space, based on a set of measured sample points on the surface. In this paper we consider the closely related problem of estimating the manifold's intrinsic dimension and the intrinsic entropy of the sample points. Specifically, we view the sample points as realizations of an unknown multivariate density supported on an unknown smooth manifold. We introduce a novel geometric approach based on entropic graph methods. Although the theory presented applies to this general class of graphs, we focus on the geodesic-minimalspanning-tree (GMST) to obtaining asymptotically consistent estimates of the manifold dimension and the Rényi -entropy of the sample density on the manifold. The GMST approach is striking in its simplicity and does not require reconstructing the manifold or estimating the multivariate density of the samples. The GMST method simply constructs a minimal spanning tree (MST) sequence using a geodesic edge matrix and uses the overall lengths of the MSTs to simultaneously estimate manifold dimension and entropy. We illustrate the GMST approach on standard synthetic manifolds as well as on real data sets consisting of images of faces.

Intrinsic dimensionality estimation with optimally topology preserving maps

IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998

A new method for analyzing the intrinsic dimensionality (ID) of low dimensional manifolds in high dimensional feature spaces is presented. The basic idea is to rst extract a low-dimensional representation that captures the intrinsic topological structure of the input data and then to analyze this representation, i.e. estimate the intrinsic dimensionality. More speci cally, the representation we extract is an optimally topology preserving feature map (OTPM) which is an undirected parametrized graph with a pointer in the input space associated with each node. Estimation of the intrinsic dimensionality is based on local PCA of the pointers of the nodes in the OTPM and their direct neighbors. The method has a number of important advantages compared with previous approaches: First, it can be shown to have only linear time complexity w.r.t. the dimensionality of the input space, in contrast to conventional PCA based approaches which have cubic complexity and hence become computational impracticable for high dimensional input spaces. Second, it is less sensitive to noise than former approaches, and, nally, the extracted representation can be directly used for further data processing tasks including auto-association and classi cation. Experiments include ID estimation of synthetic data for illustration as well as ID estimation of a sequence of full scale images.

Intrinsic Dimensionality Estimation within Tight Localities

Proceedings of the 2019 SIAM International Conference on Data Mining, 2019

Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for 'tight' localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-valuetheoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.

Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

Mathematical Problems in Engineering, 2015

When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-a...

Distributional Results for Model-Based Intrinsic Dimension Estimators

2021

Modern datasets are characterized by a large number of features that describe complex dependency structures. To deal with this type of data, dimensionality reduction techniques are essential. Numerous dimensionality reduction methods rely on the concept of intrinsic dimension, a measure of the complexity of the dataset. In this article, we first review the TWO-NN model, a likelihood-based intrinsic dimension estimator recently introduced by Facco et al. [2017]. Specifically, the TWO-NN estimator is based on the statistical properties of the ratio of the distances between a point and its first two nearest neighbors. We extend the TWO-NN theoretical framework by providing novel distributional results of consecutive and generic ratios of distances. These distributional results are then employed to derive intrinsic dimension estimators, called Cride and Gride. These novel estimators are more robust to noisy measurements than the TWO-NN and allow the study of the evolution of the intrins...

Estimating Local Intrinsic Dimensionality

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl's expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation (MLE), the method of moments (MoM), probability weighted moments (PWM), and regularly varying functions (RV). An experimental evaluation is also provided, using both real and artificial data.

Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension Estimation

2021

Manifold hypothesis states that data points in highdimensional space actually lie in close vicinity of a manifold of much lower dimension. In many cases this hypothesis was empirically verified and used to enhance unsupervised and semi-supervised learning. Here we present new approach to manifold hypothesis checking and underlying manifold dimension estimation. In order to do it we use two very different methods simultaneously — one geometric, another probabilistic — and check whether they give the same result. Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation. The probabilistic method is new. Although it exploits standard nearest neighborhood distance, it is different from methods which were previously used in such situations. This method is robust, fast and includes special preliminary data transformation. Experiments on real datasets show that the suggested approach based on two methods combination ...