Estimating Local Intrinsic Dimensionality (original) (raw)
Related papers
Extreme-value-theoretic estimation of local intrinsic dimensionality
Data Mining and Knowledge Discovery, 2018
This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl's expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation, the method of moments, probability weighted moments, and regularly varying functions. An experimental evaluation is also provided, using both real and artificial data.
Intrinsic Dimensionality Estimation within Tight Localities: A Theoretical and Experimental Analysis
Cornell University - arXiv, 2022
Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for 'tight' localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-value-theoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
Intrinsic Dimensionality Estimation within Tight Localities
Proceedings of the 2019 SIAM International Conference on Data Mining, 2019
Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for 'tight' localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-valuetheoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
Maximum Likelihood Estimation of Intrinsic Dimension
2004
We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators.
Local intrinsic dimensionality estimators based on concentration of measure
2020
Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. Knowing ID is crucial to choose the appropriate machine learning approach as well as to understand its behavior and validate it. ID can be computed globally for the whole data point distribution, or computed locally in different regions of the data space. In this paper, we introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds, which is one of the manifestations of concentration of measure. We empirically study the properties of these estimators and compare them with other recently introduced ID estimators exploiting various effects of measure concentration. Observed differences between estimators can be used to anticipate their behaviour in practical applications.
An empirical evaluation of intrinsic dimension estimators
Information Systems, 2017
In this work, we study the behavior of different algorithms that attempt to estimate the intrinsic dimension (ID) in metric spaces. Some of these algorithms were developed specifically for evaluating the complexity of the search on metric spaces, based on different theories related to the distribution of distances between objects on such spaces. Others were designed originally only for vector spaces and they have been adapted so that they can be applied to metric spaces. To determine the goodness of the ID estimation obtained with each algorithm-or at least determine which one fits the best to the actual difficulty of the search process on the tested metric spaces-we make comparisons using two indices, one based on pivots and the other on compact partitions. This allows us to verify if the considered ID estimators reflect the actual hardness of searching over the considered spaces.
Data classification based on the local intrinsic dimension
2020
One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary tocommon intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technicaldiscussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robustapproach to discriminate regions with different local IDs and classify the points accordingly. Our approach is computationallyefficient, and can be proficiently used even on large data sets. We find that many real-world data sets contain regions withwidely heterogeneous dimensions. These regions host points differing in core properties: folded vs unfolded configurations in aprotein molecular dynamics trajectory, active vs non-active regions in brain imaging data, and firms with different financial ...
Journal of Emerging Technologies in Web Intelligence, 2013
The analysis of high-dimensional data is usually challenging since many standard modelling approaches tend to break down due to the so-called "curse of dimensionality". Dimension reduction techniques, which reduce the data set (explicitly or implicitly) to a smaller number of variables, make the data analysis more efficient and are furthermore useful for visualization purposes. However, most dimension reduction techniques require fixing the intrinsic dimension of the low-dimensional subspace in advance. The intrinsic dimension can be estimated by fractal dimension estimation methods, which exploit the intrinsic geometry of a data set. The most popular concept from this family of methods is the correlation dimension, which requires estimation of the correlation integral for a ball of radius tending to 0. In this paper we propose approaches to approximate the correlation integral in this limit. Experimental results on real world and simulated data are used to demonstrate the algorithms and compare to other methodology. A simulation study which verifies the effectiveness of the proposed methods is also provided.
The generalized ratios intrinsic dimension estimator
Scientific Reports
Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the depends rather dramatically on the scale of the distances among data points. At short distances, the can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, , that allows estimating the as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantifi...
Distributional Results for Model-Based Intrinsic Dimension Estimators
2021
Modern datasets are characterized by a large number of features that describe complex dependency structures. To deal with this type of data, dimensionality reduction techniques are essential. Numerous dimensionality reduction methods rely on the concept of intrinsic dimension, a measure of the complexity of the dataset. In this article, we first review the TWO-NN model, a likelihood-based intrinsic dimension estimator recently introduced by Facco et al. [2017]. Specifically, the TWO-NN estimator is based on the statistical properties of the ratio of the distances between a point and its first two nearest neighbors. We extend the TWO-NN theoretical framework by providing novel distributional results of consecutive and generic ratios of distances. These distributional results are then employed to derive intrinsic dimension estimators, called Cride and Gride. These novel estimators are more robust to noisy measurements than the TWO-NN and allow the study of the evolution of the intrins...