Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension Estimation (original) (raw)
Related papers
Dimension estimation of image manifolds by minimal cover approximation
Neurocomputing, 2013
Estimating intrinsic dimension of data is an important problem in feature extraction and feature selection. It provides an estimation of the number of desired features. Principal Components Analysis (PCA) is a powerful tool in discovering the dimension of data sets with a linear structure; it, however, becomes ineffective when data have a nonlinear structure. In this paper, we propose a new PCA-based method to estimate the embedding dimension of data with nonlinear structures. Our method works by first finding a minimal cover of the data set, then performing PCA locally on each subset in the cover to obtain local intrinsic dimension estimations and finally giving the estimation result as the average of the local estimations. There are two main innovations in our method. (1) A novel noise filtering procedure is applied in the PCA procedure for local intrinsic dimension estimation. (2) A minimal cover is constructed over the whole data set. Because of these two innovations, our method is fast, robust to noise and outliers, converges to a stable estimation with a wide range of sub-region sizes and can be used in the incremental sense, where the subregion refers to the local approximation of the distributed manifold. Experiments on synthetic and image data sets show effectiveness of the proposed method.
2005
We address dimensionality estimation and nonlinear manifold inference starting from point inputs in high dimensional spaces using tensor voting. The proposed method operates locally in neighborhoods and does not involve any global computations. It is based on information propagation among neighboring points implemented as a voting process. Unlike other local approaches for manifold learning, the quantity propagated from one point to another is not a scalar, but is in the form of a tensor that provides considerably richer information. The accumulation of votes at each point provides a reliable estimate of local dimensionality, as well as of the orientation of a potential manifold going through the point. Reliable dimensionality estimation at the point level is a major advantage over competing methods. Moreover, the absence of global operations allows us to process significantly larger datasets. We demonstrate the effectiveness of our method on a variety of challenging datasets. magnitude of the vote is more than 3% of the magnitude of the voter.
On Local Intrinsic Dimension Estimation and Its Applications
In this paper, we present multiple novel applications for local intrinsic dimension estimation. There has been much work done on estimating the global dimension of a data set, typically for the purposes of dimensionality reduction. We show that by estimating dimension locally, we are able to extend the uses of dimension estimation to many applications, which are not possible with global dimension estimation. Additionally, we show that local dimension estimation can be used to obtain a better global dimension estimate, alleviating the negative bias that is common to all known dimension estimation algorithms. We illustrate local dimension estimation's uses towards additional applications, such as learning on statistical manifolds, network anomaly detection, clustering, and image segmentation.
De-biasing local dimension estimation
Many algorithms have been proposed for estimating the intrinsic dimension of high dimensional data. A phenomenon common to all of them is a negative bias, perceived to be the result of undersampling. We propose improved methods for estimating intrinsic dimension, taking manifold boundaries into consideration. By estimating dimension locally, we are able to analyze and reduce the effect that sample data depth has on the negative bias. Additionally, we offer improvements to an existing algorithm for dimension estimation, based on k-nearest neighbor graphs, and offer an algorithm for adapting any dimension estimation algorithm to operate locally. Finally, we illustrate the uses of local dimension estimation with data sets consisting of multiple manifolds, including applications such as diagnosing anomalies in router networks and image segmentation.
Estimating Local Intrinsic Dimension with k-Nearest Neighbor Graphs
Many high-dimensional data sets of practical interest exhibit a varying complexity in different parts of the data space. This is the case, for example, of databases of images containing many samples of a few textures of different complexity. Such phenomena can be modeled by assuming that the data lies on a collection of manifolds with different intrinsic dimensionalities. In this extended abstract, we introduce a method to estimate the local dimensionality associated with each point in a data set, without any prior information about the manifolds, their quantity and their sampling distributions. The proposed method uses a global dimensionality estimator based on knearest neighbor (k-NN) graphs, together with an algorithm for computing neighborhoods in the data with similar topological properties.
Recent Advances in Nonlinear Dimensionality Reduction, Manifold and Topological Learning
The ever-growing amount of data stored in digital databases raises the question of how to organize and extract useful knowledge. This paper outlines some current developments in the domains of dimensionality reduction, manifold learning, and topological learning. Several aspects are dealt with, ranging from novel algorithmic approaches to their realworld applications. The issue of quality assessment is also considered and progress in quantitive as well as visual crieria is reported.
2023
Dimensionality reduction techniques are crucial for modern data analysis applications due to the large complexity of the datasets found in several domains of science. In order to find relevant patterns in this vast amount of data, a feature extraction step is often required. Linear approaches were the first class of methods applied to reduce dimensionality of data. However, they assume that the extracted features lie in an Euclidean space, which is not a reasonable assuption in many real problems. In 2000, with advances on kernel methods and computational resources, the research field manifold learning started to emerge with the pioneering ISOMAP algorithm, showing a much more realistic model for dimensionality reduction for metric learning. This paper presents a review of unsupervised metric learning techniques ranging from Principal Component Analysis (PCA) to modern manifold learning algorithms, such as t-SNE, Local Tangent Space Aligment, Diffusion Maps, among others, with a special attention to the mathematical background, but at the same time trying to elucidate the intuition behind each method. This year we celebrate the 20th anniversary of this remarkable field which had profound impact in the study of unsupervised metric learning methods, in the sense that these methods can learn a distance function that geometrically is better suited to represent a similarity measure between a pair of objects in a given dataset. Basically, our goal with this review is to provide a complete guide to researchers interested in initiating at the field.
Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction
2010
We propose a unified manifold learning framework for semi-supervised and unsupervised dimension reduction by employing a simple but effective linear regression function to map the new data points. For semi-supervised dimension reduction, we aim to find the optimal prediction labels for all the training samples , the linear regression function ( ) and the regression residue 0 = ( ) simultaneously. Our new objective function integrates two terms related to label fitness and manifold smoothness as well as a flexible penalty term defined on the residue 0 . Our Semi-Supervised learning framework, referred to as flexible manifold embedding (FME), can effectively utilize label information from labeled data as well as a manifold structure from both labeled and unlabeled data. By modeling the mismatch between ( ) and , we show that FME relaxes the hard linear constraint = ( ) in manifold regularization (MR), making it better cope with the data sampled from a nonlinear manifold. In addition, we propose a simplified version (referred to as FME/U) for unsupervised dimension reduction. We also show that our proposed framework provides a unified view to explain and understand many semi-supervised, supervised and unsupervised dimension reduction techniques. Comprehensive experiments on several benchmark databases demonstrate the significant improvement over existing dimension reduction algorithms.
Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework
Mathematical Problems in Engineering, 2015
When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-a...