Neural networks for estimating intrinsic dimension (original) (raw)

On the Effects of Dimensionality on Data Analysis with Neural Networks

2003

Modern data analysis often faces high-dimensional data. Nevertheless, most neural network data analysis tools are not adapted to high-dimensional spaces, because of the use of conventional concepts (as the Euclidean distance) that scale poorly with dimension. This paper shows some limitations of such concepts and suggests some research directions as the use of alternative distance definitions and of non-linear dimension reduction.

An approach to dimensionality reduction in time series

Information Sciences, 2014

Many methods of dimensionality reduction of data series (time series) have been introduced over the past decades. Some of them rely on a symbolic representation of the original data, however in this case the obtained dimensionality reduction is not substantial. In this paper, we introduce a new approach referred to as Symbolic Essential Attributes Approximation (SEAA) to reduce the dimensionality of multidimensional time series. In such a way we form a new nominal representation of the original data series. The approach is based on the concept of data series envelopes and essential attributes generated by a multilayer neural network. The real-valued attributes are discretized, and in this way symbolic data series representation is formed. The SEAA generates a vector of nominal values of new attributes which form the compressed representation of original data series. The nominal attributes are synthetic, and while not being directly interpretable, they still retain important features of the original data series. A validation of usefulness of the proposed dimensionality reduction is carried out for classification and clustering tasks. The experiments have shown that even for a significant reduction of dimensionality, the new representation retains information about the data series sufficient for classification and clustering of the time series.

An Additive Autoencoder for Dimension Estimation

arXiv (Cornell University), 2022

An additive autoencoder for dimension reduction, which is composed of a serially performed bias estimation, linear trend estimation, and nonlinear residual estimation, is proposed and analyzed. Computational experiments confirm that an autoencoder of this form, with only a shallow network to encapsulate the nonlinear behavior, is able to identify an intrinsic dimension of a dataset with a low autoencoding error. This observation leads to an investigation in which shallow and deep network structures, and how they are trained, are compared. We conclude that the deeper network structures obtain lower autoencoding errors during the identification of the intrinsic dimension. However, the detected dimension does not change compared to a shallow network.

A journey into low-dimensional spaces with autoassociative neural networks

Talanta, 2003

The compression and the visualization of the data have been always a subject of a great deal of excitement. Since multidimensional data sets are difficult to interpret and visualize, much of the attention is drawn how to compress them efficiently. Usually, the compression of dimensionality is considered as the first step of exploratory data analysis. Here, we focus our attention on autoassociative neural networks (ANNs), which in a very elegant manner provide data compression and visualization. ANNs can deal with linear and nonlinear correlation among variables, what makes them a very powerful tool in exploratory data analysis. In the literature, ANNs are often referred as nonlinear principal component analysis (PCA), and due to their specific structure they are also known as bottleneck neural networks. In this paper, ANNs are discussed in details. Different training modes are described and illustrated on real example. The usefulness of ANNs for nonlinear data compression and visualization purposes is proven with the aid of chemical data sets, being the subject of analysis. The comparison of ANNs with well-known PCA is also presented. #

Intrinsic dimensionality estimation with optimally topology preserving maps

IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998

A new method for analyzing the intrinsic dimensionality (ID) of low dimensional manifolds in high dimensional feature spaces is presented. The basic idea is to rst extract a low-dimensional representation that captures the intrinsic topological structure of the input data and then to analyze this representation, i.e. estimate the intrinsic dimensionality. More speci cally, the representation we extract is an optimally topology preserving feature map (OTPM) which is an undirected parametrized graph with a pointer in the input space associated with each node. Estimation of the intrinsic dimensionality is based on local PCA of the pointers of the nodes in the OTPM and their direct neighbors. The method has a number of important advantages compared with previous approaches: First, it can be shown to have only linear time complexity w.r.t. the dimensionality of the input space, in contrast to conventional PCA based approaches which have cubic complexity and hence become computational impracticable for high dimensional input spaces. Second, it is less sensitive to noise than former approaches, and, nally, the extracted representation can be directly used for further data processing tasks including auto-association and classi cation. Experiments include ID estimation of synthetic data for illustration as well as ID estimation of a sequence of full scale images.

Novel high intrinsic dimensionality estimators

Machine Learning, 2012

Recently, a great deal of research work has been devoted to the development of algorithms to estimate the intrinsic dimensionality (id) of a given dataset, that is the minimum number of parameters needed to represent the data without information loss. id estimation is important for the following reasons: the capacity and the generalization capability of discriminant methods depend on it; id is a necessary information for any dimensionality reduction technique; in neural network design the number of hidden units in the encoding middle layer should be chosen according to the id of data; the id value is strongly related to the model order in a time series, that is crucial to obtain reliable time series predictions. Although many estimation techniques have been proposed in the literature, most of them fail on noisy data, or compute underestimated values when the id is sufficiently high. In this paper, after reviewing some of the most important id estimators related to our work, we provide a theoretical motivation of the bias that causes the underestimation effect, and we present two id estimators based on the statistical properties of manifold neighborhoods, which have been developed in order to reduce this effect. We exhaustively evaluate the proposed techniques on synthetic and real datasets, by employing an objective evaluation measure to compare their performance with those achieved by state of the art algorithms; the results show that the proposed methods are promising, and produce reliable estimates also in the difficult case of datasets drawn from non-linearly embedded manifolds, characterized by high id.

IDEA: Intrinsic Dimension Estimation Algorithm

Lecture Notes in Computer Science, 2011

The high dimensionality of some real life signals makes the usage of the most common signal processing and pattern recognition methods unfeasible. For this reason, in literature a great deal of research work has been devoted to the development of algorithms performing dimensionality reduction. To this aim, an useful help could be provided by the estimation of the intrinsic dimensionality of a given dataset, that is the minimum number of parameters needed to capture, and describe, all the information carried by the data. Although many techniques have been proposed, most of them fail in case of noisy data or when the intrinsic dimensionality is too high. In this paper we propose a local intrinsic dimension estimator exploiting the statistical properties of data neighborhoods. The algorithm evaluation on both synthetic and real datasets, and the comparison with state of the art algorithms, proves that the proposed technique is promising.

Estimating the Embedding Dimension Distribution of Time Series with SOMOS

Lecture Notes in Computer Science, 2009

The paper proposes a new method to estimate the distribution of the embedding dimension associated with a time series, using the Self Organizing Map decision taken in Output Space (SOMOS) dimensionality reduction neural network. It is shown that SOMOS, besides estimating the embedding dimension, it also provides an approximation of the overall distribution of such dimension for the set where the time series evolves. Such estimation can be employed to select a proper window size in different predictor schemes; also, it can provide a measure of the future predictability at a given instant of time. The results are illustrated via the analysis of time series generated from both chaotic Hénon map and Lorenz system.

Optimization of the Maximum Likelihood Estimator for Determining the Intrinsic Dimensionality of High–Dimensional Data

International Journal of Applied Mathematics and Computer Science, 2015

One of the problems in the analysis of the set of images of a moving object is to evaluate the degree of freedom of motion and the angle of rotation. Here the intrinsic dimensionality of multidimensional data, characterizing the set of images, can be used. Usually, the image may be represented by a high-dimensional point whose dimensionality depends on the number of pixels in the image. The knowledge of the intrinsic dimensionality of a data set is very useful information in exploratory data analysis, because it is possible to reduce the dimensionality of the data without losing much information. In this paper, the maximum likelihood estimator (MLE) of the intrinsic dimensionality is explored experimentally. In contrast to the previous works, the radius of a hypersphere, which covers neighbours of the analysed points, is fixed instead of the number of the nearest neighbours in the MLE. A way of choosing the radius in this method is proposed. We explore which metric—Euclidean or geod...