direpack: A Python 3 package for state-of-the-art statistical dimensionality reduction methods (original) (raw)

Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation

Entropy

Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces scikit-dimension, an open-source Python package for intrinsic dimension estimation. The scikit-dimension package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimati...

LDR : A Package for Likelihood-Based Sufficient Dimension Reduction

Journal of Statistical Software, 2011

We introduce a new MATLAB software package that implements several recently proposed likelihood-based methods for sufficient dimension reduction. Current capabilities include estimation of reduced subspaces with a fixed dimension d, as well as estimation of d by use of likelihood-ratio testing, permutation testing and information criteria. The methods are suitable for preprocessing data for both regression and classification. Implementations of related estimators are also available. Although the software is more oriented to command-line operation, a graphical user interface is also provided for prototype computations.

Dimension estimation in sufficient dimension reduction: A unifying approach

Journal of Multivariate Analysis, 2011

Sufficient Dimension Reduction (SDR) in regression comprises the estimation of the dimension of the smallest (central) dimension reduction subspace and its basis elements. For SDR methods based on a kernel matrix, such as SIR and SAVE, the dimension estimation is equivalent to the estimation of the rank of a random matrix which is the sample based estimate of the kernel. A test for the rank of a random matrix amounts to testing how many of its eigen or singular values are equal to zero. We propose two tests based on the smallest eigen or singular values of the estimated matrix: an asymptotic weighted chi-square test and a Wald-type asymptotic chi-square test. We also provide an asymptotic chi-square test for assessing whether elements of the left singular vectors of the random matrix are zero. These methods together constitute a unified approach for all SDR methods based on a kernel matrix that covers estimation of the central subspace and its dimension, as well as assessment of variable contribution to the lower-dimensional predictor projections with variable selection, a special case. A small power simulation study shows that the proposed and existing tests, specific to each SDR method, perform similarly with respect to power and achievement of the nominal level. Also, the importance of the choice of the number of slices as a tuning parameter is further exhibited.

Least Squares Regression Principal Component Analysis

2020

Dimension reduction is an important technique in surrogate modeling and machine learning. In this thesis, we present three existing dimension reduction methods in detail and then we propose a novel supervised dimension reduction method, 'Least Squares Regression Principal Component Analysis" (LSR-PCA), applicable to both classification and regression dimension reduction tasks. To show the efficacy of this method, we present different examples in visualization, classification and regression problems, comparing it to state-of-the-art dimension reduction methods. Furthermore, we present the kernel version of LSR-PCA for problems where the input are correlated non-linearly. The examples demonstrated that LSR-PCA can be a competitive dimension reduction method. I would like to express my gratitude to my thesis supervisor, Professor Xin Yee. I would like to thank her for giving me this wonderful opportunity and for her guidance and support during all the passing of this semester, putting herself at my disposal during the difficult times of the COVID-19 situation. Without her, the making of this thesis would not have been possible. I would like to extend my thanks to Mr. Pere Balsells, for allowing students like me to conduct their thesis abroad, as well as to the Balsells Foundation for its help and support throughout the whole stay. In addition, I would like to express my thanks to the second supervisor of this thesis, Professor Joan Torras, for helping me in the final stretch of the project, being as helpful as attentive. Finally, I wish to express my most sincere appreciation to my parents and my sister, Alicia, and to my friends, for their support and encouragement during the whole stay.

Principal Fitted Components for Dimension Reduction in Regression

2009

We provide a remedy for two concerns that have dogged the use of principal components in regression: (i) principal components are computed from the predictors alone and do not make apparent use of the response, and (ii) principal components are not invariant or equivariant under full rank linear transformation of the predictors. The development begins with principal fitted components [Cook,

Toward a Quantitative Survey of Dimension Reduction Techniques

IEEE Transactions on Visualization and Computer Graphics, 2019

Dimensionality reduction methods, also known as projections, are frequently used in multidimensional data exploration in machine learning, data science, and information visualization. Tens of such techniques have been proposed, aiming to address a wide set of requirements, such as ability to show the high-dimensional data structure, distance or neighborhood preservation, computational scalability, stability to data noise and/or outliers, and practical ease of use. However, it is far from clear for practitioners how to choose the best technique for a given use context. We present a survey of a wide body of projection techniques that helps answering this question. For this, we characterize the input data space, projection techniques, and the quality of projections, by several quantitative metrics. We sample these three spaces according to these metrics, aiming at good coverage with bounded effort. We describe our measurements and outline observed dependencies of the measured variables. Based on these results, we draw several conclusions that help comparing projection techniques, explain their results for different types of data, and ultimately help practitioners when choosing a projection for a given context. Our methodology, datasets, projection implementations, metrics, visualizations, and results are publicly open, so interested stakeholders can examine and/or extend this benchmark.

Some statistical methods for dimension reduction

2013

Chapter 2 A Comparative study for robust canonical correlation methods 1 The purpose of this chapter is to get robust canonical correlation (RCCA) methods. In the correlation matrix, an approach that substitutes the Pearson correlation with the percentage bend correlation and the winsorised correlation in order to obtain robust correlation matrices is presented. Moreover, the fast consistent high breakdown (FCH), reweighted fast consistent high breakdown (RFCH) and reweighted multivariate normal (RMVN) estimators are employed to obtain robust covariance matrices in the canonical correlation analysis (CCA). Simulation studies are conducted and real data is employed in order to compare the performance of the proposed approaches with the existing methods. The breakdown plots and independent tests are employed as criteria of the robustness and performance of the estimators. Based on the computational studies and real data example, suggestions on the practical implications of the results are proposed.

A survey of dimensionality reduction techniques

Experimental life sciences like biology or chemistry have seen in the recent decades an explosion of the data available from experiments. Laboratory instruments become more and more complex and report hundreds or thousands measurements for a single experiment and therefore the statistical methods face challenging tasks when dealing with such high-dimensional data. However, much of the data is highly redundant and can be efficiently brought down to a much smaller number of variables without a significant loss of information. The mathematical procedures making possible this reduction are called dimensionality reduction techniques; they have widely been developed by fields like Statistics or Machine Learning, and are currently a hot research topic. In this review we categorize the plethora of dimension reduction techniques available and give the mathematical insight behind them.

Linear Dimensionality Reduction: Survey, Insights, and Generalizations

Linear dimensionality reduction methods are a cornerstone of analyzing high dimensional data, due to their simple geometric interpretations and typically attractive computational properties. These methods capture many data features of interest, such as covariance, dynamical structure, correlation between data sets, input-output relationships, and margin between data classes. Methods have been developed with a variety of names and motivations in many fields, and perhaps as a result the connections between all these methods have not been highlighted. Here we survey methods from this disparate literature as optimization programs over matrix manifolds. We discuss principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, maximum autocorrelation factors, slow feature analysis, sufficient dimensionality reduction, undercomplete independent component analysis, linear regression, distance metric learning, and more. This optimization framework gives insight to some rarely discussed shortcomings of well-known methods, such as the suboptimality of certain eigenvector solutions. Modern techniques for optimization over matrix manifolds enable a generic linear dimensionality reduction solver, which accepts as input data and an objective to be optimized, and returns, as output, an optimal low-dimensional projection of the data. This simple optimization framework further allows straightforward generalizations and novel variants of classical methods, which we demonstrate here by creating an orthogonal-projection canonical correlations analysis. More broadly, this survey and generic solver suggest that linear dimensionality reduction can move toward becoming a blackbox, objective-agnostic numerical technology.

X-SDR: An Extensible Experimentation Suite for Dimensionality Reduction

Lecture Notes in Computer Science, 2010

Due to the vast amount and pace of high-dimensional data production, dimensionality reduction emerges as an important requirement in many application areas. In this paper, we introduce X-SDR, a prototype designed specifically for the deployment and assessment of dimensionality reduction techniques. X-SDR is an integrated environment for dimensionality reduction and knowledge discovery that can be effectively used in the data mining process. In the current version, it supports communication with different database management systems and integrates a wealth of dimensionality reduction algorithms both distributed and centralized. Additionally, it interacts with Weka thus enabling the exploitation of the data mining algorithms therein. Finally, X-SDR provides an API that enables the integration and evaluation of any dimensionality reduction algorithm.