A comparison of optimization methods and software for large-scale l1-regularized linear classification (original) (raw)

Increasing feature selection accuracy for l1 regularized linear models in large datasets

2010

Abstract L1 (also referred to as the 1-norm or Lasso) penalty based formulations have been shown to be effective in problem domains when noisy features are present. However, the L1 penalty does not give favorable asymptotic properties with respect to feature selection, and has been shown to be inconsistent as a feature selection estimator; eg when noisy features are correlated with the relevant features.

A Fast Hybrid Algorithm for Large-Scale l1-Regularized Logistic Regression

Journal of Machine Learning Research, 2010

ℓ 1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ 1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ 1 -norm in the objective function. Motivated by recent work , we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton's method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy.

Fast Implementation of ℓ1 Regularized Learning Algorithms Using Gradient Descent Methods

With the advent of high-throughput technologies, ℓ 1 regularized learning algorithms have attracted much at-tention recently. Dozens of algorithms have been pro-posed for fast implementation, using various advanced optimization techniques. In this paper, we demon-strate that ℓ 1 regularized learning problems can be eas-ily solved by using gradient-descent techniques. The ba-sic idea is to transform a convex optimization problem with a non-differentiable objective function into an un-constrained non-convex problem, upon which, via gra-dient descent, reaching a globally optimum solution is guaranteed. We present detailed implementation of the algorithm using ℓ 1 regularized logistic regression as a particular application. We conduct large-scale experi-ments to compare the new approach with other state-of-the-art algorithms on eight medium and large-scale problems. We demonstrate that our algorithm, though simple, performs similarly or even better than other ad-vanced algorithms in ter...

Linear Time Feature Selection for Regularized Least-Squares

2010

We propose a novel algorithm for greedy forward feature selection for regularized least-squares (RLS) regression and classification, also known as the leastsquares support vector machine or ridge regression. The algorithm, which we call greedy RLS, starts from the empty feature set, and on each iteration adds the feature whose addition provides the best leaveone-out cross-validation performance. Our method is considerably faster than the previously proposed ones, since its time complexity is linear in the number of training examples, the number of features in the original data set, and the desired size of the set of selected features. Therefore, as a side effect we obtain a new training algorithm for learning sparse linear RLS predictors which can be used for large scale learning. This speed is possible due to matrix calculus based short-cuts for leave-one-out and feature addition. We experimentally demonstrate the scalability of our algorithm and its ability to find good quality feature sets.

An Interior-Point Method for Large-Scale ℓ1-Regularized Least Squares

Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.

Speeding up greedy forward selection for regularized least-squares

2010

We propose a novel algorithm for greedy forward feature selection for regularized least-squares (RLS) regression and classification, also known as the least-squares support vector machine or ridge regression. The algorithm, which we call greedy RLS, starts from the empty feature set, and on each iteration adds the feature whose addition provides the best leave-one-out cross-validation performance. Our method is considerably faster than the previously proposed ones, since its time complexity is linear in the number of training examples, the number of features in the original data set, and the desired size of the set of selected features. Therefore, as a side effect we obtain a new training algorithm for learning sparse linear RLS predictors which can be used for large scale learning. This speed is possible due to matrix calculus based short-cuts for leave-one-out and feature addition. We experimentally demonstrate the scalability of our algorithm compared to previously proposed implementations.

Recent advances of large-scale linear classification

2012

Abstract Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (ie, testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some large-scale applications.

GURLS: a Toolbox for Regularized Least Squares Learning

We present GURLS, a toolbox for supervised learning based on the regularized least squares algorithm. The toolbox takes advantage of all the favorable properties of least squares and is tailored to deal in particular with multi-category/multi-label problems. One of the main advantages of GURLS is that it allows training and tuning a multi-category classifier at essentially the same cost of one single binary classifier.

Parallel feature selection for regularized least-squares

We propose a novel algorithm for greedy forward feature selection for regularized least-squares (RLS) regression and classification, also known as the leastsquares support vector machine or ridge regression. The algorithm, which we call greedy RLS, starts from the empty feature set, and on each iteration adds the feature whose addition provides the best leaveone-out cross-validation performance. Our method is considerably faster than the previously proposed ones, since its time complexity is linear in the number of training examples, the number of features in the original data set, and the desired size of the set of selected features. Therefore, as a side effect we obtain a new training algorithm for learning sparse linear RLS predictors which can be used for large scale learning. This speed is possible due to matrix calculus based short-cuts for leave-one-out and feature addition. We experimentally demonstrate the scalability of our algorithm and its ability to find good quality feature sets.