A Dirty Model for Multiple Sparse Regression (original) (raw)

Sparse Regression

2006

Yuan an Lin (2004) proposed the grouped LASSO, which achieves shrinkage and selection simultaneously, as LASSO does, but works on blocks of covariates. That is, the grouped LASSO provides a model where some blocks of regression coefficients are exactly zero. The grouped LASSO is useful when there are meaningful blocks of covariates such as polynomial regression and dummy variables from categorical variables. In this paper, we propose an extension of the grouped LASSO, called ‘Blockwise Sparse Regression’ (BSR). The BSR achieves shrinkage and selection simultaneously on blocks of covariates similarly to the grouped LASSO, but it works for general loss functions including generalized linear models. An efficient computational algorithm is developed and a blockwise standardization method is proposed. Simulation results show that the BSR compromises the ridge and LASSO for logistic regression. The proposed method is illustrated with two datasets.

Blockwise sparse regression

Statistica Sinica

Yuan an Lin (2004) proposed the grouped LASSO, which achieves shrink-age and selection simultaneously, as LASSO does, but works on blocks of covariates. That is, the grouped LASSO provides a model where some blocks of regression co-efficients are exactly zero. The grouped LASSO is useful when there are meaningful blocks of covariates such as polynomial regression and dummy variables from cat-egorical variables. In this paper, we propose an extension of the grouped LASSO, called 'Blockwise Sparse Regression' (BSR). The BSR achieves shrinkage and se-lection simultaneously on blocks of covariates similarly to the grouped LASSO, but it works for general loss functions including generalized linear models. An efficient computational algorithm is developed and a blockwise standardization method is proposed. Simulation results show that the BSR compromises the ridge and LASSO for logistic regression. The proposed method is illustrated with two datasets.

Sparse Regression: Scalable Algorithms and Empirical Performance

Statistical Science, 2020

In this paper, we review state-of-the-art methods for feature selection in statistics with an applicationoriented eye. Indeed, sparsity is a valuable property and the profusion of research on the topic might have provided little guidance to practitioners. We demonstrate empirically how noise and correlation impact both the accuracy-the number of correct features selected-and the false detection-the number of incorrect features selected-for five methods: the cardinality-constrained formulation, its Boolean relaxation, 1 regularization and two methods with non-convex penalties. A cogent feature selection method is expected to exhibit a twofold convergence, namely the accuracy and false detection rate should converge to 1 and 0 respectively, as the sample size increases. As a result, proper method should recover all and nothing but true features. Empirically, the integer optimization formulation and its Boolean relaxation are the closest to exhibit this two properties consistently in various regimes of noise and correlation. In addition, apart from the discrete optimization approach which requires a substantial, yet often affordable, computational time, all methods terminate in times comparable with the glmnet package for Lasso. We released code for methods that were not publicly implemented. Jointly considered, accuracy, false detection and computational time provide a comprehensive assessment of each feature selection method and shed light on alternatives to the Lasso-regularization which are not as popular in practice yet.

A tutorial on the Lasso approach to sparse modeling

Chemometrics and Intelligent Laboratory Systems, 2012

In applied research data are often collected from sources with a high dimensional multivariate output. Analysis of such data is composed of e.g. extraction and characterization of underlying patterns, and often with the aim of finding a small subset of significant variables or features. Variable and feature selection is well-established in the area of regression, whereas for other types of models this seems more difficult. Penalization of the L 1 norm provides an interesting avenue for such a problem, as it produces a sparse solution and hence embeds variable selection. In this paper a brief introduction to the mathematical properties of using the L 1 norm as a penalty is given. Examples of models extended with L 1 norm penalties/constraints are presented. The examples include PCA modeling with sparse loadings which enhance interpretability of single components. Sparse inverse covariance matrix estimation is used to unravel which variables are affecting each other, and a modified PCA to model data with (piecewise) constant responses in e.g. process monitoring is shown. All examples are demonstrated on real or synthetic data. The results indicate that sparse solutions, when appropriate, can enhance model interpretability.

Numerical characterization of support recovery in sparse regression with correlated design

Communications in Statistics - Simulation and Computation

Sparse regression is frequently employed in diverse scientific settings as a feature selection method. A pervasive aspect of scientific data that hampers both feature selection and estimation is the presence of strong correlations between predictive features. These fundamental issues are often not appreciated by practitioners, and jeapordize conclusions drawn from estimated models. On the other hand, theoretical results on sparsity-inducing regularized regression such as the Lasso have largely addressed conditions for selection consistency via asymptotics, and disregard the problem of model selection, whereby regularization parameters are chosen. In this numerical study, we address these issues through exhaustive characterization of the performance of several regression estimators, coupled with a range of model selection strategies. These estimators and selection criteria were examined across correlated regression problems with varying degrees of signal to noise, distribution of the non-zero model coefficients, and model sparsity. Our results reveal a fundamental tradeoff between false positive and false negative control in all regression estimators and model selection criteria examined. Additionally, we are able to numerically explore a transition point modulated by the signal-to-noise ratio and spectral properties of the design covariance matrix at which the selection accuracy of all considered algorithms degrades. Overall, we find that SCAD coupled with BIC or empirical Bayes model selection performs the best feature selection across the regression problems considered.

A note on the group lasso and a sparse group lasso

2010

We consider the group lasso penalty for the linear model. We note that the standard algorithm for solving the problem assumes that the model matrices in each group are orthonormal. Here we consider a more general penalty that blends the lasso (L 1) with the group lasso ("two-norm"). This penalty yields solutions that are sparse at both the group and individual feature levels. We derive an efficient algorithm for the resulting convex problem based on coordinate descent. This algorithm can also be used to solve the general form of the group lasso, with non-orthonormal model matrices.

The LASSO risk: asymptotic results and real world examples

2010

Abstract We consider the problem of learning a coefficient vector x0∈ RN from noisy linear observation y= Ax0+ w∈ Rn. In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator ̂x. In this case, a popular approach consists in solving an ℓ1-penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN).

Convex Block-sparse Linear Regression with Expanders - Provably

ArXiv, 2016

Sparse matrices are favorable objects in machine learning and optimization. When such matrices are used, in place of dense ones, the overall complexity requirements in optimization can be significantly reduced in practice, both in terms of space and run-time. Prompted by this observation, we study a convex optimization scheme for block-sparse recovery from linear measurements. To obtain linear sketches, we use expander matrices, i.e., sparse matrices containing only few non-zeros per column. Hitherto, to the best of our knowledge, such algorithmic solutions have been only studied from a non-convex perspective. Our aim here is to theoretically characterize the performance of convex approaches under such setting. Our key novelty is the expression of the recovery error in terms of the model-based norm, while assuring that solution lives in the model. To achieve this, we show that sparse model-based matrices satisfy a group version of the null-space property. Our experimental findings o...

The Generalized LASSO

IEEE Transactions on Neural Networks, 2004

In the last few years, the support vector machine (SVM) method has motivated new interest in kernel regression techniques. Although the SVM has been shown to exhibit excellent generalization properties in many experiments, it suffers from several drawbacks, both of a theoretical and a technical nature: the absence of probabilistic outputs, the restriction to Mercer kernels, and the steep growth of the number of support vectors with increasing size of the training set. In this paper, we present a different class of kernel regressors that effectively overcome the above problems. We call this approach generalized LASSO regression. It has a clear probabilistic interpretation, can handle learning sets that are corrupted by outliers, produces extremely sparse solutions, and is capable of dealing with large-scale problems. For regression functionals which can be modeled as iteratively reweighted least-squares (IRLS) problems, we present a highly efficient algorithm with guaranteed global convergence. This defies a unique framework for sparse regression models in the very rich class of IRLS models, including various types of robust regression models and logistic regression. Performance studies for many standard benchmark datasets effectively demonstrate the advantages of this model over related approaches.