Peter Bühlmann - Profile on Academia.edu (original) (raw)

Papers by Peter Bühlmann

Journal of Machine Learning Research, 2014

We propose a novel and efficient algorithm for maximizing the observed log-likelihood of a multiv... more We propose a novel and efficient algorithm for maximizing the observed log-likelihood of a multivariate normal data matrix with missing values. We show that our procedure, based on iteratively regr...

One Modern Culture of Statistics: Comments on Statistical Modeling: The Two Cultures (Breiman, 2001b)

Observational Studies

$Research paper thumbnail of A Look at Robustness and Stability of <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi mathvariant="normal">ℓ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\ell_{1}</annotation></semantics></math>ℓ1-versus <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi mathvariant="normal">ℓ</mi><mn>0</mn></msub></mrow><annotation encoding="application/x-tex">\ell_{0}</annotation></semantics></math>ℓ0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al$

A Look at Robustness and Stability of ell1\ell_{1}ell1-versus ell0\ell_{0}ell0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al

Statistical Science

Journal of the American Statistical Association

We investigate the problem of inferring the causal predictors of a response Y from a set of d exp... more We investigate the problem of inferring the causal predictors of a response Y from a set of d explanatory variables (X 1 ,. .. , X d). Classical ordinary least squares regression includes all predictors that reduce the variance of Y. Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions; loosely speaking they lead to invariance across different "environments" or "heterogeneity patterns". More precisely, the conditional distribution of Y given its causal predictors remains invariant for all observations. Recent work exploits such a stability to infer causal relations from data with different but known environments. We show that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In particular, this allows detecting instantaneous causal relations in multivariate linear time series which is usually not the case for Granger causality. Besides novel methodology, we provide statistical confidence bounds and asymptotic detection results for inferring causal predictors, and present an application to monetary policy in macroeconomics.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

We investigate the problem of testing whether d possibly multivariate random variables, which may... more We investigate the problem of testing whether d possibly multivariate random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two-variable Hilbert-Schmidt independence criterion but allows for an arbitrary number of variables. We embed the joint distribution and the product of the marginals in a reproducing kernel Hilbert space and define the d-variable Hilbert-Schmidt independence criterion dHSIC as the squared distance between the embeddings. In the population case, the value of dHSIC is 0 if and only if the d variables are jointly independent, as long as the kernel is characteristic. On the basis of an empirical estimate of dHSIC, we investigate three nonparametric hypothesis tests: a permutation test, a bootstrap analogue and a procedure based on a gamma approximation. We apply non-parametric independence testing to a problem in causal discovery and illustrate the new methods on simulated and real data sets.

Bioinformatics, 2016

Motivation: Although Genome Wide Association Studies (GWAS) genotype a very large number of singl... more Motivation: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data is often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS. Results: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely highdimensional datasets. Our method yields p-values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whether or not a SNP carries any additional information about the phenotype beyond that available by all the other SNPs. This rules out spurious correlations between phenotypes and SNPs that can arise from marginal methods because the "spuriously correlated" SNP merely happens to be correlated with the "truly causal" SNP. In addition, the method offers a data driven approach to identifying and refining groups of SNPs that jointly contain informative signals about the phenotype. We demonstrate the value of our method by applying it to the seven diseases analyzed by the WTCCC (The Wellcome Trust Case Control Consortium, 2007). We show, in particular, that our method is also capable of finding significant SNPs that were not identified in the original WTCCC study, but were replicated in other independent studies.

Electronic Journal of Statistics, 2015

We consider the problem of inferring the total causal effect of a single continuous variable inte... more We consider the problem of inferring the total causal effect of a single continuous variable intervention on a (response) variable of interest. We propose a certain marginal integration regression technique for a very general class of potentially nonlinear structural equation models (SEMs) with known structure, or at least known superset of adjustment variables: we call the procedure S-mint regression. We easily derive that it achieves the convergence rate as for nonparametric regression: for example, single variable intervention effects can be estimated with convergence rate n −2/5 assuming smoothness with twice differentiable functions. Our result can also be seen as a major robustness property with respect to model misspecification which goes much beyond the notion of double robustness. Furthermore, when the structure of the SEM is not known, we can estimate (the equivalence class of) the directed acyclic graph corresponding to the SEM, and then proceed by using S-mint based on these estimates. We empirically compare the S-mint regression method with more classical approaches and argue that the former is indeed more robust, more reliable and substantially simpler.

One challenge of large-scale data analysis is that the assumption of an identical distribution fo... more One challenge of large-scale data analysis is that the assumption of an identical distribution for all samples is often not realistic. An optimal linear regression might, for example, be markedly different for distinct groups of the data. Maximin effects have been proposed as a computationally attractive way to estimate effects that are common across all data without fitting a mixture distribution explicitly. So far just point estimators of the common maximin effects have been proposed in Meinshausen and B\"uhlmann (2014). Here we propose asymptotically valid confidence regions for these effects.

The Annals of Statistics, 2015

Large-scale data are often characterised by some degree of inhomogeneity as data are either recor... more Large-scale data are often characterised by some degree of inhomogeneity as data are either recorded in different time regimes or taken from multiple sources. We look at regression models and the effect of randomly changing coefficients, where the change is either smoothly in time or some other dimension or even without any such structure. Fitting varying-coefficient models or mixture models can be appropriate solutions but are computationally very demanding and often try to return more information than necessary. If we just ask for a model estimator that shows good predictive properties for all regimes of the data, then we are aiming for a simple linear model that is reliable for all possible subsets of the data. We propose a maximin effects estimator and look at its prediction accuracy from a theoretical point of view in a mixture model with known or unknown group structure. Under certain circumstances the estimator can be computed orders of magnitudes faster than standard penalised regression estimators, making computations on large-scale data feasible. Empirical examples complement the novel methodology and theory.

Robust Statistics

Selected Works in Probability and Statistics, 2012

We propose a general, modular method for significance testing of groups (or clusters) of variable... more We propose a general, modular method for significance testing of groups (or clusters) of variables in a high-dimensional linear model. In presence of high correlations among the covariables, due to serious problems of identifiability, it is indispensable to focus on detecting groups of variables rather than singletons. We propose an inference method which allows to build in hierarchical structures. It relies on repeated sample splitting and sequential rejection, and we prove that it asymptotically controls the familywise error rate. It can be implemented on any collection of clusters and leads to improved power in comparison to more standard non-sequential rejection methods. We complement the theoretical analysis with empirical results for simulated and real data.

We propose a method for testing whether hierarchically ordered groups of potentially correlated v... more We propose a method for testing whether hierarchically ordered groups of potentially correlated variables are significant for explaining a response in a high-dimensional linear model. In presence of highly correlated variables, as is very common in high-dimensional data, it seems indispensable to go beyond an approach of inferring individual regression coefficients. Thanks to the hierarchy among the groups of variables, powerful multiple testing adjustment is possible which leads to a data-driven choice of the resolution level for the groups. Our procedure, based on repeated sample splitting, is shown to asymptotically control the familywise error rate and we provide empirical results for simulated and real data which complement the theoretical analysis.

The Annals of Statistics, 2014

We propose a general method for constructing confidence intervals and statistical tests for singl... more We propose a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model. It can be easily adjusted for multiplicity taking dependence among tests into account. For linear models, our method is essentially the same as in : we analyze its asymptotic properties and establish its asymptotic optimality in terms of semiparametric efficiency. Our method naturally extends to generalized linear models with convex loss functions. We develop the corresponding theory which includes a careful analysis for Gaussian, sub-Gaussian and bounded correlated designs. Y.R. gratefully acknowledges financial support from the Forschungsinstitut für Mathematik (FIM) at ETH Zürich and from the Israel Science Foundation (ISF)

Stable solutions

Springer Series in Statistics, 2011

ABSTRACT Estimation of discrete structure such as in variable selection or graphical modeling is ... more ABSTRACT Estimation of discrete structure such as in variable selection or graphical modeling is notoriously difficult, especially for high-dimensional data. Subsampling or bootstrapping have the potential to substantially increase the stability of high-dimensional selection algorithms and to quantify their uncertainties. Stability via subsampling or bootstrapping has been introduced by Breiman (1996) in the context of prediction. Here, the focus is different: the resampling scheme can provide finite sample control for certain error rates of false discoveries and hence a transparent principle to choose a proper amount of regularization for structure estimation. We discuss methodology and theory for very general settings which include variable selection in linear or generalized linear models or graphical modeling from Chapter 13. For the special case of variable selection in linear models, the theoretical properties (developed here) for consistent selection using stable solutions based on subsampling or bootstrapping require slightly stronger assumptions and are less refined than say for the adaptive or thresholded Lasso.

TEST, 2010

We are very grateful to all discussants for their many insightful and inspiring comments. We also... more We are very grateful to all discussants for their many insightful and inspiring comments. We also would like to thank the co-editors Ricardo Cao and Domingo Morales for having arranged this discussion.

TEST, 2010

We consider a finite mixture of regressions (FMR) model for high-dimensional inhomogeneous data w... more We consider a finite mixture of regressions (FMR) model for high-dimensional inhomogeneous data where the number of covariates may be much larger than sample size. We propose an ℓ 1-penalized maximum likelihood estimator in an appropriate parameterization. This kind of estimation belongs to a class of problems where optimization and theory for non-convex functions is needed. This distinguishes itself very clearly from high-dimensional estimation with convex loss-or objective functions, as for example with the Lasso in linear or generalized linear models. Mixture models represent a prime and important example where non-convexity arises. For FMR models, we develop an efficient EM algorithm for numerical optimization with provable convergence properties. Our penalized estimator is numerically better posed (e.g., boundedness of the criterion function) than unpenalized maximum likelihood estimation, and it allows for effective statistical regularization including variable selection. We also present some asymptotic theory and oracle inequalities: due to non-convexity of the negative log-likelihood function, different mathematical arguments are needed than for problems with convex losses. Finally, we apply the new method to both simulated and real data.

Statistical Science, 2014

Statistical Science, 2002

We compare and review block, sieve and local bootstraps for time series and thereby illuminate th... more We compare and review block, sieve and local bootstraps for time series and thereby illuminate theoretical facts as well as performance on finite-sample data. Our (re-) view is selective with the intention to get a new and fair picture about some particular aspects of bootstrapping time series. The generality of the block bootstrap is contrasted by sieve bootstraps. We discuss implementational dis-/advantages and argue that two types of sieves outperform the block method, each of them in its own important niche, namely linear and categorical processes, respectively. Local bootstraps, designed for nonparametric smoothing problems, are easy to use and implement but exhibit in some cases low performance.

Journal of Machine Learning Research, 2014

One Modern Culture of Statistics: Comments on Statistical Modeling: The Two Cultures (Breiman, 2001b)

Observational Studies

A Look at Robustness and Stability of ell1\ell_{1}ell1-versus ell0\ell_{0}ell0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al

Statistical Science

Journal of the American Statistical Association

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

Bioinformatics, 2016

Electronic Journal of Statistics, 2015

The Annals of Statistics, 2015

Robust Statistics

Selected Works in Probability and Statistics, 2012

The Annals of Statistics, 2014

Stable solutions

Springer Series in Statistics, 2011

TEST, 2010

Statistical Science, 2014

Statistical Science, 2002