Ed George - Academia.edu (original) (raw)
Papers by Ed George
arXiv (Cornell University), Dec 10, 2016
We develop a model-free theory of general types of parametric regression for iid observations. Th... more We develop a model-free theory of general types of parametric regression for iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals", defined on large non-parametric classes of joint x-y distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary x-y distributions, without assuming a linear model (see Part I). More generally, regression functionals can be defined by minimizing objective functions or solving estimating equations at joint x-y distributions. In this framework it is possible to achieve the following: (1) define a notion of well-specification for regression functionals that replaces the notion of correct specification of models, (2) propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3) decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution and another due to the regressor distribution interacting with misspecification, both of order N −1/2 , (4) exhibit plug-in/sandwich estimators of standard error as limit cases of x-y bootstrap estimators, and (5) provide theoretical heuristics to indicate that x-y bootstrap standard errors may generally be more stable than sandwich estimators.
arXiv (Cornell University), Dec 10, 2016
We discuss a model-robust theory for general types of regression in the simplest case of iid obse... more We discuss a model-robust theory for general types of regression in the simplest case of iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals" and defined on large non-parametric classes of joint xy\xyxy distributions without assuming a working model. Examples of regression functionals are the slopes of OLS linear equations at largely arbitrary xy\xyxy distributions (see Part~I). More generally, regression functionals can be defined by minimizing objective functions or solving estimating equations at joint xy\xyxy distributions. The role of parametric models is reduced to heuristics for generating objective functions and estimating equations without assuming them as correct. In this framework it is possible to achieve the following: (1)~explicate the meaning of mis/well-specification for regression functionals, (2)~propose a misspecification diagnostic for regression functionals, (3)~decompose sampling variability into two components, one due to the conditional response distributions and another due to the regressor distribution interacting (conspiring) with misspecification, (4)~exhibit plug-in (and hence sandwich) estimators of standard error as limiting cases of xy\xyxy bootstrap estimators.
Journal of The Royal Statistical Society Series B-statistical Methodology, Jul 1, 2022
Journal of the American Statistical Association, 1997
arXiv (Cornell University), Jan 9, 2018
Consider the problem of high dimensional variable selection for the Gaussian linear model when th... more Consider the problem of high dimensional variable selection for the Gaussian linear model when the unknown error variance is also of interest. In this paper, we show that the use of conjugate shrinkage priors for Bayesian variable selection can have detrimental consequences for such variance estimation. Such priors are often motivated by the invariance argument of Jeffreys (1961). Revisiting this work, however, we highlight a caveat that Jeffreys himself noticed; namely that biased estimators can result from inducing dependence between parameters a priori. In a similar way, we show that conjugate priors for linear regression, which induce prior dependence, can lead to such underestimation in the Bayesian high-dimensional regression setting. Following Jeffreys, we recommend as a remedy to treat regression coefficients and the error variance as independent a priori. Using such an independence prior framework, we extend the Spike-and-Slab Lasso of Ročková and George (2018) to the unknown variance case. This extended procedure outperforms both the fixed variance approach and alternative penalized likelihood methods on simulated data. On the protein activity dataset of Clyde and Parmigiani (1998), the Spike-and-Slab Lasso with unknown variance achieves lower cross-validation error than alternative penalized likelihood methods, demonstrating the gains in predictive accuracy afforded by simultaneous error variance estimation.
Machine Learning - ML, 2002
When simple parametric models such as linear regression fail to adequately approximate a relation... more When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (e.g. CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this paper, we propose a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets.
SUMMARY For the standard regression setup, conventional tree models partition the predictor space... more SUMMARY For the standard regression setup, conventional tree models partition the predictor space into regions where the variable of interest Y , can be approximated by a constant. A treed model extends this idea by allowing a functional relationship between Y and the predictors within each region. As opposed to using a single model to describe the global variation of the response, treed models allow for local modeling across the predictor space. In this paper, we consider treed versions of generalized linear models (GLMs) and propose a Bayesian approach to flnding and fltting such models. The potential of this approach is illustrated with a treed Poisson regression.
Institute of Mathematical Statistics Collections, 2010
For the general Bayesian model uncertainty framework, the focus of this paper is on the developme... more For the general Bayesian model uncertainty framework, the focus of this paper is on the development of model space priors which can compensate for redundancy between model classes, the so-called dilution priors proposed in George (1999). Several distinct approaches for dilution prior construction are suggested. One is based on tessellation determined neighborhoods, another on collinearity adjustments, and a third on pairwise distances between models.
The Annals of Applied Statistics, 2014
We consider the task of discovering gene regulatory networks, which are defined as sets of genes ... more We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.
Handbooks of Sociology and Social Research, 2013
It is common for social science researchers to provide estimates of causal effects from regressio... more It is common for social science researchers to provide estimates of causal effects from regression models imposed on observational data. The many problems with such work are well documented and widely known. The usual response is to claim, with little real evidence, that the causal model is close enough to the “truth” that sufficiently accurate causal effects can be estimated. In this chapter, a more circumspect approach is taken. We assume that the causal model is a substantial distance from the truth and then consider what can be learned nevertheless. To that end, we distinguish between how nature generated the data, a “true” model representing how this was accomplished, and a working model that is imposed on the data. The working model will typically be “wrong.” Nevertheless, unbiased or asymptotically unbiased estimates from parametric, semiparametric, and nonparametric working models can often be obtained in concert with appropriate statistical tests and confidence intervals. However, the estimates are not of the regression parameters typically assumed. Estimates of causal effects are not provided. Correlation is not causation. Nor is partial correlation, even when dressed up as regression coefficients. However, we argue that insights about causal effects do not require estimates of causal effects. We also discuss what can be learned when our alternative approach is not persuasive.
Health Care Management Science, 2014
A commonly used method for evaluating a hospital's performance on an outcome is to compare the ho... more A commonly used method for evaluating a hospital's performance on an outcome is to compare the hospital's observed outcome rate to the hospital's expected outcome rate given its patient (case) mix and service. The process of calculating the hospitals expected outcome rate given its patient mix and service is called risk adjustment. Risk adjustment is critical for accurately evaluating and comparing hospitals' performances since we would not want to unfairly penalize a hospital just because it treats sicker patients. The key to risk adjustment is accurately estimating the probability of an outcome given patient characteristics. For cases with binary outcomes, the method that is commonly used in risk adjustment is logistic regression. In this paper, we consider ensemble of trees methods as alternatives for risk adjustment, including random forests and Bayesian additive regression trees (BART). Both random forests and BART are modern machine learning methods that have been shown recently to have excellent performance for prediction of outcomes in many settings. We apply these methods to carry out risk adjustment for the performance of neonatal intensive care units (NICU). We show that these ensemble of trees methods outperform logistic regression in predicting mortality among babies treated in NICU, and provide a superior method of risk adjustment compared to logistic regression.
Evaluation Review, 2013
Background: It has become common practice to analyze randomized experiments using linear regressi... more Background: It has become common practice to analyze randomized experiments using linear regression with covariates. Improved precision of treatment effect estimates is the usual motivation. In a series of important articles, David Freedman showed that this approach can be badly flawed. Recent work by Winston Lin offers partial remedies, but important problems remain. Results: In this article, we address those problems through a reformulation of the Neyman causal model. We provide a practical estimator and valid standard errors for the average treatment effect. Proper generalizations to well-defined populations can follow. Conclusion: In most applications, the use of covariates to improve precision is not worth the trouble.
The Annals of Statistics, 1993
Biometrika, 2000
For the problem of variable selection for the normal linear model, selection criteria such as AIC... more For the problem of variable selection for the normal linear model, selection criteria such as AIC, C p , BIC and RIC have fixed dimensionality penalties. Such criteria are shown to correspond to selection of maximum posterior models under implicit hyperparameter choices for a particular hierarchical Bayes formulation. Based on this calibration, we propose empirical Bayes selection criteria that use hyperparameter estimates instead of fixed choices. For obtaining these estimates, both marginal and conditional maximum likelihood methods are considered. As opposed to traditional fixed penalty criteria, these empirical Bayes criteria have dimensionality penalties that depend on the data. Their performance is seen to approximate adaptively the performance of the best fixed penalty criterion across a variety of orthogonal and nonorthogonal setups , including wavelet regression. Empirical Bayes shrinkage estimators of the selected coefficients are also proposed.
For the problem of estimating the mean of a univariate normal distribution with known variance, t... more For the problem of estimating the mean of a univariate normal distribution with known variance, the maximum likelihood estimator (MLE) is best invariant, minimax, and admissible under squared-error loss. It is shown that if the variance is the realized value of an ancillary statistic with known distribution, the MLE can be inadmissible with respect to the unconditional risk averaged over this ancillary distribution.
The richest form of a prediction is a predictive density over the space of all possible outcomes,... more The richest form of a prediction is a predictive density over the space of all possible outcomes, a density which is obtained naturally by the Bayesian approach. In this chapter, we describe a variety of recent results that use a decision theoretic framework based on expected Kullback-Leibler loss to evaluate the long run performance of Bayesian predictive estimators. In particular, we focus on high dimensional prediction for the multivariate normal distribution and extensions to the normal linear regression model. General conditions for minimaxity and admissibility, as well as a complete class theorem, are described.
arXiv (Cornell University), Dec 10, 2016
We develop a model-free theory of general types of parametric regression for iid observations. Th... more We develop a model-free theory of general types of parametric regression for iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals", defined on large non-parametric classes of joint x-y distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary x-y distributions, without assuming a linear model (see Part I). More generally, regression functionals can be defined by minimizing objective functions or solving estimating equations at joint x-y distributions. In this framework it is possible to achieve the following: (1) define a notion of well-specification for regression functionals that replaces the notion of correct specification of models, (2) propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3) decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution and another due to the regressor distribution interacting with misspecification, both of order N −1/2 , (4) exhibit plug-in/sandwich estimators of standard error as limit cases of x-y bootstrap estimators, and (5) provide theoretical heuristics to indicate that x-y bootstrap standard errors may generally be more stable than sandwich estimators.
arXiv (Cornell University), Dec 10, 2016
We discuss a model-robust theory for general types of regression in the simplest case of iid obse... more We discuss a model-robust theory for general types of regression in the simplest case of iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals" and defined on large non-parametric classes of joint xy\xyxy distributions without assuming a working model. Examples of regression functionals are the slopes of OLS linear equations at largely arbitrary xy\xyxy distributions (see Part~I). More generally, regression functionals can be defined by minimizing objective functions or solving estimating equations at joint xy\xyxy distributions. The role of parametric models is reduced to heuristics for generating objective functions and estimating equations without assuming them as correct. In this framework it is possible to achieve the following: (1)~explicate the meaning of mis/well-specification for regression functionals, (2)~propose a misspecification diagnostic for regression functionals, (3)~decompose sampling variability into two components, one due to the conditional response distributions and another due to the regressor distribution interacting (conspiring) with misspecification, (4)~exhibit plug-in (and hence sandwich) estimators of standard error as limiting cases of xy\xyxy bootstrap estimators.
Journal of The Royal Statistical Society Series B-statistical Methodology, Jul 1, 2022
Journal of the American Statistical Association, 1997
arXiv (Cornell University), Jan 9, 2018
Consider the problem of high dimensional variable selection for the Gaussian linear model when th... more Consider the problem of high dimensional variable selection for the Gaussian linear model when the unknown error variance is also of interest. In this paper, we show that the use of conjugate shrinkage priors for Bayesian variable selection can have detrimental consequences for such variance estimation. Such priors are often motivated by the invariance argument of Jeffreys (1961). Revisiting this work, however, we highlight a caveat that Jeffreys himself noticed; namely that biased estimators can result from inducing dependence between parameters a priori. In a similar way, we show that conjugate priors for linear regression, which induce prior dependence, can lead to such underestimation in the Bayesian high-dimensional regression setting. Following Jeffreys, we recommend as a remedy to treat regression coefficients and the error variance as independent a priori. Using such an independence prior framework, we extend the Spike-and-Slab Lasso of Ročková and George (2018) to the unknown variance case. This extended procedure outperforms both the fixed variance approach and alternative penalized likelihood methods on simulated data. On the protein activity dataset of Clyde and Parmigiani (1998), the Spike-and-Slab Lasso with unknown variance achieves lower cross-validation error than alternative penalized likelihood methods, demonstrating the gains in predictive accuracy afforded by simultaneous error variance estimation.
Machine Learning - ML, 2002
When simple parametric models such as linear regression fail to adequately approximate a relation... more When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (e.g. CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this paper, we propose a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets.
SUMMARY For the standard regression setup, conventional tree models partition the predictor space... more SUMMARY For the standard regression setup, conventional tree models partition the predictor space into regions where the variable of interest Y , can be approximated by a constant. A treed model extends this idea by allowing a functional relationship between Y and the predictors within each region. As opposed to using a single model to describe the global variation of the response, treed models allow for local modeling across the predictor space. In this paper, we consider treed versions of generalized linear models (GLMs) and propose a Bayesian approach to flnding and fltting such models. The potential of this approach is illustrated with a treed Poisson regression.
Institute of Mathematical Statistics Collections, 2010
For the general Bayesian model uncertainty framework, the focus of this paper is on the developme... more For the general Bayesian model uncertainty framework, the focus of this paper is on the development of model space priors which can compensate for redundancy between model classes, the so-called dilution priors proposed in George (1999). Several distinct approaches for dilution prior construction are suggested. One is based on tessellation determined neighborhoods, another on collinearity adjustments, and a third on pairwise distances between models.
The Annals of Applied Statistics, 2014
We consider the task of discovering gene regulatory networks, which are defined as sets of genes ... more We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.
Handbooks of Sociology and Social Research, 2013
It is common for social science researchers to provide estimates of causal effects from regressio... more It is common for social science researchers to provide estimates of causal effects from regression models imposed on observational data. The many problems with such work are well documented and widely known. The usual response is to claim, with little real evidence, that the causal model is close enough to the “truth” that sufficiently accurate causal effects can be estimated. In this chapter, a more circumspect approach is taken. We assume that the causal model is a substantial distance from the truth and then consider what can be learned nevertheless. To that end, we distinguish between how nature generated the data, a “true” model representing how this was accomplished, and a working model that is imposed on the data. The working model will typically be “wrong.” Nevertheless, unbiased or asymptotically unbiased estimates from parametric, semiparametric, and nonparametric working models can often be obtained in concert with appropriate statistical tests and confidence intervals. However, the estimates are not of the regression parameters typically assumed. Estimates of causal effects are not provided. Correlation is not causation. Nor is partial correlation, even when dressed up as regression coefficients. However, we argue that insights about causal effects do not require estimates of causal effects. We also discuss what can be learned when our alternative approach is not persuasive.
Health Care Management Science, 2014
A commonly used method for evaluating a hospital's performance on an outcome is to compare the ho... more A commonly used method for evaluating a hospital's performance on an outcome is to compare the hospital's observed outcome rate to the hospital's expected outcome rate given its patient (case) mix and service. The process of calculating the hospitals expected outcome rate given its patient mix and service is called risk adjustment. Risk adjustment is critical for accurately evaluating and comparing hospitals' performances since we would not want to unfairly penalize a hospital just because it treats sicker patients. The key to risk adjustment is accurately estimating the probability of an outcome given patient characteristics. For cases with binary outcomes, the method that is commonly used in risk adjustment is logistic regression. In this paper, we consider ensemble of trees methods as alternatives for risk adjustment, including random forests and Bayesian additive regression trees (BART). Both random forests and BART are modern machine learning methods that have been shown recently to have excellent performance for prediction of outcomes in many settings. We apply these methods to carry out risk adjustment for the performance of neonatal intensive care units (NICU). We show that these ensemble of trees methods outperform logistic regression in predicting mortality among babies treated in NICU, and provide a superior method of risk adjustment compared to logistic regression.
Evaluation Review, 2013
Background: It has become common practice to analyze randomized experiments using linear regressi... more Background: It has become common practice to analyze randomized experiments using linear regression with covariates. Improved precision of treatment effect estimates is the usual motivation. In a series of important articles, David Freedman showed that this approach can be badly flawed. Recent work by Winston Lin offers partial remedies, but important problems remain. Results: In this article, we address those problems through a reformulation of the Neyman causal model. We provide a practical estimator and valid standard errors for the average treatment effect. Proper generalizations to well-defined populations can follow. Conclusion: In most applications, the use of covariates to improve precision is not worth the trouble.
The Annals of Statistics, 1993
Biometrika, 2000
For the problem of variable selection for the normal linear model, selection criteria such as AIC... more For the problem of variable selection for the normal linear model, selection criteria such as AIC, C p , BIC and RIC have fixed dimensionality penalties. Such criteria are shown to correspond to selection of maximum posterior models under implicit hyperparameter choices for a particular hierarchical Bayes formulation. Based on this calibration, we propose empirical Bayes selection criteria that use hyperparameter estimates instead of fixed choices. For obtaining these estimates, both marginal and conditional maximum likelihood methods are considered. As opposed to traditional fixed penalty criteria, these empirical Bayes criteria have dimensionality penalties that depend on the data. Their performance is seen to approximate adaptively the performance of the best fixed penalty criterion across a variety of orthogonal and nonorthogonal setups , including wavelet regression. Empirical Bayes shrinkage estimators of the selected coefficients are also proposed.
For the problem of estimating the mean of a univariate normal distribution with known variance, t... more For the problem of estimating the mean of a univariate normal distribution with known variance, the maximum likelihood estimator (MLE) is best invariant, minimax, and admissible under squared-error loss. It is shown that if the variance is the realized value of an ancillary statistic with known distribution, the MLE can be inadmissible with respect to the unconditional risk averaged over this ancillary distribution.
The richest form of a prediction is a predictive density over the space of all possible outcomes,... more The richest form of a prediction is a predictive density over the space of all possible outcomes, a density which is obtained naturally by the Bayesian approach. In this chapter, we describe a variety of recent results that use a decision theoretic framework based on expected Kullback-Leibler loss to evaluate the long run performance of Bayesian predictive estimators. In particular, we focus on high dimensional prediction for the multivariate normal distribution and extensions to the normal linear regression model. General conditions for minimaxity and admissibility, as well as a complete class theorem, are described.