Key Concepts in Model Selection: Performance and Generalizability (original) (raw)

Model selection criteria: An investigation of relative accuracy, posterior probabilities, and combinations of criteria

1995

W e investigate the performance of empirical criteria for comparing and selecting quantitative models from among a candidate set. A simulation based on empirically observed parameter values is used to determine which criterion is the most accurate at identifying the correct model specification. The simulation is composed of both nested and nonnested linear regression models. We then derive posterior probability estimates of the superiority of the alternative models from each of the criteria and evaluate the relative accuracy, bias, and information content of these probabilities. To investigate whether additional accuracy can be derived from combining criteria, a method for obtaining a joint prediction from combinations of the criteria is proposed and the incremental improvement in selection accuracy considered. Based on the simulation, we conclude that most leading criteria perform well in selecting the best model, and several criteria also produce accurate probabilities of model superiority. Computationally intensive criteria failed to perform better than criteria which were computationally simpler. Also, the use of several criteria in combination failed to appreciably outperform the use of one model. The Schwarz criterion performed best overall in terms of selection accuracy, accuracy of posterior probabilities, and ease of use. Thus, we suggest that general model comparison, model selection, and model probability estimation be performed using the Schwarz criterion, which can be implemented (given the model log likelihoods) using only a hand calculator.

Model selection: Beyond the Bayesian/frequentist divide

Journal of Machine Learning Research, 2010

The principle of parsimony also known as "Ockham's razor" has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian!

A comparative study of information criteria for model selection

2006

To build good models, we need to know the appropriate model size. To handle this problem, a variety of information criteria have already been proposed, each with a different background. In this paper, we consider the problem of model selection and investigate the performance of a number of proposed information criteria and whether the assumption to obtain the formulae that fitting errors are normally distributed hold or not in some conditions (different data points and noise levels). The results show that although the application of information criteria prevents over-fitting and under-fitting in most cases, there are cases where we cannot avoid even involving many data points and low noise levels in ideal situations. The results also show that the distribution of the fitting errors is not always normally distributed, although the observational noise is Gaussian, which contradicts an assumption of the information criteria.

Objective Bayesian Methods for Model Selection: Introduction and Comparison

Institute of Mathematical Statistics Lecture Notes - Monograph Series, 2001

The basics of the Bayesian approach to model selection are first presented, as well as the motivations for the Bayesian approach. We then review four methods of developing default Bayesian procedures that have undergone considerable recent development, the Conventional Prior approach, the Bayes Information Criterion, the Intrinsic Bayes Factor, and the Fractional Bayes Factor. As part of the review, these methods are illustrated on examples involving the normal linear model. The later part of the chapter focuses on comparison of the four approaches, and includes an extensive discussion of criteria for judging model selection procedures.

A comparison of a large number of model selection criteria

2008

Abstract This paper uses Monte Carlo analysis to compare the performance of a large number of model selection criteria (MSC), including information criteria, General-to-Specific modelling, Bayesian Model Averaging, and portfolio models. We use Mean Squared Error (MSE) as our measure of MSC performance. The decomposition of MSE into Bias and Variance provides a useful decomposition for understanding MSC performance.

On over-fitting in model selection and subsequent selection bias in performance evaluation

Journal of Machine Learning Research, 2010

Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.

A Comparison of Scientific and Engineering Criteria for Model Selection

2000

Given a set of possible models for variables X and a set of possible parameters for each model, the Bayesian "estimate" of the probability distribution for X given observed data is obtained by averaging over the possible models and their parameters. An often-used approximation for this estimate is obtained by selecting a single model and averaging over its parameters. The approximation is useful because it is computationally efficient, and because it provides a model that facilitates understanding of the domain. A common criterion for model selection is the posterior probability of the model. Another criterion for model selection, proposed by San Martini and Spezzafari (1984), is the predictive performance of a model for the next observation to be seen. From the standpoint of domain understanding, both criteria are useful, because one identifies the model that is most likely, whereas the other identifies the model that is the best predictor of the next observation. To highlight the difference, we refer to the posterior-probability and alternative criteria as the scientific criterion (SC) and engineering criterion (EC), respectively. When we are interested in predicting the next observation, the model-averaged estimate is at least as good as that produced by EC, which itself is at least as good as the estimate produced by SC. We show experimentally that, for Bayesian-network models containing discrete variables only, the predictive performance of the model average can be significantly better than those of single models selected by either criterion, and that differences between models selected by the two criterion can be substantial.

A Teaching Note for Model Selection and Validation

The model selection problem is always crucial for any decision making in statistical research and management. Among the choice of many competing models, how to decide the best is even more crucial for researchers. This small article is prepared as a teaching note for deciding an appropriate model for a real life data set. We briefly describe some of the existing methods of model selection. The best model from the two competing models is decided based on the comparison of limited expected value function (LEVF) or loss elimination ratio (LER). A data set is analyzed through MINITAB software.