Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis - PubMed (original) (raw)
Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis
E W Steyerberg et al. J Clin Epidemiol. 1999 Oct.
Abstract
Stepwise selection methods are widely applied to identify covariables for inclusion in regression models. One of the problems of stepwise selection is biased estimation of the regression coefficients. We illustrate this "selection bias" with logistic regression in the GUSTO-I trial (40,830 patients with an acute myocardial infarction). Random samples were drawn that included 3, 5, 10, 20, or 40 events per variable (EPV). Backward stepwise selection was applied in models containing 8 or 16 pre-specified predictors of 30-day mortality. We found a considerable overestimation of regression coefficients of selected covariables. The selection bias decreased with increasing EPV. For EPV 3, 10, or 40, the bias exceeded 25% for 7, 3, and 1 in the 8-predictor model respectively, when a conventional selection criterion was used (alpha = 0.05). For these EPV values, the bias was less than 20% for all covariables when no selection was applied. We conclude that stepwise selection may result in a substantial bias of estimated regression coefficients.
Similar articles
- Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets.
Steyerberg EW, Eijkemans MJ, Harrell FE Jr, Habbema JD. Steyerberg EW, et al. Stat Med. 2000 Apr 30;19(8):1059-79. doi: 10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0. Stat Med. 2000. PMID: 10790680 - Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality.
Austin PC, Tu JV. Austin PC, et al. J Clin Epidemiol. 2004 Nov;57(11):1138-46. doi: 10.1016/j.jclinepi.2004.04.003. J Clin Epidemiol. 2004. PMID: 15567629 - A simulation study of sample size demonstrated the importance of the number of events per variable to develop prediction models in clustered data.
Wynants L, Bouwmeester W, Moons KG, Moerbeek M, Timmerman D, Van Huffel S, Van Calster B, Vergouwe Y. Wynants L, et al. J Clin Epidemiol. 2015 Dec;68(12):1406-14. doi: 10.1016/j.jclinepi.2015.02.002. Epub 2015 Feb 14. J Clin Epidemiol. 2015. PMID: 25817942 - Issues for covariance analysis of dichotomous and ordered categorical data from randomized clinical trials and non-parametric strategies for addressing them.
Koch GG, Tangen CM, Jung JW, Amara IA. Koch GG, et al. Stat Med. 1998 Aug 15-30;17(15-16):1863-92. doi: 10.1002/(sici)1097-0258(19980815/30)17:15/16<1863::aid-sim989>3.0.co;2-m. Stat Med. 1998. PMID: 9749453 Review. - [Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].
Amato L, Colais P, Davoli M, Ferroni E, Fusco D, Minozzi S, Moirano F, Sciattella P, Vecchi S, Ventura M, Perucci CA. Amato L, et al. Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100. Epidemiol Prev. 2013. PMID: 23851286 Review. Italian.
Cited by
- The mockery that confounds better treatment of confounding in epidemiology: The change in estimate fallacy.
Burstyn BI. Burstyn BI. Glob Epidemiol. 2024 Sep 26;8:100166. doi: 10.1016/j.gloepi.2024.100166. eCollection 2024 Dec. Glob Epidemiol. 2024. PMID: 39410942 Free PMC article. - The Influence of Remnant Cholesterol on Cardiovascular Risk and Mortality in Patients with Non-Functional Adrenal Incidentalomas and Mild Autonomous Cortisol Secretion: A Retrospective Cohort Study.
Sebastian-Valles F, Fernández-Moreno MJ, García-Sanz I, Pascual Gómez NF, Navas-Moreno V, Sampedro-Núñez MA, Marazuela M. Sebastian-Valles F, et al. J Clin Med. 2024 Oct 6;13(19):5947. doi: 10.3390/jcm13195947. J Clin Med. 2024. PMID: 39408007 Free PMC article. - Feature Identification Using Interpretability Machine Learning Predicting Risk Factors for Disease Severity of In-Patients with COVID-19 in South Florida.
Datta D, Ray S, Martinez L, Newman D, Dalmida SG, Hashemi J, Sareli C, Eckardt P. Datta D, et al. Diagnostics (Basel). 2024 Aug 26;14(17):1866. doi: 10.3390/diagnostics14171866. Diagnostics (Basel). 2024. PMID: 39272651 Free PMC article. - Handling missing data and measurement error for early-onset myopia risk prediction models.
Lai H, Gao K, Li M, Li T, Zhou X, Zhou X, Guo H, Fu B. Lai H, et al. BMC Med Res Methodol. 2024 Sep 6;24(1):194. doi: 10.1186/s12874-024-02319-x. BMC Med Res Methodol. 2024. PMID: 39243025 Free PMC article. - Developing clinical prediction models: a step-by-step guide.
Efthimiou O, Seo M, Chalkou K, Debray T, Egger M, Salanti G. Efthimiou O, et al. BMJ. 2024 Sep 3;386:e078276. doi: 10.1136/bmj-2023-078276. BMJ. 2024. PMID: 39227063 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical