Strategies for variable selection in large-scale healthcare database studies with missing covariate and outcome data (original) (raw)
Related papers
BMC Medical Research Methodology
Background Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. Methods We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin’s rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women...
Dealing with missing data in observational health care outcome analyses
Journal of Clinical Epidemiology, 2000
Observational outcome analyses appear frequently in the health research literature. For such analyses, clinical registries are preferred to administrative databases. Missing data are a common problem in any clinical registry, and pose a threat to the validity of observational outcomes analyses. Faced with missing data in a new clinical registry, we compared three possible responses: exclude cases with missing data; assume that the missing data indicated absence of risk; or merge the clinical database with an existing administrative database. The predictive model derived using the merged data showed a higher C statistic ( C ϭ 0.770), better model goodness-of-fit as measured in a decile-of-risk analysis, the largest gradient of risk across deciles (46.3), and the largest decrease in deviance ( Ϫ 2 log likelihood ϭ 406.2). The superior performance of the enhanced data model supports the use of this "enhancement" methodology and bears consideration when researchers are faced with nonrandom missing data.
European Journal of Epidemiology, 2019
A review of epidemiological papers conducted in 2009 concluded that several studies employed variable selection methods susceptible to introduce bias and yield inadequate inferences. Many new confounder selection methods have been developed since then. The goal of the study was to provide an updated descriptive portrait of which variable selection methods are used by epidemiologists for analyzing observational data. Studies published in four major epidemiological journals in 2015 were reviewed. Only articles concerned with a predictive or explicative objective and reporting on the analysis of individual data were included. Method(s) employed for selecting variables were extracted from retained articles. A total of 975 articles were retrieved and 299 met eligibility criteria, 292 of which pursued an explicative objective. Among those, 146 studies (50%) reported using prior knowledge or causal graphs for selecting variables, 34 (12%) used change in effect estimate methods, 26 (9%) used stepwise approaches, 16 (5%) employed univariate analyses, 5 (2%) used various other methods and 107 (37%) did not provide sufficient details to allow classification (more than one method could be employed in a single article). Despite being less frequent than in the previous review, stepwise and univariable analyses, which are susceptible to introduce bias and produce inadequate inferences, were still prevalent. Moreover, 37% studies did not provide sufficient details to assess how variables were selected. We thus believe there is still room for improvement in variable selection methods used by epidemiologists and in their reporting.
Variable selection for covariate-adjusted semiparametric inference in randomized clinical trials
Statistics in Medicine, 2012
Extensive baseline covariate information is routinely collected on participants in randomized clinical trials, and it is well-recognized that a proper covariate-adjusted analysis can improve the efficiency of inference on the treatment effect. However, such covariate adjustment has engendered considerable controversy, as post hoc selection of covariates may involve subjectivity and lead to biased inference, while prior specification of the adjustment may exclude important variables from consideration. Accordingly, how to select covariates objectively to gain maximal efficiency is of broad interest. We propose and study the use of modern variable selection methods for this purpose in the context of a semiparametric framework, under which variable selection in modeling the relationship between outcome and covariates is separated from estimation of the treatment effect, circumventing the potential for selection bias associated with standard analysis of covariance methods. We demonstrate that such objective variable selection techniques combined with this framework can identify key variables and lead to unbiased and efficient inference on the treatment effect. A critical issue in finite samples is validity of estimators of uncertainty, such as standard errors and confidence intervals for the treatment effect. We propose an approach to estimation of sampling variation of estimated treatment effect and show its superior performance relative to that of existing methods.
Variable selection - A review and recommendations for the practicing statistician
Biometrical Journal
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms. K E Y W O R D S change-in-estimate criterion, penalized likelihood, resampling, statistical model, stepwise selection This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
Predictive model in the presence of missing data: the centroid criterion for variable selection
2018
IntroductionIn many studies, covariates are not always fully observed because of missing data process. Usually, subjects with missing data are excluded from the analysis but the number of covariates can be greater than the size of the sample when the number of removed subjects is high. Subjective selection or imputation procedures are used but this leads to biased or powerless models.The aim of our study was to develop a method based on the selection of the nearest covariate to the centroid of a homogeneous cluster of covariates. We applied this method to a forensic medicine data set to estimate the age of aborted fetuses.AnalysisMethodsWe measured 46 biometric covariates on 50 aborted fetuses. But the covariates were complete for only 18 fetuses.First, to obtain homogeneous clusters of covariates we used a hierarchical cluster analysis.Second, for each obtained cluster we selected the nearest covariate to the centroid of the cluster, maximizing the sum of correlations (the centroid...
To reduce bias by residual confounding in nonrandomized database studies, the high-dimensional propensity score (hd-PS) algorithm selects and adjusts for previously unmeasured confounders. The authors evaluated whether hd-PS maintains its capabilities in small cohorts that have few exposed patients or few outcome events. In 4 North American pharmacoepidemiologic cohort studies between 1995 and 2005, the authors repeatedly sampled the data to yield increasingly smaller cohorts. They identified potential confounders in each sample and estimated both an hd-PS that included 0–500 covariates and treatment effects adjusted by decile of hd-PS. For sensitivity analyses, they altered the variable selection process to use zero-cell correction and, separately, to use only the variables' exposure association. With >50 exposed patients with an outcome event, hd-PS-adjusted point estimates in the small cohorts were similar to the full-cohort values. With 25–50 exposed events, both sensitivity analyses yielded estimates closer to those obtained in the full data set. Point estimates generally did not change as compared with the full data set when selecting >300 covariates for the hd-PS. In these data, using zero-cell correction or exposure-based covariate selection allowed hd-PS to function robustly with few events. hd-PS is a flexible analytical tool for nonrandomized research across a range of study sizes and event frequencies.