Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study (original) (raw)

Journal Article

,

*Correspondence to Dr. Anoop D. Shah, Clinical Epidemiology Group, Department of Epidemiology and Public Health, School of Life and Medical Sciences, University College London, Wolfson House, 2-10 Stephenson Way, London NW1 2HE, United Kingdom (e-mail: anoop@doctors.org.uk).

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

Search for other works by this author on:

Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MAR, missing at random; MICE, multivariate imputation by chained equations.

Author Notes

Accepted:

20 November 2013

Published:

12 January 2014

Cite

Anoop D. Shah, Jonathan W. Bartlett, James Carpenter, Owen Nicholas, Harry Hemingway, Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study, American Journal of Epidemiology, Volume 179, Issue 6, 15 March 2014, Pages 764–774, https://doi.org/10.1093/aje/kwt312
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

Missing data are a pervasive problem in epidemiologic studies, particularly for research using routinely collected clinical health records (1). Incomplete data sets are frequently analyzed using multiple imputation, which involves creating multiple complete versions of the data with missing values imputed through random draws from distributions inferred from observed data (2). Multiple imputation typically assumes that data are missing at random (MAR)—that is, that missingness is not associated with the missing value, conditional on the observed data (3). Multivariate imputation by chained equations (MICE), also called full conditional specification, is a common method of generating imputed values by drawing from estimated conditional distributions of each variable given all the others (4). Imputation models must be appropriately specified for analyses based on imputed data to yield unbiased parameter estimates and associated standard errors. The default setting in implementations of MICE is for imputation models to include continuous variables as linear terms only with no interactions, but omission of important nonlinear terms may lead to biased results (5). Other potential problems with parametric regression models are that 1) they cannot include more predictor variables than the number of observations without recourse to prior information (6) and 2) inclusion of highly correlated variables may cause problems due to collinearity. In this paper, we propose a new imputation method which aims to overcome these problems using random forest.

Random forest is an extension of classification and regression trees (7), predictive models that recursively subdivide the data based on values of the predictor variables. They do not rely on distributional assumptions and can accommodate nonlinear relations and interactions. On simulated data sets with interactions between variables, imputation of missing data using MICE with regression trees resulted in less biased parameter estimates than MICE with linear regression (7). However, regression trees may “overfit,” following the pattern of noise too closely and producing a complex model with poor predictive power in new data sets.

Random forest uses bootstrap aggregation of multiple regression trees to reduce the risk of overfitting, and it combines the predictions from many trees to produce more accurate predictions (8, 9). Random forest is widely used in genetic epidemiology (10) and has also been used for modeling survival (11, 12) and predicting response to cancer chemotherapy (13). We propose that random forest may be useful in multiple imputation of epidemiologic data sets, particularly if there are large numbers of clinical variables per participant, as may increasingly be the case (e.g., genomic or proteomic studies).

Stekhoven et al. (14) developed a random forest-based algorithm for missing data imputation called missForest. This algorithm aims to predict individual missing values accurately rather than take random draws from a distribution, so the imputed values may lead to biased parameter estimates in statistical models. Apart from a comparison between random forest and polytomous regression for imputing tumor stage using MICE (15), we are not aware of other published evaluations of multiple imputation using random forest.

In this paper, we compare a standard implementation of MICE with imputation using missForest, and we propose a new version of MICE which imputes each variable using random forest. We compare these methods in a realistically complex survival analysis based on patients with stable angina in the CALIBER (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records) database (16) and in a simulation study with interactions.

METHODS

Imputation of missing data using MICE, where each variable is imputed using random forest

Within the MICE framework, missing values of continuous variables are conventionally imputed by fitting a linear regression model for the observed values, predicting the conditional mean for each missing value, and randomly imputing a value from a normal distribution centered on this conditional mean. Our new method is derived from the “mice.impute.norm.boot” function in the “mice” package in R (4), in which linear regression is applied to a bootstrap sample of records with observed values of the variable to be imputed. The purpose of the bootstrap is to accommodate sampling variation in estimating population regression parameters, which is part of ensuring that imputations are “proper” (3). The random forest algorithm itself involves another level of bootstrap sampling. Records with missing values in the dependent variable are imputed by random draws from independent normal distributions centered on conditional means predicted using random forest. We used the “out-of-bag” mean square error as the estimator of residual variance (which we assumed to be normally distributed). Random forest fits each tree to a different bootstrap sample of the data and aggregates the results; the out-of-bag error is the mean of squared differences between each observed value and the prediction based on trees for which that observation is not included in the bootstrap sample.

For binary or unordered categorical variables, we used random forest to fit individual regression trees to a bootstrap sample of the data and imputed each missing value as the prediction of a randomly chosen tree. This is equivalent to choosing between 0 and 1 with probability according to the mean random forest prediction. Our random forest imputation functions are available from the Comprehensive R Archive Network (17).

Simulation study based on CALIBER data

CALIBER is a database of linked routinely collected electronic health records from England (16), comprising data from primary care (Clinical Practice Research Datalink) (18), hospital admissions (19), the national registry of acute coronary syndromes (20), and the national death registry.

The cohort consisted of patients who received a diagnosis of stable angina while registered at a general practice contributing to the Clinical Practice Research Datalink. Blood pressure, smoking status, and measurements of blood biomarkers were taken from routine clinical records before the diagnosis of stable angina, and patients were followed up for the composite endpoint of death or nonfatal myocardial infarction (see Supplementary Data, available at http://aje.oxfordjournals.org/). The CALIBER record-linkage study has received ethical approval, and this study was approved by the Clinical Practice Research Datalink Independent Scientific Advisory Committee.

Analysis of interest

We investigated missing data in the context of a hypothetical analysis of associations, suggested in previous studies (21–23), between 3 commonly measured hematological parameters (hemoglobin concentration, lymphocyte count, and neutrophil count) and prognosis among patients with stable angina in CALIBER. The substantive analysis was a multivariable Cox model with the composite endpoint of death or nonfatal myocardial infarction, and with predictor variables specified a priori_._ The fully observed predictor variables were: age, age squared, sex, previous myocardial infarction, diabetes mellitus, previous stroke, peripheral arterial disease, and heart failure. Smoking status was a partially observed 3-category variable (never, former, or current smoker), and we included the following partially observed continuous variables: systolic blood pressure (mm Hg), log neutrophil count (109 cells/L), log lymphocyte count (109 cells/L), high-density lipoprotein cholesterol level (mmol/L), hemoglobin concentration (g/dL), and log serum creatinine concentration (µmol/L). For each of these variables, we took the mean of any observed values in the 2 years prior to the start of follow-up. We used the Efron approximation (24) for ties. We did not investigate alternative models and ignored clustering by general practice.

Generation of sample data sets for simulation study

For the simulation study, we created data sets with missing data for which we knew the “true” values, with a missingness pattern similar to that observed in the actual data set but which was missing at random, such that the MAR assumption underlying most multiple imputation approaches was satisfied (Figure 1). We denoted the entire cohort of 52,576 stable angina patients data set “A.” Patients with no missing values for any of the variables in the survival model were denoted data set “B” (13,308 patients).

Generation of data sets with artificial missingness from a population of patients with stable angina in the CALIBER database, 2001–2010. Data sets D1, D2, … , D1,000 are samples of 2,000 patients with replacement from data set C. CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MAR, missing at random; MCAR, missing completely at random; MICE, multivariate imputation by chained equations.

Figure 1.

Generation of data sets with artificial missingness from a population of patients with stable angina in the CALIBER database, 2001–2010. Data sets D1, D2, … , D1,000 are samples of 2,000 patients with replacement from data set C. CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MAR, missing at random; MCAR, missing completely at random; MICE, multivariate imputation by chained equations.

We used logistic regression based on completely observed variables to investigate factors associated with a patient's having a complete record (i.e., whether a patient was in data set B). We included the first value after cohort entry of partially observed continuous variables as auxiliary variables to help predict missing covariates in imputation models. The subset of patients with complete recording of all analysis and auxiliary variables was denoted data set “C” (10,128 patients), and they formed the basis for the resampling study. We artificially made some values of predictor variables missing in one thousand 2,000-patient random samples (with replacement) from data set C. We carried out simulations with one of 2 missingness mechanisms: 1) MAR in a pattern similar to that of data set A or 2) an artificial pattern of missingness completely at random in which only categorical variables were missing (see Supplementary Data for more details).

Multiple imputation of test data sets

Imputation models included all of the variables in the substantive Cox model, event status, marginal Nelson-Aalen cumulative hazard (25), and the following auxiliary variables: type of endpoint, whether the practice was receiving electronic laboratory results, and the earliest recorded value after the index date for blood pressure and the 5 blood biomarkers. We imputed continuous variables using MICE with normal-based linear regression, predictive mean matching with 3 nearest neighbors, and our new random forest method. We imputed categorical variables using either MICE with logistic or polytomous regression or MICE with random forest (choice of 10 or 100 trees). We also investigated random forest with a single tree to determine whether bootstrap aggregation had an advantage over a single regression tree.

We generated 10 MICE imputations, each drawn from a separate chain with a different random seed, with 10 cycles of imputation before drawing the imputed data set. We assessed chain mixing by reviewing plots of chain mean values and standard deviations. For two of the methods (random forest with 10 trees and parametric MICE), we also calculated results using 100 imputations.

In addition to MICE, we also evaluated missForest (14), which uses random forest in an iterative way to complete a data set with missing values, where imputed values are equal to the random forest predictions rather than being randomly sampled from a conditional distribution. We generated multiple imputed data sets by running missForest using different random seeds, which leads to different random forest models being generated.

Regardless of the imputation method, all data sets were analyzed using the same multivariable semiparametric Cox model as described above. For each set of imputed data sets, the log hazard ratios from the Cox model were combined using Rubin's rules (26), which assumes that imputed values were drawn from the appropriate Bayesian posterior. We carried out analyses using R 2.12.1 (27), with the software packages mice 2.12 (4), missForest 1.3 (28), survival 2.36-2 (29), and randomForest 4.6-6 (30). Random numbers were generated using the Mersenne Twister (31).

Comparison of results obtained by different methods

We considered the Cox proportional hazards model fitted to the entire data set (data set C) the “true” result for the assessment of bias and confidence interval coverage of hazard ratios. We compared the widths of 95% confidence intervals between 2 methods using paired-sample t tests. We compared coverage of 95% confidence intervals for each coefficient separately using McNemar's test, defining discordant pairs as data sets in which the 95% confidence interval included the “true” value for one method but not the other. We compared the efficiency of the estimators by calculating their empirical standard deviations. We calculated the between-imputation variance of the estimated log hazard ratios, defined as the mean (across simulations) of the variance of the log hazard ratio estimates from the 10 imputations per data set.

Simulation study with interactions

We also created simulated data sets to compare the performance of methods when there were nonlinearities in the association between predictor variables. We generated 2 independent random normal variables with mean 0 and variance 1, _x_1 and _x_2, and a third variable _x_3 equal to 0.5(_x_1 + _x_2 − _x_1_x_2) + e, where e was distributed normally with mean 0 and variance 1. Survival times were generated according to an exponential distribution with log hazard 0.5(_x_1 + _x_2 + _x_3). This meant that there were no interactions in the substantive model, but the default parametric imputation model for _x_3 (which would not include any interactions) would be incorrect. Observation times were generated according to a uniform distribution in the range from 0 to the 50th percentile of survival times. If the observation time was less than the survival time, the patient was considered censored (event indicator 0, and the patient's follow-up ended on his or her censoring date); otherwise the event indicator was 1, with follow-up ending on the date of the event.

Variable _x_3 was made 20% MAR according to a logistic model based on _x_1 and _x_2, the marginal Nelson-Aalen cumulative hazard and the event indicator.

We analyzed 1,000 simulated data sets with 2,000 patients each, imputing missing data using random forest and parametric MICE (without interactions), comparing the results as above (Supplementary Data and Supplementary Data).

RESULTS

Simulation study based on CALIBER data

The prevalence of missingness among partially observed variables ranged from 1.5% for smoking to 56.7% for lymphocyte counts and 56.8% for neutrophil counts (Supplementary Data). Patients with missing data were more likely to experience the primary endpoint of death or nonfatal myocardial infarction (age- and sex-adjusted hazard ratio = 1.19, 95% confidence interval: 1.13, 1.25) (Figure 2). Patients with missing data also tended to have longer follow-up because they entered the cohort earlier (a median index date of March 1, 2002, vs. December 6, 2005; P < 0.0001 by Wilcoxon rank-sum test). The logistic regression model showed that patients with diabetes, peripheral arterial disease, and previous stroke were more likely to have complete records (Table 1). Supplementary Data shows that coefficients from Cox models for patients with complete data (data set C) were similar to the average results from full data analysis of the subsamples (as we would expect in the absence of small-sample bias), so we compared imputation estimates with those from data set C.

Table 1.

Factors Associated With Having a Complete Record in a Study of Patients Diagnosed With Stable Angina (Logistic Regression Model), CALIBER Database, 2001–2010

Variable Odds Ratio 95% Confidence Interval P Value
Age, per 10 years 4.81 4.02, 5.75 <0.001
Age squared, per 10 years squared 0.89 0.87, 0.90 <0.001
Female sex 1.08 1.03, 1.12 0.002
Diabetes mellitus 1.74 1.64, 1.84 <0.001
Peripheral arterial disease 1.24 1.15, 1.35 <0.001
Previous stroke 1.26 1.17, 1.36 <0.001
Heart failure 0.96 0.89, 1.04 0.333
Previous myocardial infarction 1.03 0.98, 1.09 0.228
Electronic laboratory resultsa 4.74 4.48, 5.01 <0.001
Endpoint of fatal coronary heart disease 0.40 0.36, 0.45 <0.001
Endpoint of nonfatal myocardial infarction 0.29 0.26, 0.34 <0.001
Endpoint of noncoronary death 0.42 0.39, 0.45 <0.001
Cumulative hazard 0.03 0.02, 0.03 <0.001
Variable Odds Ratio 95% Confidence Interval P Value
Age, per 10 years 4.81 4.02, 5.75 <0.001
Age squared, per 10 years squared 0.89 0.87, 0.90 <0.001
Female sex 1.08 1.03, 1.12 0.002
Diabetes mellitus 1.74 1.64, 1.84 <0.001
Peripheral arterial disease 1.24 1.15, 1.35 <0.001
Previous stroke 1.26 1.17, 1.36 <0.001
Heart failure 0.96 0.89, 1.04 0.333
Previous myocardial infarction 1.03 0.98, 1.09 0.228
Electronic laboratory resultsa 4.74 4.48, 5.01 <0.001
Endpoint of fatal coronary heart disease 0.40 0.36, 0.45 <0.001
Endpoint of nonfatal myocardial infarction 0.29 0.26, 0.34 <0.001
Endpoint of noncoronary death 0.42 0.39, 0.45 <0.001
Cumulative hazard 0.03 0.02, 0.03 <0.001

Abbreviation: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records.

a Whether a medical practice was receiving electronic laboratory results.

Table 1.

Factors Associated With Having a Complete Record in a Study of Patients Diagnosed With Stable Angina (Logistic Regression Model), CALIBER Database, 2001–2010

Variable Odds Ratio 95% Confidence Interval P Value
Age, per 10 years 4.81 4.02, 5.75 <0.001
Age squared, per 10 years squared 0.89 0.87, 0.90 <0.001
Female sex 1.08 1.03, 1.12 0.002
Diabetes mellitus 1.74 1.64, 1.84 <0.001
Peripheral arterial disease 1.24 1.15, 1.35 <0.001
Previous stroke 1.26 1.17, 1.36 <0.001
Heart failure 0.96 0.89, 1.04 0.333
Previous myocardial infarction 1.03 0.98, 1.09 0.228
Electronic laboratory resultsa 4.74 4.48, 5.01 <0.001
Endpoint of fatal coronary heart disease 0.40 0.36, 0.45 <0.001
Endpoint of nonfatal myocardial infarction 0.29 0.26, 0.34 <0.001
Endpoint of noncoronary death 0.42 0.39, 0.45 <0.001
Cumulative hazard 0.03 0.02, 0.03 <0.001
Variable Odds Ratio 95% Confidence Interval P Value
Age, per 10 years 4.81 4.02, 5.75 <0.001
Age squared, per 10 years squared 0.89 0.87, 0.90 <0.001
Female sex 1.08 1.03, 1.12 0.002
Diabetes mellitus 1.74 1.64, 1.84 <0.001
Peripheral arterial disease 1.24 1.15, 1.35 <0.001
Previous stroke 1.26 1.17, 1.36 <0.001
Heart failure 0.96 0.89, 1.04 0.333
Previous myocardial infarction 1.03 0.98, 1.09 0.228
Electronic laboratory resultsa 4.74 4.48, 5.01 <0.001
Endpoint of fatal coronary heart disease 0.40 0.36, 0.45 <0.001
Endpoint of nonfatal myocardial infarction 0.29 0.26, 0.34 <0.001
Endpoint of noncoronary death 0.42 0.39, 0.45 <0.001
Cumulative hazard 0.03 0.02, 0.03 <0.001

Abbreviation: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records.

a Whether a medical practice was receiving electronic laboratory results.

Cumulative incidence of myocardial infarction or death (Kaplan-Meier failure curve) for patients with stable angina in the CALIBER database, by complete record status, 2001–2010. The solid line represents patients in data set A but not data set B (those with missing data; n = 39,268 at the start, dropping to 17,588 in year 6), and the dashed line represents patients in data set B (those with complete records; n = 13,308 at the start, dropping to 2,594 in year 6). CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records.

Figure 2.

Cumulative incidence of myocardial infarction or death (Kaplan-Meier failure curve) for patients with stable angina in the CALIBER database, by complete record status, 2001–2010. The solid line represents patients in data set A but not data set B (those with missing data; n = 39,268 at the start, dropping to 17,588 in year 6), and the dashed line represents patients in data set B (those with complete records; n = 13,308 at the start, dropping to 2,594 in year 6). CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records.

We found very little difference between results obtained using 10 MICE imputations and those obtained using 100 MICE imputations (Supplementary Data and 4), so all results are based on 10 imputations unless stated otherwise.

Bias

Estimates from complete-record analysis were biased for some parameters under MAR (missingness mechanism 1); this may be expected because we had introduced missingness dependent on the outcome (Table 2, Supplementary Data, Supplementary Data). For example, the geometric mean hazard ratio per doubling of lymphocyte count was 0.738 from complete-record analysis but 0.799 from full-data analysis. There was no material bias with parametric MICE (mean hazard ratio = 0.806) or our random forest MICE method with 10 trees (“MICE RF 10”; mean hazard ratio = 0.807). The random forest MICE estimate for smoking (categorical) was biased towards the null (Table 3, Supplementary Data), but there was no material bias in other parameters estimated by random forest or parametric MICE (Table 2, Supplementary Data and Supplementary Data, Supplementary Data). However, imputation using single-tree random forest MICE (“MICE Tree”) or missForest produced materially biased estimates for all continuous variables missing at random (Figure 3, Table 2, Supplementary Data, Supplementary Data).

Table 2.

Comparisons Between Methods of Handling Missing Data in 1,000 Samples With Continuous Variables Missing at Random in a Pattern Similar to That of the Original Data Set (Missingness Mechanism 1), CALIBER Database, 2001–2010

Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % Between- Imputation Variance
Neutrophils (109 cells/L), per doubling
Full data 0.002 0.43 0.158 0.564 92.2
Complete recordc −0.045 −2.67 0.533 1.677 90.1
MICE normal −0.038 −5.15 0.232 0.883 93.4 0.0243
MICE PMM −0.042 −5.68 0.230 0.889 93.4 0.0245
missForest −0.266 27.72 0.303 0.781 63.2 0.0014
MICE RF 10 trees −0.024 −4.55 0.165 0.798 97.9 0.0143
Lymphocytes (109 cells/L), per doubling
Full data −0.007 −1.23 0.155 0.526 91.6
Complete recordc −0.087 −5.87 0.464 1.544 89.8
MICE normal 0.001 0.13 0.202 0.759 93.2 0.0157
MICE PMM 0.006 0.99 0.205 0.768 92.4 0.0162
missForest −0.190 −22.21 0.270 0.724 72.5 0.0011
MICE RF 10 trees 0.003 0.56 0.156 0.727 97.8 0.0109
Hemoglobin, per g/dL
Full data −0.004 −1.99 0.057 0.202 91.6
Complete recordc −0.022 −3.91 0.180 0.593 90.8
MICE normal −0.007 −2.73 0.076 0.279 92.6 0.0019
MICE PMM −0.004 −1.47 0.077 0.279 92.7 0.0019
missForest −0.056 −19.96 0.089 0.255 77.3 0.0001
MICE RF 10 trees −0.010 −5.61 0.059 0.261 97.2 0.0012
Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % Between- Imputation Variance
Neutrophils (109 cells/L), per doubling
Full data 0.002 0.43 0.158 0.564 92.2
Complete recordc −0.045 −2.67 0.533 1.677 90.1
MICE normal −0.038 −5.15 0.232 0.883 93.4 0.0243
MICE PMM −0.042 −5.68 0.230 0.889 93.4 0.0245
missForest −0.266 27.72 0.303 0.781 63.2 0.0014
MICE RF 10 trees −0.024 −4.55 0.165 0.798 97.9 0.0143
Lymphocytes (109 cells/L), per doubling
Full data −0.007 −1.23 0.155 0.526 91.6
Complete recordc −0.087 −5.87 0.464 1.544 89.8
MICE normal 0.001 0.13 0.202 0.759 93.2 0.0157
MICE PMM 0.006 0.99 0.205 0.768 92.4 0.0162
missForest −0.190 −22.21 0.270 0.724 72.5 0.0011
MICE RF 10 trees 0.003 0.56 0.156 0.727 97.8 0.0109
Hemoglobin, per g/dL
Full data −0.004 −1.99 0.057 0.202 91.6
Complete recordc −0.022 −3.91 0.180 0.593 90.8
MICE normal −0.007 −2.73 0.076 0.279 92.6 0.0019
MICE PMM −0.004 −1.47 0.077 0.279 92.7 0.0019
missForest −0.056 −19.96 0.089 0.255 77.3 0.0001
MICE RF 10 trees −0.010 −5.61 0.059 0.261 97.2 0.0012

Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; PMM, predictive mean matching; RF 10 trees, random forest with 10 trees; SD, standard deviation.

a Bias was measured relative to estimates from analysis of the full data set (data set C) (Supplementary Data).

b The z score is defined as the mean bias of the estimate divided by the empirical standard error from simulations, and it should lie approximately within the interval (−2, +2).

c Results for complete records were based on the 986 samples for which it was possible to estimate hazard ratios for all parameters.

Table 2.

Comparisons Between Methods of Handling Missing Data in 1,000 Samples With Continuous Variables Missing at Random in a Pattern Similar to That of the Original Data Set (Missingness Mechanism 1), CALIBER Database, 2001–2010

Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % Between- Imputation Variance
Neutrophils (109 cells/L), per doubling
Full data 0.002 0.43 0.158 0.564 92.2
Complete recordc −0.045 −2.67 0.533 1.677 90.1
MICE normal −0.038 −5.15 0.232 0.883 93.4 0.0243
MICE PMM −0.042 −5.68 0.230 0.889 93.4 0.0245
missForest −0.266 27.72 0.303 0.781 63.2 0.0014
MICE RF 10 trees −0.024 −4.55 0.165 0.798 97.9 0.0143
Lymphocytes (109 cells/L), per doubling
Full data −0.007 −1.23 0.155 0.526 91.6
Complete recordc −0.087 −5.87 0.464 1.544 89.8
MICE normal 0.001 0.13 0.202 0.759 93.2 0.0157
MICE PMM 0.006 0.99 0.205 0.768 92.4 0.0162
missForest −0.190 −22.21 0.270 0.724 72.5 0.0011
MICE RF 10 trees 0.003 0.56 0.156 0.727 97.8 0.0109
Hemoglobin, per g/dL
Full data −0.004 −1.99 0.057 0.202 91.6
Complete recordc −0.022 −3.91 0.180 0.593 90.8
MICE normal −0.007 −2.73 0.076 0.279 92.6 0.0019
MICE PMM −0.004 −1.47 0.077 0.279 92.7 0.0019
missForest −0.056 −19.96 0.089 0.255 77.3 0.0001
MICE RF 10 trees −0.010 −5.61 0.059 0.261 97.2 0.0012
Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % Between- Imputation Variance
Neutrophils (109 cells/L), per doubling
Full data 0.002 0.43 0.158 0.564 92.2
Complete recordc −0.045 −2.67 0.533 1.677 90.1
MICE normal −0.038 −5.15 0.232 0.883 93.4 0.0243
MICE PMM −0.042 −5.68 0.230 0.889 93.4 0.0245
missForest −0.266 27.72 0.303 0.781 63.2 0.0014
MICE RF 10 trees −0.024 −4.55 0.165 0.798 97.9 0.0143
Lymphocytes (109 cells/L), per doubling
Full data −0.007 −1.23 0.155 0.526 91.6
Complete recordc −0.087 −5.87 0.464 1.544 89.8
MICE normal 0.001 0.13 0.202 0.759 93.2 0.0157
MICE PMM 0.006 0.99 0.205 0.768 92.4 0.0162
missForest −0.190 −22.21 0.270 0.724 72.5 0.0011
MICE RF 10 trees 0.003 0.56 0.156 0.727 97.8 0.0109
Hemoglobin, per g/dL
Full data −0.004 −1.99 0.057 0.202 91.6
Complete recordc −0.022 −3.91 0.180 0.593 90.8
MICE normal −0.007 −2.73 0.076 0.279 92.6 0.0019
MICE PMM −0.004 −1.47 0.077 0.279 92.7 0.0019
missForest −0.056 −19.96 0.089 0.255 77.3 0.0001
MICE RF 10 trees −0.010 −5.61 0.059 0.261 97.2 0.0012

Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; PMM, predictive mean matching; RF 10 trees, random forest with 10 trees; SD, standard deviation.

a Bias was measured relative to estimates from analysis of the full data set (data set C) (Supplementary Data).

b The z score is defined as the mean bias of the estimate divided by the empirical standard error from simulations, and it should lie approximately within the interval (−2, +2).

c Results for complete records were based on the 986 samples for which it was possible to estimate hazard ratios for all parameters.

Table 3.

Comparisons Between Methods of Handling Missing Data in 1,000 Samples With Categorical Variables Missing Completely at Random (Missingness Mechanism 2), CALIBER Database, 2001–2010

Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % % Falsely Classifiedc
Previous myocardial infarction
Full data 0.006 1.22 0.154 0.587 94.2 0
MICE logistic −0.013 −2.46 0.168 0.682 95.5 29.6
missForest 0.002 0.27 0.179 0.625 91.8 17.3
MICE RF 10 trees −0.020 −4.21 0.149 0.662 97.3 28.5
Diabetes mellitus
Full data 0.010 2.30 0.156 0.592 93.7 0
MICE logistic 0.016 3.21 0.171 0.685 95.7 32.0
missForest 0.014 2.73 0.182 0.627 90.8 19.7
MICE RF 10 trees −0.021 −4.25 0.149 0.668 97.5 30.7
Previous stroke
Full data 0.005 0.86 0.198 0.707 94.0 0
MICE logistic −0.005 −0.58 0.207 0.828 95.5 17.9
missForest 0.004 0.65 0.211 0.763 92.9 8.4
MICE RF 10 trees −0.011 −1.79 0.183 0.808 97.9 16.7
Peripheral arterial disease
Full data 0.016 2.59 0.199 0.730 93.6 0
MICE logistic −0.002 −0.21 0.218 0.858 94.8 15.5
missForest 0.028 4.18 0.223 0.788 91.9 7.0
MICE RF 10 trees 0.005 0.94 0.192 0.834 97.1 14.5
Heart failure
Full data 0.015 2.47 0.191 0.653 91.7 0
MICE logistic 0.015 2.22 0.207 0.759 93.8 14.6
missForest 0.001 0.08 0.216 0.696 89.4 7.2
MICE RF 10 trees −0.034 −5.78 0.190 0.746 95.5 13.7
Smoking status: current vs. never
Full data 0.019 2.62 0.264 0.969 93.9 0
MICE logistic 0.023 2.65 0.292 1.092 94.0 52.4
missForest −0.036 −3.56 0.308 1.062 91.5 35.0
MICE RF 10 trees −0.098 −12.92 0.237 1.072 95.5 50.0
Smoking status: former vs. never
Full data 0.011 1.66 0.247 0.908 93.6 0
MICE logistic −0.008 −0.82 0.266 1.022 94.1 52.4
missForest 0.045 5.34 0.270 0.980 93.2 35.0
MICE RF 10 trees −0.060 −8.81 0.212 1.000 97.1 50.0
Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % % Falsely Classifiedc
Previous myocardial infarction
Full data 0.006 1.22 0.154 0.587 94.2 0
MICE logistic −0.013 −2.46 0.168 0.682 95.5 29.6
missForest 0.002 0.27 0.179 0.625 91.8 17.3
MICE RF 10 trees −0.020 −4.21 0.149 0.662 97.3 28.5
Diabetes mellitus
Full data 0.010 2.30 0.156 0.592 93.7 0
MICE logistic 0.016 3.21 0.171 0.685 95.7 32.0
missForest 0.014 2.73 0.182 0.627 90.8 19.7
MICE RF 10 trees −0.021 −4.25 0.149 0.668 97.5 30.7
Previous stroke
Full data 0.005 0.86 0.198 0.707 94.0 0
MICE logistic −0.005 −0.58 0.207 0.828 95.5 17.9
missForest 0.004 0.65 0.211 0.763 92.9 8.4
MICE RF 10 trees −0.011 −1.79 0.183 0.808 97.9 16.7
Peripheral arterial disease
Full data 0.016 2.59 0.199 0.730 93.6 0
MICE logistic −0.002 −0.21 0.218 0.858 94.8 15.5
missForest 0.028 4.18 0.223 0.788 91.9 7.0
MICE RF 10 trees 0.005 0.94 0.192 0.834 97.1 14.5
Heart failure
Full data 0.015 2.47 0.191 0.653 91.7 0
MICE logistic 0.015 2.22 0.207 0.759 93.8 14.6
missForest 0.001 0.08 0.216 0.696 89.4 7.2
MICE RF 10 trees −0.034 −5.78 0.190 0.746 95.5 13.7
Smoking status: current vs. never
Full data 0.019 2.62 0.264 0.969 93.9 0
MICE logistic 0.023 2.65 0.292 1.092 94.0 52.4
missForest −0.036 −3.56 0.308 1.062 91.5 35.0
MICE RF 10 trees −0.098 −12.92 0.237 1.072 95.5 50.0
Smoking status: former vs. never
Full data 0.011 1.66 0.247 0.908 93.6 0
MICE logistic −0.008 −0.82 0.266 1.022 94.1 52.4
missForest 0.045 5.34 0.270 0.980 93.2 35.0
MICE RF 10 trees −0.060 −8.81 0.212 1.000 97.1 50.0

Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; RF 10 trees, random forest with 10 trees; SD, standard deviation.

a Bias was measured relative to estimates from analysis of the full data set (data set C) (Supplementary Data).

b The z score is defined as the mean bias of the estimate divided by the empirical standard error from simulations, and it should lie approximately within the interval (−2, +2).

c Percentage of imputed values that were different from the “true” (observed) missing value.

Table 3.

Comparisons Between Methods of Handling Missing Data in 1,000 Samples With Categorical Variables Missing Completely at Random (Missingness Mechanism 2), CALIBER Database, 2001–2010

Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % % Falsely Classifiedc
Previous myocardial infarction
Full data 0.006 1.22 0.154 0.587 94.2 0
MICE logistic −0.013 −2.46 0.168 0.682 95.5 29.6
missForest 0.002 0.27 0.179 0.625 91.8 17.3
MICE RF 10 trees −0.020 −4.21 0.149 0.662 97.3 28.5
Diabetes mellitus
Full data 0.010 2.30 0.156 0.592 93.7 0
MICE logistic 0.016 3.21 0.171 0.685 95.7 32.0
missForest 0.014 2.73 0.182 0.627 90.8 19.7
MICE RF 10 trees −0.021 −4.25 0.149 0.668 97.5 30.7
Previous stroke
Full data 0.005 0.86 0.198 0.707 94.0 0
MICE logistic −0.005 −0.58 0.207 0.828 95.5 17.9
missForest 0.004 0.65 0.211 0.763 92.9 8.4
MICE RF 10 trees −0.011 −1.79 0.183 0.808 97.9 16.7
Peripheral arterial disease
Full data 0.016 2.59 0.199 0.730 93.6 0
MICE logistic −0.002 −0.21 0.218 0.858 94.8 15.5
missForest 0.028 4.18 0.223 0.788 91.9 7.0
MICE RF 10 trees 0.005 0.94 0.192 0.834 97.1 14.5
Heart failure
Full data 0.015 2.47 0.191 0.653 91.7 0
MICE logistic 0.015 2.22 0.207 0.759 93.8 14.6
missForest 0.001 0.08 0.216 0.696 89.4 7.2
MICE RF 10 trees −0.034 −5.78 0.190 0.746 95.5 13.7
Smoking status: current vs. never
Full data 0.019 2.62 0.264 0.969 93.9 0
MICE logistic 0.023 2.65 0.292 1.092 94.0 52.4
missForest −0.036 −3.56 0.308 1.062 91.5 35.0
MICE RF 10 trees −0.098 −12.92 0.237 1.072 95.5 50.0
Smoking status: former vs. never
Full data 0.011 1.66 0.247 0.908 93.6 0
MICE logistic −0.008 −0.82 0.266 1.022 94.1 52.4
missForest 0.045 5.34 0.270 0.980 93.2 35.0
MICE RF 10 trees −0.060 −8.81 0.212 1.000 97.1 50.0
Variable and Method Biasa of Log HR z Score for Biasb SD of Estimated Log HR Mean Length of 95% CI Coverage of 95% CI, % % Falsely Classifiedc
Previous myocardial infarction
Full data 0.006 1.22 0.154 0.587 94.2 0
MICE logistic −0.013 −2.46 0.168 0.682 95.5 29.6
missForest 0.002 0.27 0.179 0.625 91.8 17.3
MICE RF 10 trees −0.020 −4.21 0.149 0.662 97.3 28.5
Diabetes mellitus
Full data 0.010 2.30 0.156 0.592 93.7 0
MICE logistic 0.016 3.21 0.171 0.685 95.7 32.0
missForest 0.014 2.73 0.182 0.627 90.8 19.7
MICE RF 10 trees −0.021 −4.25 0.149 0.668 97.5 30.7
Previous stroke
Full data 0.005 0.86 0.198 0.707 94.0 0
MICE logistic −0.005 −0.58 0.207 0.828 95.5 17.9
missForest 0.004 0.65 0.211 0.763 92.9 8.4
MICE RF 10 trees −0.011 −1.79 0.183 0.808 97.9 16.7
Peripheral arterial disease
Full data 0.016 2.59 0.199 0.730 93.6 0
MICE logistic −0.002 −0.21 0.218 0.858 94.8 15.5
missForest 0.028 4.18 0.223 0.788 91.9 7.0
MICE RF 10 trees 0.005 0.94 0.192 0.834 97.1 14.5
Heart failure
Full data 0.015 2.47 0.191 0.653 91.7 0
MICE logistic 0.015 2.22 0.207 0.759 93.8 14.6
missForest 0.001 0.08 0.216 0.696 89.4 7.2
MICE RF 10 trees −0.034 −5.78 0.190 0.746 95.5 13.7
Smoking status: current vs. never
Full data 0.019 2.62 0.264 0.969 93.9 0
MICE logistic 0.023 2.65 0.292 1.092 94.0 52.4
missForest −0.036 −3.56 0.308 1.062 91.5 35.0
MICE RF 10 trees −0.098 −12.92 0.237 1.072 95.5 50.0
Smoking status: former vs. never
Full data 0.011 1.66 0.247 0.908 93.6 0
MICE logistic −0.008 −0.82 0.266 1.022 94.1 52.4
missForest 0.045 5.34 0.270 0.980 93.2 35.0
MICE RF 10 trees −0.060 −8.81 0.212 1.000 97.1 50.0

Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; RF 10 trees, random forest with 10 trees; SD, standard deviation.

a Bias was measured relative to estimates from analysis of the full data set (data set C) (Supplementary Data).

b The z score is defined as the mean bias of the estimate divided by the empirical standard error from simulations, and it should lie approximately within the interval (−2, +2).

c Percentage of imputed values that were different from the “true” (observed) missing value.

Bias in estimates of log hazard ratios for partially observed variables with data missing at random (missingness mechanism 1) in 1,000 samples of patients with stable angina in the CALIBER database, 2001–2010. A) log neutrophil count (109 cells/L); B) log lymphocyte count (109 cells/L); C) hemoglobin concentration (g/dL). The solid horizontal line is the “true” log hazard ratio from the full data set (data set C); the dashed lines show ±1 empirical standard error. The boxes span the interquartile range (25th–75th percentiles), and the whiskers extend to the most extreme data point, which is no more than 1.5 times the interquartile range from the box. Circles represent outliers. The light gray boxes show results from simulations with 50% complete records, and the dark gray boxes show results from simulations with 25% complete records. CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MICE, multivariate imputation by chained equations; PMM, predictive mean matching; RF, random forest.

Figure 3.

Bias in estimates of log hazard ratios for partially observed variables with data missing at random (missingness mechanism 1) in 1,000 samples of patients with stable angina in the CALIBER database, 2001–2010. A) log neutrophil count (109 cells/L); B) log lymphocyte count (109 cells/L); C) hemoglobin concentration (g/dL). The solid horizontal line is the “true” log hazard ratio from the full data set (data set C); the dashed lines show ±1 empirical standard error. The boxes span the interquartile range (25th–75th percentiles), and the whiskers extend to the most extreme data point, which is no more than 1.5 times the interquartile range from the box. Circles represent outliers. The light gray boxes show results from simulations with 50% complete records, and the dark gray boxes show results from simulations with 25% complete records. CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MICE, multivariate imputation by chained equations; PMM, predictive mean matching; RF, random forest.

Efficiency

All of the imputation methods tested produced more efficient parameter estimates than complete-record analysis. MICE with random forest produced slightly more efficient estimates than parametric MICE, and the average between-imputation variance was also lower (Tables 2 and 3, Supplementary DataSupplementary Data).

Confidence intervals

Parametric MICE yielded confidence intervals with approximately 93%–95% coverage. The mean widths of confidence intervals were lower using random forest MICE than using parametric MICE (P < 0.001 for each comparison), but coverage was either equal or greater using random forest MICE (Tables 2 and 3, Supplementary DataSupplementary Data).

For categorical variables, missForest produced imputed values which were more likely to be equal to the “true” (observed) value than the MICE methods, but confidence intervals were too small with below nominal coverage, and between-imputation variance was very small. There was no difference in bias, precision, or coverage between normal-based MICE and predictive mean matching (Supplementary Data). Random forest MICE with 100 trees for continuous variables produced estimates with slightly narrower confidence intervals than random forest MICE with 10 trees (Supplementary Data), but with greater bias, worse coverage of 95% confidence intervals, and 10 times the computational cost. For categorical variables, random forest MICE with 10 trees and random forest MICE with 100 trees produced almost identical results.

Simulation study with interactions

The coefficient estimate for the partially observed variable (_x_3) was 10% biased using parametric MICE, 2.6% biased using random forest with 100 trees, and only 1.0% biased using random forest with 10 trees (P < 0.001 for 2-way comparisons). The bias in the _x_3 coefficient varied with the number of trees, with 10 or 20 trees giving minimal bias (Table 4). Random forest MICE produced narrower 95% confidence intervals for the _x_3 coefficient than parametric MICE (P < 0.001), and coverage was only 80% using parametric MICE as compared with 95% using random forest MICE with 5–100 trees. Further details are given in Supplementary Data.

Table 4.

Comparisons Between Methods of Handling Missing Data in a Survival Analysis of 1,000 Simulated Data Sets With a Predictor Variable Missing at Random That Is Associated With Fully Observed Predictors in a Nonlinear Way

Method Bias of Log HRa SE of Bias z Score for Bias SD of Estimate Mean Length of 95% CI Coverage of 95% CI, %
Full data −0.0002 0.001 −0.1 0.037 0.148 96.1
rfImpute 0.119 0.001 86.5 0.044 0.154 17.4
missForest 0.079 0.002 53.7 0.046 0.158 50.9
MICE RF with 5 trees −0.021 0.001 −17.1 0.038 0.172 95.3
MICE RF with 10 trees −0.005 0.001 −3.9 0.039 0.170 97.3
MICE RF with 20 trees 0.006 0.001 4.5 0.039 0.168 96.3
MICE RF with 50 trees 0.011 0.001 9.0 0.040 0.167 95.8
MICE RF with 100 trees 0.013 0.001 10.5 0.040 0.167 94.7
Parametric MICE −0.055 0.001 −44.8 0.039 0.178 79.8
Method Bias of Log HRa SE of Bias z Score for Bias SD of Estimate Mean Length of 95% CI Coverage of 95% CI, %
Full data −0.0002 0.001 −0.1 0.037 0.148 96.1
rfImpute 0.119 0.001 86.5 0.044 0.154 17.4
missForest 0.079 0.002 53.7 0.046 0.158 50.9
MICE RF with 5 trees −0.021 0.001 −17.1 0.038 0.172 95.3
MICE RF with 10 trees −0.005 0.001 −3.9 0.039 0.170 97.3
MICE RF with 20 trees 0.006 0.001 4.5 0.039 0.168 96.3
MICE RF with 50 trees 0.011 0.001 9.0 0.040 0.167 95.8
MICE RF with 100 trees 0.013 0.001 10.5 0.040 0.167 94.7
Parametric MICE −0.055 0.001 −44.8 0.039 0.178 79.8

Abbreviations: CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; RF, random forest; SD, standard deviation; SE, standard error.

a The true log hazard ratio was set at 0.5.

Table 4.

Comparisons Between Methods of Handling Missing Data in a Survival Analysis of 1,000 Simulated Data Sets With a Predictor Variable Missing at Random That Is Associated With Fully Observed Predictors in a Nonlinear Way

Method Bias of Log HRa SE of Bias z Score for Bias SD of Estimate Mean Length of 95% CI Coverage of 95% CI, %
Full data −0.0002 0.001 −0.1 0.037 0.148 96.1
rfImpute 0.119 0.001 86.5 0.044 0.154 17.4
missForest 0.079 0.002 53.7 0.046 0.158 50.9
MICE RF with 5 trees −0.021 0.001 −17.1 0.038 0.172 95.3
MICE RF with 10 trees −0.005 0.001 −3.9 0.039 0.170 97.3
MICE RF with 20 trees 0.006 0.001 4.5 0.039 0.168 96.3
MICE RF with 50 trees 0.011 0.001 9.0 0.040 0.167 95.8
MICE RF with 100 trees 0.013 0.001 10.5 0.040 0.167 94.7
Parametric MICE −0.055 0.001 −44.8 0.039 0.178 79.8
Method Bias of Log HRa SE of Bias z Score for Bias SD of Estimate Mean Length of 95% CI Coverage of 95% CI, %
Full data −0.0002 0.001 −0.1 0.037 0.148 96.1
rfImpute 0.119 0.001 86.5 0.044 0.154 17.4
missForest 0.079 0.002 53.7 0.046 0.158 50.9
MICE RF with 5 trees −0.021 0.001 −17.1 0.038 0.172 95.3
MICE RF with 10 trees −0.005 0.001 −3.9 0.039 0.170 97.3
MICE RF with 20 trees 0.006 0.001 4.5 0.039 0.168 96.3
MICE RF with 50 trees 0.011 0.001 9.0 0.040 0.167 95.8
MICE RF with 100 trees 0.013 0.001 10.5 0.040 0.167 94.7
Parametric MICE −0.055 0.001 −44.8 0.039 0.178 79.8

Abbreviations: CI, confidence interval; HR, hazard ratio; MICE, multivariate imputation by chained equations; RF, random forest; SD, standard deviation; SE, standard error.

a The true log hazard ratio was set at 0.5.

DISCUSSION

Summary of main findings

In this resampling study of methods for handling missing data, parametric and random forest MICE produced estimates with no material bias for a Cox model on data with artificially introduced MAR missingness. Random forest-based MICE produced more efficient estimates and narrower confidence intervals than parametric MICE, yet in some cases coverage probability was greater than 95%, suggesting that some confidence intervals may be conservative. A possible explanation for the efficiency gain with random forest MICE is that it was able to make better use of the available information by accommodating nonlinearities among the predictors. In simulations with an interaction among the predictor variables but not in the substantive model, random forest MICE was less biased than parametric MICE, which omitted the interaction. Using missForest for multiple imputation resulted in very biased estimates and poor coverage of confidence intervals. Overall, our results suggest that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

Imputation methods for MICE

It is important that imputation models be correctly specified for analyses to yield unbiased estimates, and random forest may help avoid the bias that can occur with parametric MICE if the latter's imputation models are misspecified. In our main study, standard parametric MICE performed well, suggesting that the true imputation models did not contain significant nonlinearities or interactions, and hence random forest did not confer an advantage from the perspective of bias. However, the simulated data sets had interactions which were not included in the parametric MICE imputation models, and in this setting random forest MICE outperformed parametric MICE (Supplementary Data). The default settings for MICE do not include interactions between the variables, and it is routine practice to include only those interactions that are in the substantive model, rather than actively search for all possible interactions and nonlinearities. This shows the importance of checking to be sure that the imputation models are reasonably well specified. Random forest reduces the need to investigate associations between predictor variables, because it should automatically accommodate nonlinearities and interactions. Imputation models should also be compatible with the substantive model (32), and random forest obviates the need to specify how the outcome should be conditioned on in the imputation models for covariates.

When using random forest for prediction, a larger number of trees is preferred in order to obtain precise predictions (30). However, when imputing continuous variables using random forest MICE, bias seemed to tend towards a nonzero limit as the number of trees increased, with 10 or 20 trees giving minimal bias (Table 4). It is possible that the relationship between the number of trees and bias may have the same functional form but with a different direction of bias, asymptotic limit, and optimal number of trees, depending on the data. This phenomenon warrants further investigation.

A disadvantage of random forest is that the “models” are complex and not easily interpretable, although arguably this is not a shortcoming for the purpose of imputation. Another disadvantage is that random forest can be biased in some situations, due to random forest predictions of continuous variables at the extremes of their range being biased towards less extreme values (33). This is because a random forest prediction effectively consists of a weighted average of observed values of the variable being predicted; unlike model-based prediction, it is unable to extrapolate beyond observed values. In a simulation study, we found that random forest imputation led to bias when the distribution of missing values was very different from that for observed values (17), although in such situations any kind of imputation may produce poor results. However, we did not find such a bias in our CALIBER study because missing and observed values had similar distributions. Another limitation of our random forest MICE method is the assumption that the residuals from the random forest regression are normally distributed with constant variance.

On these 2,000-patient data sets, computation time was 3 times as long for random forest MICE with 10 trees as for parametric MICE (137 seconds per data set vs. 48 seconds per data set on a computer with an Intel Xeon 3.47-GHz processor (Intel Corporation, Santa Clara, California)), but on a 10,000-patient data set, random forest took 6.5 times as long. However, random forest may yield a saving in analyst time because there is theoretically less need for transformation of fully observed variables or investigation of nonlinearities and interactions. It is also possible to include a large number of related predictor variables in random forest models without encountering problems due to collinearity.

We included missForest and rfImpute in our study as examples of algorithms for completing single data sets (14). They replace missing values with predicted values rather than draw from a distribution, such that the imputed values do not have the correct joint distribution, leading to biased parameter estimates. Better predictions do not mean better coverage of confidence intervals; it is important that imputation methods incorporate the correct amount of variation in order to produce unbiased estimates with correct coverage of confidence intervals (34).

There was no difference in the results between linear regression and predictive mean matching. This was probably because the partially observed continuous variables in our data were approximately normally distributed; predictive mean matching may be preferred for variables that are not (conditionally) normally distributed (35).

Limitations

Although this study had strengths (it was based on real data, and the analysis was realistically complex), it also had important limitations. The most important limitation in producing general recommendations is that it was based on a single analysis of a single study, so results should be generalized to other data sets with caution.

A limitation of our resampling methodology was that in order to avoid excessive computing time we used only 10 imputations for most of the comparisons, leading to noisy estimates of between-imputation variability. To save time, we also restricted the number of cycles of MICE to 10, and although we evaluated plots of chain means and standard deviations between cycles for a few runs, this is a crude way of assessing chain convergence; it is possible that the chains may not have converged by the end of every run.

We ignored practice-level clustering at the imputation and analysis stages, for simplicity. If patients from the same practice are more similar than patients from different practices, the variance of parameter estimates might be underestimated, and parameter estimates may also be biased. This could be properly accounted for by using hierarchical models for analysis and imputation (36).

Recommendations for further development

We consider random forest multiple imputation to be promising, but it should be tested on a larger range of data sets and in simulations to explore whether it gives unbiased estimates where there are nontrivial nonlinearities or interactions in imputation models, such that a standard parametric MICE imputation which ignores them gives biased results. Random forest tuning parameters (such as the number of trees and number of nodes) should be further investigated.

Conclusions

MICE is one of the recommended methods for multiple imputation in electronic health-record data, and we have shown that standard parametric MICE and our new random forest MICE method work reasonably well under artificially introduced missingness at random in a realistically complex data set. Random forest imputation should be further investigated in situations where MICE with default parametric imputation models produces biased results.

ACKNOWLEDGMENTS

Author affiliations: Clinical Epidemiology Group, Department of Epidemiology and Public Health, School of Life and Medical Sciences, University College London, London, United Kingdom (Anoop D. Shah, Harry Hemingway); Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, United Kingdom (Jonathan W. Bartlett, James Carpenter); and National Institute for Cardiovascular Outcomes Research, School of Life and Medical Sciences, University College London, London, United Kingdom (Owen Nicholas).

This work was supported by grants from the United Kingdom National Institute for Health Research (grant RP-PG-0407-10314); the Wellcome Trust (grants 086091/Z/08/Z and 0938/30/Z/10/Z to A.D.S.); the Medical Research Council (grants MR/K006584/1, G0902393, and G0900724 to J.W.B.); the United Kingdom Biobank; and the Farr Institute of Health Informatics Research (Health eResearch Centre Network), funded by the Medical Research Council in partnership with Arthritis Research UK, the British Heart Foundation, Cancer Research UK, the Economic and Social Research Council, the Engineering and Physical Sciences Research Council, the National Institute of Health Research, the National Institute for Social Care and Health Research (Welsh Assembly Government), the Chief Scientist Office (Scottish Government Health Directorates), and the Wellcome Trust.

The views and opinions expressed herein are those of the authors and do not necessarily reflect those of the National Institute for Health Research or the United Kingdom Department of Health. All authors reviewed and approved the final manuscript.

Conflict of interest: none declared.

REFERENCES

1

et al.

Issues in multiple imputation of missing data for large general practice clinical databases

,

Pharmacoepidemiol Drug Saf

,

2010

, vol.

19

6

(pg.

618

-

626

)

2

,

Analysis of Incomplete Multivariate Data

,

1997

London, United Kingdom

Chapman & Hall Ltd

3

,

Statistical Analysis With Missing Data

,

2002

2nd ed

Hoboken, NJ

John Wiley & Sons, Inc

4

mice: Multivariate Imputation by Chained Equations in R

,

J Stat Softw

,

2011

, vol.

45

3

(pg.

1

-

67

)

5

Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods

,

BMC Med Res Methodol

,

2012

, vol.

12

1

pg.

46

6

Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

,

BMC Med Res Methodol

,

2012

, vol.

12

1

pg.

184

7

Multiple imputation for missing data via sequential regression trees

,

Am J Epidemiol

,

2010

, vol.

172

9

(pg.

1070

-

1076

)

8

Random forests

,

Mach Learn

,

2001

, vol.

45

1

(pg.

5

-

32

)

9

,

Manual on Setting Up, Using, and Understanding Random Forests V3.1

,

2002

Berkeley, CA

University of California, Berkeley

(Accessed November 11, 2013)

10

et al.

Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience

,

Genet Epidemiol

,

2011

, vol.

35

S1

(pg.

S5

-

S11

)

11

et al.

Relative risk forests for exercise heart rate recovery as a predictor of mortality

,

J Am Stat Assoc

,

2004

, vol.

99

467

(pg.

591

-

600

)

12

et al.

Random survival forests

,

Ann Appl Stat

,

2008

, vol.

2

3

(pg.

841

-

860

)

13

et al.

Potential responders to FOLFOX therapy for colorectal cancer by random forests analysis

,

Br J Cancer

,

2012

, vol.

106

1

(pg.

126

-

132

)

14

MissForest—non-parametric missing value imputation for mixed-type data

,

Bioinformatics

,

2012

, vol.

28

1

(pg.

112

-

118

)

15

Imputation of missing values of tumour stage in population-based cancer registration

,

BMC Med Res Methodol

,

2011

, vol.

11

1

pg.

129

16

et al.

Data resource profile: CArdiovascular disease research using LInked BEspoke studies and electronic Records (CALIBER)

,

Int J Epidemiol

,

2012

, vol.

41

6

(pg.

1625

-

1638

)

17

,

CALIBERrfimpute: Imputation in MICE using Random Forest

,

2013

Vienna, Austria

Comprehensive R Archive Network

R package, version 0.1-2)

(Accessed November 12, 2013)

18

et al.

Validation and validity of diagnoses in the General Practice Research Database: a systematic review

,

Br J Clin Pharmacol

,

2010

, vol.

69

1

(pg.

4

-

14

)

19

Health and Social Care Information Centre

,

Hospital Episode Statistics

,

2013

Leeds, United Kingdom

Health and Social Care Information Centre

(Accessed November 11, 2013)

20

et al.

The Myocardial Ischaemia National Audit Project (MINAP)

,

Heart

,

2010

, vol.

96

16

(pg.

1264

-

1267

)

21

et al.

Threshold haemoglobin levels and the prognosis of stable coronary disease: two new cohorts and a systematic review and meta-analysis

,

PLoS Med

,

2011

, vol.

8

5

pg.

e1000439

22

et al.

Neutrophils and clinical outcomes in patients with acute coronary syndromes and/or cardiac revascularization: a systematic review on more than 34,000 subjects

,

Thromb Haemost

,

2011

, vol.

106

4

(pg.

591

-

599

)

23

et al.

Low lymphocyte count and cardiovascular diseases

,

Curr Med Chem

,

2011

, vol.

18

21

(pg.

3226

-

3233

)

24

Validity and efficiency of approximation methods for tied survival times in Cox regression

,

Biometrics

,

1997

, vol.

53

3

(pg.

1151

-

)

25

Imputing missing covariate values for the Cox model

,

Stat Med

,

2009

, vol.

28

15

(pg.

1982

-

1998

)

26

Small-sample degrees of freedom with multiple imputation

,

Biometrika

,

1999

, vol.

86

4

(pg.

948

-

955

)

27

R Development Core Team

,

R: A Language and Environment for Statistical Computing

,

2010

Vienna, Austria

R Foundation for Statistical Computing

(Accessed November 11, 2013)

28

,

missForest: Nonparametric Missing Value Imputation using Random Forest

,

2012

Vienna, Austria

Comprehensive R Archive Network

R package, version 1.3)

(Accessed November 11, 2013)

29

,

survival: Survival Analysis, Including Penalised Likelihood

,

2010

Vienna, Austria

Comprehensive R Archive Network

R package, version 2.36-2)

(Accessed November 23, 2010)

30

Classification and regression by randomForest

,

R News

,

2002

, vol.

2

3

(pg.

18

-

22

)

31

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

,

ACM Trans Model Comput Simul

,

1998

, vol.

8

1

(pg.

3

-

30

)

32

et al. ,

Multiple Imputation of Covariates by Fully Conditional Specification: Accommodating the Substantive Model

,

2012

Ithaca, NY

Cornell University Library

33

Estimating residual variance in random forest regression

,

Comput Stat Data Anal

,

2011

, vol.

55

11

(pg.

2937

-

2950

)

34

Multiple imputation after 18+ years

,

J Am Stat Assoc

,

1996

, vol.

91

434

(pg.

473

-

489

)

35

Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study

,

BMC Med Res Methodol

,

2010

, vol.

10

pg.

112

36

REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types

,

J Stat Softw

,

2011

, vol.

45

5

(pg.

1

-

14

)

Author notes

Abbreviations: CALIBER, Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; MAR, missing at random; MICE, multivariate imputation by chained equations.

© The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 18,429

13,787 Pageviews

4,642 PDF Downloads

Since 12/1/2016

Month: Total Views:
December 2016 4
January 2017 21
February 2017 56
March 2017 37
April 2017 38
May 2017 34
June 2017 43
July 2017 27
August 2017 40
September 2017 34
October 2017 44
November 2017 58
December 2017 79
January 2018 117
February 2018 118
March 2018 93
April 2018 145
May 2018 146
June 2018 141
July 2018 116
August 2018 143
September 2018 101
October 2018 99
November 2018 182
December 2018 142
January 2019 179
February 2019 189
March 2019 230
April 2019 289
May 2019 238
June 2019 225
July 2019 207
August 2019 204
September 2019 165
October 2019 280
November 2019 273
December 2019 125
January 2020 147
February 2020 166
March 2020 177
April 2020 112
May 2020 134
June 2020 219
July 2020 134
August 2020 172
September 2020 173
October 2020 186
November 2020 194
December 2020 185
January 2021 175
February 2021 160
March 2021 248
April 2021 251
May 2021 237
June 2021 199
July 2021 182
August 2021 209
September 2021 205
October 2021 227
November 2021 210
December 2021 169
January 2022 237
February 2022 233
March 2022 311
April 2022 338
May 2022 332
June 2022 250
July 2022 235
August 2022 234
September 2022 241
October 2022 376
November 2022 232
December 2022 264
January 2023 267
February 2023 252
March 2023 345
April 2023 328
May 2023 299
June 2023 287
July 2023 243
August 2023 288
September 2023 233
October 2023 228
November 2023 217
December 2023 256
January 2024 330
February 2024 277
March 2024 298
April 2024 352
May 2024 305
June 2024 267
July 2024 248
August 2024 275
September 2024 288
October 2024 130

Citations

402 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic