Using Super Learner Prediction Modeling to Improve... : Epidemiology (original) (raw)

Methods

Using Super Learner Prediction Modeling to Improve High-dimensional Propensity Score Estimation

Wyss, Richarda; Schneeweiss, Sebastiana; van der Laan, Markb; Lendle, Samuel D.b; Ju, Chengb; Franklin, Jessica M.a

From the aDivision of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA; and bDepartment of Biostatistics, University of California, Berkeley, CA.

Submitted September 19, 2016; accepted September 27, 2017.

Code availability: Software for the methods discussed in the manuscript is available at https://github.com/lendle/hdps and https://github.com/lendle/TargetedLearning.jl. R code for producing plasmode simulations is available upon request.

Supported by by PCORI contract ME-1303–5638.

The authors report no conflicts of interest.

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com).

Correspondence: Richard Wyss, Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women’s Hospital and Harvard Medical School, 1620 Tremont St, Suite 3030, Boston, MA 02120. E-mail: [email protected].

Abstract

The high-dimensional propensity score is a semiautomated variable selection algorithm that can supplement expert knowledge to improve confounding control in nonexperimental medical studies utilizing electronic healthcare databases. Although the algorithm can be used to generate hundreds of patient-level variables and rank them by their potential confounding impact, it remains unclear how to select the optimal number of variables for adjustment. We used plasmode simulations based on empirical data to discuss and evaluate data-adaptive approaches for variable selection and prediction modeling that can be combined with the high-dimensional propensity score to improve confounding control in large healthcare databases. We considered approaches that combine the high-dimensional propensity score with Super Learner prediction modeling, a scalable version of collaborative targeted maximum-likelihood estimation, and penalized regression. We evaluated performance using bias and mean squared error (MSE) in effect estimates. Results showed that the high-dimensional propensity score can be sensitive to the number of variables included for adjustment and that severe overfitting of the propensity score model can negatively impact the properties of effect estimates. Combining the high-dimensional propensity score with Super Learner was the most consistent strategy, in terms of reducing bias and MSE in the effect estimates, and may be promising for semiautomated data-adaptive propensity score estimation in high-dimensional covariate datasets.