Parallelized prediction error estimation for evaluation of high-dimensional models (original) (raw)
Abstract
Summary: There is a multitude of new techniques that promise to extract predictive information in bioinformatics applications. It has been recognized that a first step for validation of the resulting model fits should rely on proper use of resampling techniques. However, this advice is frequently not followed, potential reasons being difficulty of correct implementation and computational demand. This is addressed by the R package peperr, which is designed for reliable prediction error estimation through resampling, potentially accelerated by parallel execution on a compute cluster. Its interface allows easy connection to newly developed model fitting routines. Performance evaluation of the latter is furthermore guided by diagnostic plots, which helps to detect specific problems due to high-dimensional data structures.
Availability: http://cran.r-project.org, http://www.imbi.uni-freiburg.de/parallel
Contact: cp@fdm.uni-freiburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
When predictive models are to be built in bioinformatics applications, overfitting is a severe problem, especially in high-dimensional data settings. For example, in predictive survival models built from microarray data, where there are many covariates and few observations, prediction performance of the resulting signatures will often be overestimated, when data that were used for fitting is employed for evaluation. As validation data are only rarely available, prediction error estimation based on resampling procedures, such as the bootstrap or cross-validation, have been proposed.
One reason why such approaches are not generally used in practice may be the computational burden (e.g. often still some test data are set aside, loosing observations for fitting). Parallel execution using a multiprocessor computer (nowadays even available in desktop computers with multicore processors) or a dedicated compute cluster promises to alleviate this problem. Resampling methods are predestined for parallel execution, as the computation for each split into training and test set is independent. Although some computational costs for the parallelization have to be considered, a noticeable speed-up can be reached even with a standard dual-core processor. A convenient interface to such parallel execution techniques might increase use of resampling methods and thereby generally improve the state in terms of model evaluation.
When using resampling for prediction error estimation, it is essential to carry out all preprocessing steps as well as the selection of the complexity parameters in each split separately, as information of the test set should not be re-used by the model fitting procedure in the training set (Simon et al., 2003). Otherwise, estimates of prediction performance may be severely biased. Although this strategy is well-known, still a lot of studies can be found with no or no adequate resampling procedure (Dupuy and Simon, 2007). The reason for this may be that implementation of resampling prediction error estimators such as the bootstrap.632+estimate (Efron and Tibshirani, 1997) is not trivial. For example, it is easy to loose track, where complexity selection has to be repeated, and where not. Specifically, in bootstrap samples complexity selection is essential to avoid bias, while the no-information error is estimated with fixed complexity. Also, in a time-to-event setting, an overall set of evaluation time points has to be determined to allow for aggregation of the bootstrap samples.
We developed a parallelized module that provides model evaluation based on resampling techniques in a simple and very flexible form, which relieves the end-user of implementation details of resampling estimates and of almost all issues concerning the parallelization itself. In addition to various integrated procedures for resampling, selection of model complexity, model fitting and measurement of prediction error, the user can easily employ own routines. In particular in high-dimensional settings with low sample size, model fitting in combination with resampling procedures may fail to adequately reflect the situation in new data. For example, selection of model complexity may be biased when resampling with replacement is used (Binder and Schumacher, 2008a). The package tries to avoid such problems by using sensible defaults. In addition, diagnostic plots are offered for checking whether the resampling procedure works adequately. For judging estimates of the prediction error, the null model, i.e. a statistical model including no information about the covariates, is fitted according to the response type, which may seem trivial but nevertheless is sometimes forgotten and leads to problematic interpretations of model fits (Binder and Schumacher, 2008c).
2 OVERVIEW
For resampling-based prediction error estimation, training and test datasets are repeatedly generated from the original data. The two most prominent approaches are the bootstrap and cross-validation.
For a realization of training and test data, the following steps have to be performed: (i) selection of model complexity (if desired), using the training dataset; (ii) fit of the model with the selected or a given complexity on the training dataset; and (iii) measurement of prediction error on the corresponding test set. Step (i) can be omitted, if the prediction model used does not need a specified complexity. Otherwise, it can be stated explicitly, i.e. one value or a set of complexity values can be passed at one call of the function, which might be used if prediction error estimates shall be used for selection of a model complexity parameter. In the latter case a prediction model for each complexity value is fitted, using the current training dataset. Another possibility is to specify a function to determine the best complexity out of a set of given values.
Finally, the obtained prediction error measurements for each split are aggregated depending on the response. We provide.632+estimates of prediction error curves for survival models, as outlined in Schumacher et al. (2007) (see also Efron and Tibshirani, 1997; Gerds and Schumacher, 2007) and of the Brier score (Brier, 1950) and misclassification rate for binary responses. Out-of-bag prediction error estimates are available as well.
Sensible defaults for all of the above steps are provided. For example, bootstrap without replacement is used if the number of observations is smaller than the covariates. Otherwise, bootstrap with replacement is employed (following Binder and Schumacher, 2008a). Users can, however, also employ their own routines, then automatically benefitting from parallelization.
Parallelization is performed via the R package snowfall, which is a top-level wrapper around the R package snow (Rossini et al., 2007). snowfall has the advantage that it can be used with sfCluster (Knaus et al., 2009), a Unix tool for flexible and comfortable management of parallel R processes, but it can be used with all cluster solutions supported by snow as well. Typically all libraries, functions and variables required for computation have to be exported manually to the cluster. peperr automates this by analyzing user-defined functions and loading the required components to the cluster. The user does not have to deal with parallelization at all and benefits from the extended error handling of snowfall and an adequate.632+estimate implementation. Parallel random number generation is automatically enabled in such a way that results can be reproduced.
3 APPLICATION
For an illustration, we apply our package together with high-dimensional survival data (van de Vijver et al., 2002) and the CoxBoost algorithm (Binder and Schumacher, 2008b). The data set contains a 70-gene profile of 295 patients with primary breast carcinomas. A Cox proportional hazards model is fitted by likelihood-based boosting, where the number of boosting steps is determined by a cross-validation procedure. These methods are part of the package, namely fit.CoxBoost and complexity.mincv.CoxBoost. With survival response response and matrix of covariates x, the call then simply is:
A simple overview of the results is given by diagnostic plots. For example, the distribution of the boosting steps selected in each bootstrap sample might highlight a potential complexity selection bias (Binder and Schumacher, 2008a), see Supplementary Data Figure S1, left panel.
The predictive partial log-likelihood (PLL) measures the prediction performance of each model, fitted in a bootstrap sample, using the data not in this sample (Verweij and van Houwelingen, 1993). Multiplying by –2 leads to a deviance-like measure, which means that small values indicate good prediction performance. Plots of the predictive PLL of each bootstrap sample typically show a slight convex shape, due to the fact that a very small as well as a very large level of complexity decreases prediction performance, because of under- and overfitting, respectively (Supplementary Data Fig. S1, right panel).
Prediction error curves show the prediction error as a function of time. For the apparent error, the same data are used for training, i.e. fitting the model, and testing. This leads to an overly optimistic estimate of the prediction error. A more reasonable estimate is provided by the.632+bootstrap prediction error estimate (Gerds and Schumacher, 2007). It is a weighted linear combination of the apparent error and the mean of the estimated prediction error of the bootstrap samples. Both curves (Fig. 1) show instability at large evaluation time points due to few observations under risk at that time points. Moreover, the graphic shows that the fit from CoxBoost performs all in all better than the null model, which indicates a reasonable model fit.
Fig. 1.
Prediction error curve of the null model, the.632+bootstrap prediction error estimate, the apparent error in the full dataset and the out-of-bag prediction error estimates of the bootstrap samples.
Using parallelization on a standard dual-core laptop with snow/LAM-MPI libraries installed, we reach a speed-up of about 1.8 compared to sequential mode in this application (while no other processes were running). Additionally, we ran the procedure sequentially on a 16-GB-compute-server and concurrently using the 24 CPUs of three identical compute servers where a speed-up of approximately factor 19 is obtained. Given such remarkable speed gains and the ability of easily incorporating newly developed routines, it makes prediction error estimation feasible even for large datasets. We expect that resampling methods will be applied more often and properly, as the package peperr provides a convenient one-call solution.
ACKNOWLEDGEMENTS
We gratefully acknowledge the support from Deutsche Forschungsgemeinschaft (DFG Forschergruppe FOR 534). We would like to thank the four anonymous referees for their comments that helped to considerably improve the manuscript as well as the peperr package and also Jochen Knaus for providing the snowfall package and adapting it to our requirements.
REFERENCES
Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples
,
Stat. Appl. Genet. Mol. Biol.
,
2008
, vol.
7
pg.
12
Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models
,
BMC Bioinformatics
,
2008
, vol.
9
pg.
14
Comment on ‘Network-constrained regularization and variable selection for analysis of genomic data’
,
Bioinformatics
,
2008
, vol.
24
(pg.
2566
-
2568
)
Verification of forecasts expressed in terms of probability
,
Mon. Weather Rev.
,
1950
, vol.
78
(pg.
1
-
3
)
Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting
,
J. Natl Cancer I.
,
2007
, vol.
99
(pg.
147
-
157
)
Improvements on cross-validation: the. 632+bootstrap method
,
J. Am. Stat. Assoc.
,
1997
, vol.
78
(pg.
316
-
331
)
Efron-type measures of prediction error for survival analysis
,
Biometrics
,
2007
, vol.
63
(pg.
1283
-
1287
)
et al.
Easier parallel computing in R with snowfall and sfCluster
,
R News, accepted.
,
2009
et al.
Simple parallel statistical computing in R
,
J. Comput. Graph. Stat.
,
2007
, vol.
16
(pg.
399
-
420
)
et al.
Assessment of survival prediction models based on microarray data
,
Bioinformatics
,
2007
, vol.
23
(pg.
1768
-
1774
)
et al.
Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification
,
J. Natl Cancer I.
,
2003
, vol.
95
(pg.
14
-
18
)
et al.
A gene-expression signature as a predictor of survival in breast cancer
,
N. Engl. J. Med.
,
2002
, vol.
347
pg.
25
Cross-validation in survival analysis
,
Stat. Med.
,
1993
, vol.
12
(pg.
2305
-
2314
)
Author notes
Associate Editor: David Rocke
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org