The practical effect of batch on genomic prediction - PubMed (original) (raw)

The practical effect of batch on genomic prediction

Hilary S Parker et al. Stat Appl Genet Mol Biol. 2012.

Abstract

Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.

PubMed Disclaimer

Figures

Figure 1

Figure 1. Assignment of batch by date of microarray studies

We assigned batches as indicated, based on the histogram of array dates for (a) Wang et al. (2005) and (b) Minn et al. (2005) data.

Figure 2

Figure 2. Design of cross-validation for prediction accuracy, allowing for (a) within, (b) between and (c) pooled batches

We randomly selected two mutually exclusive training and testing subsets of arrays from each batch - all four had the same number of samples, with proportional mixes of each outcome. These subsets were preprocessed separately. We built a predictive model on the training set, and then either tested it on a) the testing set of the same batch, or b) the testing set of the other batch. We iterated this process 100 times to obtain robust accuracy rates for the models. c) In addition, an internal control was created which pooled the batches together. We randomly selected two mutually exclusive training and testing equal-sized subsets of the arrays, with a mix of batch and outcome proportional to the entire dataset. We again built a predictive model on the training set and tested it on the test set, and iterated 100 times.

Figure 3

Figure 3. Simulated design allows for predictive model to be built on subset of data with batch and outcome perfectly confounded

We built the model on a subset of the data, using ER− samples only from batch A, and ER+ samples only from batch B. We then tested the accuracy of the model on a subset of the data from each batch and outcome combination and report the accuracy in Figure 4.

Figure 4

Figure 4. Prediction accuracy rates for perfect confounding simulated design show that batch and outcome information is conflated

The study design is presented above (Figure 3), and prediction accuracy rates are shown as boxplots for the accuracy measurements taken from the 100 iterations. Results are shown only for fRMA-preprocessing. RMA-preprocessing results are shown in the Supplemental Materials, and are very similar to the fRMA-preprocessing results. The results show that batch/outcome combinations used in the training dataset (Figure 3) perform much better than batch/outcome combinations not used in the training dataset. This suggests that batch information is heavily used by the predictive algorithm when there is high confounding between batch and outcome.

Figure 5

Figure 5

Density plots of _β_1 and _β_2 estimates from the fitted model 1 on 100 iterations of the simulated design (Figure 3), using the a) Wang et al. (2005) data, and b) Minn et al. (2005) data. We utilized both RMA- and fRMA- normalization for each study.

Figure 6

Figure 6. Prediction accuracy improves as batch-affected probes are removed

Batch-affected probes in the Wang et al. (2005) dataset were determined by fitting model 1 and selecting probes with the most significant _β_1 estimates.

Figure 7

Figure 7. Histogram of _R_2 values from model 3 for background-adjusted probe-level data from 262 microarray studies, as well as the Wang et al. (2005) and Minn et al. (2005) datasets

_R_2 values show that no more than 15% of the variation in the measure of batch-affectedness in the probes is due to the probe sequence.

Figure 8

Figure 8. Examples of parameter estimates from model 3 for the Wang et al. (2005) and Minn et al. (2005) datasets, as well as 262 others, do not show a consistent pattern

Model 3 was fit on GCRMA-background corrected probe-level data with the outcome coefficient excluded. The vast majority of the models displayed coefficients close to zero, as in the Minn et al. (2005) dataset. Some studies did show coefficient patterns, such as in the Wang et al. (2005) dataset. However, these patterns were not consistent from study to study.

References

    1. Akey JM, Biswas S, Leek JT, Storey JD. On the design and analysis of gene expression studies in human populations. Nature Genetics. 2007;39:807–8. author reply 808–9, URL http://dx.doi.org/10.1038/ng0707-807. - DOI - PubMed
    1. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006;7:55–65. URL http://dx.doi.org/10.1038/nrg1749. - DOI - PubMed
    1. Baggerly KA, Edmonson SR, Morris JS, Coombes KR. High-resolution serum proteomic patterns for ovarian cancer detection. Endocrine-Related Cancer. 2004;11:583–4. author reply 585–7, URL http://dx.doi.org/10.1677/erc.1.00868. - DOI - PubMed
    1. Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M. Genome-wide analysis of estrogen receptor binding sites. Nature Genetics. 2006;38:1289–97. URL http://dx.doi.org/10.1038/ng1901. - DOI - PubMed
    1. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30:207–10. URL http://dx.doi.org/10.1093/nar/30.1.207. - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources