Predictor correlation impacts machine learning algorithms: implications for genomic studies (original) (raw)

Journal Article

1 Department of Statistical Genetics, Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, 2 Department of Clinical Pharmacology, University of Oxford, Old Road Campus Research Building, Roosevelt Road, Oxford, OX3 7DQ, UK, 3 Genes, Cognition and Psychosis Program, Intramural Research Program, National Institute of Mental Health, Room 4S-235, 10 Center Drive and 4 Mathematical and Statistical Computing Laboratory, Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD 20892, USA

* To whom correspondence should be addressed.

Search for other works by this author on:

Revision received:

14 May 2009

Navbar Search Filter Mobile Enter search term Search

Abstract

Motivation: The advent of high-throughput genomics has produced studies with large numbers of predictors (e.g. genome-wide association, microarray studies). Machine learning algorithms (MLAs) are a computationally efficient way to identify phenotype-associated variables in high-dimensional data. There are important results from mathematical theory and numerous practical results documenting their value. One attractive feature of MLAs is that many operate in a fully multivariate environment, allowing for small-importance variables to be included when they act cooperatively. However, certain properties of MLAs under conditions common in genomic-related data have not been well-studied—in particular, correlations among predictors pose a problem.

Results: Using extensive simulation, we showed considering correlation within predictors is crucial in making valid inferences using variable importance measures (VIMs) from three MLAs: random forest (RF), conditional inference forest (CIF) and Monte Carlo logic regression (MCLR). Using a case–control illustration, we showed that the RF VIMs—even permutation-based—were less able to detect association than other algorithms at effect sizes encountered in complex disease studies. This reduction occurred when ‘causal’ predictors were correlated with other predictors, and was sharpest when RF tree building used the Gini index. Indeed, RF Gini VIMs are biased under correlation, dependent on predictor correlation strength/number and over-trained to random fluctuations in data when tree terminal node size was small. Permutation-based VIM distributions were less variable for correlated predictors and are unbiased, thus may be preferred when predictors are correlated. MLAs are a powerful tool for high-dimensional data analysis, but well-considered use of algorithms is necessary to draw valid conclusions.

Contact: kristin.nicodemus@well.ox.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

High-throughput genomic technology has led to an increasing interest in data-mining approaches such as machine learning algorithms (MLAs) (Biau et al., 2008; Breiman, 2001; Devroye et al., 1996; Hothorn et al., 2006; Kooperberg and Ruczinski, 2005; Liaw and Weiner, 2002) to efficiently locate predictors related to outcome in high-dimensional data (Bureau et al., 2003; Dettling, 2004; Díaz-Uriarte and Alvarez de Andrés, 2006; Eller et al., 2007; Enot et al., 2006; Jiang et al., 2007; Kooperberg and Ruczinski, 2005; Li et al., 2004; Lunetta et al., 2004; Pang et al., 2006; Schwender et al., 2004; Wang et al., 2006). Random forest (RF) (Biau et al., 2008; Breiman, 2001; Devroye et al., 1996) has been commonly used in statistical genetics (Bureau et al., 2003, 2005; Lunetta et al., 2004; Pang et al., 2006; Schwender et al., 2004; Wang et al., 2006), gene expression profiling, metabolomic and proteomic research (Dettling, 2004; Díaz-Uriarte and Alvarez de Andrés, 2006; Eller et al., 2007; Enot et al., 2006; Jiang et al., 2007; Li et al., 2004). RF is an example of an ensemble-based MLA, which combines information across classification trees and provides a natural framework for assessing association and interaction/epistasis simultaneously using variable importance measures (VIMs) (Nicodemus et al., 2007). Two recent studies examined properties of MLA VIMs for genomic data: the first showed biased variable selection in RF when using bootstrap sampling and when scales of predictor variables differ (Strobl et al., 2007). The second showed that permutation VIMs are unstable when effect sizes are modest (Nicodemus et al., 2007). Genomic data often exhibit complex underlying correlation structures between predictors, for example, among genes or single nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD) in association studies, or co-regulation in gene expression studies. However, the impact of such correlation on VIMs has been unclear, and hence was the central object of this study.

To illustrate the behavior of VIMs from three popular MLAs under controlled levels of within-predictor correlation, we used a genetic case–control association study design. We simulated SNP data from five genes (for a total of 199 SNPs) based on LD patterns observed in HapMap CEU founder data (Fig. 1), and varied the odds ratio (OR) from 1.0 (_H_0) to 5.0. The disease-associated or ‘causal’ SNP was located in a block of strong LD (_r_2 range between causal SNP and LD block SNPs: 0.82–1.0; 26 SNPs), and we assessed the difference in associated LD block SNPs/unassociated SNPs VIM distributions in 500 replicates using Bonferroni-corrected Wilcoxon rank-sum tests.

LD (r2) plots for genes included in simulated case–control study. Large LD plot: gene (CITRON) simulated to show association with outcome; region of association/LD indicated by box. Small LD plots: four additional genes simulated to show no association with outcome: LIS1, FEZ1, NDE1 and NDEl1. The r2 value indicated for black 1.0, for white 0.

Fig. 1.

LD (_r_2) plots for genes included in simulated case–control study. Large LD plot: gene (CITRON) simulated to show association with outcome; region of association/LD indicated by box. Small LD plots: four additional genes simulated to show no association with outcome: LIS1, FEZ1, NDE1 and NDEl1. The _r_2 value indicated for black 1.0, for white 0.

We also assessed predictive ability of RF/conditional inference forest (CIF) using the top 25 ranked VIMs on the ‘out-of-bag’ data. Algorithms and VIMs are discussed in more detail in Section 2 and Supplementary Material.

We observed a reduction in the ability of RF to detect association of the causal SNP or SNPs in its LD block with outcome when compared with the other MLAs at effect sizes likely to be found in studies of complex diseases. To further evaluate this reduction, we (i) assessed the impact of within-predictor correlation strength on the VIMs, (ii) assessed VIM sensitivity to the size of the terminal nodes in the trees, (iii) examined the influence of within-predictor correlation on categorical and continuous predictors and (iv) determined the impact of within-predictor correlation concurrently with the potential for bias due to predictors having different scales of measurement (binary versus continuous) (Strobl et al., 2007). It is shown that RF VIMs based on the Gini index are biased under even weak within-predictor correlation and that the distributions of RF permutation VIMs are sensitive to correlation structure as well, although appear to be unbiased. CIF and Monte Carlo logic regression (MCLR) VIMs are observed to be both unbiased and less influenced by within-predictor correlation, and thus may be preferred in -omic applications where correlated predictors are assumed to be present.

2 METHODS

2.1 Study design and data simulation

All data analyses and simulations, with the exception of haplotype imputation, were performed using R v.2.6.0 (R Development Core Team, 2007). Genetic data simulation was performed using the R package SH2IPS (Simulation of Haplotype Heterogeneity, Interaction and Population Stratification, available from the corresponding author on request), which retains observed LD patterns from input data. Mimicking a pathway-based analysis of susceptibility genes for psychosis, five genes interacting with DISC1 (Disrupted In Schizophrenia 1) were selected [CITRON (104 SNPs), LIS1 (39), FEZ1 (27), NDE1 (17) and NDEL1 (12)]; genotypes from HapMap (www.hapmap.org) Phase I data with a minor allele frequency >0.05 were obtained from Western European population (CEU) founders. Genotype data were assigned haplotype phase for each gene using PHASE v.2.1 (Stephens and Donnelly, 2003; Stephens et al., 2001) and gene-based haplotype frequencies were used as weights in sampling 1 000 000 haplotypes to create populations of 500 000 individuals. Assuming Hardy Weinberg equilibrium, individuals were assigned case status based on the population prevalence of disease (K set to 10%), the desired OR (e.g. 1.75), the genetic model (recessive) and frequency of the risk genotype (q2) [e.g. generating model: K ∝ 1.0×(p2)+1.0×(2pq)+1.75×(q2)]. CITRON was the causal variant-containing gene. After case assignment, 1000 cases were randomly sampled from the case pool and 1000 individuals were randomly selected from the entire base population as controls. All genes were on different chromosomes with the exception of LIS1 and NDEL1 (LIS1 location=17p13.3, NDEL1 location=17p13.1). Pair-wise _r_2 between SNPs in LIS1 and NDEL1 ranged from 0.0 to 0.05, indicating no substantial LD between them. Because MCLR utilizes binary predictors, each genotype was recoded as two binary variables: one minor allele recessive/dominant (predictor N_=2 × 199=398). In all conditions, outcomes were binary with 1000 ‘cases’ and 1000 ‘controls’; only the genetic simulations considered simulations under the alternative hypothesis (H_A). For the purposes of this study, SNPs within the strong LD block (Fig. 1) were considered ‘correlated’ with disease status, even though some weaker correlation exists with these SNPs and other SNPs in the same gene. The default minimum terminal node size (RF=1, CIF=20) was used for case–control simulations. The number of variables selected per iteration for RF/CIF was estimated using the tuneRF function in randomForest (Liaw and Weiner, 2002) and were approximately equal to (N predictors)1/2; 5000 trees were generated per forest. Sub-sampling of 63.2% (approximately that proportion taken under a bootstrap sample with replacement) of the data was performed for RF/CIF (Strobl et al., 2007). Burn-in was set to 10 000 and the number of iterations was set to 1 000 000 for MCLR, with maximum model size of three trees and eight leaves.

In the synthetic (non-genetic) data simulations, correlated multivariate binary variables were simulated using the R package bindata (Leisch et al., 2008) and independent binary variables were randomly generated from bin(0.5). Correlated multivariate standard normal data were generated using the R package mvtnorm and independent continuous variables were randomly generated from a standard normal N(0,1) distribution. For these synthetic (non-genetic) data simulations, we generated 500 replicates of 100 binary (0.5) variables under the null hypothesis, varying: number of correlated variables (40, 10) with remaining variables independent of one another and the correlated set, mean correlation between correlated variables (0.9, 0.5 and 0.1) and minimum terminal node size (1, 10 and 100), thus creating 18 conditions using binary predictors. In addition, we repeated this simulation replacing binary with standard normal N(0,1) variables. To assess the impact of within-predictor correlation simultaneously with the problem of predictors having different scales of measurement, we simulated 500 datasets with 20 binary/correlated (0.9), 30 independent binary, 20 N(0,1)/correlated (0.9) and 30 independent N(0,1). As in the case–control conditions, the number of variables selected per iteration for RF/CIF was estimated using the tuneRF function in randomForest and were approximately (N predictors)1/2 except in prediction error estimation, where tuneRF suggested three variables/iteration was sufficient; 5000 trees were generated per forest. Sub-sampling of 63.2% of the data was performed for RF/CIF (Strobl et al., 2007). Burn-in was set to 10 000 and number of iterations was set to 1 000 000 for MCLR, with maximum model size of three trees and eight leaves.

2.2 MLAs

Some important MLAs are ensemble methods that deploy multiple copies of a base learner, such as a binary classification tree [RF (Breiman, 2001) and CIF (Hothorn et al., 2006)]. Logic trees [used in MCLR (Kooperberg and Ruczinski, 2005)] are structured similarly to RF and CIF classification trees. MLAs used provide a measure of association with outcome via a ranked listing of the most important predictors; these are the VIMs. For more detailed description of the MLAs, see the referenced literature and Supplementary Material 1.

RF/CIF algorithms begin by randomly selecting a subset of predictors plus a subset (62.3%) of observations and recursively partition the data based on the most strongly associated predictor; RF can use the Gini index as the splitting rule whereas CIF uses the Pearson χ2 test _P_-value corrected for multiplicity, which is unbiased when predictors have different numbers of categories (Hothorn et al., 2006). This is repeated to create a ‘forest’ of trees. The remaining observations (the ‘out-of-bag’ sample) were used for calculation of permutation VIMs and prediction error. RF/CIF also allow for bootstrapping or sub-sampling to obtain training sets; here sub-sampling was used (Strobl et al., 2007). RF Gini VIMs are based on the decrease in impurity measured by the Gini index on the training set, averaged across trees; RF/CIF permutation VIMs are based on the difference between the predictive accuracy of the tree using the ‘out-of-bag’ observations and that obtained after permuting predictor labels.

The MCLR algorithm, based on reversible jump MCMC (rjMCMC), is performed on all observations and predictors; starting at an arbitrary point (state S), it proposes a move to state S′—a contiguous logic tree within the set of all possible moves (e.g. adding a split, changing the Boolean operator, deleting a split) and accepts the move from S to S′ with probability min{1, r} where r is the product of the prior ratio, the transition ratio and the likelihood ratio for models S and S′. VIMs for MCLR were defined as the number of times each predictor was included in the model.

All MLA computations were performed using R packages randomForest, party and LogicReg. Additional information on each algorithm is available in the Supplementary Note online.

2.3 Data analysis

Differences between VIM distributions of correlated/uncorrelated predictors were tested using the Wilcoxon rank-sum test; VIM medians were calculated using the Hodges–Lehmann estimator (Hodges and Lehmann, 1963) and distribution-free confidence intervals were calculated using the method of Moses et al. (1993). In the case–control study, since each SNP was coded into two variables, VIM distributions from the recessive-coded predictors (the generating model for associated conditions was recessive) in the associated LD block were compared with all predictors not contained in the LD block. Calculation of predictive error rates for RF and CIF was performed using out-of-bag observations on the top 25 most important variables per replicate by VIM. Predictive ability was not obtained for MCLR because the algorithm used all observations during the run of the rjMCMC (thus, no ‘out-of-bag’ sample is created) and because it is not possible to perform prediction using the Monte Carlo option of logic regression, although prediction for single logic regression models or when using a greedy search algorithm was possible, they were not used in the present study.

3 RESULTS

RF Gini-based VIMs for the 199 SNPs were highly variable under _H_0 (Fig. 2) for uncorrelated SNPs. Permutation VIMs (under _H_0) were expected (and observed) to be centered about zero. However, the empirical distributions were not uniform across predictors presumably because of LD. Moreover, correlated variables were observed to be selected into trees/models less frequently than uncorrelated (Fig. 2; discussed below). The values of Gini VIMs for variables within the LD block were significantly lower than for all other predictors in 100% of the 500 replicates (Table 1, Fig. 2).

Fig. 2.

Distribution of MLA variable importance measures under _H_0 for all five genes with varying LD. Figures showing box plots of VIMs from 398 variables in five genes in the genetic simulation study, under _H_0. (A) RF Gini VIMs; (B) RF permutation-based VIMs; (C) MCLR count VIMs; (D) CIF permutation-based VIMs. These represent the percent of models during the length of the Markov chain that included each variable. Vertical lines separate genes, which are plotted in the following order: NDEL1, NDE1, FEZ1, CITRON and LIS1. Black horizontal lines at VIM=0 for RF Gini and MCLR (note these are both below the distribution of most VIMs); white vertical lines at VIM=0 for RF permutation-based and CIF VIMs.

Table 1.

Genetic association illustration: results and descriptive statistics

Condition	Significant replicatesa (%)	Medianb	95% CIc	Lower CI range	Upper CI range	Median length CI
Null: RF Gini	100	−0.69	−0.89,−0.52	−1.23,−0.88	−0.62,−0.29	0.63
OR 1.75: RF Gini	100	−0.71	−0.90,−0.37	−1.28,−0.09	−0.62,−0.090	0.60
OR 5.00: RF Gini	100	1.32	0.63, 2.40	0.16, 1.91	0.88, 2.75	0.73
Null: RF Permutation	10.6	0.01	−0.13, 0.16	−0.16, 0.13	−0.10, 0.19	0.070
OR 1.75: RF Permutation	70.6	0.13	0.09, 0.35	−0.13, 0.32	−0.040, 0.39	0.080
OR 5.00: RF Permutation	100	0.39	0.25, 0.54	0.19, 0.47	0.29, 0.59	0.11
Null: CIF Permutation	4.2	−4.46E−06	−4.62E−05, 1.95E−04	−6.91E−05, 1.45E−04	2.25E−05, 2.40E−04	4.69E−05
OR 1.75: CIF Permutation	100	6.40E−04	4.04E−05, 1.50E−03	7.62E−09, 1.35E−03	8.11E−05, 1.75E−03	1.60E−04
OR 5.00: CIF Permutation	100	3.50E−03	2.40E−03, 4.90E−03	2.20E−03, 4.20E−03	2.70E−03, 5.20E−03	6.70E−04
Null: MCLR Count	41	−447	−3359, 19 205	−4581, 17 504	−2568, 20 452	1675
OR 1.75: MCLR Count	100	29 553	1799, 44 407	212, 41 636	2712, 47 144	3325
OR 5.00: MCLR Count	100	43 175	2981, 68 222	2311, 63 889	3556, 71 872	3973

Condition	Significant replicatesa (%)	Medianb	95% CIc	Lower CI range	Upper CI range	Median length CI
Null: RF Gini	100	−0.69	−0.89,−0.52	−1.23,−0.88	−0.62,−0.29	0.63
OR 1.75: RF Gini	100	−0.71	−0.90,−0.37	−1.28,−0.09	−0.62,−0.090	0.60
OR 5.00: RF Gini	100	1.32	0.63, 2.40	0.16, 1.91	0.88, 2.75	0.73
Null: RF Permutation	10.6	0.01	−0.13, 0.16	−0.16, 0.13	−0.10, 0.19	0.070
OR 1.75: RF Permutation	70.6	0.13	0.09, 0.35	−0.13, 0.32	−0.040, 0.39	0.080
OR 5.00: RF Permutation	100	0.39	0.25, 0.54	0.19, 0.47	0.29, 0.59	0.11
Null: CIF Permutation	4.2	−4.46E−06	−4.62E−05, 1.95E−04	−6.91E−05, 1.45E−04	2.25E−05, 2.40E−04	4.69E−05
OR 1.75: CIF Permutation	100	6.40E−04	4.04E−05, 1.50E−03	7.62E−09, 1.35E−03	8.11E−05, 1.75E−03	1.60E−04
OR 5.00: CIF Permutation	100	3.50E−03	2.40E−03, 4.90E−03	2.20E−03, 4.20E−03	2.70E−03, 5.20E−03	6.70E−04
Null: MCLR Count	41	−447	−3359, 19 205	−4581, 17 504	−2568, 20 452	1675
OR 1.75: MCLR Count	100	29 553	1799, 44 407	212, 41 636	2712, 47 144	3325
OR 5.00: MCLR Count	100	43 175	2981, 68 222	2311, 63 889	3556, 71 872	3973

aPercent of replicates with a significant difference determined by Wilcoxon rank-sum test between VIM distributions for SNPs in the associated LD block versus all other SNPs.

bMedian of Hodges–Lehmann median differences.

cMedian of the 95% Moses confidence interval for median differences. For MCLR, values shown are counts, whereas proportions are shown in MCLR figures.

Table 1.

Genetic association illustration: results and descriptive statistics

Condition	Significant replicatesa (%)	Medianb	95% CIc	Lower CI range	Upper CI range	Median length CI
Null: RF Gini	100	−0.69	−0.89,−0.52	−1.23,−0.88	−0.62,−0.29	0.63
OR 1.75: RF Gini	100	−0.71	−0.90,−0.37	−1.28,−0.09	−0.62,−0.090	0.60
OR 5.00: RF Gini	100	1.32	0.63, 2.40	0.16, 1.91	0.88, 2.75	0.73
Null: RF Permutation	10.6	0.01	−0.13, 0.16	−0.16, 0.13	−0.10, 0.19	0.070
OR 1.75: RF Permutation	70.6	0.13	0.09, 0.35	−0.13, 0.32	−0.040, 0.39	0.080
OR 5.00: RF Permutation	100	0.39	0.25, 0.54	0.19, 0.47	0.29, 0.59	0.11
Null: CIF Permutation	4.2	−4.46E−06	−4.62E−05, 1.95E−04	−6.91E−05, 1.45E−04	2.25E−05, 2.40E−04	4.69E−05
OR 1.75: CIF Permutation	100	6.40E−04	4.04E−05, 1.50E−03	7.62E−09, 1.35E−03	8.11E−05, 1.75E−03	1.60E−04
OR 5.00: CIF Permutation	100	3.50E−03	2.40E−03, 4.90E−03	2.20E−03, 4.20E−03	2.70E−03, 5.20E−03	6.70E−04
Null: MCLR Count	41	−447	−3359, 19 205	−4581, 17 504	−2568, 20 452	1675
OR 1.75: MCLR Count	100	29 553	1799, 44 407	212, 41 636	2712, 47 144	3325
OR 5.00: MCLR Count	100	43 175	2981, 68 222	2311, 63 889	3556, 71 872	3973

Condition	Significant replicatesa (%)	Medianb	95% CIc	Lower CI range	Upper CI range	Median length CI
Null: RF Gini	100	−0.69	−0.89,−0.52	−1.23,−0.88	−0.62,−0.29	0.63
OR 1.75: RF Gini	100	−0.71	−0.90,−0.37	−1.28,−0.09	−0.62,−0.090	0.60
OR 5.00: RF Gini	100	1.32	0.63, 2.40	0.16, 1.91	0.88, 2.75	0.73
Null: RF Permutation	10.6	0.01	−0.13, 0.16	−0.16, 0.13	−0.10, 0.19	0.070
OR 1.75: RF Permutation	70.6	0.13	0.09, 0.35	−0.13, 0.32	−0.040, 0.39	0.080
OR 5.00: RF Permutation	100	0.39	0.25, 0.54	0.19, 0.47	0.29, 0.59	0.11
Null: CIF Permutation	4.2	−4.46E−06	−4.62E−05, 1.95E−04	−6.91E−05, 1.45E−04	2.25E−05, 2.40E−04	4.69E−05
OR 1.75: CIF Permutation	100	6.40E−04	4.04E−05, 1.50E−03	7.62E−09, 1.35E−03	8.11E−05, 1.75E−03	1.60E−04
OR 5.00: CIF Permutation	100	3.50E−03	2.40E−03, 4.90E−03	2.20E−03, 4.20E−03	2.70E−03, 5.20E−03	6.70E−04
Null: MCLR Count	41	−447	−3359, 19 205	−4581, 17 504	−2568, 20 452	1675
OR 1.75: MCLR Count	100	29 553	1799, 44 407	212, 41 636	2712, 47 144	3325
OR 5.00: MCLR Count	100	43 175	2981, 68 222	2311, 63 889	3556, 71 872	3973

aPercent of replicates with a significant difference determined by Wilcoxon rank-sum test between VIM distributions for SNPs in the associated LD block versus all other SNPs.

bMedian of Hodges–Lehmann median differences.

cMedian of the 95% Moses confidence interval for median differences. For MCLR, values shown are counts, whereas proportions are shown in MCLR figures.

Corresponding VIM median differences, across the 500 replicates, between correlated predictors and uncorrelated predictors for RF permutation and CIF permutation were 10.2% and 4.2%, respectively, clearly showing that the use of the Gini index as the splitting rule in RF induced an increase in spurious differences between correlated and uncorrelated predictors even for the permutation-based VIM. MCLR also showed an excess difference between VIM distributions for correlated and uncorrelated predictors possibly related to the observed relatively less stable behavior of these VIMs. At OR=1.75 (Table 1; Fig. 3), RF Gini VIMs were significantly smaller for the associated ‘causal’ versus unassociated ‘non-causal’ variables in all replicates. This pattern was reversed only at very large effect sizes such as OR ≥5.0. This pattern was reversed only at very large effect sizes such as OR ≥5.0. The RF permutation VIM was able to detect association with outcome for predictors within the associated LD block in 70.6% of the 500 replicates when the OR was set to 1.75 (Table 1; Supplementary Material 4–7). CIF and MCLR detected significant differences in predictor distributions between SNPs in the LD block that were correlated with the causal SNP and uncorrelated SNPs in 100% of replicates at effect sizes of ≥1.75.

Box plots of distribution of RF Gini VIMs under HA (OR=1.75) for all SNPs in the associated gene (CITRON). Black vertical line at location of causal SNP.

Fig. 3.

Box plots of distribution of RF Gini VIMs under HA (OR=1.75) for all SNPs in the associated gene (CITRON). Black vertical line at location of causal SNP.

To further characterize the relative bias of RF VIMs, especially the Gini-based VIM, we simulated 500 replicates of 100 binary variables (40 correlated, 60 independent) under _H_0 and varied both the strength of correlation between the correlated variables and the smallest allowed terminal node size in the constructed trees. The largest difference between correlated and uncorrelated RF Gini VIMs occurred in the strongest correlation condition (0.9) using a terminal node size of 1 (Fig. 4; Supplementary Material 4–7): the median RF Gini-based VIM value for correlated predictors was 2.0, whereas for uncorrelated predictors it was 9.14 (Fig. 4).

Fig. 4.

Distribution of RF Gini-based VIMs varying correlation strength and terminal node size. Box plots showing RF Gini VIM distributions under _H_0 for 40 correlated (plotted on the left hand side) predictors and followed by plots for the 60 uncorrelated simulated predictors. Top row: correlation between correlated predictors=0.9; bottom row: correlation between correlated predictors=0.1. Left column: terminal node size=1, right column: terminal node size=100.

In this same correlation condition, a commonly used terminal node size of 10 resulted in a median Gini VIM value of 0.98 for correlated and 2.28 for uncorrelated predictors (data not shown). Using a minimal terminal node size of 100 reduced Gini VIM values to nearly expected values under _H_0: correlated median=0.12, uncorrelated median=0.20. This trend was observed across all conditions, and is a central result of this study: as correlation strength decreased, the inflation of correlated Gini VIMs increased while decreasing inflation of the uncorrelated set (Fig. 4). Moreover, values of the VIMs for terminal node size of 1 or 10 were always larger (more biased) than using a terminal node size of 100, which consistently showed interquartile ranges <0.5. In addition, we simulated another 500 sets under _H_0 with a smaller fraction of correlated variables (10 correlated, 90 independent). The pattern of results for these additional simulations was similar although quantitatively less striking than for the replicates with a larger fraction of correlated variables (Supplementary Material 4–7). The same pattern of inflated Gini VIMs was observed using standard normal N(0,1) random variables acting as continuous predictors (Supplementary Material 3).

Permutation VIMs were observed to be unbiased under correlation, although in the most extreme condition (40 predictors correlated at 0.9) VIM distributions for correlated predictors were less variable versus uncorrelated predictors (Supplementary Material 4–7). The difference in distributions was therefore seen empirically, with highly correlated variables having lower selection frequencies than independent ones. CIF and MCLR showed uniform distributions of VIMs across correlated and uncorrelated predictors when correlation was weaker (Supplementary Material 4–7).

Furthermore, RF Gini VIMs were inflated for independent predictors versus correlated ones, regardless of type (binary or continuous), but binary predictor VIM values were smaller than VIMs for continuous predictors due to biased variable selection (Strobl et al., 2007) that lead to potentially inappropriate preferential selection of continuous variables under _H_0 (Fig. 5).

Fig. 5.

Distribution of correlated and independent binary and standard normal VIMs using RF and CIF. (A) Box plots of RF Gini VIMs; (B) box plots of RF permutation-based VIMs; (C) box plots of CIF permutation-based VIMs. Predictor VIMs are plotted in the following order from the left hand side and separated by vertical lines: (1) binary, correlated (0.9); (2) binary, independent; (3) N(0,1) correlated (0.9); (4) N(0,1), independent. A terminal node size of 10 was held constant across methods.

Permutation VIMs from RF/CIF were observed to be unbiased although distributions of permutation-based RF VIMs were less uniform than CIF (Fig. 5). MCLR was excluded since it does not accept continuous predictors, and while converting a continuous predictor to a binary outcome could be considered, such coarsening is well-known to introduce bias.

The inflation of RF Gini VIMs and differing distributions for RF permutation VIMs when terminal node size is small was observed in conjunction with biased variable selection frequencies under correlation. That is, when the terminal node size is small, RF over-trained to accommodate chance fluctuations in the data (as would any partitioning MLA scheme), but also selected correlated variables less frequently than uncorrelated. This can be shown by examining the correlation (Pearson r) between Gini VIM values and the frequency of variable selection. As expected, the correlation was very strong (>0.98 across all conditions) because the Gini index was used as the node-splitting criterion and because node-splitting and Gini-based VIMs were both based on the training data. However, the correlation between RF permutation VIMs and selection frequency varied by terminal node size, strength and number of correlated predictors. For example, the median correlation between permutation VIMs and selection frequencies in the 40 predictors correlated at the highest level of correlation (0.9) set with a terminal node size of 1 was 0.012 (range: from −0.37 to 0.40); increasing the terminal node size to 100 in this condition gave a median correlation of 0.69 (0.34–0.84). Interestingly, the commonly used minimal terminal node size of 10 only marginally increased correspondence between permutation VIMs and selection frequencies (median r: 0.13, range: from −0.29 to 0.54) in this condition. In contrast, in the 10 correlated predictors at the lowest level of correlation (0.10) condition, the median correlation between selection frequencies and permutation VIM values was 0.044 (range: from −0.23 to 0.35) with terminal node size of 1. A terminal node size of 10 greatly increased the correlation (median r: 0.77, range: from 0.50 to 0.89), and increasing the terminal node size to 100 increased the correlation further (median: 0.89, range: from 0.68 to 0.96). To summarize: since RF variable selection is based on the Gini index, the correlation between variable selection and the VIM based on the Gini index is always nearly perfect. However, because the selection based on the Gini index is biased when there is correlation between predictors, there is very low correlation between the permutation-based VIM and selection frequency when there is correlation, precisely because the Gini index is performing biased variable selection. However, when there is a low correlation between predictors and especially when the terminal node size is larger than 1, there will be a much stronger correlation between the variable selection frequency and the permutation-based VIM exactly because the Gini index is not performing biased variable selection in this case, as we show.

Correlation also impacted the predictive ability of RF under H_A_. Median error rates under _H_0 were equivalent to unbiased coin-tossing (RF Gini=50.2, RF permutation=47.5 and CIF=49.6), as expected. Using the top 25 ranked Gini VIMs as predictors did not show reductions in error rates for any condition with associated or ‘causal’ predictors (range: from 49.5 to 50.0) except OR=5.0 (36.3). The RF permutation and CIF median error rates decreased as the OR increased, ranging from 47.0 in the OR=1.25 condition to 42.0 in the OR=2.25 condition; median error rate was 33.2 in the OR=5.0 condition for both.

Thus, variables selected using permutation-based VIMs showed increased predictive ability versus those selected using RF Gini-based VIMs and may be a better choice for selection of predictors for follow-up, such as after a genome-wide association study. MCLR does not allow prediction from a model using the rjMCMC algorithm, as it uses all observations and does not produce an ‘out-of-bag’ error estimate, so was excluded.

4 CONCLUSIONS

The use of MLAs with permutation-based VIMs is demonstrably powerful method for high-throughput, large datasets. However, we showed that RF Gini VIMs were biased in the presence of correlation—these VIMs too often identified non-predictive features that were independent from each other, versus those features that were correlated with one another, even when the correlated feature set contained the ‘causal’ predictor. Moreover, the distributions of RF permutation VIMs were sensitive to correlation structure even when they appeared to be unbiased. CIF and MCLR VIMs were observed to be both unbiased and less influenced by correlation. Thus, these algorithms may be preferred to RF using the Gini index as the splitting rule under the restricted goal of identifying variable importances, apart from the goal of low error prediction.

Two recent studies have examined the effect of LD on RF. One study tested a revised RF that only considered SNPs that were in linkage equilibrium within each tree-building stage combined with a permutation-based VIM that accounted for the fact that this strategy would produce different selection probabilities in sub-sampling predictors versus random sub-sampling (Meng et al., 2009). They found their method inflated their VIMs as the number of SNPs in LD with the causal variant increased; they suggested using their VIM combined with the original RF sub-sampling procedure for most stable performance. The second assessed a conditional VIM related to continuous outcomes (Strobl et al., 2008).

The central concern is that a block of predictors having strong within-predictor correlation (such as high LD) may contain a single or small set of truly causal features, in which case we would prefer the MLA to identify this signal more frequently than a non-causal signal and to not be influenced by the within-predictor correlation structure. Therefore, in applications of tree-based MLAs to -omics data with correlated predictors, we recommend: (i) grow a large number of trees (e.g. 5000 or more), (ii) if RF/CIF is used, ensure that the terminal node size is large enough (Biau et al., 2008) to avoid over-training and for reducing the effect of predictor–predictor correlation for RF (Gini or permutation) VIMs; and (iii) under MCLR, use a reasonable burn-in (e.g. 10 000) and a large number of iterations (e.g. 1 000 000). In addition, because variable importance values obtained using permutation VIMs may be variable (Nicodemus et al., 2007), we (iv) suggest running algorithms repeatedly to obtain a measure of central tendency and variability for VIMs instead of the common practice of simply reporting the VIM from a single run of any algorithm. This can be understood as a smoothing process that increases the stability of the VIM estimates without bias, and also better estimates the sampling variability of the VIM estimates. In conclusion, we show bias in RF Gini-based VIMs when predictors are correlated and that RF permutation VIMs are impacted by the underlying correlation structure, although they appear to be unbiased. However, CIF and MCLR VIMs are observed to be unbiased and not as strongly impacted by within-predictor correlation, and thus may be preferred when predictor variables show correlation.

ACKNOWLEDGEMENTS

We thank Carolin Strobl, PhD, MS, for insightful discussions. This study utilized the high performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD, USA (http://biowulf.nih.gov).

Conflict of Interest: none declared.

REFERENCES

et al.

Consistency of random forests and other averaging classifiers

J. Mach. Learn. Res.

2008

, vol.

(pg.

2015

2033

)

Random forests

Mach. Learn.

2001

, vol.

(pg.

)

et al.

Mapping complex traits using Random Forests

BMC Genet.

2003

, vol.

Suppl. 1

pg.

S64

et al.

Identifying SNPs predictive of phenotype using random forests

Genet. Epidemiol.

2005

, vol.

(pg.

171

182

)

BagBoosting for tumor classification with gene expression data

Bioinformatics

2004

, vol.

(pg.

3583

3593

)

et al. ,

A Probabilistic Theory of Pattern Recognition

1996

New York

Springer-Verlag

Gene selection and classification of microarray data using random forest

BMC Bioinformatics

2006

, vol.

pg.

et al.

Repetitive sequence environment distinguishes housekeeping genes

Gene

2007

, vol.

390

(pg.

153

165

)

et al.

Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals

Proc. Natl Acad. Sci. USA

2006

, vol.

103

(pg.

14865

14870

)

Estimates of location based on rank tests

Ann. Math. Stat.

1963

, vol.

(pg.

598

611

)

et al.

Unbiased recursive partitioning: a conditional inference framework

J. Comput. Graph. Stat.

2006

, vol.

(pg.

651

674

)

et al.

MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features

Nucleic Acids Res.

2007

, vol.

(pg.

W339

W344

)

Identifying interacting SNPs using Monte Carlo logic regression

Genet. Epidemiol.

2005

, vol.

(pg.

157

170

)

et al.

On the generation of correlated artificial binary data. Adaptive information systems and modeling in economics and management science

Working paper series

2008

Vienna University of Economics

et al.

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

Bioinformatics

2004

, vol.

(pg.

2429

2437

)

Classification and regression by randomForest

R News

2002

, vol.

(pg.

)

et al.

Screening large-scale association study data: exploiting interactions using random forests

BMC Genet.

2004

, vol.

pg.

et al.

Performance of random forest when SNPs are in linkage disequilibrium

BMC Bioinformatics

2009

, vol.

pg.

et al.

Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations

Stat. Med.

1993

, vol.

(pg.

1293

1316

)

et al.

Stability of variable importance scores and rankings using statistical learning tools on single nucleotide polymorphisms (SNPs) and risk factors involved in gene-gene and gene-environment interaction

BMC Proceedings

2007

, vol.

Suppl. 1

pg.

S58

et al.

Pathway analysis using random forests classification and regression

Bioinformatics

2006

, vol.

(pg.

2028

2036

)

R Development Core Team

R: A Language and Environment for Statistical Computing

R Foundation for statistical computing

2007

Vienna, Austria

et al.

A pilot study on the application of statistical classification procedures to molecular epidemiological data

Toxicol. Lett.

2004

, vol.

151

(pg.

291

299

)

et al.

A new statistical method for haplotype reconstruction from population data

Am. J. Hum. Genet.

2001

, vol.

(pg.

978

989

)

A comparison of Bayesian methods for haplotype reconstruction from population genotype data

Am. J. Hum. Genet.

2003

, vol.

(pg.

1162

1169

)

et al.

Bias in random forest variable importance measures: illustrations, sources and a solution

BMC Bioinformatics

2007

, vol.

pg.

et al.

Conditional variable importance for random forests

BMC Bioinformatics

2008

, vol.

pg.

307

et al.

Exploiting interactions among polymorphisms contributing to complex disease traits with boosted generative modeling

J. Comput. Biol.

2006

, vol.

(pg.

1673

1684

)

Author notes

Associate Editor: Martin Bishop

Citations

Views

Altmetric

Metrics

Total Views 5,503

4,307 Pageviews

1,196 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	6
December 2016	2
January 2017	8
February 2017	16
March 2017	16
April 2017	12
May 2017	48
June 2017	12
July 2017	21
August 2017	16
September 2017	27
October 2017	22
November 2017	33
December 2017	63
January 2018	45
February 2018	34
March 2018	66
April 2018	103
May 2018	57
June 2018	59
July 2018	61
August 2018	97
September 2018	85
October 2018	84
November 2018	49
December 2018	36
January 2019	35
February 2019	41
March 2019	73
April 2019	46
May 2019	54
June 2019	73
July 2019	73
August 2019	48
September 2019	50
October 2019	57
November 2019	52
December 2019	39
January 2020	51
February 2020	70
March 2020	42
April 2020	50
May 2020	53
June 2020	91
July 2020	41
August 2020	54
September 2020	63
October 2020	62
November 2020	45
December 2020	56
January 2021	40
February 2021	90
March 2021	84
April 2021	65
May 2021	73
June 2021	48
July 2021	76
August 2021	89
September 2021	64
October 2021	62
November 2021	55
December 2021	66
January 2022	47
February 2022	73
March 2022	72
April 2022	76
May 2022	64
June 2022	69
July 2022	59
August 2022	51
September 2022	43
October 2022	72
November 2022	91
December 2022	61
January 2023	65
February 2023	113
March 2023	74
April 2023	46
May 2023	75
June 2023	45
July 2023	62
August 2023	75
September 2023	69
October 2023	44
November 2023	62
December 2023	51
January 2024	63
February 2024	73
March 2024	72
April 2024	75
May 2024	94
June 2024	96
July 2024	71
August 2024	93
September 2024	61
October 2024	37

Citations

120 Web of Science

Predictor correlation impacts machine learning algorithms: implications for genomic studies (original) (raw)

Abstract

1 INTRODUCTION

2 METHODS

2.1 Study design and data simulation

2.2 MLAs

2.3 Data analysis

3 RESULTS

4 CONCLUSIONS

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

Predictor correlation impacts machine learning algorithms: implications for genomic studies (original) (raw)

Abstract

1 INTRODUCTION

2 METHODS

2.1 Study design and data simulation

2.2 MLAs

2.3 Data analysis

3 RESULTS

4 CONCLUSIONS

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited