SNP interaction detection with Random Forests in high-dimensional genetic data - PubMed (original) (raw)

SNP interaction detection with Random Forests in high-dimensional genetic data

Stacey J Winham et al. BMC Bioinformatics. 2012.

Abstract

Background: Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.

Results: RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.

Conclusions: While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Simulation 2 penetrance functions. Penetrance functions for the two locus interactions in the three models used in Simulation 2, with corresponding total, marginal, and interaction heritabilities.

Figure 2

Figure 2

Simulation 1 results. Probability of detection for ‘main’, ‘interacting’, and ‘null’ SNPs plotted against the number of total SNPs for select RF VIMs and logistic regression (LR). Top row shows results for the “main effects greater” Model 2; bottom row shows results for “interaction effects greater” Model 4. Results are plotted separately across MAF. Average PE estimates range between 0.430 and 0.476 ( Additional file 2 Table B3).

Figure 3

Figure 3

Simulation 2 results. Probability of detection for SNP1 and SNP2 plotted against total number of SNPs by VIM for models with interactions and two main effects (Model 6 - left), one main effect (Model 7 - center), and no main effects (Model 8 - right). Average PE estimates range between 0.465 and 0.508 ( Additional file 2 Table B4).

Similar articles

Cited by

References

    1. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
    1. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9(5):356–369. doi: 10.1038/nrg2344. - DOI - PubMed
    1. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A. et al.Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. - DOI - PMC - PubMed
    1. Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10(6):392–404. - PMC - PubMed
    1. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11(6):446–450. doi: 10.1038/nrg2809. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources