SNP interaction detection with Random Forests in high-dimensional genetic data - PubMed (original) (raw)
SNP interaction detection with Random Forests in high-dimensional genetic data
Stacey J Winham et al. BMC Bioinformatics. 2012.
Abstract
Background: Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.
Results: RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.
Conclusions: While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
Figures
Figure 1
Simulation 2 penetrance functions. Penetrance functions for the two locus interactions in the three models used in Simulation 2, with corresponding total, marginal, and interaction heritabilities.
Figure 2
Simulation 1 results. Probability of detection for ‘main’, ‘interacting’, and ‘null’ SNPs plotted against the number of total SNPs for select RF VIMs and logistic regression (LR). Top row shows results for the “main effects greater” Model 2; bottom row shows results for “interaction effects greater” Model 4. Results are plotted separately across MAF. Average PE estimates range between 0.430 and 0.476 ( Additional file 2 Table B3).
Figure 3
Simulation 2 results. Probability of detection for SNP1 and SNP2 plotted against total number of SNPs by VIM for models with interactions and two main effects (Model 6 - left), one main effect (Model 7 - center), and no main effects (Model 8 - right). Average PE estimates range between 0.465 and 0.508 ( Additional file 2 Table B4).
Similar articles
- Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.
Nguyen TT, Huang J, Wu Q, Nguyen T, Li M. Nguyen TT, et al. BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21. BMC Genomics. 2015. PMID: 25708662 Free PMC article. - Screening large-scale association study data: exploiting interactions using random forests.
Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. Lunetta KL, et al. BMC Genet. 2004 Dec 10;5:32. doi: 10.1186/1471-2156-5-32. BMC Genet. 2004. PMID: 15588316 Free PMC article. - Do little interactions get lost in dark random forests?
Wright MN, Ziegler A, König IR. Wright MN, et al. BMC Bioinformatics. 2016 Mar 31;17:145. doi: 10.1186/s12859-016-0995-8. BMC Bioinformatics. 2016. PMID: 27029549 Free PMC article. - A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data.
Uppu S, Krishna A, Gopalan RP. Uppu S, et al. IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):599-612. doi: 10.1109/TCBB.2016.2635125. Epub 2016 Dec 2. IEEE/ACM Trans Comput Biol Bioinform. 2018. PMID: 28060710 Review. - Multigenic modeling of complex disease by random forests.
Sun YV. Sun YV. Adv Genet. 2010;72:73-99. doi: 10.1016/B978-0-12-380862-2.00004-7. Adv Genet. 2010. PMID: 21029849 Review.
Cited by
- Investigating factors affecting the interval between a burn and the start of treatment using data mining methods and logistic regression.
Ahmadi-Jouybari T, Najafi-Ghobadi S, Karami-Matin R, Najafian-Ghobadi S, Najafi-Ghobadi K. Ahmadi-Jouybari T, et al. BMC Med Res Methodol. 2021 Apr 14;21(1):71. doi: 10.1186/s12874-021-01270-5. BMC Med Res Methodol. 2021. PMID: 33853547 Free PMC article. - A comparison of internal model validation methods for multifactor dimensionality reduction in the case of genetic heterogeneity.
Gory JJ, Sweeney HC, Reif DM, Motsinger-Reif AA. Gory JJ, et al. BMC Res Notes. 2012 Nov 5;5:623. doi: 10.1186/1756-0500-5-623. BMC Res Notes. 2012. PMID: 23126544 Free PMC article. - SNP-SNP Interaction Analysis on Soybean Oil Content under Multi-Environments.
Chen Q, Mao X, Zhang Z, Zhu R, Yin Z, Leng Y, Yu H, Jia H, Jiang S, Ni Z, Jiang H, Han X, Liu C, Hu Z, Wu X, Hu G, Xin D, Qi Z. Chen Q, et al. PLoS One. 2016 Sep 26;11(9):e0163692. doi: 10.1371/journal.pone.0163692. eCollection 2016. PLoS One. 2016. PMID: 27668866 Free PMC article. - Evaluation of tree-based statistical learning methods for constructing genetic risk scores.
Lau M, Wigmann C, Kress S, Schikowski T, Schwender H. Lau M, et al. BMC Bioinformatics. 2022 Mar 21;23(1):97. doi: 10.1186/s12859-022-04634-w. BMC Bioinformatics. 2022. PMID: 35313824 Free PMC article. - Transition-transversion encoding and genetic relationship metric in ReliefF feature selection improves pathway enrichment in GWAS.
Arabnejad M, Dawkins BA, Bush WS, White BC, Harkness AR, McKinney BA. Arabnejad M, et al. BioData Min. 2018 Nov 3;11:23. doi: 10.1186/s13040-018-0186-4. eCollection 2018. BioData Min. 2018. PMID: 30410580 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials