On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data - PubMed (original) (raw)
On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data
Daniel F Schwarz et al. Bioinformatics. 2010.
Erratum in
- Bioinformatics. 2011 Feb 1;27(3):439
Abstract
Motivation: Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene-gene and gene-environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden.
Results: Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions.
Availability: The RJ software package is freely available at http://www.randomjungle.org
Figures
Fig. 1.
Comparison of computing time and memory usage of several RF implementations. Each program analyzed a simulated dataset comprising 1006 samples genotyped at 275 153 SNPs. A short bar indicates a fast implementation of RF in comparison to other programs: (a) Comparison of computing time of five implementations. The figure reads for example: For analyzing data, RJ calculations took 0.53 h. (b) Comparison of memory usage of five programs. Memory was sparsely used by Willows and RJ in comparison to randomForest and RF in Fortran. (Asterisk indicates memory usage of each computer node.)
Fig. 2.
CVI scores of SNPs and their chromosomal position. The axis of CVI scores was log transformed. Small and negative CVI values were omitted.
Fig. 3.
Six of the 10 most important genes can be combined to a potential pathway. TNFSF10 potentially interacts with IL23R, NOD2 and PRKG1. The NOD2 conceivably interacts with CDKAL1.
Similar articles
- SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies.
Yang C, He Z, Wan X, Yang Q, Xue H, Yu W. Yang C, et al. Bioinformatics. 2009 Feb 15;25(4):504-11. doi: 10.1093/bioinformatics/btn652. Epub 2008 Dec 19. Bioinformatics. 2009. PMID: 19098029 - Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.
Nguyen TT, Huang J, Wu Q, Nguyen T, Li M. Nguyen TT, et al. BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21. BMC Genomics. 2015. PMID: 25708662 Free PMC article. - Maximal conditional chi-square importance in random forests.
Wang M, Chen X, Zhang H. Wang M, et al. Bioinformatics. 2010 Mar 15;26(6):831-7. doi: 10.1093/bioinformatics/btq038. Epub 2010 Feb 3. Bioinformatics. 2010. PMID: 20130032 Free PMC article. - Classification of genetic profiles of Crohn's disease: a focus on the ATG16L1 gene.
Grant SF, Baldassano RN, Hakonarson H. Grant SF, et al. Expert Rev Mol Diagn. 2008 Mar;8(2):199-207. doi: 10.1586/14737159.8.2.199. Expert Rev Mol Diagn. 2008. PMID: 18366306 Review. - Genome simulation approaches for synthesizing in silico datasets for human genomics.
Ritchie MD, Bush WS. Ritchie MD, et al. Adv Genet. 2010;72:1-24. doi: 10.1016/B978-0-12-380862-2.00001-1. Adv Genet. 2010. PMID: 21029846 Review.
Cited by
- Improved branch and bound algorithm for detecting SNP-SNP interactions in breast cancer.
Chuang LY, Chang HW, Lin MC, Yang CH. Chuang LY, et al. J Clin Bioinforma. 2013 Feb 14;3(1):4. doi: 10.1186/2043-9113-3-4. J Clin Bioinforma. 2013. PMID: 23410245 Free PMC article. - Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience.
Dasgupta A, Sun YV, König IR, Bailey-Wilson JE, Malley JD. Dasgupta A, et al. Genet Epidemiol. 2011;35 Suppl 1(Suppl 1):S5-11. doi: 10.1002/gepi.20642. Genet Epidemiol. 2011. PMID: 22128059 Free PMC article. Review. - A gene-based information gain method for detecting gene-gene interactions in case-control studies.
Li J, Huang D, Guo M, Liu X, Wang C, Teng Z, Zhang R, Jiang Y, Lv H, Wang L. Li J, et al. Eur J Hum Genet. 2015 Nov;23(11):1566-72. doi: 10.1038/ejhg.2015.16. Epub 2015 Mar 11. Eur J Hum Genet. 2015. PMID: 25758991 Free PMC article. - Development and multi-site external validation of a generalizable risk prediction model for bipolar disorder.
Walsh CG, Ripperger MA, Hu Y, Sheu YH, Lee H, Wilimitis D, Zheutlin AB, Rocha D, Choi KW, Castro VM, Kirchner HL, Chabris CF, Davis LK, Smoller JW. Walsh CG, et al. Transl Psychiatry. 2024 Jan 25;14(1):58. doi: 10.1038/s41398-023-02720-y. Transl Psychiatry. 2024. PMID: 38272862 Free PMC article. - Genetic Dissection of Epistatic Interactions Contributing Yield-Related Agronomic Traits in Rice Using the Compressed Mixed Model.
Li L, Wu X, Chen J, Wang S, Wan Y, Ji H, Wen Y, Zhang J. Li L, et al. Plants (Basel). 2022 Sep 26;11(19):2504. doi: 10.3390/plants11192504. Plants (Basel). 2022. PMID: 36235370 Free PMC article.
References
- Archer KJ, Kimes RV. Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 2008;52:2249–2260.
- Baader E, et al. Tumor necrosis factor-related apoptosis-inducing ligand-mediated proliferation of tumor cells with receptor-proximal apoptosis defects. Cancer Res. 2005;65:7888–7895. - PubMed
- Baetu TM, et al. Disruption of NF-kappaB signaling reveals a novel role for NF-kappaB in the regulation of TNF-related apoptosis-inducing ligand expression. J. Immunol. 2001;167:3164–3173. - PubMed
- Breiman L. Bagging predictors. Mach. Learn. 1996;24:123–140.
Publication types
MeSH terms
Grants and funding
- R01-GM031575/GM/NIGMS NIH HHS/United States
- R01 AG021917/AG/NIA NIH HHS/United States
- 5R01-HL049609-14/HL/NHLBI NIH HHS/United States
- R01 GM031575/GM/NIGMS NIH HHS/United States
- 1R01-AG021917-01A1/AG/NIA NIH HHS/United States
- R01 HL049609/HL/NHLBI NIH HHS/United States
- R01 GM031575-28/GM/NIGMS NIH HHS/United States