Identifying SNPs predictive of phenotype using random forests - PubMed (original) (raw)
Identifying SNPs predictive of phenotype using random forests
Alexandre Bureau et al. Genet Epidemiol. 2005 Feb.
Abstract
There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.
2004 Wiley-Liss, Inc.
Similar articles
- Screening large-scale association study data: exploiting interactions using random forests.
Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. Lunetta KL, et al. BMC Genet. 2004 Dec 10;5:32. doi: 10.1186/1471-2156-5-32. BMC Genet. 2004. PMID: 15588316 Free PMC article. - Further investigation of linkage disequilibrium SNPs and their ability to identify associated susceptibility loci.
North BV, Curtis D, Martin ER, Lai EH, Roses AD, Sham PC. North BV, et al. Ann Hum Genet. 2004 May;68(Pt 3):240-8. doi: 10.1046/j.1529-8817.2004.00086.x. Ann Hum Genet. 2004. PMID: 15180704 - Exploration of methods to identify polymorphisms associated with variation in DNA repair capacity phenotypes.
Jones IM, Thomas CB, Xi T, Mohrenweiser HW, Nelson DO. Jones IM, et al. Mutat Res. 2007 Mar 1;616(1-2):213-20. doi: 10.1016/j.mrfmmm.2006.11.005. Epub 2006 Dec 4. Mutat Res. 2007. PMID: 17145065 - Regulatory SNPs in complex diseases: their identification and functional validation.
Prokunina L, Alarcón-Riquelme ME. Prokunina L, et al. Expert Rev Mol Med. 2004 Apr 27;6(10):1-15. doi: 10.1017/S1462399404007690. Expert Rev Mol Med. 2004. PMID: 15122975 Review. - Mathematical multi-locus approaches to localizing complex human trait genes.
Hoh J, Ott J. Hoh J, et al. Nat Rev Genet. 2003 Sep;4(9):701-9. doi: 10.1038/nrg1155. Nat Rev Genet. 2003. PMID: 12951571 Review.
Cited by
- Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies.
Yaldız B, Erdoğan O, Rafatov S, Iyigün C, Aydın Son Y. Yaldız B, et al. BioData Min. 2024 Jan 30;17(1):3. doi: 10.1186/s13040-024-00355-3. BioData Min. 2024. PMID: 38291454 Free PMC article. - Machine Learning to Advance Human Genome-Wide Association Studies.
Sigala RE, Lagou V, Shmeliov A, Atito S, Kouchaki S, Awais M, Prokopenko I, Mahdi A, Demirkan A. Sigala RE, et al. Genes (Basel). 2023 Dec 25;15(1):34. doi: 10.3390/genes15010034. Genes (Basel). 2023. PMID: 38254924 Free PMC article. Review. - Identifying the regional drivers of influenza-like illness in Nova Scotia, Canada, with dominance analysis.
Aydede Y, Ditzen J. Aydede Y, et al. Sci Rep. 2023 Jun 21;13(1):10114. doi: 10.1038/s41598-023-37184-z. Sci Rep. 2023. PMID: 37344569 Free PMC article. - Efficient feature extraction from highly sparse binary genotype data for cancer prognosis prediction using an auto-encoder.
Shen J, Li H, Yu X, Bai L, Dong Y, Cao J, Lu K, Tang Z. Shen J, et al. Front Oncol. 2023 Jan 10;12:1091767. doi: 10.3389/fonc.2022.1091767. eCollection 2022. Front Oncol. 2023. PMID: 36703783 Free PMC article. - A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction.
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. Pudjihartono N, et al. Front Bioinform. 2022 Jun 27;2:927312. doi: 10.3389/fbinf.2022.927312. eCollection 2022. Front Bioinform. 2022. PMID: 36304293 Free PMC article. Review.
MeSH terms
LinkOut - more resources
Full Text Sources