Identifying novel constrained elements by exploiting biased substitution patterns - PubMed (original) (raw)
Identifying novel constrained elements by exploiting biased substitution patterns
Manuel Garber et al. Bioinformatics. 2009.
Abstract
Motivation: Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations.
Results: We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection.
Availability: The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/.
Supplementary information: Supplementary data are available at Bioinformatics online.
Figures
Fig. 1.
Estimating the power of SiPhy for evolutionary constraint detection. The dependence of SiPhy on three factors—the number of available species (M), the IC of each site, and the size of k-mers (1 or 12), is evaluated through a simulation study using a star phylogeny. An element is said to be 50% identifiable by SiPhy if>50% of its instances can be identified by SiPhy with P<0.001, and similarly for 25% and 75% identifiable. (A) The lowest IC of a single site that is 25% (blue), 50% (green) or 75% (red) identifiable is shown as a function of M. (B) The 25%, 50% and 75% percentile _P_- values are shown as the function of the IC of a single site, with the number of species fixed at _M_=9 (top) and _M_=50 (bottom). (C) and (D) show similar plots for 12mers. Plot the number of bases recovered at P<0.001 for a given IC. The blue line corresponds to 25% of bases recovered, green corresponds to 50% of bases recovered and red corresponds to 75% of bases covered.
Fig. 2.
SiPhy shows greater power for detecting degenerate sites. The figure shows the distributions of LO scores for separating constrained 12mers from neutral ones within two datasets: the 2D and the 4D degenerate sites in the third codon positions. The LO scores are calculated using either π-based method (A) or a rate-based method (B). The shaded regions represent the portion of the sites that show excess constraint in the 2D dataset if the 4D dataset is used as a control.
Fig. 3.
Estimating constraint in the ENCODE regions. (A) Comparison of constraints in neutral sequences versus genomic (ENCODE) regions. The 12mer LO scores are computed by SiPhy for ARs, bootstrapped ancestral repeats (AR boot) and genomic regions (black). Excess constraint in the genomic regions is highlighted by the shaded region in light blue when compared to AR, and in both light blue and dark blue when compared to AR boot. (B) Overlap between SiPhy elements and three other types of elements: coding exons (green), GERP (blue) and PhastCons (red). Each curve shows the percentage of the elements overlapped by SiPhy 12mers for a given LO score cutoff. (C) A Venn diagram of SiPhy, PhastCons and GERP elements in the ENCODE regions.
Fig. 4.
SiPhy-HMM state diagram. A schematic representation of the HMM used to identify SiPhy constrained elements. N(π0) represents the neutral state. The constrained state is represented by a mixture of 10 constraint vectors, of which four are non-degenerate (π1–π4) and six are 2D (π6–π10).
Similar articles
- Identifying a high fraction of the human genome to be under selective constraint using GERP++.
Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Davydov EV, et al. PLoS Comput Biol. 2010 Dec 2;6(12):e1001025. doi: 10.1371/journal.pcbi.1001025. PLoS Comput Biol. 2010. PMID: 21152010 Free PMC article. - Detection of nonneutral substitution rates on mammalian phylogenies.
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Pollard KS, et al. Genome Res. 2010 Jan;20(1):110-21. doi: 10.1101/gr.097857.109. Epub 2009 Oct 26. Genome Res. 2010. PMID: 19858363 Free PMC article. - ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements.
Taylor J, Tyekucheva S, King DC, Hardison RC, Miller W, Chiaromonte F. Taylor J, et al. Genome Res. 2006 Dec;16(12):1596-604. doi: 10.1101/gr.4537706. Epub 2006 Oct 19. Genome Res. 2006. PMID: 17053093 Free PMC article. - Mulan: multiple-sequence local alignment and visualization for studying function and evolution.
Ovcharenko I, Loots GG, Giardine BM, Hou M, Ma J, Hardison RC, Stubbs L, Miller W. Ovcharenko I, et al. Genome Res. 2005 Jan;15(1):184-94. doi: 10.1101/gr.3007205. Epub 2004 Dec 8. Genome Res. 2005. PMID: 15590941 Free PMC article. - Genome-wide functional element detection using pairwise statistical alignment outperforms multiple genome footprinting techniques.
Satija R, Hein J, Lunter GA. Satija R, et al. Bioinformatics. 2010 Sep 1;26(17):2116-20. doi: 10.1093/bioinformatics/btq360. Epub 2010 Jul 7. Bioinformatics. 2010. PMID: 20610610
Cited by
- Enhancing Missense Variant Pathogenicity Prediction with MissenseNet: Integrating Structural Insights and ShuffleNet-Based Deep Learning Techniques.
Liu J, Chen Y, Huang K, Guan X. Liu J, et al. Biomolecules. 2024 Sep 2;14(9):1105. doi: 10.3390/biom14091105. Biomolecules. 2024. PMID: 39334871 Free PMC article. - Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors.
Lin YJ, Menon AS, Hu Z, Brenner SE. Lin YJ, et al. Hum Genomics. 2024 Aug 28;18(1):90. doi: 10.1186/s40246-024-00663-z. Hum Genomics. 2024. PMID: 39198917 Free PMC article. - Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors.
Lin YJ, Menon AS, Hu Z, Brenner SE. Lin YJ, et al. bioRxiv [Preprint]. 2024 Jun 28:2024.06.25.600283. doi: 10.1101/2024.06.25.600283. bioRxiv. 2024. PMID: 38979289 Free PMC article. Updated. Preprint. - Where do obesity and male infertility collide?
Jahangir M, Nazari M, Babakhanzadeh E, Manshadi SD. Jahangir M, et al. BMC Med Genomics. 2024 May 10;17(1):128. doi: 10.1186/s12920-024-01897-5. BMC Med Genomics. 2024. PMID: 38730451 Free PMC article. - Exome sequencing implicates ancestry-related Mendelian variation at SYNE1 in childhood-onset essential hypertension.
Copeland I, Wonkam-Tingang E, Gupta-Malhotra M, Hashmi SS, Han Y, Jajoo A, Hall NJ, Hernandez PP, Lie N, Liu D, Xu J, Rosenfeld J, Haldipur A, Desire Z, Coban-Akdemir ZH, Scott DA, Li Q, Chao HT, Zaske AM, Lupski JR, Milewicz DM, Shete S, Posey JE, Hanchard NA. Copeland I, et al. JCI Insight. 2024 May 8;9(9):e172152. doi: 10.1172/jci.insight.172152. JCI Insight. 2024. PMID: 38716726 Free PMC article.
References
- Bejerano G, et al. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature. 2006;441:87–90. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources