Identifying novel constrained elements by exploiting biased substitution patterns - PubMed (original) (raw)

Identifying novel constrained elements by exploiting biased substitution patterns

Manuel Garber et al. Bioinformatics. 2009.

Abstract

Motivation: Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations.

Results: We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection.

Availability: The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

Estimating the power of SiPhy for evolutionary constraint detection. The dependence of SiPhy on three factors—the number of available species (M), the IC of each site, and the size of k-mers (1 or 12), is evaluated through a simulation study using a star phylogeny. An element is said to be 50% identifiable by SiPhy if>50% of its instances can be identified by SiPhy with P<0.001, and similarly for 25% and 75% identifiable. (A) The lowest IC of a single site that is 25% (blue), 50% (green) or 75% (red) identifiable is shown as a function of M. (B) The 25%, 50% and 75% percentile _P_- values are shown as the function of the IC of a single site, with the number of species fixed at _M_=9 (top) and _M_=50 (bottom). (C) and (D) show similar plots for 12mers. Plot the number of bases recovered at P<0.001 for a given IC. The blue line corresponds to 25% of bases recovered, green corresponds to 50% of bases recovered and red corresponds to 75% of bases covered.

Fig. 2.

Fig. 2.

SiPhy shows greater power for detecting degenerate sites. The figure shows the distributions of LO scores for separating constrained 12mers from neutral ones within two datasets: the 2D and the 4D degenerate sites in the third codon positions. The LO scores are calculated using either π-based method (A) or a rate-based method (B). The shaded regions represent the portion of the sites that show excess constraint in the 2D dataset if the 4D dataset is used as a control.

Fig. 3.

Fig. 3.

Estimating constraint in the ENCODE regions. (A) Comparison of constraints in neutral sequences versus genomic (ENCODE) regions. The 12mer LO scores are computed by SiPhy for ARs, bootstrapped ancestral repeats (AR boot) and genomic regions (black). Excess constraint in the genomic regions is highlighted by the shaded region in light blue when compared to AR, and in both light blue and dark blue when compared to AR boot. (B) Overlap between SiPhy elements and three other types of elements: coding exons (green), GERP (blue) and PhastCons (red). Each curve shows the percentage of the elements overlapped by SiPhy 12mers for a given LO score cutoff. (C) A Venn diagram of SiPhy, PhastCons and GERP elements in the ENCODE regions.

Fig. 4.

Fig. 4.

SiPhy-HMM state diagram. A schematic representation of the HMM used to identify SiPhy constrained elements. N(π0) represents the neutral state. The constrained state is represented by a mixture of 10 constraint vectors, of which four are non-degenerate (π1–π4) and six are 2D (π6–π10).

Similar articles

Cited by

References

    1. Asthana S, et al. Analysis of sequence conservation at nucleotide resolution. PLOS Comput. Biol. 2007;3:e254. - PMC - PubMed
    1. Bejerano G, et al. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature. 2006;441:87–90. - PubMed
    1. Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the encode pilot project. Nature. 2007;447:799–816. - PMC - PubMed
    1. Blanchette M, et al. Aligning multiple genomic sequences with the threaded blockset Aligner. Genome Res. 2004;14:708–715. - PMC - PubMed
    1. Cooper GM, et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources