Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome - PubMed (original) (raw)

Gregory M Cooper, George Asimenos, Daryl J Thomas, Colin N Dewey, Adam Siepel, Ewan Birney, Damian Keefe, Ariel S Schwartz, Minmei Hou, James Taylor, Sergey Nikolaev, Juan I Montoya-Burgos, Ari Löytynoja, Simon Whelan, Fabio Pardi, Tim Massingham, James B Brown, Peter Bickel, Ian Holmes, James C Mullikin, Abel Ureta-Vidal, Benedict Paten, Eric A Stone, Kate R Rosenbloom, W James Kent, Gerard G Bouffard, Xiaobin Guan, Nancy F Hansen, Jacquelyn R Idol, Valerie V B Maduro, Baishali Maskeri, Jennifer C McDowell, Morgan Park, Pamela J Thomas, Alice C Young, Robert W Blakesley, Donna M Muzny, Erica Sodergren, David A Wheeler, Kim C Worley, Huaiyang Jiang, George M Weinstock, Richard A Gibbs, Tina Graves, Robert Fulton, Elaine R Mardis, Richard K Wilson, Michele Clamp, James Cuff, Sante Gnerre, David B Jaffe, Jean L Chang, Kerstin Lindblad-Toh, Eric S Lander, Angie Hinrichs, Heather Trumbower, Hiram Clawson, Ann Zweig, Robert M Kuhn, Galt Barber, Rachel Harte, Donna Karolchik, Matthew A Field, Richard A Moore, Carrie A Matthewson, Jacqueline E Schein, Marco A Marra, Stylianos E Antonarakis, Serafim Batzoglou, Nick Goldman, Ross Hardison, David Haussler, Webb Miller, Lior Pachter, Eric D Green, Arend Sidow

Affiliations

Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome

Elliott H Margulies et al. Genome Res. 2007 Jun.

Abstract

A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Phylogenetic tree relating the set of analyzed species. The depicted topology and branch lengths illustrate the relationships among the analyzed species’ sequences. Analytical support for the represented tree is provided elsewhere (Nikolaev et al. 2007). The numbers next to each species name indicate the amount of sequence (in Mb) examined in this study (some species have >30 Mb of sequence either as a result of lineage-specific expansions of these regions or the resolution with which orthologous sequences can be identified before alignment) (see Supplemental Material for additional details); (red numbers) BAC-derived sequence sequenced to “comparative grade” (see Methods); (blue numbers) sequence obtained from whole-genome sequencing efforts; and (black numbers) finished human sequence. Blue and green branches distinguish mammalian from non-mammalian sequences, respectively.

Figure 2.

Figure 2.

“Human-centric” approach for constructing multisequence alignments. The human sequence (middle) is aligned to two other species’ sequences (top and bottom). In the final alignment (right), nucleotides from the other species need not have retained their original order and orientation; they may, for example, have been subjected to inversions (top blue) or duplications (bottom green). Non-human duplications need to be resolved (top magenta), so that each position in the human sequence is aligned to at most one position in any other species’ sequence.

Figure 3.

Figure 3.

Rearrangements and duplications inferred by the alignments. (A) Number of rearrangement breakpoints in the ENCODE regions as a function of minimum block size, determined by three alignment methods. For each species, the average number of breakpoints over all regions (_Y_-axis) was calculated for all minimum block sizes (in base pairs; _X_-axis). The species shown are chimp (dark blue), baboon (brown), mouse (green), dog (orange), and cow (light blue). For each minimum block size, the number of breakpoints in a given region was determined after removing blocks in order of increasing size and joining consistent blocks until no block had size less than the minimum (see Methods). (B) Duplicated human nucleotide positions in the ENCODE regions. The fraction of ENCODE positions that are inparalogous to one another relative to a given species is plotted for each species, as determined by TBA (yellow) and MLAGAN (green). Colobus Monkey, Dusky Titi, Mouse Lemur, and Owl Monkey are not shown because sequence from these species was only obtained for one region (ENm001).

Figure 4.

Figure 4.

Alignment coverage of coding exons and ancestral repeats. For a representative group of mammalian species (_X_-axis), the fraction of human coding exons covered by at least 1 base (top panel) or completely covered (i.e., no gaps, middle panel) are shown for the MAVID (blue), TBA (yellow), MLAGAN (green), and PECAN (red) alignments. For the same set of species, we also show the percentage of all human “ancestral repeat” bases (out of a total of ∼5.8 million) that are aligned to a nucleotide within a mobile element of the same class and family. Note that absolute coverage levels should be interpreted cautiously, as they reflect both phylogenetic signal (i.e., insertions and deletions of DNA between human and the query species) and sequence completeness.

Figure 5.

Figure 5.

Alignment “correctness” as measured by Alu exclusion and periodicity of substitutions in coding exons. For a group of non-primate mammals, the fraction of human Alu bases (out of a total of ∼3.8 million) that are not aligned (i.e., gapped) is shown (top panel). A score of 1 would correlate with complete exclusion of all _Alu_s, as would be the case in alignments with no false-positive orthology predictions. We also show the fraction of human coding exons that show a triplet periodicity in substitutions in the pairwise alignment between human and each query species (see Methods). Note that this is purely a relative measure, since we exclude exons that are completely gapped in at least one alignment, or fail to show periodicity in at least one alignment.

Figure 6.

Figure 6.

Constrained bases in each ENCODE region. For each ENCODE region (_Y_-axis), the percentage of nucleotides found to be under evolutionary constraint in the strict (red), moderate (blue), and loose sets (yellow) is shown (_X_-axis). The 44 regions are ranked from top to bottom by the fraction of bases in the moderate (green) annotations. For all the manually picked regions, their biological significance is noted in parentheses.

Figure 7.

Figure 7.

Annotated versus unannotated constrained sequences. For each block of constrained sequence, a score based on the log-likelihood of observing such a sequence under a model of constrained versus neutral evolution was computed using the phastOdds program (Siepel et al. 2005). These values were divided by the length of each block to compute a normalized per-base log-likelihood that reflects constraint intensity (_X_-axis). These values were plotted as a frequency histogram (_Y_-axis) for the blocks of constrained sequences that do (yellow) or do not (blue) overlap an experimental annotation. The distributions largely overlap (green), even at the extreme positive end in which highly constrained sequences reside. For comparison, the distribution for ancestral repeat sequences is shown as a representation of largely neutral DNA.

Figure 8.

Figure 8.

Significance of constrained sequence overlapping various experimental annotations. We quantified the ratio of “observed” to “randomized” overlaps between constrained sequences and experimental annotations (see Supplemental Box S1), after adding and subtracting a given number of bases to the ends of each experimentally identified annotation. Randomized data sets were generated by randomizing the start positions of features within each ENCODE target, preserving the length distribution of each feature set and any target-specific regional effects. (A) This analysis is illustrated for a hypothetical set of annotations. (Orange bars) The positions of constrained sequences; trimmed (blue bars), observed (green bars), and expanded (red) experimental annotations. (Vertical gray bars) Regions of overlap between constrained sequences and experimental annotations. A table summarizing the overlaps among the different scenarios is provided below the diagram. For this hypothetical example, note how the ratio of overlap between the observed and randomized data sets increases as the experimental annotations are trimmed, indicating an enrichment of constrained sequence in the trimmed annotations. (B) This analysis for several experimentally identified elements is plotted, where the _X_-axis indicates the amount of trimmed (negative) or expanded (positive) sequence on each element, and the _Y_-axis indicates the ratio of observed-to-randomized overlap (scale varies between plots). Note that CDSs exhibit a slight enrichment after deletion of a small number of bases at either end, but are very similar to what is expected given the theoretically optimal self–self overlap (“Constrained Sequence”), where we know that trimming should not increase specificity. For many annotations (e.g., “TUFs” and “5′-UTRs”) (see Supplemental Box S1), such enrichment quickly drops off as the annotations are expanded or trimmed. However, some annotations, such as “FAIRE Sites” and “Sequence-Specific Factors,” exhibit a clear improvement in overlap after trimming substantial amounts of sequence from either end (250 and 500 bases for “FAIRE Sites” and “Sequence-Specific Factors,” respectively). Similar plots for all experimental annotations are available as Supplemental Figure S4.

References

    1. Aparicio S., Chapman J., Stupka E., Putnam N., Chia J.-M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Chapman J., Stupka E., Putnam N., Chia J.-M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Stupka E., Putnam N., Chia J.-M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Putnam N., Chia J.-M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Chia J.-M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Christoffels A., Rash S., Hoon S., Smit A., Rash S., Hoon S., Smit A., Hoon S., Smit A., Smit A., et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002;297:1301–1310. - PubMed
    1. Blakesley R.W., Hansen N.F., Mullikin J.C., Thomas P.J., McDowell J.C., Maskeri B., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., Hansen N.F., Mullikin J.C., Thomas P.J., McDowell J.C., Maskeri B., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., Mullikin J.C., Thomas P.J., McDowell J.C., Maskeri B., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., Thomas P.J., McDowell J.C., Maskeri B., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., McDowell J.C., Maskeri B., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., Maskeri B., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., Young A.C., Benjamin B., Brooks S.Y., Coleman B.I., Benjamin B., Brooks S.Y., Coleman B.I., Brooks S.Y., Coleman B.I., Coleman B.I., et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 2004;14:2235–2244. - PMC - PubMed
    1. Blanchette M., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Rosenbloom K., Clawson H., Green E.D., Clawson H., Green E.D., Green E.D., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. - PMC - PubMed
    1. Boffelli D., McAuliffe J., Ovcharenko D., Lewis K.D., Ovcharenko I., Pachter L., Rubin E.M., McAuliffe J., Ovcharenko D., Lewis K.D., Ovcharenko I., Pachter L., Rubin E.M., Ovcharenko D., Lewis K.D., Ovcharenko I., Pachter L., Rubin E.M., Lewis K.D., Ovcharenko I., Pachter L., Rubin E.M., Ovcharenko I., Pachter L., Rubin E.M., Pachter L., Rubin E.M., Rubin E.M. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. 2003;299:1391–1394. - PubMed
    1. Bray N., Pachter L., Pachter L. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–699. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources