A computational analysis of sequence features involved in recognition of short introns - PubMed (original) (raw)

A computational analysis of sequence features involved in recognition of short introns

L P Lim et al. Proc Natl Acad Sci U S A. 2001.

Abstract

Splicing of short introns by the nuclear pre-mRNA splicing machinery is thought to proceed via an "intron definition" mechanism, in which the 5' and 3' splice sites (5'ss, 3'ss, respectively) are initially recognized and paired across the intron. Here, we describe a computational analysis of sequence features involved in recognition of short introns by using available transcript data from five eukaryotes with complete or nearly complete genomic sequences. The information content of five different transcript features was measured by using methods from information theory, and Monte Carlo simulations were used to determine the amount of information required for accurate recognition of short introns in each organism. We conclude: (i) that short introns in Drosophila melanogaster and Caenorhabditis elegans contain essentially all of the information for their recognition by the splicing machinery, and computer programs that simulate splicing specificity can predict the exact boundaries of approximately 95% of short introns in both organisms; (ii) that in yeast, the 5'ss, branch signal, and 3'ss can accurately identify intron locations but do not precisely determine the location of 3' cleavage in every intron; and (iii) that the 5'ss, branch signal, and 3'ss are not sufficient to accurately identify short introns in plant and human transcripts, but that specific subsets of candidate intronic enhancer motifs can be identified in both human and Arabidopsis that contribute dramatically to the accuracy of splicing simulators.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Intron length distributions. Histograms of the lengths of introns from each organism are plotted, using a log scale for the abscissa. Each histogram was fitted as a mixture of two lognormal distributions by using the

r

statistical package (curved lines). The position of the point of intersection of these distributions is indicated. S. ce., S. cerevisiae; C. el., C. elegans; D. me., D. melanogaster; A. th.,A. thaliana; H. sa., Homo sapiens.

Figure 2

Figure 2

Splice signal motifs. Sequence motifs for the 5′ss (A), branch site (B), and 3′ss (C) are displayed by using the

pictogram

program (

http://genes.mit.edu/pictogram.html

). The height of each letter is proportional to the frequency of the corresponding base at the given position, and bases are listed in descending order of frequency from top to bottom. The RelEnt (in bits) of the motif model used in our analyses (I1M or WMM) relative to the background transcript base composition is also shown. The splice junctions and branch point are marked by inverted triangles.

Figure 3

Figure 3

Monte Carlo estimation of information required for short intron recognition. EAc of prediction of short introns by

pairscan

in randomized transcripts is plotted versus the sum of the RelEnts of the splice signal motifs used. Dotted gray line indicates 98% EAc. Each curve is the best-fit from 130 simulations. Brackets indicate 1 SD above and below the best-fit curve for three chosen RelEnt values. Solid circles represent EAc for

intronscan

in real transcripts versus the sum of the RelEnts of the transcript features used.

Figure 4

Figure 4

Relative contributions of five transcript features to intron detection. The area of each wedge represents the relative contribution to intron detection accuracy of the corresponding transcript feature, calculated as described in Methods. The sizes of the wedges are scaled so that the complete circle represents the RelEnt per intron required to achieve 98% detection accuracy in each organism, derived from Fig. 3.

Figure 5

Figure 5

Contribution of subsets of pentamers to intron prediction. Exact prediction accuracies are shown for

intronscan

by using the 5′ss and 3′ss signals and specialized intron composition models that score particular subsets of pentamers (see the supporting information) as a function of the number of pentamers used. Circles represent accuracy calculated by using 0, 10, 20, 40, 60, and 100 pentamers, with pentamers chosen in order from high values of_f_log(f/g) to low, where f and g are the pentamer frequency in introns and exons, respectively, using a protocol that avoids choosing overlapping pentamers (see the supporting information). (A) Drosophila, (B)Arabidopsis, (C) human. (D). The first ten intron-biased pentamers chosen from each organism. The dashed black line represents average accuracy for 25 random orderings of pentamers. The solid gray line represents accuracy by using all 1,024 pentamers—dashed gray lines are described in text.

Similar articles

Cited by

References

    1. Claverie J M. Genome Res. 2000;10:1277–1279. - PubMed
    1. International Human Genome Sequencing Consortium. Nature (London) 2001;409:860–921. - PubMed
    1. Berget S M. J Biol Chem. 1995;270:2411–2414. - PubMed
    1. Talerico M, Berget S M. Mol Cell Biol. 1994;14:3434–3445. - PMC - PubMed
    1. Gatermann K B, Hoffmann A, Rosenberg G H, Kaufer N F. Mol Cell Biol. 1989;9:1526–1535. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources