Analysis of canonical and non-canonical splice sites in mammalian genomes - PubMed (original) (raw)

Analysis of canonical and non-canonical splice sites in mammalian genomes

M Burset et al. Nucleic Acids Res. 2000.

Abstract

A set of 43 337 splice junction pairs was extracted from mammalian GenBank annotated genes. Expressed sequence tag (EST) sequences support 22 489 of them. Of these, 98.71% contain canonical dinucleotides GT and AG for donor and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice site pairs; and the remaining 0.73% occurs in a lot of small groups (with a maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only eight observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of the splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). A high proportion (156 out of 171) of the human non-canonical and EST-supported splice site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors that were corrected to GT-AG canonical pairs, six AT-AC pairs (of which two were errors that corrected to AT-AC), one case was produced from non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole set of annotated mammalian non-canonical splice sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22 199 entries versus approximately 600) and finally, a set of 290 EST-supported non-canonical splice sites. Both sets should be significant for future investigations of the splicing mechanism.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Structure and classification of spliced constructs. (a) Structure of spliced constructs. Two sequence regions of a splice pair (marked as Donor and Acceptor) with the corresponding splice site dinucleotides surrounded by 40 bp of gene sequence at each side. Joining exon part of donor (ExonL) and exon part of acceptor (ExonR) we produce a sequence of splice construct to be verified by ESTs. (b) EST alignment classification. After obtaining EST and splice construct alignments, every match was classified as D-end (EST covers only the donor part), A-end (EST covers only the acceptor part), B-ends (EST covers a splice junction without mismatches) or Error (EST covers the junction with mismatches).

Figure 2

Figure 2

Examples of possible ambiguities in supported by EST splice pairs. (a) Homo sapiens Telethonin gene, intron 1 (AJ010063). An example of annotated non-canonical junction supported by EST. The same EST can also support a canonical splice junction. The annotated non-canonical junction and the putative canonical one produce the same spliced sequence. (b) Homo sapiens FUS gene, intron 14 (X99001). An example of annotated non-canonical junction supported by EST. Another EST supports a closely located canonical splice junction. In this case the EST-supported putative spliced sequence differs by 2 nucleotides (gg) from the annotated one.

Figure 3

Figure 3

Shifted splice sites. Examples of GG-AG verified splice pairs (11 cases). In donor sites (exactly after the cut point) a GG pair is always found. To decide to which type of splicing pair we should assign these non-canonical examples we checked all closely located standard dinucleotides. They are found shifted by 1 nucleotide downstream. We reclassify the presented splice pairs as nine canonical GT-AG, one GC-AG and one GA-AG site.

Figure 4

Figure 4

Analysis of EST-supported non-canonical splice site groups. (a) Classification. Analyzing all EST-verified non-canonical splice pairs and taking into account cases with shifted canonical consensus this classification has been produced. Practically all splice pairs have only one non-canonical splice dinucleotide. (b) Table of possible splice pairs. After generalization we have obtained only seven non-canonical splice pair groups and a total of eight groups if we include the canonical splice pairs. The first (top) part of the right figures shows canonical donor site combined with all observed variations of acceptor site (GT-AG, GT-CG and GT-TG). The second (middle) part shows AT-AC group and hybrid pairs (GT-AC, AT-AC and AT-AG). The third (bottom) part shows canonical acceptor site combined with all observed variations of donor site (GA-AG, GC-AG and GT-AG).

Figure 5

Figure 5

Comparison of human GenBank sequences and available HTGs.

Figure 6

Figure 6

Small annotated and EST-supported non-canonical splice pair groups (without shifted dinucleotides).

Similar articles

Cited by

References

    1. Breathnach R., Benoist,C., O’Hare,K., Gannon,F. and Chambon,P. (1978) Proc. Natl Acad. Sci. USA, 75, 4853–4857. - PMC - PubMed
    1. Breathnach R. and Chambon,P. (1981) Annu. Rev. Biochem., 50, 349–393. - PubMed
    1. Mount S. (1981) Nucleic Acids Res., 10, 459–472. - PMC - PubMed
    1. Hodge M.R. and Cumsky,M.G. (1989) Mol. Cell. Biol., 9, 2765–2770. - PMC - PubMed
    1. Jackson I.J. (1991) Nucleic Acids Res., 19, 3795–3798. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources