The Alternative Splicing Gallery (ASG): bridging the gap between genome and transcriptome (original) (raw)
Journal Article
,
* To whom correspondence should be addressed. Tel: +1 919 513 2726; Fax: +1 919 513 7315; Email: sheber@ncsu.edu
Search for other works by this author on:
,
Search for other works by this author on:
Search for other works by this author on:
Published:
01 January 2004
Cite
Jeremy Leipzig, Pavel Pevzner, Steffen Heber, The Alternative Splicing Gallery (ASG): bridging the gap between genome and transcriptome, Nucleic Acids Research, Volume 32, Issue 13, 1 July 2004, Pages 3977–3983, https://doi.org/10.1093/nar/gkh731
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Alternative splicing essentially increases the diversity of the transcriptome and has important implications for physiology, development and the genesis of diseases. Conventionally, alternative splicing is investigated in a case-by-case fashion, but this becomes cumbersome and error prone if genes show a huge abundance of different splice variants. We use a different approach and integrate all transcripts derived from a gene into a single splicing graph. Each transcript corresponds to a path in the graph, and alternative splicing is displayed by bifurcations. This representation preserves the relationships between different splicing variants and allows us to investigate systematically all possible putative transcripts. We built a database of splicing graphs for human genes, using transcript information from various major sources (Ensembl, RefSeq, STACK, TIGR and UniGene). A Web interface allows users to display the splicing graphs, to interactively assemble transcripts and to access their sequences as well as neighboring genomic regions. We also provide for each gene an exhaustive pre-computed catalog of putative transcripts—in total more than 1.2 million sequences. We found that ∼65% of the investigated genes show evidence for alternative splicing, and in 5% of the cases, a single gene might produce over 100 transcripts.
Received January 5, 2004; Revised March 14, 2004; Accepted July 12, 2004
INTRODUCTION
Alternative splicing is a major link between the estimated 30 000 genes and the myriad of proteins that are believed to be necessary for complex organisms like humans. Previous studies ( 1 – 3 ) reported that over half of all known human genes might be alternatively spliced, and some genes create a vast assortment of different transcripts. Unfortunately, existing ab initio gene prediction programs only infer information about one or a small number of most likely transcripts. Our goal here is to complement these programs by providing information about all putative transcripts.
Since expressed sequence tags (ESTs) and cDNAs provide direct evidence for all sampled transcripts, they are currently the most important resources to infer gene structure and alternative splicing.
Typically, these sequences are collected in the gene indices like UniGene ( 4 ), the TIGR Gene Index ( 5 ), GeneNest ( 6 ) and STACK ( 7 ). Owing to the fragmentary nature of EST sequences and their sometimes low quality, biologists often assemble them into consensus sequences before using them for further analyses ( 6 , 8 , 9 ). Several spliced alignment programs, such as sim4 ( 10 ), gap2 ( 11 ), spidey ( 12 ) and BLAT ( 13 ), are available for aligning transcripts to genomic sequence and subsequent programs ( 14 – 16 ) have been developed to infer gene structure and predictions about alternative splicing. All these programs represent splicing variants as a list so that a gene with n splicing variants will correspond to a list with n entries. This is hardly efficient since n is often very large, e.g. in our study, 1.7% (380) of the investigated genes have more than 500 assemblies, 0.4% (89) even more than 5000. Even more troublesome, such a representation conceals the relationships between different transcripts.
To overcome these problems, we have developed the Alternative Splicing Gallery (ASG), a web-based tool ( http://statgen.ncsu.edu/asg/ ) that integrates transcript information from Ensembl ( 17 ), RefSeq ( 18 ), STACK ( 7 ), TIGR ( 5 ) and UniGene ( 4 ) into splicing graphs ( 19 ) in order to explore and visualize gene structure and alternative splicing, as well as to compile an exhaustive transcript catalog.
Conceptually, splicing graphs are built by ‘projecting’ transcribed sequences onto their genomic templates and ‘overlaying’ these projections (see below for a formal definition). They combine shared segments of different transcripts into single paths and display alternative splicing by bifurcations. Our approach integrates the information of all (even divergent) transcripts of a gene into a single, unambiguously defined data structure, rather than handling them separately. This distinguishes ASG from other alternative splicing databases that partition ESTs with respect to splice variants, e.g. SpliceNest ( 15 ), where UniGene clusters are partitioned by assembling them into consensus sequences, or STACK ( 7 ), where isoforms are partitioned within ‘loose’ EST clusters based on a multiple sequence alignment. Such a partition potentially might result in incomplete or even lost transcripts [see ( 19 ) and Figure 3 ]. Although the number of possible transcripts of a gene might be very large, splicing graphs display them all simultaneously. They allow us to investigate systematically all possible assemblies consistent with the input data as well as to recover the corresponding splice variants and their relationships. This complements other alternative splicing databases, which usually try to recover a minimal or most probable set of splice variants. A detailed comparison of ASG with other databases is shown in Table 1 .
Table 1.
Comparison of ASG with other databases
Database | Methodology | Statistics (human) | Organisms |
---|---|---|---|
ASAP ( 16 ) | Input: EST/mRNAs (UniGene) | 68 032 EST clusters mapped to genome | Human |
Map to genome: BLAST, dynamic programming | 44% show alternative splicing | Mouse | |
Analyze genomic-EST-mRNA multiple alignments | |||
Tissue-specific results | |||
ASD ( 20 ) consists of AltExtron, AltSplice (R1), AEDB | Input: genes (Ensembl), EST/mRNAs (GenBank) | AltSplice: | Human |
AltExtron, AltSplice | 16 215 genes | Mouse | |
Computer-generated | 77% of genes with >1 transcripts | AltExtron: Model organisms | |
Map to genome: BLAST | 61 880 transcripts | ||
AEDB | |||
Manually created | |||
Literature based | |||
Ensembl ( 17 ) (V22.34d.1) | Input: ESTs (dbEST) | 38 581 EST genes | Human |
Map to genome: Exonerate, BLAST, Est_Genome | 43% show alternative splicing | Other metazoan species | |
122 247 transcripts | |||
Redundancy reduction and splice site adjustment | |||
Transcript annotation: genome wise | |||
PALSdb ( 21 ) (R6) | Input: EST/mRNAs (UniGene) | 33 111 clusters | Human |
Compare ESTs with longest mRNA in cluster | 43% show alternative splicing | Mouse | |
No genomic reference | |||
ProSplicer ( 22 ) (R3.0) | Input: genes (Ensembl), mRNAs (UniGene), ESTs (dbEST), proteins (Swiss-Prot, TrEMBL) | 21 786 genes | Human |
Map to genome: proteins (BLAST), EST/mRNAs (sim4) | |||
GeneNest ( 6 ) & SpliceNest ( 15 ) | Input: ESTs (UniGene) | 426 178 contigs | Human |
Assemble UniGene clusters into contigs | 31 185 singletons | Mouse | |
Map to genome: Reputer, sim4 | 33 431 clusters mapped to genes | Fruitfly | |
45% show alternative splicing | Zebrafish | ||
Arabidopsis | |||
STACKdb ( 7 ) (v3.1) | Input: EST/mRNAs (GenBank) | 270 515 cluster | Human |
Cluster ESTs and assembel cluster | 850 835 singletons | ||
Post-cluster assemblies | |||
Tissue and disease-specific categories | |||
No genomic reference | |||
TAP ( 14 ) | Input: ESTs (dbEST) | 1007 multi-exon RefSeq genes | Human |
Map to genome: WuBLAST, sim4 | 55% show alternative splicing | Mouse | |
Predict gene structure and Poly(A) sites | |||
ASG | Input: EST/mRNA data (UniGene, TIGR, STACK, RefSeq, Ensembl), genes (Ensembl) | 22 127 genes | Human |
65% show alternative splicing | |||
>1.2 millions transcripts | |||
Map to genome: BLAST, sim4 | |||
Build splicing graphs |
Database | Methodology | Statistics (human) | Organisms |
---|---|---|---|
ASAP ( 16 ) | Input: EST/mRNAs (UniGene) | 68 032 EST clusters mapped to genome | Human |
Map to genome: BLAST, dynamic programming | 44% show alternative splicing | Mouse | |
Analyze genomic-EST-mRNA multiple alignments | |||
Tissue-specific results | |||
ASD ( 20 ) consists of AltExtron, AltSplice (R1), AEDB | Input: genes (Ensembl), EST/mRNAs (GenBank) | AltSplice: | Human |
AltExtron, AltSplice | 16 215 genes | Mouse | |
Computer-generated | 77% of genes with >1 transcripts | AltExtron: Model organisms | |
Map to genome: BLAST | 61 880 transcripts | ||
AEDB | |||
Manually created | |||
Literature based | |||
Ensembl ( 17 ) (V22.34d.1) | Input: ESTs (dbEST) | 38 581 EST genes | Human |
Map to genome: Exonerate, BLAST, Est_Genome | 43% show alternative splicing | Other metazoan species | |
122 247 transcripts | |||
Redundancy reduction and splice site adjustment | |||
Transcript annotation: genome wise | |||
PALSdb ( 21 ) (R6) | Input: EST/mRNAs (UniGene) | 33 111 clusters | Human |
Compare ESTs with longest mRNA in cluster | 43% show alternative splicing | Mouse | |
No genomic reference | |||
ProSplicer ( 22 ) (R3.0) | Input: genes (Ensembl), mRNAs (UniGene), ESTs (dbEST), proteins (Swiss-Prot, TrEMBL) | 21 786 genes | Human |
Map to genome: proteins (BLAST), EST/mRNAs (sim4) | |||
GeneNest ( 6 ) & SpliceNest ( 15 ) | Input: ESTs (UniGene) | 426 178 contigs | Human |
Assemble UniGene clusters into contigs | 31 185 singletons | Mouse | |
Map to genome: Reputer, sim4 | 33 431 clusters mapped to genes | Fruitfly | |
45% show alternative splicing | Zebrafish | ||
Arabidopsis | |||
STACKdb ( 7 ) (v3.1) | Input: EST/mRNAs (GenBank) | 270 515 cluster | Human |
Cluster ESTs and assembel cluster | 850 835 singletons | ||
Post-cluster assemblies | |||
Tissue and disease-specific categories | |||
No genomic reference | |||
TAP ( 14 ) | Input: ESTs (dbEST) | 1007 multi-exon RefSeq genes | Human |
Map to genome: WuBLAST, sim4 | 55% show alternative splicing | Mouse | |
Predict gene structure and Poly(A) sites | |||
ASG | Input: EST/mRNA data (UniGene, TIGR, STACK, RefSeq, Ensembl), genes (Ensembl) | 22 127 genes | Human |
65% show alternative splicing | |||
>1.2 millions transcripts | |||
Map to genome: BLAST, sim4 | |||
Build splicing graphs |
Table 1.
Comparison of ASG with other databases
Database | Methodology | Statistics (human) | Organisms |
---|---|---|---|
ASAP ( 16 ) | Input: EST/mRNAs (UniGene) | 68 032 EST clusters mapped to genome | Human |
Map to genome: BLAST, dynamic programming | 44% show alternative splicing | Mouse | |
Analyze genomic-EST-mRNA multiple alignments | |||
Tissue-specific results | |||
ASD ( 20 ) consists of AltExtron, AltSplice (R1), AEDB | Input: genes (Ensembl), EST/mRNAs (GenBank) | AltSplice: | Human |
AltExtron, AltSplice | 16 215 genes | Mouse | |
Computer-generated | 77% of genes with >1 transcripts | AltExtron: Model organisms | |
Map to genome: BLAST | 61 880 transcripts | ||
AEDB | |||
Manually created | |||
Literature based | |||
Ensembl ( 17 ) (V22.34d.1) | Input: ESTs (dbEST) | 38 581 EST genes | Human |
Map to genome: Exonerate, BLAST, Est_Genome | 43% show alternative splicing | Other metazoan species | |
122 247 transcripts | |||
Redundancy reduction and splice site adjustment | |||
Transcript annotation: genome wise | |||
PALSdb ( 21 ) (R6) | Input: EST/mRNAs (UniGene) | 33 111 clusters | Human |
Compare ESTs with longest mRNA in cluster | 43% show alternative splicing | Mouse | |
No genomic reference | |||
ProSplicer ( 22 ) (R3.0) | Input: genes (Ensembl), mRNAs (UniGene), ESTs (dbEST), proteins (Swiss-Prot, TrEMBL) | 21 786 genes | Human |
Map to genome: proteins (BLAST), EST/mRNAs (sim4) | |||
GeneNest ( 6 ) & SpliceNest ( 15 ) | Input: ESTs (UniGene) | 426 178 contigs | Human |
Assemble UniGene clusters into contigs | 31 185 singletons | Mouse | |
Map to genome: Reputer, sim4 | 33 431 clusters mapped to genes | Fruitfly | |
45% show alternative splicing | Zebrafish | ||
Arabidopsis | |||
STACKdb ( 7 ) (v3.1) | Input: EST/mRNAs (GenBank) | 270 515 cluster | Human |
Cluster ESTs and assembel cluster | 850 835 singletons | ||
Post-cluster assemblies | |||
Tissue and disease-specific categories | |||
No genomic reference | |||
TAP ( 14 ) | Input: ESTs (dbEST) | 1007 multi-exon RefSeq genes | Human |
Map to genome: WuBLAST, sim4 | 55% show alternative splicing | Mouse | |
Predict gene structure and Poly(A) sites | |||
ASG | Input: EST/mRNA data (UniGene, TIGR, STACK, RefSeq, Ensembl), genes (Ensembl) | 22 127 genes | Human |
65% show alternative splicing | |||
>1.2 millions transcripts | |||
Map to genome: BLAST, sim4 | |||
Build splicing graphs |
Database | Methodology | Statistics (human) | Organisms |
---|---|---|---|
ASAP ( 16 ) | Input: EST/mRNAs (UniGene) | 68 032 EST clusters mapped to genome | Human |
Map to genome: BLAST, dynamic programming | 44% show alternative splicing | Mouse | |
Analyze genomic-EST-mRNA multiple alignments | |||
Tissue-specific results | |||
ASD ( 20 ) consists of AltExtron, AltSplice (R1), AEDB | Input: genes (Ensembl), EST/mRNAs (GenBank) | AltSplice: | Human |
AltExtron, AltSplice | 16 215 genes | Mouse | |
Computer-generated | 77% of genes with >1 transcripts | AltExtron: Model organisms | |
Map to genome: BLAST | 61 880 transcripts | ||
AEDB | |||
Manually created | |||
Literature based | |||
Ensembl ( 17 ) (V22.34d.1) | Input: ESTs (dbEST) | 38 581 EST genes | Human |
Map to genome: Exonerate, BLAST, Est_Genome | 43% show alternative splicing | Other metazoan species | |
122 247 transcripts | |||
Redundancy reduction and splice site adjustment | |||
Transcript annotation: genome wise | |||
PALSdb ( 21 ) (R6) | Input: EST/mRNAs (UniGene) | 33 111 clusters | Human |
Compare ESTs with longest mRNA in cluster | 43% show alternative splicing | Mouse | |
No genomic reference | |||
ProSplicer ( 22 ) (R3.0) | Input: genes (Ensembl), mRNAs (UniGene), ESTs (dbEST), proteins (Swiss-Prot, TrEMBL) | 21 786 genes | Human |
Map to genome: proteins (BLAST), EST/mRNAs (sim4) | |||
GeneNest ( 6 ) & SpliceNest ( 15 ) | Input: ESTs (UniGene) | 426 178 contigs | Human |
Assemble UniGene clusters into contigs | 31 185 singletons | Mouse | |
Map to genome: Reputer, sim4 | 33 431 clusters mapped to genes | Fruitfly | |
45% show alternative splicing | Zebrafish | ||
Arabidopsis | |||
STACKdb ( 7 ) (v3.1) | Input: EST/mRNAs (GenBank) | 270 515 cluster | Human |
Cluster ESTs and assembel cluster | 850 835 singletons | ||
Post-cluster assemblies | |||
Tissue and disease-specific categories | |||
No genomic reference | |||
TAP ( 14 ) | Input: ESTs (dbEST) | 1007 multi-exon RefSeq genes | Human |
Map to genome: WuBLAST, sim4 | 55% show alternative splicing | Mouse | |
Predict gene structure and Poly(A) sites | |||
ASG | Input: EST/mRNA data (UniGene, TIGR, STACK, RefSeq, Ensembl), genes (Ensembl) | 22 127 genes | Human |
65% show alternative splicing | |||
>1.2 millions transcripts | |||
Map to genome: BLAST, sim4 | |||
Build splicing graphs |
We analyzed and annotated ASG for alternative splicing events and constructed for each gene (except for 89 genes with more than 5000 assemblies each) an exhaustive set of transcripts. In good concordance with other studies ( 1 , 23 ), we found that ∼65% of the genes showed evidence for alternative splicing. Surprisingly, our transcript catalog resulted in total more than 1.2 million sequences—a number that might very well explain the complexity found in humans.
We display splicing graphs with respect to transcripts and the corresponding genomic sequence. A sequence builder allows users to interactively ‘assemble’ transcripts. As an example, we show in Figure 1 the splicing graph of the human CBFB gene (Ensembl gene identifier: ENSG00000067955), which is involved in human leukemogenesis ( 24 ) and encodes the β-subunit of the heterodimeric transcription factor core-binding factor (CBF) involved in the regulation of genes important in hematopoiesis ( 25 ). The CBFB gene contains six exons and spans ∼70 kb.
Figure 1.
Visualization of the splicing graph (gray) of the human CBFB gene with Ensembl gene identifier: ENSG00000067955 together with the corresponding aligned input transcripts (green) and representative transcript reconstructions (purple). Not drawn to scale! Splice sites are marked by vertical bars. Color-labeled vertices mark annotated alternative splicing events. The highlighted boxes in the sequence builder depict a transcript that skips exon 3 and uses an alternative 5′ splice site in exon 5. Transcripts are displayed with respect to their alignment with the genomic sequence as rows of boxes (aligned regions) connected by dotted lines (putative introns). Only alignments that meet our quality constraints (alignment boundaries correspond to splice sites, sequence identity >95%) are incorporated in the splicing graph.
It was previously shown that the last 31 nt of exon 5 can be alternatively spliced ( 26 ). The splicing graph confirms this finding (node 5, marked orange) and points out to an additional (and so far unreported) alternative splicing event: skipping of exon 3 (node 3, marked red). This observation is supported by the cDNAs BM462417 (GenBank) and BM477780 (GenBank), both derived from leiomyosarcoma tissue libraries. The splicing graph allows us to generate an exhaustive list of all possible putative transcripts by generating all paths in the graph: Such a list could be an invaluable starting point for subsequent research. It immediately raises questions about possible dependences between the alternative splicing events and about which of the transcripts has a biological function. Although at the moment, there is no sufficient data to answer these questions in a high-throughput setting, we consider this as one of the biggest challenges for future work.
- _t_1 : _n_1 → _n_2 → _n_3 → _n_4 → _n_5 → _n_6 → _n_7
- _t_2 : _n_1 → _n_2 → _n_3 → _n_4 → _n_5 → _n_7
- _t_3 : _n_1 → _n_2 → _n_4 → _n_5 → _n_6 → _n_7
- _t_4 : _n_1 → _n_2 → _n_4 → _n_5 → _n_7 .
Some splicing graphs are considerably more complicated. As an example, we show in Figure 2 the splicing graph of the human collagen, type IV, alpha 6 gene COL4A6 (Ensembl gene identifier: ENSG00000133124). Type IV collagen is the major structural component of glomerular basement membranes, which compartmentalize tissues and provide important signals for the differentiation of the cells they support. The COL4A6 gene maps to chromosome Xq22.3 and was found to contain two alternative promoters. The gene seems to be connected with Alport syndrome accompanied by diffuse leiomyomatosis ( 27 , 28 ). The gene belongs to our list of the 89 most complex genes with more than 5000 assemblies. Our annotation shows a total of eight simple alternative splicing events, but the splicing graph reveals an additional large amount of unannotated alternative splicing [alternative promoters, alternative poly(A) sites, and complex and nested events], which might be overlooked by an automated annotation procedure. It is hard to imagine, how a conventional approach could display this complex situation adequately.
Figure 2.
Visualization of the splicing graph (gray) of the human COL4A6 gene with Ensembl gene identifier ENSG00000133124 together with the corresponding aligned input transcripts (green). Not drawn to scale!
MATERIALS AND METHODS
Data sources and preparation
We downloaded UniGene Build #160 ( ftp://ftp.ncbi.nih.gov/repository/UniGene/ ). After pre-processing [vector trimming, poly(A) trimming and the elimination of short sequences], we assembled the UniGene clusters using CAP3 ( 29 ) with default parameters. The resulting assemblies were merged with the TIGR Human Gene Index, Version 13.0, Release October 14, 2003 ( ftp://ftp.tigr.org/pub/data/tgi/Homo_sapiens/ ); Stackdb v3.1 of the South African National Bioinformatics Institute (SANBI) ( http://www.sanbi.ac.za/Dbases.html ); and the set of mRNAs of RefSeq Release 2, October 21, 2003 ( http://www.ncbi.nlm.nih.gov/RefSeq/ ).
Transcript mapping to Ensembl genes
We mapped the above transcripts and EST contigs onto the known Ensembl genes of Ensembl Human release v18.34.1 ( http://www.ensembl.org/ ) using a two-step approach. After masking repeats by RepeatMasker (Smit,A.F.A. and Green,P., http://ftp.genome.washington.edu/RM/RepeatMasker.html ), we first identified the candidate Ensembl gene by matching each sequence with the set of Ensembl transcripts using BLASTN ( 30 ) with default parameters and E -value threshold E < 10 −50 . To establish a match, we require an alignment with an overall identity rate of 95% over more than 100 nt; in case of multiple hits fulfilling these requirements we only use the best match. In the second step, we align the matched sequences combined with the Ensembl transcripts to the corresponding genomic region (derived from Ensembl) plus 10 kb on either end using the spliced alignment program sim4 ( 10 ) with reduced word size W = 8. Sequences that resulted in low-quality alignments (alignment length smaller than 100 positions, identity score <95%) or inconsistent orientation like overlapping transcripts mapping to the opposite strands of the genome, were discarded.
Splicing graph construction
Splicing graphs are constructed as follows: let { _s_1 , …, s n } be the set of transcripts for a given gene. Each transcript s i corresponds via a spliced alignment to a set of genomic positions V i with V i ≠ V j for i ≠ j . Define the set of all transcribed positions
\({\cup}_{i=1}^{n}V_{i}\)
as the union of all sets V i . The splicing graph G is the directed graph on the set of transcribed positions V that contains an edge ( v , w ) if and only if v and w are consecutive positions in one of the transcripts s i . The resulting graph is post-processed to eliminate splices that do not comply with the canonical (GT/AG) or the non-canonical (GC/AG; AT/AC) splice sites and to prune unspliced intron parts. To obtain a more compact representation, we collapse vertices that correspond to consecutive genomic positions ( Figure 1 ).
Transcript generation
In a splicing graph, a transcript is defined as a path from a source to a sink vertex. This definition corresponds to a maximal list of consistent exons and does not capture truncated transcripts, which could result from alternative transcription initiation or termination, but such sequences could be included easily. To create an exhaustive transcript catalog, we traverse all paths from a source to a sink and report the corresponding sequences.
In contrast to conventional EST assembly approaches, splicing graphs will recover all potential putative exon combinations, regardless how often they are represented in the input data or in which order the data are processed ( Figure 3 ). This eliminates much of the ambiguities in current EST assembly algorithms [for an overview see ( 19 )]. On the other hand, in case of dependences between alternative splicing events, e.g. events that always coincide or are mutually exclusive of each other, this approach might combine splice variants that do not co-occur in nature, and yield overpredictions. If such spurious transcripts can be identified, they could be removed easily from our catalog. Unfortunately, to the best of our knowledge, all current methods to determine precisely which splicing events occur in individual isoforms—especially if they affect distant transcript regions—are experimental in nature and do not lend themselves to high-throughput applications [for a more complete overview see ( 31 )].
Figure 3.
In the presence of alternative splicing, conventional EST-based transcript reconstruction is often incomplete. For example, given the set of displayed ESTs, there are two different ways of assembling (partitioning) all input ESTs into consensus sequences. Both reconstructions are equally computable from the data and explain all ESTs, but each one consists of only two sequences. Dependent on the order of the processed ESTs, a conventional approach might result in either reconstruction and miss the other. In contrast, a splicing graph-based approach does not partition the data but reports exhaustively all four different putative transcripts. However, in the presence of dependences between alternative splicing events, this approach runs the risk of overpredictions by grouping together splicing events that might not co-occur in nature.
Our method does not require that a given alternative splice form is detected in multiple transcripts, but we complement our predictions by a quality value [similar to the approach described in ( 32 )], which tries to assess the degree to which a prediction corresponds to a potential real transcript. Our quality value (range: 0–1, where 0 is bad and 1 is good) penalizes the occurrence of non-consensus splice sites and transcript regions with poor EST support, and it rewards a high overall EST coverage. The precise combination of these paramaters into a single score is heuristically determined based on the inspection of individual transcripts.
RESULTS
In total, approximately 500 000 EST consensus sequences and mRNAs were used to build 22 127 splicing graphs. ASG shows splicing graphs with respect to their corresponding genomic sequence and the input sequences ( Figures 1 and 2 ). Exons or exon fragments are depicted as rectangular nodes, and splices as circular edges between nodes that correspond to non-consecutive genomic positions. Splice sites are marked by vertical bars and non-canonical splice sites are highlighted by an asterisk. Alternative splicing is indicated by positions of in-degree or out-degree larger than 1. The splicing graphs are automatically analyzed and four simple main types of alternative splicing [single and multiple cassette exons, retained introns, competing 5′ and 3′ splice sites; see Figure 4 and ( 31 )] are highlighted by colors. We perform this analysis by identifying graph patterns similar to those in Figure 4 . In a splicing graph, exons correspond to adjacent vertices that map to consecutive genomic positions bordered by splice sites. Now, for example, to determine an exon-skipping event, we look for two bifurcation vertices, s and t , which correspond to a 5′ and a 3′ splice site on the border of different exons, and which are connected by a single edge as well as by a path traversing one or more exons. Similar searches are performed for retained introns, competing splice sites, multiple promoters and poly(A) sites (the latter two are not displayed in the graph), and a detailed description is given in our Web page. Currently, our automated annotation does not identify mutual exclusive exons, complex or nested splicing events, or transcript truncations. The results of our alternative splicing analysis of human genes are summarized in Tables 2 and 3 .
Figure 4.
Types of alternative splicing annotated in the splicing graph gallery. Boxes represent exons or exon fragments. Retained introns are often caused by incompletely spliced ESTs and should be interpreted very carefully.
Table 2.
Tabulation of simple alternative splicing events and number of genes where they occurred in the ASG consisting of 22 127 Ensembl genes
| | Total number | Percentage of genes | Number of genes | | | ------------------------- | ------------------- | --------------- | ---- | | Cassette exons | 10 940 | 30.3 | 6701 | | Competing 5′ splice sites | 4808 | 17.1 | 3783 | | Competing 3′ splice sites | 5211 | 17.8 | 3935 | | Retained introns | 12 777 | 31.0 | 6856 |
| | Total number | Percentage of genes | Number of genes | | | ------------------------- | ------------------- | --------------- | ---- | | Cassette exons | 10 940 | 30.3 | 6701 | | Competing 5′ splice sites | 4808 | 17.1 | 3783 | | Competing 3′ splice sites | 5211 | 17.8 | 3935 | | Retained introns | 12 777 | 31.0 | 6856 |
In addition to these numbers, we found over 10 000 more complex or nested alternative splicing events, which did not fall in the above classification, 5879 genes showed evidence for multiple promoters or multiple poly(A) sites. Only ∼35% of the genes did not show evidence for alternative splicing.
Table 2.
Tabulation of simple alternative splicing events and number of genes where they occurred in the ASG consisting of 22 127 Ensembl genes
| | Total number | Percentage of genes | Number of genes | | | ------------------------- | ------------------- | --------------- | ---- | | Cassette exons | 10 940 | 30.3 | 6701 | | Competing 5′ splice sites | 4808 | 17.1 | 3783 | | Competing 3′ splice sites | 5211 | 17.8 | 3935 | | Retained introns | 12 777 | 31.0 | 6856 |
| | Total number | Percentage of genes | Number of genes | | | ------------------------- | ------------------- | --------------- | ---- | | Cassette exons | 10 940 | 30.3 | 6701 | | Competing 5′ splice sites | 4808 | 17.1 | 3783 | | Competing 3′ splice sites | 5211 | 17.8 | 3935 | | Retained introns | 12 777 | 31.0 | 6856 |
In addition to these numbers, we found over 10 000 more complex or nested alternative splicing events, which did not fall in the above classification, 5879 genes showed evidence for multiple promoters or multiple poly(A) sites. Only ∼35% of the genes did not show evidence for alternative splicing.
Table 3.
Distribution of the number of transcript reconstructions per gene in the Alternative Splicing Gallery consisting of 22127 Ensembl genes
Transcripts per gene | Percentage of genes | Number of genes |
---|---|---|
1 | 34.9 | 7722 |
2 | 15.7 | 3471 |
3–4 | 14.8 | 3282 |
5–10 | 11.5 | 2547 |
11–20 | 8.1 | 1781 |
21–50 | 6.9 | 1518 |
51–100 | 3.1 | 694 |
101–200 | 1.8 | 400 |
201–500 | 1.5 | 332 |
501–5000 | 1.3 | 291 |
>5000 | 0.4 | 89 |
Transcripts per gene | Percentage of genes | Number of genes |
---|---|---|
1 | 34.9 | 7722 |
2 | 15.7 | 3471 |
3–4 | 14.8 | 3282 |
5–10 | 11.5 | 2547 |
11–20 | 8.1 | 1781 |
21–50 | 6.9 | 1518 |
51–100 | 3.1 | 694 |
101–200 | 1.8 | 400 |
201–500 | 1.5 | 332 |
501–5000 | 1.3 | 291 |
>5000 | 0.4 | 89 |
Table 3.
Distribution of the number of transcript reconstructions per gene in the Alternative Splicing Gallery consisting of 22127 Ensembl genes
Transcripts per gene | Percentage of genes | Number of genes |
---|---|---|
1 | 34.9 | 7722 |
2 | 15.7 | 3471 |
3–4 | 14.8 | 3282 |
5–10 | 11.5 | 2547 |
11–20 | 8.1 | 1781 |
21–50 | 6.9 | 1518 |
51–100 | 3.1 | 694 |
101–200 | 1.8 | 400 |
201–500 | 1.5 | 332 |
501–5000 | 1.3 | 291 |
>5000 | 0.4 | 89 |
Transcripts per gene | Percentage of genes | Number of genes |
---|---|---|
1 | 34.9 | 7722 |
2 | 15.7 | 3471 |
3–4 | 14.8 | 3282 |
5–10 | 11.5 | 2547 |
11–20 | 8.1 | 1781 |
21–50 | 6.9 | 1518 |
51–100 | 3.1 | 694 |
101–200 | 1.8 | 400 |
201–500 | 1.5 | 332 |
501–5000 | 1.3 | 291 |
>5000 | 0.4 | 89 |
A sequence builder allows users to construct and retrieve interactively any transcript supported by the splicing graph—as well as neighboring upstream/downstream regions and introns—by simply selecting the corresponding elements in the splicing graph or the sequence builder. In addition, we provide for each gene (except for 89 genes that produced more than 5000 different assemblies each) a pre-computed exhaustive set of putative transcript reconstructions (i.e. paths in the graph)—in total more than 1.2 million sequences. We also provide a usually much smaller set of representative assemblies that ‘cover’ the splicing graph. The representative assemblies were chosen by selecting for each splice (i.e. each graph edge) an assembly of maximal length. For each splice site, we generated a probe by concatenating the splice site flanking 30mers of the splicing graph. We used MegaBLAST ( 33 ) to search dbEST ( 34 ) with these probes. The probe set and the GenBank identifiers of the found ESTs, which support the splice sites, can be downloaded from our Web page. A list of the genes, which produced more than 5000 different assemblies, is provided as Supplementary Material.
We display for each gene basic information like genomic position, gene description, Gene Ontology annotation ( http://www.geneontology.org/ ), OMIM annotation ( 4 ), known PFAM domains ( 35 ) and provide links to other alternative splicing databases [ASD ( 20 ), ASAP ( 16 ), HASDB ( 23 ), PALSdb ( 21 ), ProSplicer ( 22 ) and SpliceNest ( 15 )]. ASG can be queried by using source database identifiers or by a BLAST ( 30 ) search.
DISCUSSION
ASG is a compact genome-based representation of the huge quantity of EST and cDNA data—designed as a starting point for the systematic investigation of gene structure and the transcriptome. We integrated transcript data from RefSeq, Ensembl, UniGene, STACK and TIGR with respect to the set of Ensembl genes into splicing graphs. Combining these various data sources has several advantages. We get a more complete overview, and reduce potential bias introduced by different EST clustering strategies, an important point, since the large amount of missed real splice forms is a big disadvantage of any method that maps EST data to genomic sequence ( 2 ). In addition, by merging EST data with full-length mRNAs and model sequences we overcome the problem of coverage gaps in gene structure and transcript prediction. Since splicing graphs combine reoccurring transcript segments into single paths and display alternative splicing as bifurcations, they yield a compact and biologically meaningful visualization, which highlights potential splice variants. We automatically annotate the main simple types of alternative splicing, and in contrast to most other alternative splicing databases; we also display other more complex events. The essential advantage of splicing graphs over conventional representations is that they preserve the relationships between splice variants and therefore allow us to systematically generate and analyze all putative transcripts represented by the input data. This is an important prerequisite for the analysis and quantification of complex splicing patterns, the investigation of mRNA splicing regulation and for cataloging the transcriptome.
We derived for each gene a small set of representative putative transcripts as well as an exhaustive catalog. Since our approach explores all possible compatible splice variations it might overpredict the number of transcripts in the case of dependences between alternative splicing events of a gene. If identified, the spurious transcripts could be removed easily from our catalog. Unfortunately, current high-throughput techniques in general cannot determine such dependences with much certainty ( 31 ). Since our goal was to complement existing techniques by an algorithm that provides an exhaustive transcript catalog, we refrained from applying additional filtering steps at this stage to avoid omitting a real variant. We did, however, complement our transcript reconstructions by a quality value, which ranks transcripts with respect to the occurrence of non-standard splice sites and regions of poor EST support. This allows users to further filter and prioritize our predicitions. In addition to the above transcript catalogs, ASG offers a sequence builder that allows users to interactively assemble exons, and to retrieve upstream and downstream regions, or introns by simply ‘clicking’ on the corresponding elements. This feature is especially helpful for investigating gene structure or for finding regulatory sequences. Following the suggestion of a very helpful anonymous referee, we interconnected our database with other alternative splicing databases. This allows users to compare and combine our results with other approaches, as well as to complement ASG with additional information.
Quantifying the number of different transcripts that originate from a single gene under certain conditions is a fascinating and sparsely addressed dimension of the hidden transcriptome, which exceeds simply cataloging alternative splicing events. Our database is only a first step toward this direction. Although we neither expect that each of our in silico reconstructions corresponds to a biological functional transcript nor that we reconstructed all such transcripts, our database highlights a set of genes which potentially produce hundreds of different proteins.
To illustrate the biological importance of such genes, we investigated in a preliminary study (data not shown) their frequency among the genes involved in inherited human disease, which are stored in OMIM ( 4 ). We found a highly significant ( P -value = 2.2 × 10 −16 ) overrepresentation of genes with multiple transcripts, in average OMIM genes produced over 50% more transcripts than others. Although this has to be interpreted very carefully—one could for example argue that due to the high interest in desease genes databases are biased or that desease genes might have a higher transcription levels which could result in more biologically non-functional erroneous transcripts—we hypothesize that genes with multiple transcripts are of fundamental biological importance and therefore more likely to be involved in desease. Screening these genes in different tissues and under different conditions as well as investigating them with respect to their function, evolution and involvement in diseases are interesting challenges for future research.
Future work
The current version of ASG does not display annotations of coding sequences, promoters, polyadenylation sites, the strength of splice sites and transcript truncations. We plan to include these features together with an accompanying protein section in a future edition of the gallery.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at NAR Online.
We thank Chris Smith and Dr Christopher Basten for computer support and very valuable discussions.
REFERENCES
Mironov,A., Fickett,J. and Gelfand,M. (
1999
) Frequent alternative splicing of human genes.
Genome Res.
,
9
,
1288
–1293.
Modrek,B. and Lee,C. (
2001
) A genomic view of alternative splicing.
Nature Genet.
,
30
,
13
–19.
Graveley,B. (
2001
) Alternative splicing: increasing diversity in the proteomic world.
Trends Genet.
,
17
,
100
–107.
Wheeler,D., Church,D., Federhen,S., Lash,A., Madden,T., Pontius,J., Schuler,G., Schriml,L., Sequeira,E., Tatusova,T. and Wagner,L. (
2003
) Database resources of the National Center for Biotechnology.
Nucleic Acids Res.
,
31
,
28
–33.
Quackenbush,J., Cho,J., Lee,D., Liang,F., Holt,I., Karamycheva,S., Parvizi,B., Pertea,G., Sultana,R. and White,J. (
2001
) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species.
Nucleic Acids Res.
,
29
,
159
–164.
Haas,S., Beissbarth,T., Rivals,E., Krause,A. and Vingron,M. (
2000
) GeneNest: automated generation and visualization of gene indices.
Trends Genet.
,
16
,
521
–523.
Christoffels,A., vanGelder,A., Greyling,G., Miler,R., Hide,T. and Hide,W. (
2001
) STACK: Sequence Tag Alignment and Consensus Knowledgebase.
Nucleic Acids Res.
,
29
,
234
–238.
Burke,J., Wang,H., Hide,W. and Davison,D. (
1998
) Alternative gene form discovery and candidate gene selection from gene indexing projects.
Genome Res.
,
8
,
276
–290.
Zhuo,D., Zhao,W., Wright,F., Yang,H., Wang,J., Sears,R., Baer,T., Kwon,D., Gordon,D., Gibbs,S., Dai,D., Yang,Q., Spitzner,J., Krahe,R., Stredney,D., Stutz,A. and Yuan,B. (
2001
) Assembly, annotation, and integration of UNIGENE clusters into the human genome draft.
Genome Res.
,
11
,
904
–918.
Florea,L., Hartzell,G., Zhang,Z., Rubin,G. and Miller,W. (
1998
) A computer program for aligning a cDNA sequence with a genomic DNA sequence.
Genome Res.
,
8
,
967
–974.
Huang,X., Adams,M., Zhou,H. and Kerlavage,A. (
1997
) A tool for analyzing and annotating genomic sequences.
Genomics
,
46
,
37
–45.
Wheelan,S.J., Church,D.M. and Ostell,J.M. (
2000
) Spidey: a tool for mRNA-to-genomic alignments.
Genome Res.
,
11
,
1952
–1957.
Kent,W.J. (
2002
) BLAT—the BLAST-like alignment tool.
Genome Res.
,
12
,
656
–664.
Kan,Z., Rouchka,E.C., Gish,W.R. and States,D.J. (
2001
) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs.
Genome Res.
,
11
,
889
–900.
Coward,E., Haas,S. and Vingron,M. (
2002
) SpliceNest: visualizing gene structure and alternative splicing based on EST clusters.
Trends Genet.
,
18
,
53
–55.
Lee,C., Atanelov,L., Modrek,B. and Xing,Y. (
2003
) ASAP: the Alternative Splicing Annotation Project.
Nucleic Acids Res.
,
31
,
101
–105.
Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V., Down,T. et al . (
2002
) The Ensembl genome database project.
Nucleic Acids Res.
,
30
,
38
–41.
Pruitt,K. and Maglott,D. (
2001
) RefSeq and LocusLink: NCBI gene-centered resources.
Nucleic Acids Res.
,
29
,
137
–140.
Heber,S., Alekseyev,M., Sze,S.H., Tang,H. and Pevzner,P.A. (
2002
) Splicing graphs and EST assembly problem.
Bioinformatics
,
18
(Suppl. 1),
181
–188.
Thanaraj,T.A., Stamm,S., Clark,F., Riethoven,J.J., Le Texier,V. and Muilu,J. (
2004
) ASD: the Alternative Splicing Database.
Nucleic Acids Res.
,
32
,
64
–69.
Huang,Y.H., Chen,Y.T., Lai,J.J., Yang,S.T. and Yang,U.C. (
2002
) PALS db: putative alternative splicing database.
Nucleic Acids Res.
,
30
,
186
–190.
Huang,H.D., Horng,J.T., Lee,C.C. and Liu,B.J. (
2003
) ProSplicer: a database of putative alternative splicing information derived from protein, mRNA and expressed sequence tag sequence data.
Genome Biol.
,
4
,
R29
.
Modrek,B., Resch,A., Grasso,C. and Lee,C. (
2001
) Genome-wide detection of alternative splicing in expressed sequences of human genes.
Nucleic Acids Res.
,
29
,
2850
–2859.
Liu,P., Tarle,S., Hajra,A., Claxton,D., Marlton,P., Freedman,M., Siciliano,M. and Collins,F. (
1993
) Fusion between transcription factor CBF beta/PEBP2 beta and a myosin heavy chain in acute myeloid leukemia.
Science
,
261
,
1041
–1044.
Wang,S., Wang,Q., Crute,B.E., Melnikova,I.N., Keller,S.R. and Speck,N.A. (
1993
) Cloning and characterization of subunits of the T-cell receptor and murine leukemia virus enhancer core-binding factor.
Mol. Cell. Biol.
,
13
,
3324
–3339.
van der Reijden,B.A., Lombardo,M., Dauwerse,H.G., Giles,R.H., Muhlematter,D., Bellomo,M.J., Wessels,H.W., Beverstock,G.C., van Ommen,G.J., Hagemeijer,A. et al . (
1995
) RT–PCR diagnosis of patients with acute nonlymphocytic leukemia and inv(16)(p13q22) and identification of new alternative splicing in CBFB-MYH11 transcripts.
Blood
,
86
,
277
–282.
Zhang,X., Zhou,J., Reeders,S.T. and Tryggvason,K. (
1996
) Structure of the human type IV collagen COL4A6 gene, which is mutated in Alport syndrome-associated leiomyomatosis.
Genomics
,
33
,
473
–479.
Zhou,J., Mochizuki,T., Smeets,H., Antignac,C., Laurila,P., de Paepe,A., Tryggvason,K. and Reeders,S.T. (
1993
) Deletion of the paired alpha 5(IV) and alpha 6(IV) collagen genes in inherited smooth muscle tumors.
Science
,
261
,
1167
–1169.
Huang,X. and Madan,A. (
1999
) CAP3: a DNA sequence assembly program.
Genome Res.
,
9
,
868
–877.
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (
1990
) Basic local alignment search tool.
J. Mol. Biol.
,
215
,
403
–410.
Roberts,G.C. and Smith,C.W.J. (
2002
) Alternative splicing: combinatorial output from the genome.
Curr. Opin. Chem. Biol.
,
6
,
375
–383
Gupta,S., Zink,D., Korn,B., Vingron,M. and Haas,S.A. (
2004
) Genome wide identification and classification of alternative splicing based on EST data.
Bioinformatics
, April 29 (Epub ahead of print).
Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. (
2000
) A greedy algorithm for aligning DNA sequences.
J. Comput. Biol.
,
7
,
203
–214.
Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (
1993
) dbEST-database for “expressed sequence tags”.
Nature Genet.
,
4
,
332
–333.
Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L. (
2002
) The Pfam protein families database.
Nucleic Acids Res.
,
30
,
276
–280.
Author notes
Department of Computer Science, College of Engineering, North Carolina State University, Raleigh, NC 27695-7566, USA and 1Department of Computer Science & Engineering, APM 4802, University of California, San Diego, La Jolla, CA 92093-0114, USA
Nucleic Acids Research, Vol. 32 No. 13 © Oxford University Press 2004; all rights reserved
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 849
653 Pageviews
196 PDF Downloads
Since 2/1/2017
Month: | Total Views: |
---|---|
February 2017 | 10 |
April 2017 | 7 |
May 2017 | 2 |
June 2017 | 6 |
July 2017 | 2 |
August 2017 | 4 |
September 2017 | 4 |
October 2017 | 1 |
November 2017 | 10 |
December 2017 | 11 |
January 2018 | 14 |
February 2018 | 11 |
March 2018 | 11 |
April 2018 | 6 |
May 2018 | 4 |
June 2018 | 8 |
July 2018 | 10 |
August 2018 | 10 |
September 2018 | 8 |
October 2018 | 5 |
November 2018 | 7 |
December 2018 | 8 |
January 2019 | 4 |
February 2019 | 20 |
March 2019 | 17 |
April 2019 | 19 |
May 2019 | 21 |
June 2019 | 13 |
July 2019 | 15 |
August 2019 | 15 |
September 2019 | 14 |
October 2019 | 7 |
November 2019 | 11 |
December 2019 | 18 |
January 2020 | 8 |
February 2020 | 9 |
March 2020 | 8 |
April 2020 | 9 |
May 2020 | 13 |
June 2020 | 15 |
July 2020 | 9 |
August 2020 | 11 |
September 2020 | 16 |
October 2020 | 7 |
November 2020 | 9 |
December 2020 | 6 |
January 2021 | 6 |
February 2021 | 4 |
March 2021 | 16 |
April 2021 | 1 |
May 2021 | 10 |
June 2021 | 11 |
August 2021 | 9 |
September 2021 | 4 |
October 2021 | 5 |
November 2021 | 5 |
December 2021 | 5 |
January 2022 | 2 |
February 2022 | 7 |
March 2022 | 3 |
April 2022 | 9 |
May 2022 | 5 |
June 2022 | 9 |
July 2022 | 10 |
August 2022 | 12 |
September 2022 | 12 |
October 2022 | 8 |
November 2022 | 9 |
December 2022 | 2 |
January 2023 | 7 |
February 2023 | 9 |
March 2023 | 11 |
April 2023 | 3 |
May 2023 | 4 |
June 2023 | 6 |
July 2023 | 5 |
August 2023 | 10 |
September 2023 | 4 |
October 2023 | 3 |
November 2023 | 7 |
December 2023 | 9 |
January 2024 | 6 |
February 2024 | 11 |
March 2024 | 8 |
April 2024 | 25 |
May 2024 | 15 |
June 2024 | 26 |
July 2024 | 30 |
August 2024 | 20 |
September 2024 | 12 |
October 2024 | 1 |
Citations
65 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic