Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies - PubMed (original) (raw)

. 2003 Oct 1;31(19):5654-66.

doi: 10.1093/nar/gkg770.

Arthur L Delcher, Stephen M Mount, Jennifer R Wortman, Roger K Smith Jr, Linda I Hannick, Rama Maiti, Catherine M Ronning, Douglas B Rusch, Christopher D Town, Steven L Salzberg, Owen White

Affiliations

Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies

Brian J Haas et al. Nucleic Acids Res. 2003.

Abstract

The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Pictorial representation of the PASA algorithm. Overlapping Arabidopsis transcript sequence alignments are ordered by their beginning position and assigned indices 0–8 as shown in (a). The matrix shown in (b) provides the ordered alignments along the rows and columns according to their index positions. The predetermined containments (chain links) and incompatibilities (bricks) present obstacles within the matrix, disallowing any direct comparisons between two alignments during La or Ra calculations. To compute La, each alignment a is compared with all compatible preceding alignments b, generating the value max[Ca, Lb + C _a_\_b_] stored in the upper left of the matrix cell at [row _b_][column _a_], with an arrow drawn to the L b value that yielded the max. L a is then the maximum upper left value in column a and is shown circled in yellow. After all L a values are computed, the maximum one is found at [row 7][column 8] and the arrows traced back from it (indicated in red) identify the alignments comprising the maximal assembly (alignments 8, 7, 4 and 0) with their contained alignments (5, 6 and 2). The result is the red assembly 1 shown in (a). An examination of assembly 1 indicates that it lacks alignments 1 and 3. The trace back from the forward scan at La where a = 3 provides the maximal assembly containing a originating from the left of a (trace back drawn in blue), but does not identify the alignments to the right of a that are in the a maximal assembly. To find the maximal alignment assembly containing alignment 3, the reverse scan computations were performed, calculating max[Ca, Lb + C _a_\_b_], where b > a, and storing the score in the lower right of cell [row _b_][column _a_]. Ra is the maximum lower right value in column a and is shown circled in yellow. The trace forward from Ra, where a = 3, is shown with blue arrows. Combined with the trace back from La where a = 3, this yields the maximal assembly containing alignment 3, shown as assembly 2, namely alignments 0–3, 6–8. This assembly is also the maximal assembly containing alignment 1. In general, maximal assemblies for missing alignments are found in order of decreasing La + RaCa value until all missing alignments are accounted for within maximal alignment assemblies; this was not done here for brevity. Note that alignment 6 is regarded as contained in alignment 1, even though it extends a few bases into intron 4. PASA has a parameter, called ‘fuzz distance’, which specifies the length of mismatches to discount at transcript ends, where sequence alignment quality is often poor.

Figure 2

Figure 2

Distribution of ESTs and FL-cDNAs within alignment assemblies. The ESTs and FL-cDNAs are non-randomly distributed within alignment assemblies, with relatively small numbers of assemblies containing large numbers of transcript alignments.

Figure 3

Figure 3

Examples of annotation updates using alignment assemblies. The types of gene structure updates provided by alignment assemblies are classified into several distinct categories. FL-cDNA containing assemblies are presumed to encode the full-length gene product including UTRs. Existing annotated gene structures can therefore be replaced by gene structures inferred from the FL-cDNA-containing alignment assemblies. Those assemblies lacking FL-cDNAs are presumed to encode only partial gene structures and are stitched into existing annotated gene structures, providing significant alterations to gene structures or simply adding or extending UTR annotations. The protein coding segments of gene structures are shown in red and UTRs are shown in black. Alignment assemblies lacking a FL-cDNA are shown in black, whereas those containing FL-cDNAs are shown in gray. Boundaries consistent with the original gene structure annotation are highlighted in blue.

Figure 4

Figure 4

Five splicing isoforms supported by transcript sequence alignments. The cDNA alignments supporting the five splicing variations identified for the WD-40 repeat gene (At2g32700) are illustrated. For the purpose of comparison, FL-cDNA gi|13605814 is presumed to provide the representative gene structure. EST gi|5842113 contains an unspliced intron within the upstream UTR. EST gi|8688866 provides an alternative AG acceptor splice site within the upstream UTR which extends the spliced transcript length by 3 bp. EST gi|8689273 provides an alternative AG acceptor splice site corresponding to a different upstream UTR exon which removes 3 bp from the spliced transcript length. EST gi|9787494 provides an alternative AG acceptor splice site at a protein coding exon, deleting 6 bp corresponding to two codons of the translated sequence. Only one of the five isoforms encodes a variant protein sequence, while the remainder encode variations restricted to the upstream UTR region.

Figure 5

Figure 5

Unspliced introns impact on protein products. Unspliced introns have variable effects on translation products. (a) The lack of splicing of the second intron yields a protein of similar length, albeit a different C-terminus. (b) Two different overlapping introns, varying at the donor splice junction, encode an integer number of codons and splicing removes internal segments from the protein. (c) Intron splicing alters the reading frame, providing a different and shorter C-terminus. (d) The lack of splicing truncates the protein sequence due to a stop codon encountered within the unspliced intron sequence.

Similar articles

Cited by

References

    1. Adams M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., Moreno,R.F. et al. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252, 1651–1656. - PubMed
    1. Huang X., Adams,M.D., Zhou,H. and Kerlavage,A.R. (1997) A tool for analyzing and annotating genomic sequences. Genomics, 46, 37–45. - PubMed
    1. Bailey L.C. Jr, Searls,D.B. and Overton,G.C. (1998) Analysis of EST-driven gene annotation in human genomic sequence. Genome Res., 8, 362–376. - PubMed
    1. Wolfsberg T.G. and Landsman,D. (1997) A comparison of expressed sequence tags (ESTs) to human genomic sequences. Nucleic Acids Res., 25, 1626–1632. - PMC - PubMed
    1. Kan Z., Rouchka,E.C., Gish,W.R. and States,D.J. (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res., 11, 889–900. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources