Identification of gene 3' ends by automated EST cluster analysis - PubMed (original) (raw)

Identification of gene 3' ends by automated EST cluster analysis

Enrique M Muro et al. Proc Natl Acad Sci U S A. 2008.

Abstract

The properties and biology of mRNA transcripts can be affected profoundly by the choice of alternative polyadenylation sites, making definition of the 3' ends of transcripts essential for understanding their regulation. Here we show that 22-52% of sequences in commonly used human and murine "full-length" transcript databases may not currently end at bona fide polyadenylation sites. To identify probable transcript termini over the entire murine and human genomes, we analyzed the EST databases for positional clustering of EST ends. The analysis yielded 58,282 murine- and 86,410 human-candidate polyadenylation sites, of which 75% mapped to 23,091 known murine transcripts and 22,891 known human transcripts. The murine dataset correctly predicted 97% of the 3' ends in a manually curated and experimentally supported benchmark transcript set. Of currently known genes, 15% had no associated prediction and 25% had only a single predicted termination site. The remaining genes had an average of 3-4 alternative polyadenylation sites predicted for each murine or human transcript, respectively. The results are made available in the form of tables and an interactive web site that can be mined for rapid assessment of the validity of 3' ends in existing collections, enumeration of potential alternative 3' polyadenylation sites of known transcripts, direct retrieval of terminal sequences for design of probes, and detection of polyadenylation sites not currently mapped to known genes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.

Fig. 1.

EST evidence for alternative 3′ ends for murine Pde7a transcripts. (A) The diagram, obtained from the UCSC Genome Browser (Mouse mm8, February 2006 Assembly) (22), illustrates a region of mouse chromosome 3 spanning 8 kb, including the 3′UTR end of the Pde7a gene, transcribed from right to left. The ends of gene transcript predictions from RefSeq and Ensembl are represented as blue and brown bars, respectively. Black boxes represent the matches to mouse ESTs and mRNAs. Accumulations of EST ends at 3 particular positions are visible (indicated by a red diamond, a yellow oval, and a violet diamond). The position indicated by the red diamond suggests the existence of an alternative termination of the Pde7a gene considered neither by RefSeq nor by Ensembl. (B) Interpretation of EST and PAS information around the predicted 3′UTR of murine Pde7a. The curve, in blue, indicates the number of EST matches at each position. Many of these ESTs end abruptly at the left side of the principal peak, whereas the right side of the peak has a softer slope, which indicates that the ESTs derive from transcripts running from right to left, in agreement with the known direction of transcription of the Pde7a gene. The vertical red lines are maxima of the convolution of the EST-match histogram (see Materials and Methods), which indicate potential terminations. The red lines below the baseline represent potential terminations in the sense of the Pde7a transcription. Further evidence using PAS and clusters of EST ends is then used to confirm transcript ends. The 2 violet vertical bars (under both diamonds) represent clusters of EST ends located near rough ends and composed of at least 2 ESTs ending in the same position with a valid local polyadenylation signal. As explained in A, the rightmost end (violet diamond) is from the Ensembl collection but is found in the RefSeq database; the central end (yellow oval) is represented in Ensembl; and the leftmost (red diamond) is not represented in those collections. Of note, the end marked with the yellow oval had many EST ends (see A). However there was no corresponding PAS, and a tract of 16 consecutive A's coincided with the peak of EST ends. This end, reported by Ensembl, appears to reflect internal transcript priming during cDNA generation rather than a site of transcript termination.

Similar articles

Cited by

References

    1. Tian B, Hu J, Zhang H, Lutz C. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 2005;33:201–212. - PMC - PubMed
    1. Zhang H, Lee JY, Tian B. Biased alternative polyadenylation in human tissues. Genome Biol. 2005;6:R100. - PMC - PubMed
    1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith M, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. - PubMed
    1. Kan Z, States D, Gish W. Selecting for functional alternative splices in ESTs. Genome Res. 2002;12:1837–1845. - PMC - PubMed
    1. Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 2001;11:889–900. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources