Genome-wide analysis of mRNA lengths in Saccharomyces cerevisiae - PubMed (original) (raw)

Genome-wide analysis of mRNA lengths in Saccharomyces cerevisiae

Evan H Hurowitz et al. Genome Biol. 2003.

Abstract

Background: Although the protein-coding sequences in the Saccharomyces cerevisiae genome have been studied and annotated extensively, much less is known about the extent and characteristics of the untranslated regions of yeast mRNAs.

Results: We developed a 'Virtual Northern' method, using DNA microarrays for genome-wide systematic analysis of mRNA lengths. We used this method to measure mRNAs corresponding to 84% of the annotated open reading frames (ORFs) in the S. cerevisiae genome, with high precision and accuracy (measurement errors +/- 6-7%). We found a close linear relationship between mRNA lengths and the lengths of known or predicted translated sequences; mRNAs were typically around 300 nucleotides longer than the translated sequences. Analysis of genes deviating from that relationship identified ORFs with annotation errors, ORFs that appear not to be bona fide genes, and potentially novel genes. Interestingly, we found that systematic differences in the total length of the untranslated sequences in mRNAs were related to the functions of the encoded proteins.

Conclusions: The Virtual Northern method provides a practical and efficient method for genome-scale analysis of transcript lengths. Approximately 12-15% of the yeast genome is represented in untranslated sequences of mRNAs. A systematic relationship between the lengths of the untranslated regions in yeast mRNAs and the functions of the proteins they encode may point to an important regulatory role for these sequences.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Virtual Northern scheme.

Figure 2

Figure 2

Examples of length profiles. Each length profile is a plot of the normalized ratio from all 30 microarrays. The x-axis is the distance of the midpoint of each gel slice from the origin. The black line indicates the threshold fluorescence ratio for peak recognition of 2.0, and the closed circles represent the three points used to calculate the midpoint of each peak.

Figure 3

Figure 3

Relationship between ORF length and transcript length. Measured transcript length in nucleotides (nts) is plotted against ORF length in base pairs (bps). Green squares and red triangles indicate ORFs whose length is greater than their transcript length and transcripts greater than their theoretical maximum length, respectively. Parentheses indicate how many spots of each type are plotted. The black line is the linear least-squares fit to the blue 'Good' ORF circles. It has the parameters y = 1.0031x + 311.88 (R2 = 0.93).

Figure 4

Figure 4

Relationship between the interval between flanking ORFs ('maximum' length) and transcript length. Measured transcript length in nucleotides (nts) for all nonoverlapping genes is plotted against the inter-ORF interval for each gene. Red squares indicate those transcripts whose length exceeds their theoretical maximum within the precision of the measurement. The identity line is shown in black for clarity. Parentheses indicate how many spots of each type are plotted.

Figure 5

Figure 5

Distribution of residuals between observed and expected transcript lengths. The distributions of the residuals are plotted for all genes, for 209 genes annotated as ribosomal subunits and for 245 genes annotated as having transcription factor activity. The number of genes in each length bin is plotted as a percent of the total number of genes in that distribution.

Figure 6

Figure 6

Adjacent ORF anomalies. Eight different classes of adjacent ORF anomalies are pictured schematically. Open boxes represent ORFs, and their arrows represent their orientation from translational start to stop. Dashed open boxes represent non-conserved ORFs. The solid lines indicate the transcripts detected by Virtual Northern and RACE analysis. The dashed portions of those lines represent variation in the extent of 3'-UTR overlap between different cases. The number following each title lists the number of loci in that class.

Figure 7

Figure 7

Calibrating the relationship between gel mobility and transcript length. The precise gel mobilities of 94 transcripts are plotted against the base 10 logs of their exact lengths based on their 5'- and 3'-ends as determined by RACE. The exponential fits to the top, middle and bottom 2 cm of gel are shown by three black lines with the best fit parameters y = -0.0167x + 4.3456 (R2 = 0.94), y = -0.0176x + 4.4088 (R2 = 0.92) and y = -0.0228x + 4.8286 (R2 = 0.91) respectively.

References

    1. Mignone F, Gissi C, Liuni S, Pesole G. Untranslated regions of mRNAs. Genome Biol. 2002;3:reviews0004.1–0004.10. doi: 10.1186/gb-2002-3-3-reviews0004. - DOI - PMC - PubMed
    1. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002;420:563–573. doi: 10.1038/nature01266. - DOI - PubMed
    1. Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y, et al. Functional annotation of a full-length Arabidopsis cDNA collection. Science. 2002;296:141–145. doi: 10.1126/science.1071006. - DOI - PubMed
    1. Stapleton M, Carlson J, Brokstein P, Yu C, Champe M, George R, Guarin H, Kronmiller B, Pacleb J, Park S, et al. A Drosophila full-length cDNA resource. Genome Biol. 2002;3:research0080.1–0080.8. doi: 10.1186/gb-2002-3-12-research0080. - DOI - PMC - PubMed
    1. Graber JH, McAllister GD, Smith TF. Probabilistic prediction of Saccharomyces cerevisiae mRNA 3'-processing sites. Nucleic Acids Res. 2002;30:1851–1858. doi: 10.1093/nar/30.8.1851. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources