RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data - PubMed (original) (raw)

RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data

Stefan Washietl et al. RNA. 2011 Apr.

Abstract

With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied "out of the box," without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as "noncoding." RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode.

PubMed Disclaimer

Figures

FIGURE 1.

FIGURE 1.

Overview of the RNAcode algorithm. First, a phylogenetic tree is estimated from the input alignment including a reference sequence (darker line) under a noncoding (neutral) nucleotide model. From this background model and a protein similarity matrix, a normalized substitution score is derived to evaluate observed mutations for evidence of negative selection. This substitution score and a gap scoring scheme are the basis for a dynamic programming (DP) algorithm to find local high-scoring coding segments. To estimate the statistical significance of these segments, a background score distribution is estimated from randomized alignments that are simulated along the same phylogenetic tree. The parameters of the extreme value distributed random scores are estimated and used to assign _P_-values to the observed segments in the native alignment.

FIGURE 2.

FIGURE 2.

Examples of typical gap patterns and scoring paths in a pairwise alignment assumed to be coding. Nucleotides are shown as blocks, codons as three consecutive blocks of the same shading. (A) A gap of length three does not change the reading frame and in-frame-aligned codons are scored with the normalized substitution score σ. (B) A single gap destroys the reading frame but gets corrected downstream by another gap. The triplets that are out-of-phase because of this obvious alignment error are penalized by the two frameshift penalties Ω and ω. (C) A single gap that, in principle, destroys the reading frame is interpreted as a sequence error. Penalized by a high negative score Δ, this frameshift is ignored, and downstream codons are considered to be in-phase.

FIGURE 3.

FIGURE 3.

RNAcode results on comparative test sets from various species. (A) Score distributions of annotated coding regions and randomly chosen noncoding regions in the Drosophila test set. (B) ROC curves for all six test sets. The full curve for all ranges of sensitivity/specificity from 0 to 1 is shown in the main diagrams. (Insets) The high specificity rate with false positive rates from 0 to 0.1. (C) Score distribution of noncoding alignments. The same distribution of the Drosophila test set as shown in A is shown in more detail. The fitted Gumbel distribution is shown as dotted line. (Upper right diagram) Comparison of the calculated _P_-values (via simulation and fitting of the Gumbel distribution) to the empirical _P_-values, i.e., the actual observed frequencies in the test set.

FIGURE 4.

FIGURE 4.

Comparison of the RNAcode substitution score with other comparative metrics. The ROC curves show the classification performance of the dN/dS ratio, substitution rate variation, and the average substitution score σ used by RNAcode. Results are shown for alignments of length 30 from vertebrates, archaebacteria, yeasts, and drosophilid species grouped by the number of sequences in the alignment (N) and the mean pairwise sequence identity (MPI). The area under the ROC curve (AUC) as a measure for classification performance is shown for all methods and sets.

FIGURE 5.

FIGURE 5.

Examples of novel short proteins in Escherichia coli. Sequence, genomic context, the high-scoring RNAcode segment, and fragment ion mass spectra are shown. Genome browser screenshots were made at

http://archaea.ucsc.edu

(Schneider et al. 2006). Arrows within annotated elements indicate their reading direction. The shading of mutational patterns was directly produced by the RNAcode program. The full species names for the abbreviations can be found in Supplemental Table 1. The mass spectra are shown for two selected proteolytic peptides, which were scored with 80% probability and used in combination with the detection of additional peptides to confirm the expression of the candidates (for details, see Supplemental Table 3). The proteins shown in A and B correspond to candidates 28 and 19, respectively, listed in Supplemental Tables 2 and 3.

FIGURE 6.

FIGURE 6.

Examples of ambiguities between the coding and noncoding nature of three RNAs. (A) The RNA C0343 from E. coli is listed as a noncoding RNA in Rfam. However, it overlaps with an RNAcode-predicted coding segment. While there is no evidence for a RNA secondary structure according to the RNAz classification value, the highly significant RNAcode prediction and MS experiments suggest that C0343 is an mRNA and not an ncRNA. (B) RNAIII of Staphylococcus aureus (Rfam RF00503) contains a short ORF of a hemolysin gene. RNAcode predicts the open reading frame at the correct position, while RNAz clearly detects a structural signal. These results are consistent with the well-established dual nature of this molecule. (C) The Bacillus subtilis RNA SR1 is known to have function on the RNA level by targeting an mRNA. RNAcode detects a short ORF that was shown by Gimpel et al. (2010) to produce a small peptide and is thus another example of a dual-function RNA.

FIGURE 7.

FIGURE 7.

Finite state automaton representing the scoring of pairwise alignments. The three states correspond to the relative phases of the sequences. Insertions and deletions with z ≠ 0 lead to local changes in-phase that are penalized by Ω. Extension in each of the two out-of-frame states S+ and S− is penalized by ω. In/dels interpreted as sequencing errors or true frameshifts are penalized by Δ.

Similar articles

Cited by

References

    1. Aebersold R, Mann M 2003. Mass spectrometry-based proteomics. Nature 422: 198–207 - PubMed
    1. Badger JH, Olsen GJ 1999. CRITICA: Coding region identification tool invoking comparative analysis. Mol Biol Evol 16: 512–524 - PubMed
    1. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708–715 - PMC - PubMed
    1. Bofkin L, Goldman N 2007. Variation in evolutionary processes at different codon positions. Mol Biol Evol 24: 513–521 - PubMed
    1. Boisset S, Geissmann T, Huntzinger E, Fechter P, Bendridi N, Possedko M, Chevalier C, Helfer AC, Benito Y, Jacquier A, et al. 2007. Staphylococcus aureus RNAIII coordinately represses the synthesis of virulence factors and the transcription regulator Rot by an antisense mechanism. Genes Dev 21: 1353–1366 - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources