Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change - PubMed (original) (raw)

Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change

Andrew V Uzilov et al. BMC Bioinformatics. 2006.

Abstract

Background: Non-coding RNAs (ncRNAs) have a multitude of roles in the cell, many of which remain to be discovered. However, it is difficult to detect novel ncRNAs in biochemical screens. To advance biological knowledge, computational methods that can accurately detect ncRNAs in sequenced genomes are therefore desirable. The increasing number of genomic sequences provides a rich dataset for computational comparative sequence analysis and detection of novel ncRNAs.

Results: Here, Dynalign, a program for predicting secondary structures common to two RNA sequences on the basis of minimizing folding free energy change, is utilized as a computational ncRNA detection tool. The Dynalign-computed optimal total free energy change, which scores the structural alignment and the free energy change of folding into a common structure for two RNA sequences, is shown to be an effective measure for distinguishing ncRNA from randomized sequences. To make the classification as a ncRNA, the total free energy change of an input sequence pair can either be compared with the total free energy changes of a set of control sequence pairs, or be used in combination with sequence length and nucleotide frequencies as input to a classification support vector machine. The latter method is much faster, but slightly less sensitive at a given specificity. Additionally, the classification support vector machine method is shown to be sensitive and specific on genomic ncRNA screens of two different Escherichia coli and Salmonella typhi genome alignments, in which many ncRNAs are known. The Dynalign computational experiments are also compared with two other ncRNA detection programs, RNAz and QRNA.

Conclusion: The Dynalign-based support vector machine method is more sensitive for known ncRNAs in the test genomic screens than RNAz and QRNA. Additionally, both Dynalign-based methods are more sensitive than RNAz and QRNA at low sequence pair identities. Dynalign can be used as a comparable or more accurate tool than RNAz or QRNA in genomic screens, especially for low-identity regions. Dynalign provides a method for discovering ncRNAs in sequenced genomes that other methods may not identify. Significant improvements in Dynalign runtime have also been achieved.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Distribution of single sequence z scores for 5S rRNA, tRNA, and negative sequences. Distributions of RNAstructure-predicted z scores computed on the basis of folding single sequences for 5S rRNA and negatives generated from them (left figure) and tRNA and negatives generated from them (right figure). Real ncRNA are white, negatives are black. Controls were generated by the Altschul-Erikson dinucleotide shuffle of original sequence, with 100 controls for each test set sequence. 309 5S rRNA sequences and 482 tRNA sequences, plus one negative sequence generated from each real sequence by the Altschul-Erikson shuffle, were used for the test set.

Figure 2

Figure 2

Quality of classification using the z score method for single sequences. ROC curves showing quality of classification based on single sequences, using RNAstructure-predicted z scores for folding free energy change. The ncRNA sequences and controls are the same as in Figure 1. Red and green show results for 5S rRNA and tRNA, respectively, when tested separately; blue shows results when both are combined into a single test set.

Figure 3

Figure 3

Comparison of three methods for generating 20 controls from each input sequence pair. ROC curves comparing three methods for generating a set of 20 controls from an input sequence pair to determine the z score for ncRNA classification using the Dynalign-computed ΔG°total. The test set contains 755 5S rRNA and 896 tRNA sequence pairs, plus one negative sequence pair generated from each real sequence pair, yielding 3,302 trial pairs total. All tests are run with the parameter M = 8. "dinuc controls" (green): controls are generated by sampling from a first-order Markov chain, approximately preserving dinucleotide frequencies of each original sequence. "AE controls" (orange): controls are generated by the Altschul-Erikson dinucleotide shuffle, exactly preserving dinucleotide frequencies of each original sequence. "column controls" (blue): controls are generated by a columnwise shuffle of a global sequence alignment, without regard for gap placement or local conservation.

Figure 4

Figure 4

Distribution of sequence pair z scores for 5S rRNA, tRNA, and negative sequences. Distribution of z scores computed using the Dynalign ΔG°total and the columnwise shuffle control method (M = 8) for 5S rRNA sequence pairs and negatives generated from them (left figure) and tRNA sequence pairs and negatives generated from them (right figure). Real ncRNA are white, negatives are black. Test set is the same as for Figure 3.

Figure 5

Figure 5

Higher M parameter improves quality of classification when using the z score method. ROC curves comparing effectiveness of the best control generation method for sequence pairs (i.e. columnwise shuffle of a global sequence alignment) at parameters M = 6 (dark blue) and M = 8 (light blue). Test set is same as for Figure 3. For all other control generation methods, increasing the M parameter value likewise increases the quality of classification (see Additional File 1 in "Additional Files" for supporting figure).

Figure 6

Figure 6

Quality of classification using the z score method, broken down by ncRNA family. ROC curves showing effectiveness of the best control generation method for sequence pairs (i.e. columnwise shuffle of global alignment at parameter M = 8) for 5S rRNA by itself (red), tRNA by itself (green), and both combined into one test set (blue). The 5S rRNA or tRNA sequences in the test set are the same as those used for the test set in Figures 3, 4 and 5.

Figure 7

Figure 7

Comparison of the z score classification method using single sequences versus using sequence pairs. ROC curves comparing quality of classification based on single sequences versus based on sequence pairs, using the same free energy parameters for both. The single sequence curve (blue) is the same as in Figure 2. Black shows the best results for the sequence pair approach from Figure 3 (i.e. control generation by columnwise shuffle of global sequence alignment at parameter M = 8), to illustrate the difference in prediction quality.

Figure 8

Figure 8

Comparison of the Dynalign z score method with RNAz for sequence pairs of all identities. ROC curves for the Dynalign z score classification method (running 20 controls for each input sequence pair to determine z score, M = 8; blue) and RNAz (red), both tested on the same test set of sequence pairs as in Figure 3.

Figure 9

Figure 9

Comparison of the Dynalign z score method with RNAz for sequence pairs below 50% identity. ROC curves for the Dynalign z score classification method (running 20 controls for each input sequence pair, M = 8; blue) and RNAz (red), both tested only on those sequence pairs from the Figure 3 test set that have less than 50% sequence pair identity. Dynalign becomes more sensitive than RNAz at low sequence pair identities for all specificities.

Figure 10

Figure 10

Comparison of the Dynalign/LIBSVM classifier with RNAz for sequence pairs of all identities. ROC curves for the Dynalign/LIBSVM classifier (blue) and RNAz (red), both based on a test set of 38,069 5S rRNA sequence pairs, 52,470 tRNA sequence pairs, plus two negative sequence pairs generated from each real sequence pair – one by a columnwise shuffle of a global alignment, one by an Altschul-Erikson dinucleotide shuffle of each sequence in the pair separately, yielding 90,539 real trial sequence pairs and 181,078 negative trial sequence pairs.

Figure 11

Figure 11

Comparison of the Dynalign/LIBSVM classifier with RNAz for sequence pairs below 50% identity. ROC curves for the Dynalign/LIBSVM classifier method (blue) and RNAz (red), both tested only on those sequence pairs from the Figure 10 test set that have less than 50% sequence pair identity.

Figure 12

Figure 12

Comparison of the Dynalign z score method with the Dynalign/LIBSVM classifier. ROC curves for the Dynalign z score method, M = 8 (blue, column shuffle controls of the global alignments; orange, Altschul-Erickson dinucleotide shuffle controls; green, first-order Markov chain sampling controls) versus the Dynalign/LIBSVM classifier (pink). The z score ROC curves are from Figure 3; the Dynalign/LIBSVM ROC curve is from Figure 10.

Figure 13

Figure 13

ncRNA probabilities (P values) of scanning windows iterating through a 16S rRNA. Probabilities of ncRNA computed by the Dynalign/LIBSVM classifier for 30 150-nucleotide-long scanning windows iterating through a global alignment of Borrelia burgdorferi and Bacillus subtilis 16S rRNA in steps of 75.

Figure 14

Figure 14

ncRNA probabilities (P values) of scanning windows iterating through a 23S rRNA. Probabilities of ncRNA computed by the Dynalign/LIBSVM classifier for 57 150-nucleotide-long scanning windows iterating through a global alignment of Bacillus subtilis and Bos taurus 23S rRNA in steps of 75.

Figure 15

Figure 15

Distribution of scanning window percent identities in the MUMmer whole genome ncRNA screen. Histogram showing the distribution of percent identities of 15,214 genomic windows (size 150 alignment columns, scanning step size 75 alignment columns), generated from the MUMmer whole genome alignment of E. coli and S. typhi.

Figure 16

Figure 16

Distribution of scanning window percent identities in the WuBLASTn whole genome ncRNA screen. Histogram showing the distribution of percent identities of 90,404 genomic windows (size 150 alignment columns, scanning step size 75 alignment columns), generated from the WuBLASTn whole genome alignment of E. coli and S. typhi.

Figure 17

Figure 17

Distribution of percent identities of 50-nucleotide windows in the human-mouse genome alignment. The BLASTZ pairwise alignment of the human and mouse genomes [73] is broken down into 50-nucleotide-long non-overlapping windows and the percent identity for each is calculated, then plotted in this histogram. There are 22,456,315 windows total.

Similar articles

Cited by

References

    1. Nissen P, Hansen J, Ban N, Moore PB, Steitz TA. The structural basis of ribosomal activity in peptide bond synthesis. Science. 2000;289:920–930. doi: 10.1126/science.289.5481.920. - DOI - PubMed
    1. Hansen JL, Schmeing TM, Moore PB, Steitz TA. Structural insights into peptide bond formation. Proc Natl Acad Sci U S A. 2002;99:11670–11675. doi: 10.1073/pnas.172404099. - DOI - PMC - PubMed
    1. Walter P, Blobel G. Signal recognition particle contains a 7S RNA essential for protein translocation across the endoplasmic reticulum. Nature. 1982;299:691–698. doi: 10.1038/299691a0. - DOI - PubMed
    1. Cullen BR. RNA interference: antiviral defense and genetic tool. Nat Immunol. 2002;3:597–599. doi: 10.1038/ni0702-597. - DOI - PubMed
    1. Doudna JA, Cech TR. The chemical repertoire of natural ribozymes. Nature. 2002;418:222–228. doi: 10.1038/418222a. - DOI - PubMed

MeSH terms

Substances

LinkOut - more resources