Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change - PubMed (original) (raw)
Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change
Andrew V Uzilov et al. BMC Bioinformatics. 2006.
Abstract
Background: Non-coding RNAs (ncRNAs) have a multitude of roles in the cell, many of which remain to be discovered. However, it is difficult to detect novel ncRNAs in biochemical screens. To advance biological knowledge, computational methods that can accurately detect ncRNAs in sequenced genomes are therefore desirable. The increasing number of genomic sequences provides a rich dataset for computational comparative sequence analysis and detection of novel ncRNAs.
Results: Here, Dynalign, a program for predicting secondary structures common to two RNA sequences on the basis of minimizing folding free energy change, is utilized as a computational ncRNA detection tool. The Dynalign-computed optimal total free energy change, which scores the structural alignment and the free energy change of folding into a common structure for two RNA sequences, is shown to be an effective measure for distinguishing ncRNA from randomized sequences. To make the classification as a ncRNA, the total free energy change of an input sequence pair can either be compared with the total free energy changes of a set of control sequence pairs, or be used in combination with sequence length and nucleotide frequencies as input to a classification support vector machine. The latter method is much faster, but slightly less sensitive at a given specificity. Additionally, the classification support vector machine method is shown to be sensitive and specific on genomic ncRNA screens of two different Escherichia coli and Salmonella typhi genome alignments, in which many ncRNAs are known. The Dynalign computational experiments are also compared with two other ncRNA detection programs, RNAz and QRNA.
Conclusion: The Dynalign-based support vector machine method is more sensitive for known ncRNAs in the test genomic screens than RNAz and QRNA. Additionally, both Dynalign-based methods are more sensitive than RNAz and QRNA at low sequence pair identities. Dynalign can be used as a comparable or more accurate tool than RNAz or QRNA in genomic screens, especially for low-identity regions. Dynalign provides a method for discovering ncRNAs in sequenced genomes that other methods may not identify. Significant improvements in Dynalign runtime have also been achieved.
Figures
Figure 1
Distribution of single sequence z scores for 5S rRNA, tRNA, and negative sequences. Distributions of RNAstructure-predicted z scores computed on the basis of folding single sequences for 5S rRNA and negatives generated from them (left figure) and tRNA and negatives generated from them (right figure). Real ncRNA are white, negatives are black. Controls were generated by the Altschul-Erikson dinucleotide shuffle of original sequence, with 100 controls for each test set sequence. 309 5S rRNA sequences and 482 tRNA sequences, plus one negative sequence generated from each real sequence by the Altschul-Erikson shuffle, were used for the test set.
Figure 2
Quality of classification using the z score method for single sequences. ROC curves showing quality of classification based on single sequences, using RNAstructure-predicted z scores for folding free energy change. The ncRNA sequences and controls are the same as in Figure 1. Red and green show results for 5S rRNA and tRNA, respectively, when tested separately; blue shows results when both are combined into a single test set.
Figure 3
Comparison of three methods for generating 20 controls from each input sequence pair. ROC curves comparing three methods for generating a set of 20 controls from an input sequence pair to determine the z score for ncRNA classification using the Dynalign-computed ΔG°total. The test set contains 755 5S rRNA and 896 tRNA sequence pairs, plus one negative sequence pair generated from each real sequence pair, yielding 3,302 trial pairs total. All tests are run with the parameter M = 8. "dinuc controls" (green): controls are generated by sampling from a first-order Markov chain, approximately preserving dinucleotide frequencies of each original sequence. "AE controls" (orange): controls are generated by the Altschul-Erikson dinucleotide shuffle, exactly preserving dinucleotide frequencies of each original sequence. "column controls" (blue): controls are generated by a columnwise shuffle of a global sequence alignment, without regard for gap placement or local conservation.
Figure 4
Distribution of sequence pair z scores for 5S rRNA, tRNA, and negative sequences. Distribution of z scores computed using the Dynalign ΔG°total and the columnwise shuffle control method (M = 8) for 5S rRNA sequence pairs and negatives generated from them (left figure) and tRNA sequence pairs and negatives generated from them (right figure). Real ncRNA are white, negatives are black. Test set is the same as for Figure 3.
Figure 5
Higher M parameter improves quality of classification when using the z score method. ROC curves comparing effectiveness of the best control generation method for sequence pairs (i.e. columnwise shuffle of a global sequence alignment) at parameters M = 6 (dark blue) and M = 8 (light blue). Test set is same as for Figure 3. For all other control generation methods, increasing the M parameter value likewise increases the quality of classification (see Additional File 1 in "Additional Files" for supporting figure).
Figure 6
Quality of classification using the z score method, broken down by ncRNA family. ROC curves showing effectiveness of the best control generation method for sequence pairs (i.e. columnwise shuffle of global alignment at parameter M = 8) for 5S rRNA by itself (red), tRNA by itself (green), and both combined into one test set (blue). The 5S rRNA or tRNA sequences in the test set are the same as those used for the test set in Figures 3, 4 and 5.
Figure 7
Comparison of the z score classification method using single sequences versus using sequence pairs. ROC curves comparing quality of classification based on single sequences versus based on sequence pairs, using the same free energy parameters for both. The single sequence curve (blue) is the same as in Figure 2. Black shows the best results for the sequence pair approach from Figure 3 (i.e. control generation by columnwise shuffle of global sequence alignment at parameter M = 8), to illustrate the difference in prediction quality.
Figure 8
Comparison of the Dynalign z score method with RNAz for sequence pairs of all identities. ROC curves for the Dynalign z score classification method (running 20 controls for each input sequence pair to determine z score, M = 8; blue) and RNAz (red), both tested on the same test set of sequence pairs as in Figure 3.
Figure 9
Comparison of the Dynalign z score method with RNAz for sequence pairs below 50% identity. ROC curves for the Dynalign z score classification method (running 20 controls for each input sequence pair, M = 8; blue) and RNAz (red), both tested only on those sequence pairs from the Figure 3 test set that have less than 50% sequence pair identity. Dynalign becomes more sensitive than RNAz at low sequence pair identities for all specificities.
Figure 10
Comparison of the Dynalign/LIBSVM classifier with RNAz for sequence pairs of all identities. ROC curves for the Dynalign/LIBSVM classifier (blue) and RNAz (red), both based on a test set of 38,069 5S rRNA sequence pairs, 52,470 tRNA sequence pairs, plus two negative sequence pairs generated from each real sequence pair – one by a columnwise shuffle of a global alignment, one by an Altschul-Erikson dinucleotide shuffle of each sequence in the pair separately, yielding 90,539 real trial sequence pairs and 181,078 negative trial sequence pairs.
Figure 11
Comparison of the Dynalign/LIBSVM classifier with RNAz for sequence pairs below 50% identity. ROC curves for the Dynalign/LIBSVM classifier method (blue) and RNAz (red), both tested only on those sequence pairs from the Figure 10 test set that have less than 50% sequence pair identity.
Figure 12
Comparison of the Dynalign z score method with the Dynalign/LIBSVM classifier. ROC curves for the Dynalign z score method, M = 8 (blue, column shuffle controls of the global alignments; orange, Altschul-Erickson dinucleotide shuffle controls; green, first-order Markov chain sampling controls) versus the Dynalign/LIBSVM classifier (pink). The z score ROC curves are from Figure 3; the Dynalign/LIBSVM ROC curve is from Figure 10.
Figure 13
ncRNA probabilities (P values) of scanning windows iterating through a 16S rRNA. Probabilities of ncRNA computed by the Dynalign/LIBSVM classifier for 30 150-nucleotide-long scanning windows iterating through a global alignment of Borrelia burgdorferi and Bacillus subtilis 16S rRNA in steps of 75.
Figure 14
ncRNA probabilities (P values) of scanning windows iterating through a 23S rRNA. Probabilities of ncRNA computed by the Dynalign/LIBSVM classifier for 57 150-nucleotide-long scanning windows iterating through a global alignment of Bacillus subtilis and Bos taurus 23S rRNA in steps of 75.
Figure 15
Distribution of scanning window percent identities in the MUMmer whole genome ncRNA screen. Histogram showing the distribution of percent identities of 15,214 genomic windows (size 150 alignment columns, scanning step size 75 alignment columns), generated from the MUMmer whole genome alignment of E. coli and S. typhi.
Figure 16
Distribution of scanning window percent identities in the WuBLASTn whole genome ncRNA screen. Histogram showing the distribution of percent identities of 90,404 genomic windows (size 150 alignment columns, scanning step size 75 alignment columns), generated from the WuBLASTn whole genome alignment of E. coli and S. typhi.
Figure 17
Distribution of percent identities of 50-nucleotide windows in the human-mouse genome alignment. The BLASTZ pairwise alignment of the human and mouse genomes [73] is broken down into 50-nucleotide-long non-overlapping windows and the percent identity for each is calculated, then plotted in this histogram. There are 22,456,315 windows total.
Similar articles
- Discovery of Novel ncRNA Sequences in Multiple Genome Alignments on the Basis of Conserved and Stable Secondary Structures.
Fu Y, Xu ZZ, Lu ZJ, Zhao S, Mathews DH. Fu Y, et al. PLoS One. 2015 Jun 15;10(6):e0130200. doi: 10.1371/journal.pone.0130200. eCollection 2015. PLoS One. 2015. PMID: 26075601 Free PMC article. - Considerations in the identification of functional RNA structural elements in genomic alignments.
Babak T, Blencowe BJ, Hughes TR. Babak T, et al. BMC Bioinformatics. 2007 Jan 30;8:33. doi: 10.1186/1471-2105-8-33. BMC Bioinformatics. 2007. PMID: 17263882 Free PMC article. - Predicting a set of minimal free energy RNA secondary structures common to two sequences.
Mathews DH. Mathews DH. Bioinformatics. 2005 May 15;21(10):2246-53. doi: 10.1093/bioinformatics/bti349. Epub 2005 Feb 24. Bioinformatics. 2005. PMID: 15731207 - From structure prediction to genomic screens for novel non-coding RNAs.
Gorodkin J, Hofacker IL. Gorodkin J, et al. PLoS Comput Biol. 2011 Aug;7(8):e1002100. doi: 10.1371/journal.pcbi.1002100. Epub 2011 Aug 4. PLoS Comput Biol. 2011. PMID: 21829340 Free PMC article. Review. - Advances in Computational Methodologies for Classification and Sub-Cellular Locality Prediction of Non-Coding RNAs.
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Asim MN, et al. Int J Mol Sci. 2021 Aug 13;22(16):8719. doi: 10.3390/ijms22168719. Int J Mol Sci. 2021. PMID: 34445436 Free PMC article. Review.
Cited by
- Rational design of ligands targeting triplet repeating transcripts that cause RNA dominant disease: application to myotonic muscular dystrophy type 1 and spinocerebellar ataxia type 3.
Pushechnikov A, Lee MM, Childs-Disney JL, Sobczak K, French JM, Thornton CA, Disney MD. Pushechnikov A, et al. J Am Chem Soc. 2009 Jul 22;131(28):9767-79. doi: 10.1021/ja9020149. J Am Chem Soc. 2009. PMID: 19552411 Free PMC article. - Dinucleotide controlled null models for comparative RNA gene prediction.
Gesell T, Washietl S. Gesell T, et al. BMC Bioinformatics. 2008 May 27;9:248. doi: 10.1186/1471-2105-9-248. BMC Bioinformatics. 2008. PMID: 18505553 Free PMC article. - Alignment-free comparative genomic screen for structured RNAs using coarse-grained secondary structure dot plots.
Kato Y, Gorodkin J, Havgaard JH. Kato Y, et al. BMC Genomics. 2017 Dec 2;18(1):935. doi: 10.1186/s12864-017-4309-y. BMC Genomics. 2017. PMID: 29197323 Free PMC article. - Discovery of Novel ncRNA Sequences in Multiple Genome Alignments on the Basis of Conserved and Stable Secondary Structures.
Fu Y, Xu ZZ, Lu ZJ, Zhao S, Mathews DH. Fu Y, et al. PLoS One. 2015 Jun 15;10(6):e0130200. doi: 10.1371/journal.pone.0130200. eCollection 2015. PLoS One. 2015. PMID: 26075601 Free PMC article. - Considerations in the identification of functional RNA structural elements in genomic alignments.
Babak T, Blencowe BJ, Hughes TR. Babak T, et al. BMC Bioinformatics. 2007 Jan 30;8:33. doi: 10.1186/1471-2105-8-33. BMC Bioinformatics. 2007. PMID: 17263882 Free PMC article.
References
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources