Accelerated probabilistic inference of RNA structure evolution - PubMed (original) (raw)

Accelerated probabilistic inference of RNA structure evolution

Ian Holmes. BMC Bioinformatics. 2005.

Abstract

Background: Pairwise stochastic context-free grammars (Pair SCFGs) are powerful tools for evolutionary analysis of RNA, including simultaneous RNA sequence alignment and secondary structure prediction, but the associated algorithms are intensive in both CPU and memory usage. The same problem is faced by other RNA alignment-and-folding algorithms based on Sankoff's 1985 algorithm. It is therefore desirable to constrain such algorithms, by pre-processing the sequences and using this first pass to limit the range of structures and/or alignments that can be considered.

Results: We demonstrate how flexible classes of constraint can be imposed, greatly reducing the computational costs while maintaining a high quality of structural homology prediction. Any score-attributed context-free grammar (e.g. energy-based scoring schemes, or conditionally normalized Pair SCFGs) is amenable to this treatment. It is now possible to combine independent structural and alignment constraints of unprecedented general flexibility in Pair SCFG alignment algorithms. We outline several applications to the bioinformatics of RNA sequence and structure, including Waterman-Eggert N-best alignments and progressive multiple alignment. We evaluate the performance of the algorithm on test examples from the RFAM database.

Conclusion: A program, Stemloc, that implements these algorithms for efficient RNA sequence alignment and structure prediction is available under the GNU General Public License.

PubMed Disclaimer

Figures

Figure 1

Figure 1

A parse tree for the grammar of Table 1. Each internal node is labeled with a nonterminal (Stem or Loop); additionally, the subsequences (X ij, Y kl) generated by each internal node are shown. The parse tree determines both the structure and alignment of the two sequences. The cut-points of the alignment are the sequence co-ordinates at which the alignment can be split, i.e. {(0, 0), (1, 1), (2, 2) ... (15, 12), (16, 13), (17, 14)}.

Figure 2

Figure 2

Parsing a pair of sequences (X, Y) using the Inside algorithm involves iterating over subsequence-pairs (X ij, Y kl) specified by four indices (i, j, k, l). In the constrained Inside algorithm, these indices are only valid if the fold envelopes (triangular grids) include the respective subsequences (i, j) and (k, l) (shown as black circles) and the alignment envelope (rectangular grid) includes both cutpoints (i, k) and (j, l) (shown as short diagonal lines). The filled cells in the rectangular grid show the aligned nucleotides. Note that the co-ordinates (i, j, k, l) lie on the grid-lines between the nucleotides.

Figure 3

Figure 3

Bifurcation rules allow a subsequence-pair (X ij, Y kl) to be composed from two adjoining subsequence-pairs (X im, Y kn) and (X nj, Y ni). For this to be permitted by the constraints, the _X_-fold envelope (upper triangular grid) must contain subsequences (i, m), (m, j) and (i, j) (black dots), the _Y_-fold envelope (rightmost triangular grid) must contain subsequences (k, n), (n, l) and (k, l) (black dots) and the alignment envelope (rectangular grid) must contain cutpoints (i, k), (m, n) and (j, l) (short diagonal lines). The filled cells in the rectangular grid show the nucleotide homologies highlighted in the alignment. Note that all co-ordinates (i, j, k, l, m, n) lie on the grid-lines between nucleotides.

Figure 4

Figure 4

These fold envelopes (triangular grids) limit the maximum length of subsequences (black dots), while the alignment envelope (rectangular grid) limits the maximum deviation of cutpoints (short diagonal lines) from the main diagonal.

Figure 5

Figure 5

These fold envelopes (triangular grids) and alignment envelope (rectangular grid) limit the subsequences (black dots) and cutpoints (short diagonal lines) to those consistent with a given alignment and consensus secondary structure (shown). The alignment path is also shown on the alignment envelope as a solid black line, broken by cutpoints.

Figure 6

Figure 6

Fold envelope size is highly correlated with N in the _N_-best fold test, although the variance is large due to the diversity of alignments in the test.

Figure 7

Figure 7

Alignment envelope size is highly correlated with N in the _N_-best alignment test, although the variance is large due to the diversity of alignments in the test.

Figure 8

Figure 8

Alignment sensitivity as a function of envelope size parameter N for three different test regimes.

Figure 9

Figure 9

Alignment specificity as a function of envelope size parameter N for three different test regimes.

Figure 10

Figure 10

Fold sensitivity as a function of envelope size parameter N for three different test regimes.

Figure 11

Figure 11

Fold specificity as a function of envelope size parameter N for three different test regimes.

Figure 12

Figure 12

Total running time of stemloc (including envelope generation phases) as a function of envelope size parameter N for three different test regimes.

Figure 13

Figure 13

Peak memory usage of stemloc (i.e. the size of the principal CYK matrix) as a function of envelope size parameter N for three different test regimes.

Similar articles

Cited by

References

    1. Eddy SR. Noncoding RNA genes. Current Opinion in Genetics and Development. 1999;9:695–699. doi: 10.1016/S0959-437X(99)00022-2. - DOI - PubMed
    1. Mandal M, Boese B, Barrick JE, Winkler WC, Breaker RR. Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria. Cell. 2003;113:577–586. doi: 10.1016/S0092-8674(03)00391-X. - DOI - PubMed
    1. Sijen T, Plasterk RH. Transposon silencing in the Caenorhabditis elegans germ line by natural RNAi. Nature. 2003;426:310–314. doi: 10.1038/nature02107. - DOI - PubMed
    1. Ambros V. The functions of animal microRNAs. Nature. 2004;431:350–355. doi: 10.1038/nature02871. - DOI - PubMed
    1. Baulcombe D. RNA silencing in plants. Nature. 2004;431:356–363. doi: 10.1038/nature02874. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources