Automated alignment of RNA sequences to pseudoknotted structures (original) (raw)

A memory efficient algorithm for structural alignment of RNAs with embedded simple pseudoknots

2008

In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query RNA sequence of length m with known secondary structure that may contain embedded simple pseduoknots. The best known algorithm for solving this problem (Dost et al. [13]) runs in O(mn4) time with space complexity of O(mn3), which requires too much memory making it infeasible for comparing ncRNAs (non-coding RNAs) with length several hundreds or more. We propose a memory efficient algorithm to solve the same problem. We reduce the space complexity to O(mn2 + n3) while maintaining the same time complexity of Dost et al.'s algorithm. Experimental reslts show that our algorithm is feasible for comparing ncRNAs of length more than 500. Availability: The source code of our program is available upon request.

Structural alignment of RNA with complex pseudoknot structure

2011

The secondary structure of an ncRNA molecule is known to play an important role in its biological functions. Aligning a known ncRNA to a target candidate to determine the sequence and structural similarity helps in identifying de novo ncRNA molecules that are in the same family of the known ncRNA. However, existing algorithms cannot handle complex pseudoknot structures which are found in nature. In this article, we propose algorithms to handle two types of complex pseudoknots: simple non-standard pseudoknots and recursive pseudoknots. Although our methods are not designed for general pseudoknots, it already covers all known ncRNAs in both Rfam and PseudoBase databases. An evaluation of our algorithms shows that it is useful to identify ncRNA molecules in other species which are in the same family of a known ncRNA.

Alignments of RNA Structures

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2000

We describe a theoretical unifying framework to express comparison of RNA structures, which we call alignment hierarchy. This framework relies on the definition of common supersequences for arc-annotated sequences, and encompasses main existing models for RNA structure comparison based on trees and arc-annotated sequences with a variety of edit operations. It also gives rise to edit models that have not been studied yet. We provide a thorough analysis of the alignment hierarchy, including a new polynomial time algorithm and an NP-completeness proof. The polynomial time algorithm involves biologically relevant evolutionary operations, such as pairing or unpairing nucleotides. It has been implemented in a software, called gardenia that is available at the web server http://bioinfo.lifl.fr/RNA/gardenia.

RNAlign2D – a novel RNA structural alignment tool based on pseudo-amino acid substitution matrix

2020

MotivationThe function of RNA molecules is mainly determined by their secondary structure. Addressing that issue requires creation of appropriate bioinformatic tools that enable alignment of multiple RNA molecules to determine functional domains and/or classify RNA families. The existing tools for RNA multiple alignment that use structural information are relatively slow. Therefore, providing a rapid tool for multiple structural alignment may improve classification of the known RNAs and reveal the function of the newly discovered ones.ResultsHere, we developed an extremely fast Python based RNAlign2D tool. It converts RNA sequence and structure to pseudo-amino acid sequence and uses customizable pseudo-amino acid substitution matrix to align RNA secondary structures and sequences using MUSCLE. It is suitable for RNAs containing modified nucleosides and/or pseudoknots. Our approach is compatible with virtually all protein aligners.Availability and implementationRNAlign2D is available...

Alignment of RNA structures

2008

We describe a theoretical unifying framework to express comparison of RNA structures, which we call alignment hierarchy. This framework relies on the definition of common supersequences for arc-annotated sequences, and encompasses main existing models for RNA structure comparison based on trees and arc-annotated sequences with a variety of edit operations. It also gives rise to edit models that have not been studied yet. We provide a thorough analysis of the alignment hierarchy, including a new polynomial time algorithm and an NP-completeness proof. The polynomial time algorithm involves biologically relevant evolutionary operations, such as pairing or unpairing nucleotides. It has been implemented in a software, called gardenia that is available at the web server http://bioinfo.lifl.fr/RNA/gardenia.

Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization

BMC Bioinformatics, 2007

Background: The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. Results: We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. Conclusion: The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from http://www.planet-lisa.net.

Fast and Accurate Structural RNA Alignment by Progressive Lagrangian Optimization

Lecture Notes in Computer Science, 2005

During the last few years new functionalities of RNA have been discovered, renewing the need for computational tools for their analysis. To this respect, multiple sequence alignment is an essential step in finding structurally conserved regions in related RNA sequences. In contrast to proteins, many classes of functionally related RNA molecules show a rather weak sequence conservation but instead a fairly well conserved secondary structure. Hence, any method that relates RNA sequences in form of multiple alignments should take structural features into account, which has been verified in recent studies. Progress has been made in developing new structural alignment algorithms, however, current methods are computationally costly or do not have the desired accuracy to make them an everyday tool. In this paper we present a fast, practical, and accurate method for computing multiple, structural RNA alignments. The method is based on combining a new pairwise structural alignment method with the popular program T-Coffee. Our pairwise method is based on an integer linear programming (ILP) formulation resulting from a graph-theoretic reformulation of the structural alignment problem. We find provably optimal or near-optimal solutions of the ILP with a Lagrangian approach. Tests on a recently published benchmark set show that our Lagrangian approach outperforms current programs in quality and in the length of the sequences it can align.

A computational model for RNA multiple structural alignment

2006

This paper addresses the problem of aligning multiple sequences of noncoding RNA (ncRNA) genes. We approach this problem with the biologically motivated paradigm that scoring of ncRNA alignments should be based primarily on secondary structure rather than nucleotide conservation.

An Efficient Alignment Algorithm for Searching Simple Pseudoknots over Long Genomic Sequence

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012

Structural alignment has been shown to be an effective computational method to identify structural noncoding RNA (ncRNA) candidates as ncRNAs are known to be conserved in secondary structures. However, the complexity of the structural alignment algorithms becomes higher when the structure has pseudoknots. Even for the simplest type of pseudoknots (simple pseudoknots), the fastest algorithm runs in Oðmn 3 Þ time, where m, n are the length of the query ncRNA (with known structure) and the length of the target sequence (with unknown structure), respectively. In practice, we are usually given a long DNA sequence and we try to locate regions in the sequence for possible candidates of a particular ncRNA. Thus, we need to run the structural alignment algorithm on every possible region in the long sequence. For example, finding candidates for a known ncRNA of length 100 on a sequence of length 50,000, it takes more than one day. In this paper, we provide an efficient algorithm to solve the problem for simple pseudoknots and it is shown to be 10 times faster. The speedup stems from an effective pruning strategy consisting of the computation of a lower bound score for the optimal alignment and an estimation of the maximum score that a candidate can achieve to decide whether to prune the current candidate or not.

Fast Pairwise Structural RNA Alignments by Pruning of the Dynamical Programming Matrix

PLoS Computational Biology, 2007

It has become clear that noncoding RNAs (ncRNA) play important roles in cells, and emerging studies indicate that there might be a large number of unknown ncRNAs in mammalian genomes. There exist computational methods that can be used to search for ncRNAs by comparing sequences from different genomes. One main problem with these methods is their computational complexity, and heuristics are therefore employed. Two heuristics are currently very popular: pre-folding and pre-aligning. However, these heuristics are not ideal, as pre-aligning is dependent on sequence similarity that may not be present and pre-folding ignores the comparative information. Here, pruning of the dynamical programming matrix is presented as an alternative novel heuristic constraint. All subalignments that do not exceed a length-dependent minimum score are discarded as the matrix is filled out, thus giving the advantage of providing the constraints dynamically. This has been included in a new implementation of the FOLDALIGN algorithm for pairwise local or global structural alignment of RNA sequences. It is shown that time and memory requirements are dramatically lowered while overall performance is maintained. Furthermore, a new divide and conquer method is introduced to limit the memory requirement during global alignment and backtrack of local alignment. All branch points in the computed RNA structure are found and used to divide the structure into smaller unbranched segments. Each segment is then realigned and backtracked in a normal fashion. Finally, the FOLDALIGN algorithm has also been updated with a better memory implementation and an improved energy model. With these improvements in the algorithm, the FOLDALIGN software package provides the molecular biologist with an efficient and user-friendly tool for searching for new ncRNAs. The software package is available for download at http://foldalign. ku.dk. Citation: Havgaard JH, Torarinsson E, Gorodkin J (2007) Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol 3(10): e193.