Finding flexible patterns in a text: an application to three-dimensional molecular matching (original) (raw)

Finding Flexible Patterns in a Text - an Application to 3D Molecular Matching

2005

Finding certain regularities in a text is an important problem in many areas, for instance in the analysis of biological molecules such as nucleic acids or proteins. In the latter case, the text may be sequences of amino acids or a linear coding of 3D structures, and the regularities then correspond to lexical or structural motifs common to two, or more, proteins. We first recall an earlier algorithm allowing to find these regularities in a flexible way. Then we introduce a generalized version of this algorithm designed for the particular case of protein 3D structures, since these structures present a few peculiarities that make them computationally harder to process. Finally, we give some applications of our new algorithm on concrete examples. keywords : cliques, multiple alignment, protein structural matching. Introduction The main motivation for the new algorithm presented in this paper is that of finding the patterns common to a set of protein structures. This algorithm is an ex...

Finding flexible patterns in a text - An application to 3D matching

Computer applications in the biosciences: CABIOS

Finding certain regularities in a text is an important problem in many areas, e.g. in the analysis of biological molecules such as nucleic acids or proteins. In the latter case, the text may be sequences of amino acids or a linear coding of three-dimensional structures, and the regularities then correspond to lexical or structural motifs common to two, or more, proteins. We first recall an earlier algorithm that found these regularities in a flexible way. Then we introduce a generalized version of this algorithm designed for the particular case of protein three-dimensional structures, since these structures present a few peculiarities that make them computationally harder to process. Finally, we give some applications of our new algorithm on concrete examples.

Three-Dimensional Searching for Recurrent Structural Motifs in Data Bases of Protein Structures

Journal of Computational Biology, 1994

The problem of searching a data base of coordinates of proteins for substructures similar to a probe structure or motif is an important problem in computational molecular biology. It is the three-dimensional analog of the one-dimensional case of pattern matching in strings, procedures for which are widely used in molecular biology to search data bases of gene sequences. Typical applications of substructure searching are: (i) Determining whether structural features observed in one protein structure are unique or recurrent, and (ii) in predictions of protein structures, to bridge gaps in an incomplete structural model, by searching the data base for peptides that link the given starting and ending points. We describe our analysis of the problem and our experience in developing software.

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

Biocomputing 2005 - Proceedings of the Pacific Symposium, 2004

The comparison of structural subsites in proteins is increasingly relevant to the prediction of their biological function. To address this problem, we present the Match Augmentation algorithm (MA). Given a structural motif of interest, such as a functional site, MA searches a target protein structure for a match: the set of atoms with the greatest geometric and chemical similarity. MA is extremely efficient because it exploits the fact that the amino acids in a structural motif are not equally important to function. Using motif residues ranked on functional significance via the Evolutionary Trace (ET), MA prioritizes its search by initially forming matches with functionally significant residues, then, guided by ET, it augments this partial match stepwise until the whole motif is found. With this hierarchical strategy, MA runs considerably faster than other methods, and almost always identifies matches in homologs known to have cognate functional sites. Second, in order to interpret matches, we further introduce a statistical method using nonparametric density estimation of the frequency distribution of structural matches. Our results show that the hierarchy of functional importance within structural motifs speeds up the search within targets, and points to a new method to score their statistical significance.

A rapid protein structure alignment algorithm based on a text modeling technique

Bioinformation, 2011

Structural alignment of proteins is widely used in various fields of structural biology. In order to further improve the quality of alignment, we describe an algorithm for structural alignment based on text modelling techniques. The technique firstly superimposes secondary structure elements of two proteins and then, models the 3D-structure of the protein in a sequence of alphabets. These sequences are utilized by a step-by-step sequence alignment procedure to align two protein structures. A benchmark test was organized on a set of 200 non-homologous proteins to evaluate the program and compare it to state of the art programs, e.g. CE, SAL, TM-align and 3D-BLAST. On average, the results of all-against-all structure comparison by the program have a competitive accuracy with CE and TM-align where the algorithm has a high running speed like 3D-BLAST.

A Relational Extension of the Notion of Motifs: Application to the Common 3D Protein Substructures Searching Problem

Journal of Computational Biology, 2009

The geometric configurations of atoms in protein structures can be viewed as approximate relations among them. Then, finding similar common substructures within a set of protein structures belongs to a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of positions of the motifs. We will hence denote them as relational motifs. For this class of problems we present an algorithm that is a suitable extension of the KMR (Karp et al., 1972) paradigm and, in particular, of the KMRC (Soldano et al., 1995) as it uses a degenerate alphabet. Our algorithm contains several improvements with respect to that become especially useful when-as it is required for relational motifs-the inference is made by partially overlapping shorter motifs, rather than concatenating them like in . The efficiency, correctness and completeness of the algorithm is ensured by several non-trivial properties that are proven in this paper. The algorithm has been applied in the important field of protein common 3D substructure searching. The methods implemented have been tested on several examples of protein families such as serine proteases, globins and cytochromes P450 additionally. The detected motifs have been compared to those found by multiple structural alignments methods.

DISCO: A New Algorithm for Detecting 3D Protein Structure Similarity

IFIP Advances in Information and Communication Technology, 2012

Protein structure similarity is one of the most important aims pursued by bioinformatics and structural biology, nowadays. Although quite a few similarity methods have been proposed lately, yet fresh algorithms that fulfill new preconditions are needed to serve this purpose. In this paper, we provide a new similarity measure for 3D protein structures that detects not only similar structures but also similar substructures to a query protein, supporting both multiple and pairwise comparison procedures and combining many comparison characteristics. In order to handle similarity queries we utilize efficient and effective indexing techniques such as M-trees and we provide interesting results using real, previously tested protein data sets.

Smoothing 3D Protein Structure Motifs Through Graph Mining and Amino Acid Similarities

Journal of Computational Biology, 2014

One of the most powerful techniques to study proteins is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent substructures is formulated as a process of frequent subgraph discovery where each subgraph represents a 3D-motif. In this scope, several efficient approaches for frequent 3D-motifs discovery have been proposed in the literature. However, the set of discovered 3D-motifs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent 3D-motifs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative 3D-motifs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach detects relations between patterns that current subgraph selection approaches fail to detect, and that it is able to considerably decrease the number of motifs while enhancing their interestingness.