Three-Dimensional Searching for Recurrent Structural Motifs in Data Bases of Protein Structures (original) (raw)

A Relational Extension of the Notion of Motifs: Application to the Common 3D Protein Substructures Searching Problem

Journal of Computational Biology, 2009

The geometric configurations of atoms in protein structures can be viewed as approximate relations among them. Then, finding similar common substructures within a set of protein structures belongs to a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of positions of the motifs. We will hence denote them as relational motifs. For this class of problems we present an algorithm that is a suitable extension of the KMR (Karp et al., 1972) paradigm and, in particular, of the KMRC (Soldano et al., 1995) as it uses a degenerate alphabet. Our algorithm contains several improvements with respect to that become especially useful when-as it is required for relational motifs-the inference is made by partially overlapping shorter motifs, rather than concatenating them like in . The efficiency, correctness and completeness of the algorithm is ensured by several non-trivial properties that are proven in this paper. The algorithm has been applied in the important field of protein common 3D substructure searching. The methods implemented have been tested on several examples of protein families such as serine proteases, globins and cytochromes P450 additionally. The detected motifs have been compared to those found by multiple structural alignments methods.

Identifying Structural Motifs in Proteins

2003

In biological macromolecules, structural patterns (motifs) are often repeated across different molecules. Detection of these common motifs in a new molecule can provide useful clues to the functional properties of such a molecule. We formulate the problem of identifying a given structural motif (pattern) in a target protein (example) and discuss the notion of complete matches vis-a-vis partial matches. We describe the precise error criterion that has to be minimized and also discuss different metrics for evaluating the quality of partial matches. Secondly, we present a new polynomial time algorithm for the problem of matching a given motif in a target protein. We also use the sequence and (if available) secondary structure information to annotate the different points in motif and the target protein, thus reducing the search space size. Our algorithm guarantees the detection of a perfect match, if present. Even otherwise, the algorithm computes very good matches. Unlike other methods, the error minimized by our algorithm directly translates to root mean square deviation (RMSD), the most commonly accepted metric for structure matching in biological macromolecules. The algorithm does not involve any preprocessing and is suitable for the detection of both small and large motifs in the target protein. We also present experiments exploring the quality of matches found by the algorithm. We examine its performance in matching (both full and partial) active sites in proteins.

Discovery of Recurrent Structural Motifs for Approximating Three-Dimensional Protein Structures

Journal of the Chinese Chemical Society, 2004

The scope of conformation space that protein molecules can adopt is a problem of significant interest. Previous studies by other groups have shown that there are stereochemical constraints that confine local protein structures to a limited range of conformations. Furthermore, the results of many groups have demonstrated that the sequence-to-structure relationship remains detectable to some extent on a local level. By studying the conformational space of local protein structures, we may obtain more information concerning the constraints on local structural space and the sequence-to-structure mapping, hence facilitate ab initio structure prediction. In this study, we propose a novel algorithm that automatically discovers recurrent pentamer structures in proteins. The algorithm starts by applying Expectation-Maximization (EM) clustering to the distances between non-adjacent backbone Ca atoms in a large set of pentamer fragments. A rough partition of the conformation space can thus be derived. In the second stage, by applying a split-and-merge algorithm, we can obtain a finite number of clusters and guarantee the homogeneity and distinctiveness of each one. Each cluster of protein structures is represented by a centroid structure. The results show that, with 40 major representative structures, we can approximate most of the protein fragments with an error of 0.378 Å. With only 20 types of structures, the fragment structures can still be modeled at 0.44 Å, which is comparable to or better than the performance of previous methods. We term the representatives "building blocks." On the global level, we demonstrate that by concatenating different combinations of building blocks, we can model whole protein structures at high resolution: a resolution of 2.54 Å can be achieved simply by using 10 types of building blocks. This finding suggests that the study of molecular structures can be hugely simplified using this reduced representation.

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

Biocomputing 2005 - Proceedings of the Pacific Symposium, 2004

The comparison of structural subsites in proteins is increasingly relevant to the prediction of their biological function. To address this problem, we present the Match Augmentation algorithm (MA). Given a structural motif of interest, such as a functional site, MA searches a target protein structure for a match: the set of atoms with the greatest geometric and chemical similarity. MA is extremely efficient because it exploits the fact that the amino acids in a structural motif are not equally important to function. Using motif residues ranked on functional significance via the Evolutionary Trace (ET), MA prioritizes its search by initially forming matches with functionally significant residues, then, guided by ET, it augments this partial match stepwise until the whole motif is found. With this hierarchical strategy, MA runs considerably faster than other methods, and almost always identifies matches in homologs known to have cognate functional sites. Second, in order to interpret matches, we further introduce a statistical method using nonparametric density estimation of the frequency distribution of structural matches. Our results show that the hierarchy of functional importance within structural motifs speeds up the search within targets, and points to a new method to score their statistical significance.

Integrated search and alignment of protein structures

Bioinformatics, 2008

Motivation: Identification and comparison of similar threedimensional (3D) protein structures has become an even greater challenge in the face of the rapidly growing structure databases. Here, we introduce Vorometric, a new method that provides efficient search and alignment of a query protein against a database of protein structures. Voronoi contacts of the protein residues are enriched with the secondary structure information and a metric substitution matrix is developed to allow efficient indexing. The contact hits obtained from a distance-based indexing method are extended to obtain highscoring segment pairs, which are then used to generate structural alignments. Results: Vorometric is the first to address both search and alignment problems in the protein structure databases. The experimental results show that Vorometric is simultaneously effective in retrieving similar protein structures, producing high-quality structure alignments, and identifying cross-fold similarities. Vorometric outperforms current structure retrieval methods in search accuracy, while requiring comparable running times. Furthermore, the structural superpositions produced are shown to have better quality and coverage, when compared with those of the popular structure alignment tools.

SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures

Nucleic Acids Research, 2012

Similarities in the 3D patterns of amino acid side chains can provide insights into their function despite the absence of any detectable sequence or fold similarities. Search for protein sites (SPRITE) and amino acid pattern search for substructures and motifs (ASSAM) are graph theoretical programs that can search for 3D amino side chain matches in protein structures, by representing the amino acid side chains as pseudo-atoms. The geometric relationship of the pseudo-atoms to each other as a pattern can be represented as a labeled graph where the pseudo-atoms are the graph's nodes while the edges are the inter-pseudo-atomic distances. Both programs require the input file to be in the PDB format. The objective of using SPRITE is to identify matches of side chains in a query structure to patterns with characterized function. In contrast, a 3D pattern of interest can be searched for existing occurrences in available PDB structures using ASSAM. Both programs are freely accessible without any login requirement.

Search strategies in structural bioinformatics

Current Protein & Peptide Science, 2008

Optimisation problems pervade structural bioinformatics. In this review, we describe recent work addressing a selection of bioinformatics challenges. We begin with a discussion of research into protein structure comparison, and highlight the utility of Kolmogorov complexity as a measure of structural similarity. We then turn to research into de novo protein structure prediction, in which structures are generated from first principles. In this endeavour, there is a compromise between the detail of the model and the extent to which the conformational space of the protein can be sampled. We discuss some developments in this area, including off-lattice structure prediction using the great deluge algorithm. One strategy to reduce the size of the search space is to restrict the protein chain to sites on a regular lattice. In this context, we highlight the use of memetic algorithms, which combine genetic algorithms with local optimisation, to the study of simple protein models on the two-dimensional square lattice and the face-centred cubic lattice.