Analyzing patterns between regular secondary structures using short structural building blocks defined by a hidden Markov model (original) (raw)

Hidden Markov model approach for identifying the modular framework of the protein backbone

Protein Engineering Design and Selection, 1999

The hidden Markov model (HMM) was used to identify recurrent short 3D structural building blocks (SBBs) describing protein backbones, independently of any a priori knowledge. Polypeptide chains are decomposed into a series of short segments defined by their inter-α-carbon distances. Basically, the model takes into account the sequentiality of the observed segments and assumes that each one corresponds to one of several possible SBBs. Fitting the model to a database of non-redundant proteins allowed us to decode proteins in terms of 12 distinct SBBs with different roles in protein structure. Some SBBs correspond to classical regular secondary structures. Others correspond to a significant subdivision of their bounding regions previously considered to be a single pattern. The major contribution of the HMM is that this model implicitly takes into account the sequential connections between SBBs and thus describes the most probable pathways by which the blocks are connected to form the framework of the protein structures. Validation of the SBBs code was performed by extracting SBB series repeated in recoding proteins and examining their structural similarities. Preliminary results on the sequence specificity of SBBs suggest promising perspectives for the prediction of SBBs or series of SBBs from the protein sequences.

A Hidden Markov Model applied to the protein 3D structure analysis

Computational Statistics & Data Analysis, 2008

Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them. A Hidden Markov Model has been set up to optimally compress 3D conformation of proteins into a structural alphabet (SA), corresponding to a library of limited and representative SA-letters. Each SA-letter corresponds to a set of short local fragments of four C similar both in terms of geometry and in the way in which these fragments are concatenated in order to make a protein. The discretization of protein backbone local conformation as series of SA-letters results on a simplification of protein 3D coordinates into a unique 1D representation. Some evidence is presented that such approach can constitute a very relevant way to analyze protein architecture in particular for protein structure comparison or prediction.

A hidden Markov model applied to the analysis of protein 3D-structures

Understanding and predicting protein structures depends on the complexity and the accuracy of the models used to represent them. We have setup a Hidden Markov Model to optimally compress three dimensional (3D) conformation of protein into a structural alphabet, i.e. a library of exhaustive and representative states (describing short fragments) learnt simultaneously with connection logic. The discretization of protein backbone local conformation as a series of states results in a simplification of protein 3D coordinates into a unique unidimensional (1D) representation. We present some evidence that such approach can constitute a very relevant way to the analysis of protein architecture in particular for protein structure comparison or prediction.

Hidden Markov Model for protein secondary structure

We address the problem of protein secondary structure prediction with Hidden Markov Models. A 21-state model is built using biological knowledge and statistical analysis of sequence motifs in regular secondary structures. Sequence family information is integrated via the combination of independent predictions of homologous sequences and a weighting scheme. Prediction accuracy with single sequences reaches 65.3% and raises to 72% of correct classification with profile information.

A Hidden Markov Model Derived Structural Alphabet for Proteins

Journal of Molecular Biology, 2004

Understanding and predicting protein structures depends on the complexity and the accuracy of the models used to represent them. We have set up a hidden Markov model that discretizes protein backbone conformation as series of overlapping fragments (states) of four residues length. This approach learns simultaneously the geometry of the states and their connections. We obtain, using a statistical criterion, an optimal systematic decomposition of the conformational variability of the protein peptidic chain in 27 states with strong connection logic. This result is stable over different protein sets. Our model fits well the previous knowledge related to protein architecture organisation and seems able to grab some subtle details of protein organisation, such as helix sub-level organisation schemes. Taking into account the dependence between the states results in a description of local protein structure of low complexity. On an average, the model makes use of only 8.3 states among 27 to describe each position of a protein structure. Although we use short fragments, the learning process on entire protein conformations captures the logic of the assembly on a larger scale. Using such a model, the structure of proteins can be reconstructed with an average accuracy close to 1.1 Å root-mean-square deviation and for a complexity of only 3. Finally, we also observe that sequence specificity increases with the number of states of the structural alphabet. Such models can constitute a very relevant approach to the analysis of protein architecture in particular for protein structure prediction.

“Pinning strategy”: a novel approach for predicting the backbone structure in terms of protein blocks from sequence

Journal of Biosciences, 2007

The description of protein 3D structures can be performed through a library of 3D fragments, named a structural alphabet. Our structural alphabet is composed of 16 small protein fragments of 5 C in length, called Protein Blocks (PBs). It allows an efficient approximation of the 3D protein structures and a correct prediction of the local structure. The 72 most frequent series of 5 consecutive PBs, called Structural Words (SWs) are able to cover more than 90% of the 3D structures. PBs are highly conditioned by the presence of a limited number of transitions between them. In this study, we propose a new method called "pinning strategy" that used this specific feature to predict long protein fragments. Its goal is to define highly probable successions of PBs. It starts from the most probable SW and is then extended with overlapping SWs. Starting from an initial prediction rate of 34.4%, the use of the SWs instead of the PBs allows a gain of 4.5%. The pinning strategy simply applied to the SWs increases the prediction accuracy to 39.9%. In a second step, the sequence -structure relationship is optimized, the prediction accuracy reaches 43.6%.

A Segmental Semi Markov Model for protein secondary structure prediction

Mathematical Biosciences, 2009

Hidden Markov Models (HMMs) are practical tools which provide probabilistic base for protein secondary structure prediction. In these models, usually, only the information of the left hand side of an amino acid is considered. Accordingly, these models seem to be inefficient with respect to long range correlations. In this work we discuss a Segmental Semi Markov Model (SSMM) in which the information of both sides of amino acids are considered. It is assumed and seemed reasonable that the information on both sides of an amino acid can provide a suitable tool for measuring dependencies. We consider these dependencies by dividing them into shorter dependencies. Each of these dependency models can be applied for estimating the probability of segments in structural classes. Several conditional probabilities concerning dependency of an amino acid to the residues appeared on its both sides are considered. Based on these conditional probabilities a weighted model is obtained to calculate the probability of each segment in a structure. This results in 2.27% increase in prediction accuracy in comparison with the ordinary Segmental Semi Markov Models, SSMMs. We also compare the performance of our model with that of the Segmental Semi Markov Model introduced by Schmidler et al. [C.S. Schmidler, J.S. Liu, D.L. Brutlag, Bayesian segmentation of protein secondary structure, J. Comp. Biol. 7(1/2) (2000) 233-248]. The calculations show that the overall prediction accuracy of our model is higher than the SSMM introduced by Schmidler.

Choosing the Optimal Hidden Markov Model for Secondary-Structure Prediction

IEEE Intelligent Systems, 2005

P roteins are major constituents of living cells, forming many cellular components and most enzymes. So, knowledge of 3D protein structures is essential to understanding biological mechanisms. The experimental determination of protein structures remains a long and difficult task, however. To contend with this-as well as take advantage of the numerous protein sequences emerging from genome projects-many international researchers are developing structure prediction methods. 1 To use these methods, researchers must often first predict local structures, using methods such as secondary structure prediction.

Identification of local variations within secondary structures of proteins

Acta crystallographica. Section D, Biological crystallography, 2015

Secondary-structure elements (SSEs) play an important role in the folding of proteins. Identification of SSEs in proteins is a common problem in structural biology. A new method, ASSP (Assignment of Secondary Structure in Proteins), using only the path traversed by the C(α) atoms has been developed. The algorithm is based on the premise that the protein structure can be divided into continuous or uniform stretches, which can be defined in terms of helical parameters, and depending on their values the stretches can be classified into different SSEs, namely α-helices, 310-helices, π-helices, extended β-strands and polyproline II (PPII) and other left-handed helices. The methodology was validated using an unbiased clustering of these parameters for a protein data set consisting of 1008 protein chains, which suggested that there are seven well defined clusters associated with different SSEs. Apart from α-helices and extended β-strands, 310-helices and π-helices were also found to occur ...

Predicting protein structure using hidden Markov models

Proteins: Structure, Function, and Genetics, 1997

We discuss how methods based on hidden Markov models performed in the fold recognition section of the CASP2 experiment. Hidden Markov models were built for a set of about a thousand structures from the PDB database, and each CASP2 target sequence was scored against this library of hidden Markov models. In addition, a hidden Markov model was built for each of the target sequences, and all of the sequences in PDB were scored against that target model. Having high scores from both methods was found to be highly indicative of the target and a structure being homologous.