Building a Knowledge-Based Statistical Potential by Capturing High-Order Inter-residue Interactions and its Applications in Protein Secondary Structure Assessment (original) (raw)

Sixty-five years of the long march in protein secondary structure prediction: the final stretch?

Briefings in bioinformatics, 2016

Protein secondary structure prediction began in 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone even before the first protein structure was determined. Sixty-five years later, powerful new methods breathe new life into this field. The highest three-state accuracy without relying on structure templates is now at 82-84%, a number unthinkable just a few years ago. These improvements came from increasingly larger databases of protein sequences and structures for training, the use of template secondary structure information and more powerful deep learning techniques. As we are approaching to the theoretical limit of three-state prediction (88-90%), alternative to secondary structure prediction (prediction of backbone torsion angles and Cα-atom-based angles and torsion angles) not only has more room for further improvement but also allows direct prediction of three-dimensional fragment structures with constantly improved accuracy. Abo...

A Novel Methodology for Protein Secondary Structure Prediction Using Physics Principles

Proc. of the 2012 International Conference on Medical Physics and Biomedical Engineering (ICMPBE-2012), 2012

Protein structure prediction (PSP) is one of the most important and challenging problems in bioinformatics today. This is due to the fact that the biological function of protein is determined by its structure. This paper presents a novel methodology for protein secondary structure prediction using physical and chemical properties. This methodology is very effective when the consensus prediction approach fails. In addition, it gives an energy value for each secondary structure conformation and selects the minimum one to build the tertiary structure.

On the use of secondary structure in protein structure prediction: a bioinformatic analysis

Polymer

The amount of structural information encoded in secondary structure can be measured by its ability to specify the correct peptide backbone conformation of protein chains. Using methodology derived from information theory, we generate optimized distributions of backbone phipsi dihedral angle pairs given either correct or predicted three-state secondary structure. Entropy measurements on these distributions provide a means to determine the effect of secondary structure knowledge on identifying the actual 3D conformation of protein chains. We find that only a modest fraction of the total uncertainty in phi-psi conformation (from 14 to 38%, at 20-908 resolutions, respectively) is resolved even with perfect knowledge of secondary structure. We further show that prediction of secondary structures, because of an accuracy ceiling below 80%, degrades structural information substantially. If prediction accuracy is below 50%, virtually no advantage is gained from using the prediction. Moreover, even state-of-the-art prediction accuracy of 75% retains less than one-third of the structural information encoded in secondary structure. We demonstrate that the level of structural description affects the amount of information extracted. The effort to provide as much structural detail as possible, while faced with a limited structural data set, results in an optimum resolution in the vicinity of a 208partition of the ðf; cÞ plane. We show that structural information increases exponentially with prediction accuracy, revealing that even marginal gains in the performance of secondary structure prediction algorithms are important for the retention of structural information. We observe that different kinds of secondary structure prediction outputs (single-state prediction, single-state prediction with a confidence index, and three-state probability prediction) do not differ greatly in the amount of structural information they yield, so long as the methods formulated in this work to generate propensity distributions are applied appropriately. The optimal phi-psi probability distributions developed here may be useful in biasing searches in structure space. We discuss the sources of the degradation of information caused by errors in secondary structure prediction, and their consequences for the prediction of the 3D conformation of protein chains.

Protein Secondary Structure Prediction: A Review of Progress and Directions

Current Bioinformatics

Background: Over the last few decades, a search for the theory of protein folding has grown into a full-fledged research field at the intersection of biology, chemistry and informatics. Despite enormous effort, there are still open questions and challenges, like understanding the rules by which amino acid sequence determines protein secondary structure. Objective: In this review, we depict the progress of the prediction methods over the years and identify sources of improvement. Methods: The protein secondary structure prediction problem is described followed by the discussion on theoretical limitations, description of the commonly used data sets, features and a review of three generations of methods with the focus on the most recent advances. Additionally, methods with available online servers are assessed on the independent data set. Results: The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and 76.5% for an 8-class prediction. Conclusion: This ...

New joint prediction algorithm (Q7-JASEP) improves the prediction of protein secondary structure

Biochemistry, 1991

The classical problem of secondary structure prediction is approached by a new joint algorithm (Q,-JASEP) that combines the best aspects of six different methods. The algorithm includes the statistical methods of Chou-Fasman, Nagano, and Burgess-Ponnuswamy-Scheraga, the homology method of Nishikawa, the information theory method of Garnier-Osgurthope-Robson, and the artificial neural network approach of Qian-Sejnowski. Steps in the algorithm are (i) optimizing each individual method with respect to its correlation coefficient (Q7) for assigning a structural type from the predictive score of the method, (ii) weighting each method, (iii) combining the scores from different methods, and (iv) comparing the scores for a-helix, P-strand, and coil conformational states to assign the secondary structure at each residue position. The present application to 45 globular proteins demonstrates good predictive power in cross-validation testing (with average correlation coefficients per test protein of Q7,a = 0.41, Q7,@ = 0.47, Q7,c = 0.41 for a-helix, fl-strand, and coil conformations). By the criterion of correlation coefficient (e7) for each type of secondary structure, Q 7 -~~~~~ performs better than any of the component methods. When all protein classes are included for training and testing (by cross-validation), the results here equal the best in the literature, by the Q7 criterion. More generally, the basic algorithm can be applied to any protein class and to any type of structure/sequence or function/sequence correlation for which multiple predictive methods exist.

Review: Protein Secondary Structure Prediction Continues to Rise

Methods predicting protein secondary structure improved substantially in the 1990s through the use of evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height of around 76% of all residues predicted correctly in one of the three states, helix, strand, and other. The past year also brought successful new concepts to the field. These new methods may be particularly interesting in light of the improvements achieved through simple combining of existing methods. Divergent evolutionary profiles contain enough information not only to substantially improve prediction accuracy, but also to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on nonlocal conditions. An example is a method automatically identifying structural switches and thus finding a remarkable connection between predicted secondary structure and aspects of function. Secondary structure predictions are increasingly becoming the work horse for numerous methods aimed at predicting protein structure and function. Is the recent increase in accuracy significant enough to make predictions even more useful? Because the recent improvement yields a better prediction of segments, and in particular of ␤ strands, I believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.

HYPROSP II-A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence

Bioinformatics, 2005

Motivation: In our previous approach, we proposed a hybrid method for protein secondary structure prediction called HYPROSP, which combined our proposed knowledge-based prediction algorithm PROSP and PSIPRED. The knowledge base constructed for PROSP contains small peptides together with their secondary structural information. The hybrid strategy of HYPROSP uses a global quantitative measure, match rate, to determine whether PROSP or PSIPRED is to be used for the prediction of a target protein. HYPROSP made slight improvement of Q 3 over PSIPRED because PROSP predicted well for proteins with match rate >80%. As the portion of proteins with match rate >80% is quite small and as the performance of PSIPRED also improves, the advantage of HYPROSP is diluted. To overcome this limitation and further improve the hybrid prediction method, we present in this paper a new hybrid strategy HYPROSP II that is based on a new quantitative measure called local match rate. Results: Local match rate indicates the amount of structural information that each amino acid can extract from the knowledge base. With the local match rate, we are able to define a confidence level of the PROSP prediction results for each amino acid. Our new hybrid approach, HYPROSP II, is proposed as follows: for each amino acid in a target protein, we combine the prediction results of PROSP and PSIPRED using a hybrid function defined on their respective confidence levels. Two datasets in nrDSSP and EVA are used to perform a 10-fold cross validation. The average Q 3 of HYPROSP II is 81.8% and 80.7% on nrDSSP and EVA datasets, respectively, which is 2.0% and 1.1% better than that of PSIPRED. For local structures with match rate >80%, the average Q 3 improvement is 4.4% on the nrDSSP dataset. The use of local match rate improves the accuracy better than global match rate. There has been a long history of attempts to improve secondary structure prediction. We believe that HYPROSP II has greatly utilized the power of peptide knowledge base and raised the prediction accuracy to a new high. The method we developed in this paper could have a profound effect on the general use of knowledge base techniques for various prediction algorithms. Availability: The Linux executable file of HYPROSP II, as well as both nrDSSP and EVA datasets can be downloaded from

Further developments of protein secondary structure prediction using information theory

Journal of Molecular Biology, 1987

We have re-evaluated the information used in the Garnier-Osguthorpe-Robson (GOR) method of secondary structure prediction with the currently available database. The framework of information theory provides a means to formulate the influence of local sequence upon the conformation of a given residue, in a rigorous manner. However, the existing database does not allow the evaluation of parameters required for an exact, treatment of the problem. The validity of the approximations drawn from the theory is examined. It is shown that the first-level approximation, involving single-residue parameters, is only marginally improved by an increase in the database. The second-level approximation, involving pairs of residues, provides a better model. However, in this case the database is not big enough and this method might lead to parameters with deficiencies. Attention is therefore given to overcoming this lack of data. We have determined the significant pairs and the number of dummy observations necessary to obtain the best result for the prediction.