Current protein structure predictors do not produce meaningful folding pathways (original) (raw)
Related papers
Current structure predictors are not learning the physics of protein folding
Bioinformatics, 2022
Summary Motivation. Predicting the native state of a protein has long been considered a gateway problem for understanding protein folding. Recent advances in structural modeling driven by deep learning have achieved unprecedented success at predicting a protein’s crystal structure, but it is not clear if these models are learning the physics of how proteins dynamically fold into their equilibrium structure or are just accurate knowledge-based predictors of the final state. Results. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to experimental data about protein folding pathways. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories p...
ManyFold: an efficient and flexible library for training and validating protein folding models
Bioinformatics
Summary ManyFold is a flexible library for protein structure prediction with deep learning that (i) supports models that use both multiple sequence alignments (MSAs) and protein language model (pLM) embedding as inputs, (ii) allows inference of existing models (AlphaFold and OpenFold), (iii) is fully trainable, allowing for both fine-tuning and the training of new models from scratch and (iv) is written in Jax to support efficient batched operation in distributed settings. A proof-of-concept pLM-based model, pLMFold, is trained from scratch to obtain reasonable results with reduced computational overheads in comparison to AlphaFold. Availability and implementation The source code for ManyFold, the validation dataset and a small sample of training data are available at https://github.com/instadeepai/manyfold. Supplementary information Supplementary data are available at Bioinformatics online.
Deep Learning-Based Advances in Protein Structure Prediction
International Journal of Molecular Sciences, 2021
Obtaining an accurate description of protein structure is a fundamental step toward understanding the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the success of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram predi...
Simplified Protein Models: Predicting Folding Pathways and Structure Using Amino Acid Sequences
Physical Review Letters, 2013
We demonstrate the ability of simultaneously determining a protein's folding pathway and structure using a properly formulated model without prior knowledge of the native structure. Our model employs a natural coordinate system for describing proteins and a search strategy inspired by the observation that real proteins fold in a sequential fashion by incrementally stabilizing native-like substructures or "foldons". Comparable folding pathways and structures are obtained for the twelve proteins recently studied using atomistic molecular dynamics simulations [K. Lindorff-Larsen, S. Piana, R.O. Dror, D. E. Shaw, Science 334, 517 (2011)], with our calculations running several orders of magnitude faster. We find that native-like propensities in the unfolded state do not necessarily determine the order of structure formation, a departure from a major conclusion of the MD study. Instead, our results support a more expansive view wherein intrinsic local structural propensities may be enhanced or overridden in the folding process by environmental context. The success of our search strategy validates it as an expedient mechanism for folding both in silico and in vivo.
A Review of Protein Structure Prediction using Deep Learning
BIO Web of Conferences
Proteins are macromolecules composed of 20 types of amino acids in a specific order. Understanding how proteins fold is vital because its 3-dimensional structure determines the function of a protein. Prediction of protein structure based on amino acid strands and evolutionary information becomes the basis for other studies such as predicting the function, property or behaviour of a protein and modifying or designing new proteins to perform certain desired functions. Machine learning advances, particularly deep learning, are igniting a paradigm shift in scientific study. In this review, we summarize recent work in applying deep learning techniques to tackle problems in protein structural prediction. We discuss various deep learning approaches used to predict protein structure and future achievements and challenges. This review is expected to help provide perspectives on problems in biochemistry that can take advantage of the deep learning approach. Some of the unanswered challenges w...
AlphaFold: Improved protein structure prediction using
2019
Protein structure prediction aims to determine the three-dimensional shape of a protein from 11 its amino acid sequence1. This problem is of fundamental importance to biology as the struc12 ture of a protein largely determines its function2 but can be hard to determine experimen13 tally. In recent years, considerable progress has been made by leveraging genetic informa14 tion: analysing the co-variation of homologous sequences can allow one to infer which amino 15 acid residues are in contact, which in turn can aid structure prediction3. In this work, we 16 show that we can train a neural network to accurately predict the distances between pairs 17 of residues in a protein which convey more about structure than contact predictions. With 18 this information we construct a potential of mean force4 that can accurately describe the 19 shape of a protein. We find that the resulting potential can be optimised by a simple gradient 20 descent algorithm, to realise structures without the nee...
A machine learning approach for predicting kinetic order and rate constant of protein folding
Understanding the basic rules of protein folding is one of the most important challenges of molecular biology. In the last years several experiments have been carried out in order to study the pathway and stability of protein folding. Empirical models are available for predicting the protein folding rates, based on the linear correlation between structural protein features and folding kinetics. However no direct statistical evaluation of their prediction performance is available. Recently, a significant number of kinetic data on protein folding was published. This allows the application of machine learning methods for predicting the kinetic order and rate of protein folding starting from structural information. In this paper we describe a support vector machine-based method suited to predict whether a protein is endowed with intermediates in the folding process and also the protein folding rate constants. Using a dataset consisting of 63 experimental protein folding data, our predictor correctly classify 78% of the folding pathways in the database and supplies an estimation of the logarithm of the folding rate constant with a correlation coefficient of 0.65. The method overcomes previous methods in optimizing the solution of folding-rate predictions. Furthermore, by predicting the presence of putative folding intermediates, it provides also a scheme for highlighting putative protein folding-mechanisms.
Directionality in protein fold prediction
2010
Background: Ever since the groundbreaking work of Anfinsen et al. in which a denatured protein was found to refold to its native state, it has been frequently stated by the protein fold prediction community that all the information required for protein folding lies in the amino acid sequence. Recent in vitro experiments and in silico computational studies, however, have shown that cotranslation may affect the folding pathway of some proteins, especially those of ancient folds. In this paper aspects of cotranslational folding have been incorporated into a protein structure prediction algorithm by adapting the Rosetta program to fold proteins as the nascent chain elongates. This makes it possible to conduct a pairwise comparison of folding accuracy, by comparing folds created sequentially from each end of the protein. Results: A single main result emerged: in 94% of proteins analyzed, following the sense of translation, from N-terminus to C-terminus, produced better predictions than following the reverse sense of translation, from the C-terminus to Nterminus. Two secondary results emerged. First, this superiority of N-terminus to C-terminus folding was more marked for proteins showing stronger evidence of cotranslation and second, an algorithm following the sense of translation produced predictions comparable to, and occasionally better than, Rosetta. Conclusions: There is a directionality effect in protein fold prediction. At present, prediction methods appear to be too noisy to take advantage of this effect; as techniques refine, it may be possible to draw benefit from a sequential approach to protein fold prediction.
Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction
arXiv (Cornell University), 2023
The goal of Protein Structure Prediction (PSP) problem is to predict a protein's 3D structure (confirmation) from its amino acid sequence. The problem has been a 'holy grail' of science since the Noble prize-winning work of Anfinsen demonstrated that protein conformation was determined by sequence. A recent and important step towards this goal was the development of AlphaFold2, currently the best PSP method. AlphaFold2 is probably the highest profile application of AI to science. Both AlphaFold2 and RoseTTAFold (another impressive PSP method) have been published and placed in the public domain (code & models). Stacking is a form of ensemble machine learning ML in which multiple baseline models are first learnt, then a meta-model is learnt using the outputs of the baseline level model to form a model that outperforms the base models. Stacking has been successful in many applications. We developed the ARStack PSP method by stacking AlphaFold2 and RoseTTAFold. ARStack significantly outperforms AlphaFold2. We rigorously demonstrate this using two sets of nonhomologous proteins, and a test set of protein structures published after that of AlphaFold2 and RoseTTAFold. As more high quality prediction methods are published it is likely that ensemble methods will increasingly outperform any single method.