Protein sequence design by conformational landscape optimization (original) (raw)

Protein sequence design by explicit energy landscape optimization

2020

The protein design problem is to identify an amino acid sequence which folds to a desired structure. Given Anfinsen’s thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the lowest energy conformation is that structure. As this calculation involves not only all possible amino acid sequences but also all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest energy conformation for the designed sequence, and discarding the in many cases large fraction of designed sequences for which this is not the case. Here we show that by backpropagating gradients through the trRosetta structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid seque...

Exploring folding free energy landscapes using computational protein design

Current Opinion in Structural Biology, 2004

Recent advances in computational protein design have allowed exciting new insights into the sequence dependence of protein folding free energy landscapes. Whereas most previous studies have examined the sequence dependence of protein stability and folding kinetics by characterizing naturally occurring proteins and variants of these proteins that contain a small number of mutations, it is now possible to generate and characterize computationally designed proteins that differ significantly from naturally occurring proteins in sequence and/or structure. These computer-generated proteins provide insights into the determinants of protein structure, stability and folding, and make it possible to disentangle the properties of proteins that are the consequence of natural selection from those that reflect the fundamental physical chemistry of polypeptide chains.

Protein Design with Deep Learning

International Journal of Molecular Sciences

Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.

End-to-End deep structure generative model for protein design

Designing protein with desirable structure and functional properties is the pinnacle of computational protein design with unlimited potentials in the scientific community from therapeutic development to combating the global climate crisis. However, designing protein macromolecules at scale remains challenging due to hard-to-realize structures and low sequence design success rate. Recently, many generative models are proposed for protein design but they come with many limitations. Here, we present a VAE-based universal protein structure generative model that can model proteins in a large fold space and generate high-quality realistic 3-dimensional protein structures. We illustrate how our model can enable robust and efficient protein design pipelines with generated conformational decoys that bridge the gap in designing structure conforming sequences. Specifically, sequences generated from our design pipeline outperform native fixed backbone design in 856 out of the 1,016 tested targe...

Protein Structure and Energy Landscape Dependence on Sequence Using a Continuous Energy Function

Journal of Computational Biology, 1997

We have recently described a new conformational search strategy for protein folding algorithms, called the CGU (convex global underestimator) method. Here we use a simplified protein chain representation and a differentiable form of the Sun/Thomas/Dill energy function to test the CGU method. Standard search methods, such as Monte Carlo and molecular dynamics are slowed by kinetic traps. That is, the computer time depends more strongly on the shape of the energy landscape (dictated by the amino acid sequence) than on the number of degrees of freedom (dictated by the chain length). The CGU method is not subject to this limitation, since it explores the underside of the energy landscape, not the top. We find that the CGU computer time is largely independent of the monomer sequence, for different chain folds, and scales as O(n 4 ) with chain length. By using different starting points, we show that the method appears to find global minima. Since we can currently find stable states of 36-residue chains in 2.4 hours, the method may be practical for small proteins.

SPIN2: Predicting sequence profiles from protein structures using deep neural networks

Proteins, 2018

Designing protein sequences that can fold into a given structure is a well-known inverse protein-folding problem. One important characteristic to attain for a protein design program is the ability to recover wild-type sequences given their native backbone structures. The highest average sequence identity accuracy achieved by current protein-design programs in this problem is around 30%, achieved by our previous system, SPIN. SPIN is a program that predicts sequences compatible with a provided structure using a neural network with fragment-based local and energy-based nonlocal profiles. Our new model, SPIN2, uses a deep neural network and additional structural features to improve on SPIN. SPIN2 achieves over 34% in sequence recovery in 10-fold cross-validation and independent tests, a 4% improvement over the previous version. The sequence profiles generated from SPIN2 are expected to be useful for improving existing fold recognition and protein design techniques. SPIN2 is available a...

Current protein structure predictors do not produce meaningful folding pathways

2021

ABSTRACTProtein structure prediction has long been considered a gateway problem for understanding protein folding. Recent advances in deep learning have achieved unprecedented success at predicting a protein’s crystal structure, but whether this achievement relates to a better modelling of the folding process remains an open question. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to experimental folding data. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathwhay, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories produced are also uncorrelated with parameters such as intermediate structures and the folding rate constant. These results suggest that recent advances in protein structure...

Analyzing energy landscapes for folding model proteins

The Journal of Chemical Physics, 2006

A new benchmark 20-bead HP model protein sequence ͑on a square lattice͒, which has 17 distinct but degenerate global minimum ͑GM͒ energy structures, has been studied using a genetic algorithm ͑GA͒. The relative probabilities of finding particular GM conformations are determined and related to the theoretical probability of generating these structures using a recoil growth constructor operator. It is found that for longer successful GA runs, the GM probability distribution is generally very different from the constructor probability, as other GA operators have had time to overcome any initial bias in the originally generated population of structures. Structural and metric relationships ͑e.g., Hamming distances͒ between the 17 distinct GM are investigated and used, in conjunction with data on the connectivities of the GM and the pathways that link them, to explain the GM probability distributions obtained by the GA. A comparison is made of searches where the sequence is defined in the normal ͑forward͒ and reverse directions. The ease of finding mirror image solutions are also compared. Finally, this approach is applied to rationalize the ease or difficulty of finding the GM for a number of standard benchmark HP sequences on the square lattice. It is shown that the relative probabilities of finding particular members of a set of degenerate global minima depend critically on the topography of the energy landscape in the vicinity of the GM, the connections and distances between the GM, and the nature of the operators used in the chosen search method.

AWSEM-Suite: a protein structure prediction server based on template-guided, coevolutionary-enhanced optimized folding landscapes

The accurate and reliable prediction of the 3D structures of proteins and their assemblies remains difficult even though the number of solved structures soars and prediction techniques improve. In this study, a free and open access web server, AWSEM-Suite, whose goal is to predict monomeric protein tertiary structures from sequence is described. The model underlying the server's predictions is a coarse-grained protein force field which has its roots in neural network ideas that has been optimized using energy landscape theory. Employing physically motivated potentials and knowledge-based local structure biasing terms, the addition of homologous template and co-evolutionary restraints to AWSEM-Suite greatly improves the predictive power of pure AWSEM structure prediction. From the independent evaluation metrics released in the CASP13 experiment, AWSEM-Suite proves to be a reasonably accurate algorithm for free modeling, standing at the eighth position in the free modeling category of CASP13. The AWSEM-Suite server also features a front end with a user-friendly interface. The AWSEM-Suite server is a powerful tool for predicting monomeric protein tertiary structures that is most useful when a suitable structure template is not available. The AWSEM-Suite server is freely available at: https://awsem.rice.edu.