Native protein sequences are close to optimal for their structures. (original) (raw)

Automatic protein design with all atom force-fields by exact and heuristic optimization 1 1 Edited by J. Thorton

Journal of Molecular Biology, 2000

A fully automatic procedure for predicting the amino acid sequences compatible with a given target structure is described. It is based on the CHARMM package, and uses an all atom force-®eld and rotamer libraries to describe and evaluate side-chain types and conformations. Sequences are ranked by a quantity akin to the free energy of folding, which incorporates hydration effects. Exact (Branch and Bound) and heuristic optimisation procedures are used to identifying highly scoring sequences from an astronomical number of possibilities. These sequences include the minimum free energy sequence, as well as all amino acid sequences whose free energy lies within a speci®ed window from the minimum. Several applications of our procedure are illustrated. Prediction of side-chain conformations for a set of ten proteins yields results comparable to those of established side-chain placement programs. Applications to sequence optimisation comprise the re-design of the protein cores of c-Crk SH3 domain, the B1 domain of protein G and Ubiquitin, and of surface residues of the SH3 domain. In all calculations, no restrictions are imposed on the amino acid composition and identical parameter settings are used for core and surface residues. The best scoring sequences for the protein cores are virtually identical to wild-type. They feature no more than one to three mutations in a total of 11-16 variable positions. Tests suggest that this is due to the balance between various contributions in the force-®eld rather than to overwhelming in¯uence from packing constraints. The effectiveness of our force-®eld is further supported by the sequence predictions for surface residues of the SH3 domain. More mutations are predicted than in the core, seemingly in order to optimise the network of complementary interactions between polar and charged groups. This appears to be an important energetic requirement in absence of the partner molecules with which the SH3 domain interacts, which were not included in the calculations. Finally, a detailed comparison between the sequences generated by the heuristic and exact optimisation algorithms, commends a note of caution concerning the ef®ciency of heuristic procedures in exploring sequence space.

Solution structure of a de novo protein from a designed combinatorial library


Combinatorial libraries of de novo amino acid sequences can provide a rich source of diversity for the discovery of novel proteins. Randomly generated sequences, however, rarely fold into well ordered protein-like structures. To enhance the quality of a library, diversity must be focused into those regions of sequence space most likely to yield well folded structures. We have constructed focused libraries of de novo sequences by designing the binary pattern of polar and nonpolar amino acids to favor structures that contain abundant secondary structure, while simultaneously burying hydrophobic side chains in the protein interior and exposing hydrophilic side chains to solvent. Because binary patterning specifies only the polar/nonpolar periodicity, but not the identities of the side chains, detailed structural features, including packing interactions, cannot be designed a priori. Can binary patterned libraries nonetheless encode well folded proteins? An unambiguous answer to this question requires determination of a 3D structure. We used NMR spectroscopy to determine the structure of S-824, a novel protein from a recently constructed library of 102-residue sequences. This library is "naïve" in that it has not been subjected to high-throughput screens or directed evolution. The experimentally determined structure of S-824 is a four-helix bundle, as specified by the design. As dictated by the binary-code strategy, nonpolar side chains are buried in the protein interior, and polar side chains are exposed to solvent. The polypeptide backbone and buried side chains are well ordered, demonstrating that S-824 is not a molten globule and forms a unique structure. These results show that amino acid sequences that have neither been selected by evolution, nor designed by computer, nor isolated by high-throughput screening, can form native-like structures. These findings validate the binary-code strategy as an effective method for producing vast collections of well folded de novo proteins.

Amino-acid site variability among natural and designed proteins


Computational protein design attempts to create protein sequences that fold stably into pre-specified structures. Here we compare alignments of designed proteins to alignments of natural proteins and assess how closely designed sequences recapitulate patterns of sequence variation found in natural protein sequences. We design proteins using RosettaDesign, and we evaluate both fixed-backbone designs and variable-backbone designs with different amounts of backbone flexibility. We find that proteins designed with a fixed backbone tend to underestimate the amount of site variability observed in natural proteins while proteins designed with an intermediate amount of backbone flexibility result in more realistic site variability. Further, the correlation between solvent exposure and site variability in designed proteins is lower than that in natural proteins. This finding suggests that site variability is too uniform across different solvent exposure states (i.e., buried residues are too variable or exposed residues too conserved). When comparing the amino acid frequencies in the designed proteins with those in natural proteins we find that in the designed proteins hydrophobic residues are underrepresented in the core. From these results we conclude that intermediate backbone flexibility during design results in more accurate protein design and that either scoring functions or backbone sampling methods require further improvement to accurately replicate structural constraints on site variability. Ollikainen N, Kortemme T. 2013. Computational protein design quantifies structural constraints on amino acid covariation. PLoS Computational Biology In Press. Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL. 1992. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Science 1:216-226 DOI 10.1002/pro.5560010203. Porto M, Roman HE, Vendruscolo M, Bastolla U. 2004. Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Molecular Biology and Evolution 22:630-638 DOI 10.1093/molbev/msi048. . 2008. Kemp elimination catalysts by computational enzyme design. Nature 453:190-195 DOI 10.1038/nature06879. Scherrer MP, Meyer AG, Wilke CO. 2012. Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evolutionary Biology 12:179 Smith CA, Kortemme T. 2010. Structure-based prediction of the peptide sequence space recognized by natural and synthetic PDZ domains.

Protein design by optimization of a sequence-structure quality function

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1994

An automated procedure for protein design by optimization of a sequence-structure quality has been developed. The method selects a statistically optimal sequence for a particular structure, on the assumption that such a protein will adopt the desired structure. We present two optimization algorithms: one provides an exact optimization while the other uses a combinatorial technique for comparatively rapid results. Both are suitable for massively parallel computers. A prototype system was used to design sequences which should adopt the four-helix bundle conformation of myohemerythrin. These appear satisfactory to secondary structure and profile analysis. Detailed inspection reveals that the sequences are generally plausible but, as expected, lack some specific structural features. The design parameters provide some insight into the general determinants of protein structure.

Evaluating the accuracy of protein design using native secondary sub-structures

BMC Bioinformatics, 2016

Background: According to structure-dependent function of proteins, two main challenging problems called Protein Structure Prediction (PSP) and Inverse Protein Folding (IPF) are investigated. In spite of IPF essential applications, it has not been investigated as much as PSP problem. In fact, the ultimate goal of IPF problem or protein design is to create proteins with enhanced properties or even novel functions. One of the major computational challenges in protein design is its large sequence space, namely searching through all plausible sequences is impossible. Inasmuch as, protein secondary structure represents an appropriate primary scaffold of the protein conformation, undoubtedly studying the Protein Secondary Structure Inverse Folding (PSSIF) problem is a quantum leap forward in protein design, as it can reduce the search space. In this paper, a novel genetic algorithm which uses native secondary sub-structures is proposed to solve PSSIF problem. In essence, evolutionary information can lead the algorithm to design appropriate amino acid sequences respective to the target secondary structures. Furthermore, they can be folded to tertiary structures almost similar to their reference 3D structures. Results: The proposed algorithm called GAPSSIF benefits from evolutionary information obtained by solved proteins in the PDB. Therefore, we construct a repository of protein secondary sub-structures to accelerate convergence of the algorithm. The secondary structure of designed sequences by GAPSSIF is comparable with those obtained by Evolver and EvoDesign. Although we do not explicitly consider tertiary structure features through the algorithm, the structural similarity of native and designed sequences declares acceptable values. Conclusions: Using the evolutionary information of native structures can significantly improve the quality of designed sequences. In fact, the combination of this information and effective features such as solvent accessibility and torsion angles leads IPF problem to an efficient solution. GAPSSIF can be downloaded at http://bioinformatics.

A quantitative methodology for the de novo design of proteins

Protein Science, 1994

We have developed a general quantitative methodology for designing proteins de novo, which automatically produces sequences for any given plausible protein structure. The method incorporates statistical information, a theoretical description of protein structure, and motifs described in the literature. A model system embodying a portion of the quantitative methodology has been used to design many protein sequences for the phage 434 Cro and fibronectin type 111 domain folds, as well as several other structures. Residue sequences selected by this prototype share no significant identity with any natural protein. Nonetheless, 3-dimensional models of the designed sequences appear generally plausible. When examined using secondary structure prediction methods and profile analysis, the designed sequences generally score considerably better than the natural ones. The designed sequences are also in reasonable agreement with a sequence template. This quantitative methodology is likely to be capable of successfully designing new proteins and yielding fundamental insights about the determinants of protein structure.

Theoretical and Computational Protein Design

Annual Review of Physical Chemistry, 2011

From exponentially large numbers of possible sequences, protein design seeks to identify the properties of those that fold to predetermined structures and have targeted structural and functional properties. The interactions that confer structure and function involve intermolecular forces and large numbers of interacting amino acids. As a result, the identification of sequences can be subtle and complex. Sophisticated methods for characterizing sequences consistent with a particular structure have been developed, assisting the design of novel proteins. Developments in such computational protein design are discussed, along with recent accomplishments, ranging from the redesign of existing proteins to the design of new functionalities and nonbiological applications.

A Large Scale Test of Computational Protein Design Folding and Stability of Nine Completely Redesigned Globular Proteins

A previously developed computer program for protein design, RosettaDesign, was used to predict low free energy sequences for nine naturally occurring protein backbones. RosettaDesign had no knowledge of the naturally occurring sequences and on average 65% of the residues in the designed sequences differ from wild-type. Synthetic genes for ten completely redesigned proteins were generated, and the proteins were expressed, purified, and then characterized using circular dichroism, chemical and temperature denaturation and NMR experiments. Although high-resolution structures have not yet been determined, eight of these proteins appear to be folded and their circular dichroism spectra are similar to those of their wild-type counterparts. Six of the proteins have stabilities equal to or up to 7 kcal/mol greater than their wild-type counterparts, and four of the proteins have NMR spectra consistent with a well-packed, rigid structure. These encouraging results indicate that the computational protein design methods can, with significant reliability, identify amino acid sequences compatible with a target protein backbone.