Yifan Song | University of Washington (original) (raw)
Papers by Yifan Song
The Rosetta software suite for macromolecular modeling, docking, and design is widely used in pha... more The Rosetta software suite for macromolecular modeling, docking, and design is widely used in pharmaceutical, industrial, academic, non-profit, and government laboratories. Despite its broad modeling capabilities, Rosetta remains consistently among leading software suites when compared to other methods created for highly specialized protein modeling and design tasks. Developed for over two decades by a global community of over 60 laboratories, Rosetta has undergone multiple refactorings, and now comprises over three million lines of code. Here we discuss methods developed in the last five years in Rosetta, involving the latest protocols for structure prediction; protein–protein and protein–small molecule docking; protein structure and interface design; loop modeling; the incorporation of various types of experimental data; modeling of peptides, antibodies and proteins in the immune system, nucleic acids, non-standard chemistries, carbohydrates, and membrane proteins. We briefly disc...
Software to predict the change in protein stability upon point mutation is a valuable tool for a ... more Software to predict the change in protein stability upon point mutation is a valuable tool for a number of biotechnological and scientific problems. To facilitate the development of such software and provide easy access to the available experimental data, the ProTherm database was created. Biases in the methods and types of information collected has led to disparity in the types of mutations for which experimental data is available. For example, mutations to alanine are hugely overrepresented whereas those involving charged residues, especially from one charged residue to another, are underrepresented. ProTherm subsets created as benchmark sets that do not account for this often underrepresented certain mutational types. This issue introduces systematic biases into previously published protocols’ ability to accurately predict the change in folding energy on these classes of mutations. To resolve this issue, we have generated a new benchmark set with these problems corrected. We have...
Nature, 2016
Author Contribution CDB, GB, VKM and DB designed the study. VKM developed algorithms with help fr... more Author Contribution CDB, GB, VKM and DB designed the study. VKM developed algorithms with help from AW, EC, YS, GB, RB, CDB, GJR, and TWL. CDB and JMG designed canonical peptides with help from DB, GJR, and TWL. GB designed heterochiral and backbonecyclized peptides with help from VKM, DB, PG, and PSH. CDB expressed and characterized designed canonical peptides from E. coli with help from JMG and SAR. JMG performed MS analysis. WAG and CEC purified canonical peptides via Daedalus and determined X-ray crystal structures. GWB, SVSRKP, AE, and TS determined NMR solution structures of canonical peptides, purified with isotopic labeling by CDB. OC and GB synthesized, purified and characterized designed noncanonical peptides. PJH and DJC determined NMR solution structures of noncanonical peptides. PJH, QK and DJC analysed data from structure determination of noncanonical peptides. CDB, GB, VKM, and DB wrote the manuscript with help from all authors. NMR solution structures are deposited to RCSB Protein Data Bank with accession codes 5JG9,
Methods in molecular biology (Clifton, N.J.), 2016
Predicting the outcome of engineered and naturally occurring sequence perturbations to protein-DN... more Predicting the outcome of engineered and naturally occurring sequence perturbations to protein-DNA interfaces requires accurate computational modeling technologies. It has been well established that computational design to accommodate small numbers of DNA target site substitutions is possible. This chapter details the basic method of design used in the Rosetta macromolecular modeling program that has been successfully used to modulate the specificity of DNA-binding proteins. More recently, combining computational design and directed evolution has become a common approach for increasing the success rate of protein engineering projects. The power of such high-throughput screening depends on computational methods producing multiple potential solutions. Therefore, this chapter describes several protocols for increasing the diversity of designed output. Lastly, we describe an approach for building comparative models of protein-DNA complexes in order to utilize information from homologous...
Proquest Dissertations and Theses Thesis City University of New York 2006 Publication Number Aai3232000 Isbn 9780542850783 Source Dissertation Abstracts International Volume 67 08 Section B Page 4299 149 P, 2006
The proton gradient across the biological membrane is important for the biological systems. Bacte... more The proton gradient across the biological membrane is important for the biological systems. Bacteriorhodopsin and cytochrome c oxidase convert different energy sources into this gradient. The focus of this thesis is to understand the mechanism of these proteins using computational methods. In bacteriorhodopsin, residue ionization states were calculated in 9 crystal structures trapped in bR, early M and late M
Cryo-EM has revealed many challenging yet exciting macromolecular assemblies at near-atomic resol... more Cryo-EM has revealed many challenging yet exciting macromolecular assemblies at near-atomic resolution (3-4.5 Angstrom), providing biological phenomena with molecular descriptions. However, at these resolutions accurately positioning individual atoms remains challenging and may be error-prone. Manually refining thousands of amino acids -- typical in a macromolecular assembly -- is tedious and time-consuming. We present an automated method that can improve the atomic details in models manually built in near-atomic-resolution cryo-EM maps. Applying the method to three systems recently solved by cryo-EM, we are able to improve model geometry while maintaining or improving the fit-to-density. Backbone placement errors are automatically detected and corrected, and the refinement shows a large radius of convergence. The results demonstrate the method is amenable to structures with symmetry, of very large size, and containing RNA as well as covalently bound ligands. The method should strea...
Science (New York, N.Y.), Jan 20, 2015
The fleeting lifetimes of the transition states (TSs) of chemical reactions make determination of... more The fleeting lifetimes of the transition states (TSs) of chemical reactions make determination of their three-dimensional structures by diffraction methods a challenge. Here, we used packing interactions within the core of a protein to stabilize the planar TS conformation for rotation around the central carbon-carbon bond of biphenyl so that it could be directly observed by x-ray crystallography. The computational protein design software Rosetta was used to design a pocket within threonyl-transfer RNA synthetase from the thermophile Pyrococcus abyssi that forms complementary van der Waals interactions with a planar biphenyl. This latter moiety was introduced biosynthetically as the side chain of the noncanonical amino acid p-biphenylalanine. Through iterative rounds of computational design and structural analysis, we identified a protein in which the side chain of p-biphenylalanine is trapped in the energetically disfavored, coplanar conformation of the TS of the bond rotation react...
Nature methods, Jan 23, 2015
We describe a general approach for refining protein structure models on the basis of cryo-electro... more We describe a general approach for refining protein structure models on the basis of cryo-electron microscopy maps with near-atomic resolution. The method integrates Monte Carlo sampling with local density-guided optimization, Rosetta all-atom refinement and real-space B-factor fitting. In tests on experimental maps of three different systems with 4.5-Å resolution or better, the method consistently produced models with atomic-level accuracy largely independently of starting-model quality, and it outperformed the molecular dynamics-based MDFF method. Cross-validated model quality statistics correlated with model accuracy over the three test systems.
Nucleic acids research, Jan 16, 2014
We describe the identification and characterization of novel homing endonucleases using genome da... more We describe the identification and characterization of novel homing endonucleases using genome database mining to identify putative target sites, followed by high throughput activity screening in a bacterial selection system. We characterized the substrate specificity and kinetics of these endonucleases by monitoring DNA cleavage events with deep sequencing. The endonuclease specificities revealed by these experiments can be partially recapitulated using 3D structure-based computational models. Analysis of these models together with genome sequence data provide insights into how alternative endonuclease specificities were generated during natural evolution.
Proceedings of the National Academy of Sciences, 2014
The proton gradient across the biological membrane is important for the biological systems. Bacte... more The proton gradient across the biological membrane is important for the biological systems. Bacteriorhodopsin and cytochrome c oxidase convert different energy sources into this gradient. The focus of this thesis is to understand the mechanism of these proteins using computational methods. In bacteriorhodopsin, residue ionization states were calculated in 9 crystal structures trapped in bR, early M and late M
Cell, 2014
Because apoptosis of infected cells can limit virus production and spread, some viruses have co-o... more Because apoptosis of infected cells can limit virus production and spread, some viruses have co-opted prosurvival genes from the host. This includes the Epstein-Barr virus (EBV) gene BHRF1, a homolog of human Bcl-2 proteins that block apoptosis and are associated with cancer. Computational design and experimental optimization were used to generate a novel protein called BINDI that binds BHRF1 with picomolar affinity. BINDI recognizes the hydrophobic cleft of BHRF1 in a manner similar to other Bcl-2 protein interactions but makes many additional contacts to achieve exceptional affinity and specificity. BINDI induces apoptosis in EBV-infected cancer lines, and when delivered with an antibody-targeted intracellular delivery carrier, BINDI suppressed tumor growth and extended survival in a xenograft disease model of EBV-positive human lymphoma. High-specificity-designed proteins that selectively kill target cells may provide an advantage over the toxic compounds used in current generation antibody-drug conjugates.
Proteins, 2014
A number of methods have been described for identifying pairs of contacting residues in protein t... more A number of methods have been described for identifying pairs of contacting residues in protein three-dimensional structures, but it is unclear how many contacts are required for accurate structure modeling. The CASP10 assisted contact experiment provided a blind test of contact guided protein structure modeling. We describe the models generated for these contact guided prediction challenges using the Rosetta structure modeling methodology. For nearly all cases, the submitted models had the correct overall topology, and in some cases, they had near atomic-level accuracy; for example the model of the 384 residue homo-oligomeric tetramer (Tc680o) had only 2.9 Å root-mean-square deviation (RMSD) from the crystal structure. Our results suggest that experimental and bioinformatic methods for obtaining contact information may need to generate only one correct contact for every 12 residues in the protein to allow accurate topology level modeling.
Structure (London, England : 1993), Jan 8, 2013
We describe an improved method for comparative modeling, RosettaCM, which optimizes a physically ... more We describe an improved method for comparative modeling, RosettaCM, which optimizes a physically realistic all-atom energy function over the conformational space defined by homologous structures. Given a set of sequence alignments, RosettaCM assembles topologies by recombining aligned segments in Cartesian space and building unaligned regions de novo in torsion space. The junctions between segments are regularized using a loop closure method combining fragment superposition with gradient-based minimization. The energies of the resulting models are optimized by all-atom refinement, and the most representative low-energy model is selected. The CASP10 experiment suggests that RosettaCM yields models with more accurate side-chain and backbone conformations than other methods when the sequence identity to the templates is greater than ∼15%.
Methods in Enzymology, 2013
Accurate energy functions are critical to macromolecular modeling and design. We describe new too... more Accurate energy functions are critical to macromolecular modeling and design. We describe new tools for identifying inaccuracies in energy functions and guiding their improvement, and illustrate the application of these tools to the improvement of the Rosetta energy function. The feature analysis tool identifies discrepancies between structures deposited in the PDB and low-energy structures generated by Rosetta; these likely arise from inaccuracies in the energy function. The optE tool optimizes the weights on the different components of the energy function by maximizing the recapitulation of a wide range of experimental observations. We use the tools to examine three proposed modifications to the Rosetta energy function: improving the unfolded state energy model (reference energies), using bicubic spline interpolation to generate knowledge-based torisonal potentials, and incorporating the recently developed Dunbrack 2010 rotamer library (Shapovalov & Dunbrack, 2011).
Parasitology, 2014
SUMMARYSpecific roles of individual CDPKs vary, but in general they mediate essential biological ... more SUMMARYSpecific roles of individual CDPKs vary, but in general they mediate essential biological functions necessary for parasite survival. A comparative analysis of the structure-activity relationships (SAR) of Neospora caninum, Eimeria tenella and Babesia bovis calcium-dependent protein kinases (CDPKs) together with those of Plasmodium falciparum, Cryptosporidium parvum and Toxoplasma gondii was performed by screening against 333 bumped kinase inhibitors (BKIs). Structural modelling and experimental data revealed that residues other than the gatekeeper influence compound–protein interactions resulting in distinct sensitivity profiles. We subsequently defined potential amino-acid structural influences within the ATP-binding cavity for each orthologue necessary for consideration in the development of broad-spectrum apicomplexan CDPK inhibitors. Although the BKI library was developed for specific inhibition of glycine gatekeeper CDPKs combined with low inhibition of threonine gatekee...
Proteins: Structure, Function, and Bioinformatics, 2011
The Rosetta software suite for macromolecular modeling, docking, and design is widely used in pha... more The Rosetta software suite for macromolecular modeling, docking, and design is widely used in pharmaceutical, industrial, academic, non-profit, and government laboratories. Despite its broad modeling capabilities, Rosetta remains consistently among leading software suites when compared to other methods created for highly specialized protein modeling and design tasks. Developed for over two decades by a global community of over 60 laboratories, Rosetta has undergone multiple refactorings, and now comprises over three million lines of code. Here we discuss methods developed in the last five years in Rosetta, involving the latest protocols for structure prediction; protein–protein and protein–small molecule docking; protein structure and interface design; loop modeling; the incorporation of various types of experimental data; modeling of peptides, antibodies and proteins in the immune system, nucleic acids, non-standard chemistries, carbohydrates, and membrane proteins. We briefly disc...
Software to predict the change in protein stability upon point mutation is a valuable tool for a ... more Software to predict the change in protein stability upon point mutation is a valuable tool for a number of biotechnological and scientific problems. To facilitate the development of such software and provide easy access to the available experimental data, the ProTherm database was created. Biases in the methods and types of information collected has led to disparity in the types of mutations for which experimental data is available. For example, mutations to alanine are hugely overrepresented whereas those involving charged residues, especially from one charged residue to another, are underrepresented. ProTherm subsets created as benchmark sets that do not account for this often underrepresented certain mutational types. This issue introduces systematic biases into previously published protocols’ ability to accurately predict the change in folding energy on these classes of mutations. To resolve this issue, we have generated a new benchmark set with these problems corrected. We have...
Nature, 2016
Author Contribution CDB, GB, VKM and DB designed the study. VKM developed algorithms with help fr... more Author Contribution CDB, GB, VKM and DB designed the study. VKM developed algorithms with help from AW, EC, YS, GB, RB, CDB, GJR, and TWL. CDB and JMG designed canonical peptides with help from DB, GJR, and TWL. GB designed heterochiral and backbonecyclized peptides with help from VKM, DB, PG, and PSH. CDB expressed and characterized designed canonical peptides from E. coli with help from JMG and SAR. JMG performed MS analysis. WAG and CEC purified canonical peptides via Daedalus and determined X-ray crystal structures. GWB, SVSRKP, AE, and TS determined NMR solution structures of canonical peptides, purified with isotopic labeling by CDB. OC and GB synthesized, purified and characterized designed noncanonical peptides. PJH and DJC determined NMR solution structures of noncanonical peptides. PJH, QK and DJC analysed data from structure determination of noncanonical peptides. CDB, GB, VKM, and DB wrote the manuscript with help from all authors. NMR solution structures are deposited to RCSB Protein Data Bank with accession codes 5JG9,
Methods in molecular biology (Clifton, N.J.), 2016
Predicting the outcome of engineered and naturally occurring sequence perturbations to protein-DN... more Predicting the outcome of engineered and naturally occurring sequence perturbations to protein-DNA interfaces requires accurate computational modeling technologies. It has been well established that computational design to accommodate small numbers of DNA target site substitutions is possible. This chapter details the basic method of design used in the Rosetta macromolecular modeling program that has been successfully used to modulate the specificity of DNA-binding proteins. More recently, combining computational design and directed evolution has become a common approach for increasing the success rate of protein engineering projects. The power of such high-throughput screening depends on computational methods producing multiple potential solutions. Therefore, this chapter describes several protocols for increasing the diversity of designed output. Lastly, we describe an approach for building comparative models of protein-DNA complexes in order to utilize information from homologous...
Proquest Dissertations and Theses Thesis City University of New York 2006 Publication Number Aai3232000 Isbn 9780542850783 Source Dissertation Abstracts International Volume 67 08 Section B Page 4299 149 P, 2006
The proton gradient across the biological membrane is important for the biological systems. Bacte... more The proton gradient across the biological membrane is important for the biological systems. Bacteriorhodopsin and cytochrome c oxidase convert different energy sources into this gradient. The focus of this thesis is to understand the mechanism of these proteins using computational methods. In bacteriorhodopsin, residue ionization states were calculated in 9 crystal structures trapped in bR, early M and late M
Cryo-EM has revealed many challenging yet exciting macromolecular assemblies at near-atomic resol... more Cryo-EM has revealed many challenging yet exciting macromolecular assemblies at near-atomic resolution (3-4.5 Angstrom), providing biological phenomena with molecular descriptions. However, at these resolutions accurately positioning individual atoms remains challenging and may be error-prone. Manually refining thousands of amino acids -- typical in a macromolecular assembly -- is tedious and time-consuming. We present an automated method that can improve the atomic details in models manually built in near-atomic-resolution cryo-EM maps. Applying the method to three systems recently solved by cryo-EM, we are able to improve model geometry while maintaining or improving the fit-to-density. Backbone placement errors are automatically detected and corrected, and the refinement shows a large radius of convergence. The results demonstrate the method is amenable to structures with symmetry, of very large size, and containing RNA as well as covalently bound ligands. The method should strea...
Science (New York, N.Y.), Jan 20, 2015
The fleeting lifetimes of the transition states (TSs) of chemical reactions make determination of... more The fleeting lifetimes of the transition states (TSs) of chemical reactions make determination of their three-dimensional structures by diffraction methods a challenge. Here, we used packing interactions within the core of a protein to stabilize the planar TS conformation for rotation around the central carbon-carbon bond of biphenyl so that it could be directly observed by x-ray crystallography. The computational protein design software Rosetta was used to design a pocket within threonyl-transfer RNA synthetase from the thermophile Pyrococcus abyssi that forms complementary van der Waals interactions with a planar biphenyl. This latter moiety was introduced biosynthetically as the side chain of the noncanonical amino acid p-biphenylalanine. Through iterative rounds of computational design and structural analysis, we identified a protein in which the side chain of p-biphenylalanine is trapped in the energetically disfavored, coplanar conformation of the TS of the bond rotation react...
Nature methods, Jan 23, 2015
We describe a general approach for refining protein structure models on the basis of cryo-electro... more We describe a general approach for refining protein structure models on the basis of cryo-electron microscopy maps with near-atomic resolution. The method integrates Monte Carlo sampling with local density-guided optimization, Rosetta all-atom refinement and real-space B-factor fitting. In tests on experimental maps of three different systems with 4.5-Å resolution or better, the method consistently produced models with atomic-level accuracy largely independently of starting-model quality, and it outperformed the molecular dynamics-based MDFF method. Cross-validated model quality statistics correlated with model accuracy over the three test systems.
Nucleic acids research, Jan 16, 2014
We describe the identification and characterization of novel homing endonucleases using genome da... more We describe the identification and characterization of novel homing endonucleases using genome database mining to identify putative target sites, followed by high throughput activity screening in a bacterial selection system. We characterized the substrate specificity and kinetics of these endonucleases by monitoring DNA cleavage events with deep sequencing. The endonuclease specificities revealed by these experiments can be partially recapitulated using 3D structure-based computational models. Analysis of these models together with genome sequence data provide insights into how alternative endonuclease specificities were generated during natural evolution.
Proceedings of the National Academy of Sciences, 2014
The proton gradient across the biological membrane is important for the biological systems. Bacte... more The proton gradient across the biological membrane is important for the biological systems. Bacteriorhodopsin and cytochrome c oxidase convert different energy sources into this gradient. The focus of this thesis is to understand the mechanism of these proteins using computational methods. In bacteriorhodopsin, residue ionization states were calculated in 9 crystal structures trapped in bR, early M and late M
Cell, 2014
Because apoptosis of infected cells can limit virus production and spread, some viruses have co-o... more Because apoptosis of infected cells can limit virus production and spread, some viruses have co-opted prosurvival genes from the host. This includes the Epstein-Barr virus (EBV) gene BHRF1, a homolog of human Bcl-2 proteins that block apoptosis and are associated with cancer. Computational design and experimental optimization were used to generate a novel protein called BINDI that binds BHRF1 with picomolar affinity. BINDI recognizes the hydrophobic cleft of BHRF1 in a manner similar to other Bcl-2 protein interactions but makes many additional contacts to achieve exceptional affinity and specificity. BINDI induces apoptosis in EBV-infected cancer lines, and when delivered with an antibody-targeted intracellular delivery carrier, BINDI suppressed tumor growth and extended survival in a xenograft disease model of EBV-positive human lymphoma. High-specificity-designed proteins that selectively kill target cells may provide an advantage over the toxic compounds used in current generation antibody-drug conjugates.
Proteins, 2014
A number of methods have been described for identifying pairs of contacting residues in protein t... more A number of methods have been described for identifying pairs of contacting residues in protein three-dimensional structures, but it is unclear how many contacts are required for accurate structure modeling. The CASP10 assisted contact experiment provided a blind test of contact guided protein structure modeling. We describe the models generated for these contact guided prediction challenges using the Rosetta structure modeling methodology. For nearly all cases, the submitted models had the correct overall topology, and in some cases, they had near atomic-level accuracy; for example the model of the 384 residue homo-oligomeric tetramer (Tc680o) had only 2.9 Å root-mean-square deviation (RMSD) from the crystal structure. Our results suggest that experimental and bioinformatic methods for obtaining contact information may need to generate only one correct contact for every 12 residues in the protein to allow accurate topology level modeling.
Structure (London, England : 1993), Jan 8, 2013
We describe an improved method for comparative modeling, RosettaCM, which optimizes a physically ... more We describe an improved method for comparative modeling, RosettaCM, which optimizes a physically realistic all-atom energy function over the conformational space defined by homologous structures. Given a set of sequence alignments, RosettaCM assembles topologies by recombining aligned segments in Cartesian space and building unaligned regions de novo in torsion space. The junctions between segments are regularized using a loop closure method combining fragment superposition with gradient-based minimization. The energies of the resulting models are optimized by all-atom refinement, and the most representative low-energy model is selected. The CASP10 experiment suggests that RosettaCM yields models with more accurate side-chain and backbone conformations than other methods when the sequence identity to the templates is greater than ∼15%.
Methods in Enzymology, 2013
Accurate energy functions are critical to macromolecular modeling and design. We describe new too... more Accurate energy functions are critical to macromolecular modeling and design. We describe new tools for identifying inaccuracies in energy functions and guiding their improvement, and illustrate the application of these tools to the improvement of the Rosetta energy function. The feature analysis tool identifies discrepancies between structures deposited in the PDB and low-energy structures generated by Rosetta; these likely arise from inaccuracies in the energy function. The optE tool optimizes the weights on the different components of the energy function by maximizing the recapitulation of a wide range of experimental observations. We use the tools to examine three proposed modifications to the Rosetta energy function: improving the unfolded state energy model (reference energies), using bicubic spline interpolation to generate knowledge-based torisonal potentials, and incorporating the recently developed Dunbrack 2010 rotamer library (Shapovalov & Dunbrack, 2011).
Parasitology, 2014
SUMMARYSpecific roles of individual CDPKs vary, but in general they mediate essential biological ... more SUMMARYSpecific roles of individual CDPKs vary, but in general they mediate essential biological functions necessary for parasite survival. A comparative analysis of the structure-activity relationships (SAR) of Neospora caninum, Eimeria tenella and Babesia bovis calcium-dependent protein kinases (CDPKs) together with those of Plasmodium falciparum, Cryptosporidium parvum and Toxoplasma gondii was performed by screening against 333 bumped kinase inhibitors (BKIs). Structural modelling and experimental data revealed that residues other than the gatekeeper influence compound–protein interactions resulting in distinct sensitivity profiles. We subsequently defined potential amino-acid structural influences within the ATP-binding cavity for each orthologue necessary for consideration in the development of broad-spectrum apicomplexan CDPK inhibitors. Although the BKI library was developed for specific inhibition of glycine gatekeeper CDPKs combined with low inhibition of threonine gatekee...
Proteins: Structure, Function, and Bioinformatics, 2011