Retrosynthesis (original) (raw)
Related papers
Decomposing Retrosynthesis into Reactive Center Prediction and Molecule Generation
2019
Chemical retrosynthesis has been a crucial and challenging task in organic chemistry for several decades. In early years, retrosynthesis is accomplished by the disconnection approach which is labor-intensive and requires expert knowledge. Afterward, rule-based methods have dominated in retrosynthesis for years. In this study, we revisit the disconnection approach by leveraging deep learning (DL) to boost its performance and increase the explainability of DL. Concretely, we propose a novel graph-based deep-learning framework, named DeRetro, to predict the set of reactants for a target product by executing the process of disconnection and reactant generation orderly. Experimental results report that DeRetro achieves new state-of-the-art performance in predicting the reactants. In-depth analyses also demonstrate that even without the reaction type as input, DeRetro retains its retrosynthesis performance while other methods show a significant decrease, resulting in a large margin of 19%...
https://www.sciencedirect.com/journal/computational-biology-and-chemistry, 2024
Retrosynthesis is vital in synthesizing target products, guiding reaction pathway design crucial for drug and material discovery. Current models often neglect multi-scale feature extraction, limiting efficacy in leveraging molecular descriptors. Our proposed SB-Net model, a deep-learning architecture tailored for retrosynthesis prediction, addresses this gap. SB-Net combines CNN and Bi-LSTM architectures, excelling in capturing multiscale molecular features. It integrates parallel branches for processing one-hot encoded descriptors and ECFP, merging through dense layers. Experimental results demonstrate SB-Net's superiority, achieving 73.6 % top-1 and 94.6 % top-10 accuracy on USPTO-50k data. Versatility is validated on MetaNetX, with rates of 52.8 % top-1, 74.3 % top-3, 79.8 % top-5, and 83.5 % top-10. SB-Net's success in bioretrosynthesis prediction tasks indicates its efficacy. This research advances computational chemistry, offering a robust deep-learning model for retrosynthesis prediction. With implications for drug discovery and synthesis planning, SB-Net promises innovative and efficient pathways.
RetroGNN: Approximating Retrosynthesis by Graph Neural Networks for De Novo Drug Design
2020
De novo molecule generation often results in chemically unfeasible molecules. A natural idea to mitigate this problem is to bias the search process towards more easily synthesizable molecules using a proxy for synthetic accessibility. However, using currently available proxies still results in highly unrealistic compounds. We investigate the feasibility of training deep graph neural networks to approximate the outputs of a retrosynthesis planning software, and their use to bias the search process. We evaluate our method on a benchmark involving searching for drug-like molecules with antibiotic properties. Compared to enumerating over five million existing molecules from the ZINC database, our approach finds molecules predicted to be more likely to be antibiotics while maintaining good drug-like properties and being easily synthesizable. Importantly, our deep neural network can successfully filter out hard to synthesize molecules while achieving a 10510^5105 times speed-up over using the...
G2GT: Retrosynthesis Prediction with Graph to Graph Attention Neural Network and Self-Training
ArXiv, 2022
Retrosynthesis prediction is one of the fundamental challenges in organic chemistry and related fields. The goal is to find reactants molecules that can synthesize product molecules. To solve this task, we propose a new graph-to-graph transformation model, G2GT, in which the graph encoder and graph decoder are built upon the standard transformer structure. We also show that self-training, a powerful data augmentation method that utilizes unlabeled molecule data, can significantly improve the model’s performance. Inspired by the reaction type label and ensemble learning, we proposed a novel weak ensemble method to enhance diversity. We combined beam search, nucleus, and top-k sampling methods to further improve inference diversity and proposed a simple ranking algorithm to retrieve the final top-10 results. We achieved new state-of-the-art results on both the USPTO-50K dataset, with top1 accuracy of 54%, and the larger data set USPTO-full, with top1 accuracy of 50%, and competitive t...
Prediction of Organic Reaction Outcomes Using Machine Learning
ACS central science, 2017
Computer assistance in synthesis design has existed for over 40 years, yet retrosynthesis planning software has struggled to achieve widespread adoption. One critical challenge in developing high-quality pathway suggestions is that proposed reaction steps often fail when attempted in the laboratory, despite initially seeming viable. The true measure of success for any synthesis program is whether the predicted outcome matches what is observed experimentally. We report a model framework for anticipating reaction outcomes that combines the traditional use of reaction templates with the flexibility in pattern recognition afforded by neural networks. Using 15 000 experimental reaction records from granted United States patents, a model is trained to select the major (recorded) product by ranking a self-generated list of candidates where one candidate is known to be the major product. Candidate reactions are represented using a unique edit-based representation that emphasizes the fundame...
Journal of Chemical Information and Modeling, 2012
Proposing reasonable mechanisms and predicting the course of chemical reactions is important to the practice of organic chemistry. Approaches to reaction prediction have historically used obfuscating representations and manually encoded patterns or rules. Here we present ReactionPredictor, a machine learning approach to reaction prediction that models elementary, mechanistic reactions as interactions between approximate molecular orbitals (MOs). A training data set of productive reactions known to occur at reasonable rates and yields and verified by inclusion in the literature or textbooks is derived from an existing rule-based system and expanded upon with manual curation from graduate level textbooks. Using this training data set of complex polar, hypervalent, radical, and pericyclic reactions, a two-stage machine learning prediction framework is trained and validated. In the first stage, filtering models trained at the level of individual MOs are used to reduce the space of possible reactions to consider. In the second stage, ranking models over the filtered space of possible reactions are used to order the reactions such that the productive reactions are the top ranked. The resulting model, ReactionPredictor, perfectly ranks polar reactions 78.1% of the time and recovers all productive reactions 95.7% of the time when allowing for small numbers of errors. Pericyclic and radical reactions are perfectly ranked 85.8% and 77.0% of the time, respectively, rising to >93% recovery for both reaction types with a small number of allowed errors. Decisions about which of the polar, pericyclic, or radical reaction type ranking models to use can be made with >99% accuracy. Finally, for multistep reaction pathways, we implement the first mechanistic pathway predictor using constrained tree-search to discover a set of reasonable mechanistic steps from given reactants to given products. Webserver implementations of both the single step and pathway versions of ReactionPredictor are available via the chemoinformatics portal http://cdb.ics.uci.edu/.
Discovery of Novel Chemical Reactions by Deep Generative Recurrent Neural Network
Here, we report an application of Artificial Intelligence techniques to generate novel chemical reactions of the given type. A sequence-to-sequence autoencoder was trained on the USPTO reaction database. Each reaction was converted into a single Condensed Graph of Reaction (CGR), followed by their translation into on-purpose developed SMILES/GGR text strings. The autoencoder latent space was visualized on the two-dimensional generative topographic map, from which some zones populated by Suzuki coupling reactions were targeted. These served for the generation of novel reactions by sampling the latent space points and decoding them to SMILES/CGR.
Deep learning for molecular generation and optimization -a review of the state of the art
In the space of only a few years, deep generative modeling has revolutionized how we think of artificial creativity, yielding autonomous systems which produce original images, music, and text. Inspired by these successes, researchers are now applying deep generative modeling techniques to the generation and optimization of molecules-in our review we found 45 papers on the subject published in the past two years. These works point to a future where such systems will be used to generate lead molecules, greatly reducing resources spent downstream synthesizing and characterizing bad leads in the lab. In this review we survey the increasingly complex landscape of models and representation schemes that have been proposed. The four classes of techniques we describe are recursive neural networks, autoencoders, generative adversarial networks, and reinforcement learning. After first discussing some of the mathematical fundamentals of each technique, we draw high level connections and comparisons with other techniques and expose the pros and cons of each. Several important high level themes emerge as a result of this work, including the shift away from the SMILES string representation of molecules towards more sophisticated representations such as graph grammars and 3D representations, the importance of reward function design, the need for better standards for benchmark-ing and testing, and the benefits of adversarial training and reinforcement learning over maximum likelihood based training. The average cost to bring a new drug to market is now well over one billion USD, 1 with an average time from discovery to market of 13 years. 2 Outside of pharmaceuticals the average time from discovery to commercial production can be even longer, for instance for energetic molecules it is 25 years. 3 A critical first step in molecular discovery is generating a pool of candidates for computational study or synthesis and characterization. This is a daunting task because the space of possible molecules is enormous-the number of potential drug-like compounds has been estimated to be between 10 23 and 10 60 , 4 while the number of all compounds that have been synthesized is on the order of 10 8. Heuristics, such as Lipin-ski's "rule of five" for pharmaceuticals 5 can help narrow the space of possibilities, but the task remains daunting. High throughput screening (HTS) 6 and high throughput virtual screening (HTVS) 7 techniques have made larger parts of chemical space accessible to computational and experimental study. Machine learning has been shown to be capable of yielding rapid and accurate property predictions for many properties of interest and is being integrated into screening pipelines, since it is orders of magnitude faster than traditional computational chemistry methods. 8 Techniques for the interpretation and "inversion" of a machine learning model can illuminate structure-property relations that have been learned by the model which can in turn be used to guide the design of new lead molecules. 9,10 However even with these new techniques bad leads still waste limited supercomputer and laboratory resources, so minimizing the number of bad leads generated at the start of the pipeline remains a) Electronic mail: daniel.elton@nih.gov a key priority. The focus of this review is on the use of deep learning techniques for the targeted generation of molecules and guided exploration of chemical space. We note that machine learning (and more broadly artificial intelligence) is having an impact on accelerating other parts of the chemical discovery pipeline as well, via machine learning accelerated ab-initio simulation, 8 machine learning based reaction prediction, 11,12 deep learning based synthesis planning, 13 and the development of high-throughput "self-driving" robotic laboratories. 14,15 Deep neural networks, which are often defined as networks with more than three layers, have been around for many decades but until recently were difficult to train and fell behind other techniques for classification and regression. By most accounts, the deep learning revolution in machine learning began in 2012, when deep neu-ral network based models began to win several different competitions for the first time. First came a demonstration by Cire¸sanCire¸san et al. of how deep neural networks could achieve near-human performance on the task of handwritten digit classification. 16 Next came groundbreaking work by Krizhevsky et al. which showed how deep convo-lutional networks achieved superior performance on the 2010 ImageNet image classification challenge. 17 Finally, around the same time in 2012, a multitask neural network developed by Dahl et al. won the "Merck Molecular Activity Challenge" to predict the molecular activities of molecules at 15 different sites in the body, beating out more traditional machine learning approaches such as boosted decision trees. 18 One of the key technical advances published that year and used by both Krizhevsky et al. and Dahl et al. was a novel regularization trick called "dropout".
Learning to Predict Chemical Reactions
Journal of Chemical Information and Modeling, 2011
Being able to predict the course of arbitrary chemical reactions is essential to the theory and applications of organic chemistry. Approaches to the reaction prediction problems can be organized around three poles corresponding to: (1) physical laws; (2) rule-based expert systems; and (3) inductive machine learning. Previous approaches at these poles respectively are not high-throughput, are not generalizable or scalable, or lack sufficient data and structure to be implemented. We propose a new approach to reaction prediction utilizing elements from each pole. Using a physically inspired conceptualization, we describe single mechanistic reactions as interactions between coarse approximations of molecular orbitals (MOs) and use topological and physicochemical attributes as descriptors. Using an existing rule-based system (Reaction Explorer), we derive a restricted chemistry dataset consisting of 1630 full multi-step reactions with 2358 distinct starting materials and intermediates, associated with 2989 productive mechanistic steps and 6.14 million unproductive mechanistic steps. And from machine learning, we pose identifying productive mechanistic steps as a statistical ranking, information retrieval, problem: given a set of reactants and a description of conditions, learn a ranking model over potential filled-to-unfilled MO interactions such that the top ranked mechanistic steps yield the major products. The machine learning implementation follows a two-stage approach, in which we first train atom level reactivity filters to prune 94.00% of nonproductive reactions with a 0.01% error rate. Then, we train an ensemble of ranking models on pairs of interacting MOs to learn a relative productivity function over mechanistic steps in a given system. Without the use of explicit transformation patterns, the ensemble perfectly ranks the productive mechanism at the top 89.05% of the time, rising to 99.86% of the time when the top four are considered. Furthermore, the system is generalizable, making reasonable predictions over reactants and conditions which the rule-based expert does not handle. A web interface to the machine learning based mechanistic reaction predictor is accessible through our chemoinformatics portal (http://cdb.ics.uci.edu) under the Toolkits section.