State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis (original) (raw)

Augmented Transformer Achieves 97% and 85% for Top5 Prediction of Direct and Classical Retro-Synthesis

2020

We investigated the effect of different augmentation scenarios on predicting (retro)synthesis of chemical compounds using SMILES representation. We showed that augmentation of not only input sequences but also, importantly, of the target data eliminated the effect of data memorization by neural networks and improved their generalization performance for prediction of new sequences. The Top-5 accuracy was 85.4% for the prediction of the largest fragment (thus identifying principal transformation for classical retro-synthesis) for USPTO-50k test dataset and was achieved by a combination of SMILES augmentation and beam search. The same approach also outperformed best published results for prediction of direct reactions from the USPTO-MIT test set. Our model achieved 90.4% Top-1 and 96.5% Top-5 accuracy for its most challenging mixed set and 97% Top-5 accuracy for the USPTO-MIT separated set. The appearance frequency of the most abundantly generated SMILES was well correlated with the pr...

G2GT: Retrosynthesis Prediction with Graph to Graph Attention Neural Network and Self-Training

ArXiv, 2022

Retrosynthesis prediction is one of the fundamental challenges in organic chemistry and related fields. The goal is to find reactants molecules that can synthesize product molecules. To solve this task, we propose a new graph-to-graph transformation model, G2GT, in which the graph encoder and graph decoder are built upon the standard transformer structure. We also show that self-training, a powerful data augmentation method that utilizes unlabeled molecule data, can significantly improve the model’s performance. Inspired by the reaction type label and ensemble learning, we proposed a novel weak ensemble method to enhance diversity. We combined beam search, nucleus, and top-k sampling methods to further improve inference diversity and proposed a simple ranking algorithm to retrieve the final top-10 results. We achieved new state-of-the-art results on both the USPTO-50K dataset, with top1 accuracy of 54%, and the larger data set USPTO-full, with top1 accuracy of 50%, and competitive t...

SB-Net: Synergizing CNN and LSTM networks for uncovering retrosynthetic pathways in organic synthesis

https://www.sciencedirect.com/journal/computational-biology-and-chemistry, 2024

Retrosynthesis is vital in synthesizing target products, guiding reaction pathway design crucial for drug and material discovery. Current models often neglect multi-scale feature extraction, limiting efficacy in leveraging molecular descriptors. Our proposed SB-Net model, a deep-learning architecture tailored for retrosynthesis prediction, addresses this gap. SB-Net combines CNN and Bi-LSTM architectures, excelling in capturing multiscale molecular features. It integrates parallel branches for processing one-hot encoded descriptors and ECFP, merging through dense layers. Experimental results demonstrate SB-Net's superiority, achieving 73.6 % top-1 and 94.6 % top-10 accuracy on USPTO-50k data. Versatility is validated on MetaNetX, with rates of 52.8 % top-1, 74.3 % top-3, 79.8 % top-5, and 83.5 % top-10. SB-Net's success in bioretrosynthesis prediction tasks indicates its efficacy. This research advances computational chemistry, offering a robust deep-learning model for retrosynthesis prediction. With implications for drug discovery and synthesis planning, SB-Net promises innovative and efficient pathways.

Comparison of String Based Molecular Representations for Predicting Chemical Reactions

NA, 2022

Atoms are the fundamental building blocks of everything, bonded together they form the universe. A single oxygen atom is twenty million times smaller than a millimetre [ 1 ], yet is highly dense with information. Progressing through education, we are shown one representation after another, each containing an insight into the workings of an atom, but never the full story. The more we learn, we understand the fundamental flaws in our chemical representations. How do you encapsulate an atom into a representation, without losing information. Furthermore, how do you represent a compound, made up of molecules, made up of many atoms? Artificial Intelligence has become increasingly influential in all fields, and yet there has been relatively little research on the use of neural networks in predicting the outcome of different reactions. One of the fundamental reasons behind this, is the lack of a singular chemical representation. Treating reactions as natural language sequences, this dissertation compares and contrasts the results of models trained on different sequence-based representations. Using the Molecular Transformer, trained on SMILES data, as a basis of comparison. Two modern representations, DeepSMILES and SELFIES were tested. Whilst they were both designed for the use in generational models, neither one outperforms SMILES in terms of accuracy. DeepSMILES only goes to replace syntax invalidity, with semantic invalidity. Despite inaccuracies, the products predicted by SELFIES were over one hundred times more likely to be valid than the next best representation.

Prediction of Organic Reaction Outcomes Using Machine Learning

ACS central science, 2017

Computer assistance in synthesis design has existed for over 40 years, yet retrosynthesis planning software has struggled to achieve widespread adoption. One critical challenge in developing high-quality pathway suggestions is that proposed reaction steps often fail when attempted in the laboratory, despite initially seeming viable. The true measure of success for any synthesis program is whether the predicted outcome matches what is observed experimentally. We report a model framework for anticipating reaction outcomes that combines the traditional use of reaction templates with the flexibility in pattern recognition afforded by neural networks. Using 15 000 experimental reaction records from granted United States patents, a model is trained to select the major (recorded) product by ranking a self-generated list of candidates where one candidate is known to be the major product. Candidate reactions are represented using a unique edit-based representation that emphasizes the fundame...

Discovery of Novel Chemical Reactions by Deep Generative Recurrent Neural Network

Here, we report an application of Artificial Intelligence techniques to generate novel chemical reactions of the given type. A sequence-to-sequence autoencoder was trained on the USPTO reaction database. Each reaction was converted into a single Condensed Graph of Reaction (CGR), followed by their translation into on-purpose developed SMILES/GGR text strings. The autoencoder latent space was visualized on the two-dimensional generative topographic map, from which some zones populated by Suzuki coupling reactions were targeted. These served for the generation of novel reactions by sampling the latent space points and decoding them to SMILES/CGR.

Predicting retrosynthetic pathways using a combined linguistic model and hyper-graph exploration strategy

ArXiv, 2019

We present an extension of our Molecular Transformer architecture combined with a hyper-graph exploration strategy for automatic retrosyn- thesis route planning without human intervention. The single-step ret- rosynthetic model sets a new state of the art for predicting reactants as well as reagents, solvents and catalysts for each retrosynthetic step. We introduce new metrics (coverage, class diversity, round-trip accuracy and Jensen-Shannon divergence) to evaluate the single-step retrosynthetic models, using the forward prediction and a reaction classification model always based on the transformer architecture. The hypergraph is con- structed on the fly, and the nodes are filtered and further expanded based on a Bayesian-like probability. We critically assessed the end-to-end framework with several retrosynthesis examples from literature and aca- demic exams. Overall, the frameworks has a very good performance with few weaknesses due to the bias induced during the training process...

A Transformer Model for Retrosynthesis

2019

We describe a Transformer model for a retrosynthetic reaction prediction task. The model is trained on 45 033 experimental reaction examples extracted from USA patents. It can successfully predict the reactants set for 42.7% of cases on the external test set. During the training procedure, we applied different learning rate schedules and snapshot learning. These techniques can prevent overfitting and thus can be a reason to get rid of internal validation dataset that is advantageous for deep models with millions of parameters. We thoroughly investigated different approaches to train Transformer models and found that snapshot learning with averaging weights on learning rates minima works best. While decoding the model output probabilities there is a strong influence of the temperature that improves at T=1.3 the accuracy of models up to 1-2%.

Learning to Predict Chemical Reactions

Journal of Chemical Information and Modeling, 2011

Being able to predict the course of arbitrary chemical reactions is essential to the theory and applications of organic chemistry. Approaches to the reaction prediction problems can be organized around three poles corresponding to: (1) physical laws; (2) rule-based expert systems; and (3) inductive machine learning. Previous approaches at these poles respectively are not high-throughput, are not generalizable or scalable, or lack sufficient data and structure to be implemented. We propose a new approach to reaction prediction utilizing elements from each pole. Using a physically inspired conceptualization, we describe single mechanistic reactions as interactions between coarse approximations of molecular orbitals (MOs) and use topological and physicochemical attributes as descriptors. Using an existing rule-based system (Reaction Explorer), we derive a restricted chemistry dataset consisting of 1630 full multi-step reactions with 2358 distinct starting materials and intermediates, associated with 2989 productive mechanistic steps and 6.14 million unproductive mechanistic steps. And from machine learning, we pose identifying productive mechanistic steps as a statistical ranking, information retrieval, problem: given a set of reactants and a description of conditions, learn a ranking model over potential filled-to-unfilled MO interactions such that the top ranked mechanistic steps yield the major products. The machine learning implementation follows a two-stage approach, in which we first train atom level reactivity filters to prune 94.00% of nonproductive reactions with a 0.01% error rate. Then, we train an ensemble of ranking models on pairs of interacting MOs to learn a relative productivity function over mechanistic steps in a given system. Without the use of explicit transformation patterns, the ensemble perfectly ranks the productive mechanism at the top 89.05% of the time, rising to 99.86% of the time when the top four are considered. Furthermore, the system is generalizable, making reasonable predictions over reactants and conditions which the rule-based expert does not handle. A web interface to the machine learning based mechanistic reaction predictor is accessible through our chemoinformatics portal (http://cdb.ics.uci.edu) under the Toolkits section.

Decomposing Retrosynthesis into Reactive Center Prediction and Molecule Generation

2019

Chemical retrosynthesis has been a crucial and challenging task in organic chemistry for several decades. In early years, retrosynthesis is accomplished by the disconnection approach which is labor-intensive and requires expert knowledge. Afterward, rule-based methods have dominated in retrosynthesis for years. In this study, we revisit the disconnection approach by leveraging deep learning (DL) to boost its performance and increase the explainability of DL. Concretely, we propose a novel graph-based deep-learning framework, named DeRetro, to predict the set of reactants for a target product by executing the process of disconnection and reactant generation orderly. Experimental results report that DeRetro achieves new state-of-the-art performance in predicting the reactants. In-depth analyses also demonstrate that even without the reaction type as input, DeRetro retains its retrosynthesis performance while other methods show a significant decrease, resulting in a large margin of 19%...