Computer-Aided Drug Design Research Papers (original) (raw)
In the space of only a few years, deep generative modeling has revolutionized how we think of artificial creativity, yielding autonomous systems which produce original images, music, and text. Inspired by these successes, researchers are... more
In the space of only a few years, deep generative modeling has revolutionized how we think of artificial creativity, yielding autonomous systems which produce original images, music, and text. Inspired by these successes, researchers are now applying deep generative modeling techniques to the generation and optimization of molecules-in our review we found 45 papers on the subject published in the past two years. These works point to a future where such systems will be used to generate lead molecules, greatly reducing resources spent downstream synthesizing and characterizing bad leads in the lab. In this review we survey the increasingly complex landscape of models and representation schemes that have been proposed. The four classes of techniques we describe are recursive neural networks, autoencoders, generative adversarial networks, and reinforcement learning. After first discussing some of the mathematical fundamentals of each technique, we draw high level connections and comparisons with other techniques and expose the pros and cons of each. Several important high level themes emerge as a result of this work, including the shift away from the SMILES string representation of molecules towards more sophisticated representations such as graph grammars and 3D representations, the importance of reward function design, the need for better standards for benchmark-ing and testing, and the benefits of adversarial training and reinforcement learning over maximum likelihood based training. The average cost to bring a new drug to market is now well over one billion USD, 1 with an average time from discovery to market of 13 years. 2 Outside of pharmaceuticals the average time from discovery to commercial production can be even longer, for instance for energetic molecules it is 25 years. 3 A critical first step in molecular discovery is generating a pool of candidates for computational study or synthesis and characterization. This is a daunting task because the space of possible molecules is enormous-the number of potential drug-like compounds has been estimated to be between 10 23 and 10 60 , 4 while the number of all compounds that have been synthesized is on the order of 10 8. Heuristics, such as Lipin-ski's "rule of five" for pharmaceuticals 5 can help narrow the space of possibilities, but the task remains daunting. High throughput screening (HTS) 6 and high throughput virtual screening (HTVS) 7 techniques have made larger parts of chemical space accessible to computational and experimental study. Machine learning has been shown to be capable of yielding rapid and accurate property predictions for many properties of interest and is being integrated into screening pipelines, since it is orders of magnitude faster than traditional computational chemistry methods. 8 Techniques for the interpretation and "inversion" of a machine learning model can illuminate structure-property relations that have been learned by the model which can in turn be used to guide the design of new lead molecules. 9,10 However even with these new techniques bad leads still waste limited supercomputer and laboratory resources, so minimizing the number of bad leads generated at the start of the pipeline remains a) Electronic mail: daniel.elton@nih.gov a key priority. The focus of this review is on the use of deep learning techniques for the targeted generation of molecules and guided exploration of chemical space. We note that machine learning (and more broadly artificial intelligence) is having an impact on accelerating other parts of the chemical discovery pipeline as well, via machine learning accelerated ab-initio simulation, 8 machine learning based reaction prediction, 11,12 deep learning based synthesis planning, 13 and the development of high-throughput "self-driving" robotic laboratories. 14,15 Deep neural networks, which are often defined as networks with more than three layers, have been around for many decades but until recently were difficult to train and fell behind other techniques for classification and regression. By most accounts, the deep learning revolution in machine learning began in 2012, when deep neu-ral network based models began to win several different competitions for the first time. First came a demonstration by Cire¸sanCire¸san et al. of how deep neural networks could achieve near-human performance on the task of handwritten digit classification. 16 Next came groundbreaking work by Krizhevsky et al. which showed how deep convo-lutional networks achieved superior performance on the 2010 ImageNet image classification challenge. 17 Finally, around the same time in 2012, a multitask neural network developed by Dahl et al. won the "Merck Molecular Activity Challenge" to predict the molecular activities of molecules at 15 different sites in the body, beating out more traditional machine learning approaches such as boosted decision trees. 18 One of the key technical advances published that year and used by both Krizhevsky et al. and Dahl et al. was a novel regularization trick called "dropout".