Computer-Aided Drug Design Research Papers (original) (raw)

2025, IJPBS

A set of 2-substituted benzimidazoles were successfully synthesized. Benzimidazoles were prepared by condensation of ortho-phenylenediamine with substituted acids in presence of ring closing agents like Polyphosphoric acid/ HCl. The... more

A set of 2-substituted benzimidazoles were successfully synthesized. Benzimidazoles were prepared by condensation of ortho-phenylenediamine with substituted acids in presence of ring closing agents like Polyphosphoric acid/ HCl. The synthesized compounds were characterized by IR spectroscopy and Elemental analysis. All the synthesized compounds were screened for anthelmenthic activity by using Albendazole as standard.

2025, arXiv (Cornell University)

In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of high-level latent... more

In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of high-level latent random variables is helpful to learn the data distribution and solve the problem that generative adversarial net always emits the similar data. We propose the VGAN model where the generative model is composed of recurrent neural network and VAE. The discriminative model is a convolutional neural network. We train the model via policy gradient. We apply the proposed model to the task of text generation and compare it to other recent neural network based models, such as recurrent neural network language model and Seq-GAN. We evaluate the performance of the model by calculating negative log-likelihood and the BLEU score. We conduct experiments on three benchmark datasets, and results show that our model outperforms other previous models.

2024

Chemical drugs have become an inseparable part of human life today. The discovery of penicillin in 1928 by Alexander Fleming marked a turning point in the history of medicine and paved the way for the development of synthetic antibiotics... more

Chemical drugs have become an inseparable part of human life today. The discovery of penicillin in 1928 by Alexander Fleming marked a turning point in the history of medicine and paved the way for the development of synthetic antibiotics that have saved millions of human lives. Synthetic drugs have successfully treated various diseases that were previously considered fatal. Medicinal chemistry is a science that combines the principles of chemistry, biology, pharmacology, and medicine and plays an important role in the discovery of safe and effective drugs (Antimo Gioiello, et al., 2020).

2024

We propose a new empirical scoring function for binding affinity prediction modeled based on physicochemical and structural descriptors that characterize the nano-environment that encompass both ligand and binding pocket residues. Our... more

We propose a new empirical scoring function for binding affinity prediction modeled based on physicochemical and structural descriptors that characterize the nano-environment that encompass both ligand and binding pocket residues. Our hypothesis is that a more detailed characterization of protein-ligand complexes in terms of describing nano-environment as precisely as possible can lead to improvements in binding affinity prediction. Similar hypothesis has already been proven valid in case of nano-environments for protein-protein interfaces and catalytic site residues (yet to be published). INTRODUCTION In structure-based virtual screening campaigns, in silico protein-ligand complexes are evaluated and ranked according to their estimated binding affinities. Normally the ranking step is performed by using scoring functions, i.e. mathematical models that assess the strength of interaction between two binding partners. However, scoring functions are generally weak predictors of binding ...

2024, Journal of Infrastructure Policy and Development

Accurate demand forecasting is key for companies to optimize inventory management and satisfy customer demand efficiently. This paper aims to Investigate on the application of generative AI models in demand forecasting. Two models were... more

Accurate demand forecasting is key for companies to optimize inventory management and satisfy customer demand efficiently. This paper aims to Investigate on the application of generative AI models in demand forecasting. Two models were used: Long Short-Term Memory (LSTM) networks and Variational Autoencoder (VAE), and results were compared to select the optimal model in terms of performance and forecasting accuracy. The difference of actual and predicted demand values also ascertain LSTM's ability to identify latent features and basic trends in the data. Further, some of the research works were focused on computational efficiency and scalability of the proposed methods for providing the guidelines to the companies for the implementation of the complicated techniques in demand forecasting. Based on these results, LSTM networks have a promising application in enhancing the demand forecasting and consequently helpful for the decision-making process regarding inventory control and other resource allocation.

2024

The Chilean Automatic Supernovae SEarch (CHASE) is a survey designed to detect early Supernovae. In this paper we explore deep autoencoders to obtain a compressed latent space for a large transient candidate database from the CHASE image... more

The Chilean Automatic Supernovae SEarch (CHASE) is a survey designed to detect early Supernovae. In this paper we explore deep autoencoders to obtain a compressed latent space for a large transient candidate database from the CHASE image difference pipeline. Compared to conventional methods, the latent variables obtained with variational autoencoders preserve more information and are more discriminative towards real astronomical transients.

2024, Journal of Cheminformatics

In drug discovery, virtual screening is crucial for identifying potential hit compounds. This study aims to present a novel pipeline that employs machine learning models that amalgamates various conventional screening methods. A diverse... more

In drug discovery, virtual screening is crucial for identifying potential hit compounds. This study aims to present a novel pipeline that employs machine learning models that amalgamates various conventional screening methods. A diverse array of protein targets was selected, and their corresponding datasets were subjected to active/decoy distribution analysis prior to scoring using four distinct methods: QSAR, Pharmacophore, docking, and 2D shape similarity, which were ultimately integrated into a single consensus score. The fine-tuned machine learning models were ranked using the novel formula "w_new", consensus scores were calculated, and an enrichment study was performed for each target. Distinctively, consensus scoring outperformed other methods in specific protein targets such as PPARG and DPP4, achieving AUC values of 0.90 and 0.84, respectively. Remarkably, this approach consistently prioritized compounds with higher experimental PIC 50 values compared to all other screening methodologies. Moreover, the models demonstrated a range of moderate to high performance in terms of R 2 values during external validation. In conclusion, this novel workflow consistently delivered superior results, emphasizing the significance of a holistic approach in drug discovery, where both quantitative metrics and active enrichment play pivotal roles in identifying the best virtual screening methodology. Scientific contribution We presented a novel consensus scoring workflow in virtual screening, merging diverse methods for enhanced compound selection. We also introduced 'w_new' , a groundbreaking metric that intricately refines machine learning model rankings by weighing various model-specific parameters, revolutionizing their efficacy in drug discovery in addition to other domains.

2024

Machine learning is a branch of computerized reasoning science i.e. the frameworks that can learn information. For instance, a machine learning framework can learn email accepting and recognize the contrast amongst spam and non-spam... more

Machine learning is a branch of computerized reasoning science i.e. the frameworks that can learn information. For instance, a machine learning framework can learn email accepting and recognize the contrast amongst spam and non-spam message from each other. In the wake of preparing, the framework can put new messages in their envelopes utilizing order. At present, we don't know how to program PCs keeping in mind the end goal to human take in more productive. In spite of the fact that the techniques that have been found work successfully for specific purposes, not reasonable for all reasons. For instance, machine learning calculations are ordinarily utilized as a part of information mining. Indeed, even in ranges where information are concerned, these calculations work and result much superior to different strategies. For instance, in issues such as speech recognition and in calculations in view of machine learning came about much superior to alternate techniques. Evidently, it appears that our insight into PCs will enhance step by step. Unquestionably, one might say that the theme of machine learning assume a very critical part in the field of software engineering and game technology. This paper depicts algorithms of machine learning, targeted feature choice schemes, and erasing of futile information.

2024, Research and Reviews: Journal of Computational Biology

Objective: The Marburg Virus Disease (MVD) is an infectious viral disease originating from African Fruit Bats (Rousettus aegyptiacus) that has become the root cause of a fatal hemorrhaging viral fever. As per reports from the WHO, MVD has... more

Objective: The Marburg Virus Disease (MVD) is an infectious viral disease originating from African Fruit Bats (Rousettus aegyptiacus) that has become the root cause of a fatal hemorrhaging viral fever. As per reports from the WHO, MVD has claimed the lives of millions of people worldwide, with a disease fatality rate ranging from 24% in initial outbreaks to 88% in recent times owing to a difference in viral strains and epidemic management across countries. This study is an attempt to recognize the various biochemical characteristics of the phytocompounds present in Alchemilla vulgaris and document their extraordinary medicinal values as a possible source of herbal remedy to prevent or cure the Marburg Virus Disease. Methodology: The main protein taken from the Marburg Virus for this study is the RNA binding domain VP35 protein (PDB ID: 4GH9). The 3D structure of the protein was taken from the PDB site, while the phytocompounds of Alchemilla vulgaris (133 in total) were derived from the PubChem database. After that, the protein was prepared by removing the water and heteroatom molecules, as well as ligands that showed poor binding sites. Then the molecular docking process was carried out using the PyRx tool. Finally, the drug-likeness and toxicity profiles of the top 3 best-docked phytocompounds were created through the Swiss-ADME tool, Boiled-Egg analysis, and ADMET Lab 2.0 web server. Results: The Ramachandran Plot analysis predicted the possible conformations of the amino acid residues in the protein peptide through a graphical diagram of Phi (φ) v/s Psi (ψ) values. The results of the molecular docking process revealed that the top 3 phytocompounds of Alchemilla vulgaris showed significant binding affinities (>7 Kcal/mol) with the Marburg virus's VP35 protein, thus conclusively preventing various biochemical processes such as proteolytic cleavage formation, and viral translation, transcription, and replication within the host cell. Additionally, the ADME profiling and toxicity prediction showed that all the top 3 phytocompounds, namely, Hypericin, Beta-Sitosterol, and Cholesterol were safe, possessing drug-like characteristics. Conclusion: From the results of this study, it can be concluded that Hypericin, Beta-Sitosterol, and Cholesterol, the three ethnobotanical compounds of Alchemilla vulgaris, have significant finding affinity with the Marburg virus's VP35 protein and have the potential to inhibit the development of the viral hemorrhaging fever MVD as an alternate source of its herbal remedy.

2024, Research Square (Research Square)

In this work, we develop a method for generating targeted hit compounds by applying deep reinforcement learning and attention mechanisms to predict binding affinity against a biological target while considering stereochemical information.... more

In this work, we develop a method for generating targeted hit compounds by applying deep reinforcement learning and attention mechanisms to predict binding affinity against a biological target while considering stereochemical information. The novelty of this work is a deep model Predictor that can establish the relationship between chemical structures and their corresponding pIC 50 values. We thoroughly study the effect of different molecular descriptors such as ECFP4, ECFP6, SMILES and RDKFingerprint. Also, we demonstrated the importance of attention mechanisms to capture long-range dependencies in molecular sequences. Due to the importance of stereochemical information for the binding mechanism, this information was employed both in the prediction and generation processes. To identify the most promising hits, we apply the self-adaptive multi-objective optimization strategy. Moreover, to ensure the existence of stereochemical information, we consider all the possible enumerated stereoisomers to provide the most appropriate 3D structures. We evaluated this approach against the Ubiquitin-Specific Protease 7 (USP7) by generating putative inhibitors for this target. The predictor with SMILES notations as descriptor plus bidirectional recurrent neural network using attention mechanism has the best performance. Additionally, our methodology identify the regions of the generated molecules that are important for the interaction with the receptor's active site. Also, the obtained results demonstrate that it is possible to discover synthesizable molecules with high biological affinity for the target, containing the indication of their optimal stereochemical conformation.

2024, Journal of chemical theory and computation

We have developed SSTMap, a software package for mapping structural and thermodynamic water properties in molecular dynamics trajectories. The package introduces automated analysis and mapping of local measures of frustration and... more

We have developed SSTMap, a software package for mapping structural and thermodynamic water properties in molecular dynamics trajectories. The package introduces automated analysis and mapping of local measures of frustration and enhancement of water structure. The thermodynamic calculations are based on Inhomogeneous Fluid Solvation Theory (IST), which is implemented using both site-based and grid-based approaches. The package also extends the applicability of solvation analysis calculations to multiple molecular dynamics (MD) simulation programs by using existing cross-platform tools for parsing MD parameter and trajectory files. SSTMap is implemented in Python and contains both command-line tools and a Python module to facilitate flexibility in setting up calculations and for automated generation of large data sets involving analysis of multiple solutes. Output is generated in formats compatible with popular Python data science packages. This tool will be used by the molecular mo...

2024, ACS Omega

Here, we present a Gaussian-based method for estimation of protein−protein binding entropy to augment the molecular mechanics Poisson−Boltzmann surface area (MM/PBSA) method for computational prediction of binding free energy (ΔG). The... more

Here, we present a Gaussian-based method for estimation of protein−protein binding entropy to augment the molecular mechanics Poisson−Boltzmann surface area (MM/PBSA) method for computational prediction of binding free energy (ΔG). The method is termed f5-MM/PBSA/E, where "E" stands for entropy and f5 for five adjustable parameters. The enthalpy components of ΔG (molecular mechanics, polar and non-polar solvation energies) are computed from a single implicit solvent generalized Born (GB) energy minimized structure of a protein−protein complex, while the binding entropy is computed using independently GB energy minimized unbound and bound structures. It should be emphasized that the f5-MM/PBSA/E method does not use snapshots, just energy minimized structures, and is thus very fast and computationally efficient. The method is trained and benchmarked in 5-fold validation test over a data set consisting of 46 protein−protein binding cases with experimentally determined dissociation constant K d values. This data set has been used for benchmarking in recently published protein−protein binding studies that apply conventional MM/PBSA and MM/PBSA with an enhanced sampling method. The f5-MM/PBSA/E tested on the same data set achieves similar or better performance than these computationally demanding approaches, making it an excellent choice for high throughput protein−protein binding affinity prediction studies.

2024, ACS omega

Here, we present a Gaussian-based method for estimation of protein–protein binding entropy to augment the molecular mechanics Poisson–Boltzmann surface area (MM/PBSA) method for computational prediction of binding free energy (ΔG). The... more

Here, we present a Gaussian-based method for estimation of protein–protein binding entropy to augment the molecular mechanics Poisson–Boltzmann surface area (MM/PBSA) method for computational prediction of binding free energy (ΔG). The method is termed f5-MM/PBSA/E, where “E” stands for entropy and f5 for five adjustable parameters. The enthalpy components of ΔG (molecular mechanics, polar and non-polar solvation energies) are computed from a single implicit solvent generalized Born (GB) energy minimized structure of a protein–protein complex, while the binding entropy is computed using independently GB energy minimized unbound and bound structures. It should be emphasized that the f5-MM/PBSA/E method does not use snapshots, just energy minimized structures, and is thus very fast and computationally efficient. The method is trained and benchmarked in 5-fold validation test over a data set consisting of 46 protein–protein binding cases with experimentally determined dissociation constant Kd values. This data set has been used for benchmarking in recently published protein–protein binding studies that apply conventional MM/PBSA and MM/PBSA with an enhanced sampling method. The f5-MM/PBSA/E tested on the same data set achieves similar or better performance than these computationally demanding approaches, making it an excellent choice for high throughput protein–protein binding affinity prediction studies.

2024, Journal For Basic Sciences, Volume - 23, Issue - 4, PP - 1487 - 1509

More than three decades, the generation of therapeutically significant small molecules has been greatly helped by computer-aided drug discovery and design techniques. These techniques can be roughly categorized as structure-based or... more

More than three decades, the generation of therapeutically significant small molecules has been greatly helped by computer-aided drug discovery and design techniques. These techniques can be roughly categorized as structure-based or ligand-based techniques. structure-based approaches are similar in that both target and ligand structure information are relevant. QSAR, Molecular Docking, Molecular Modelling, Pharmacophore Modelling, ADME and Toxicity Prediction etc. are some of the approaches of CADD. QSAR models are theoretical models that relate a quantitative measure of chemical structure to a physical property, or a biological activity. QSAR model can be generated by using molecular descriptor of particular structure.. For computer-aided drug design, the binding pose and affinities of a ligand and enzyme are crucial pieces of information. Molecular docking techniques are frequently used to gather this information in the early stages of a drug development process. To improve the quality control of drugs, we predicted the absorption, distribution, metabolism, excretion, and toxicity (ADMET). In order to decrease the risk of late-stage attrition during the design stage of novel compounds and compound libraries, as well as to improve screening and testing by focusing on just the most promising compounds, there is a growing demand for effective predicting tools of ADMET properties. CADD has already been used in the discovery of compounds that have passed clinical trials and become novel therapeutics in the treatment of a variety of diseases. The following are the few examples of approved medications that were discovered by using CADD's tools: three medications for the treatment of human immunodeficiency virus (HIV): saquinavir (approved in 1995), ritonavir, and indinavir (both approved in 1996). The ACE inhibitor captopril, was approved in 1981 as an antihypertensive drug.

2024

With the recent advances in machine learning for quantum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string-or graphbased... more

With the recent advances in machine learning for quantum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string-or graphbased representation, but the precise three-dimensional coordinates of the atoms are usually not encoded. First attempts in this direction have been proposed, where autoregressive or GAN-based models generate atom coordinates. Those either lack a latent space in the autoregressive setting, such that a smooth exploration of the compound space is not possible, or cannot generalize to varying chemical compositions. We propose a new approach to efficiently generate molecular structures that are not restricted to a fixed size or composition. Our model is based on the variational autoencoder which learns a translation-, rotation-, and permutationinvariant low-dimensional representation of molecules. Our experiments yield a mean reconstruction error below 0.05Å, outperforming the current state-of-the-art methods by a factor of four, and which is even lower than the spatial quantization error of most chemical descriptors. The compositional and structural validity of newly generated molecules has been confirmed by quantum chemical methods in a set of experiments.

2024, arXiv (Cornell University)

Research in machine learning is at a turning point. While supervised deep learning has conquered the field at a breathtaking pace and demonstrated the ability to solve inference problems with unprecedented accuracy, it still does not... more

Research in machine learning is at a turning point. While supervised deep learning has conquered the field at a breathtaking pace and demonstrated the ability to solve inference problems with unprecedented accuracy, it still does not quite live up to its name if we think of learning as the process of acquiring knowledge about a subject or problem. Major weaknesses of present-day deep learning models are, for instance, their lack of adaptability to changes of environment or their incapability to perform other kinds of tasks than the one they were trained for. While it is still unclear how to overcome these limitations, one can observe a paradigm shift within the machine learning community, with research interests shifting away from increasing the performance of highly parameterized models to exceedingly specific tasks, and towards employing machine learning algorithms in highly diverse domains. This research question can be approached from different angles. For instance, the field of Informed AI investigates the problem of infusing domain knowledge into a machine learning model, by using techniques such as regularization, data augmentation or post-processing. On the other hand, a remarkable number of works in the recent years has focused on developing models that by themselves guarantee a certain degree of versatility and invariance with respect to the domain or problem at hand. Thus, rather than investigating how to provide domain-specific knowledge to machine learning models, these works explore methods that equip the models with the capability of acquiring the knowledge by themselves. This white paper provides an introduction and discussion of this emerging field in machine learning research. To this end, it reviews the role of knowledge in machine learning, and discusses its relation to the concept of invariance, before providing a literature review of the field. Additionally, it gives insight into some historical context.

2024, β-Amyrin and Benzene-1,2,4-trimethyl from Euphorbia (Smith) Leaves Induce Dauer Diapause via Antagonist Inhibition of daf-12 Receptor hirta L. and Nauclea latifolia

Helminths are a prevalent class among the various classes of parasitic organisms and infect over a billion people worldwide. They cause problems ranging from malnutrition, physical disabilities, mental retardation, stunted growth and... more

Helminths are a prevalent class among the various classes of parasitic organisms and infect over a billion people worldwide. They cause problems ranging from malnutrition, physical disabilities, mental retardation, stunted growth and eventually death. Studies have shown that over half of the over 40,000 presumed nematode species are parasitic in nature. 1,2 Unfortunately, not many treatment regimens exist to combat these parasites or treat host patients to the parasites. The continuous use of anthelmintics for long periods has resulted in resistance over a wide range of time which is now common in livestock infected with parasitic nematodes. 1 As alternative therapies e.g. vaccines are lacking, novel drugs for the treatment of these infections are urgently needed, hence, the choice of medicinal plants, 3,4 or plant based bioactive principle(s) targeting a novel receptor known as nuclear receptor (daf-12) Figure 1A (2-6).

2024, arXiv (Cornell University)

We introduce generative adversarial models in which the discriminator is replaced by a calibrated (non-differentiable) classifier repeatedly enhanced by domain relevant features. The role of the classifier is to prove that the actual and... more

We introduce generative adversarial models in which the discriminator is replaced by a calibrated (non-differentiable) classifier repeatedly enhanced by domain relevant features. The role of the classifier is to prove that the actual and generated data differ over a controlled semantic space. We demonstrate that such models have the ability to generate objects with strong guarantees on their properties in a wide range of domains. They require less data than ordinary GANs, provide natural stopping conditions, uncover important properties of the data, and enhance transfer learning. Our techniques can be combined with standard generative models. We demonstrate the usefulness of our approach by applying it to several unrelated domains: generating good locations for cellular antennae, molecule generation preserving key chemical properties, and generating and extrapolating lines from very few data points. Intriguing open problems are presented as well.

2024, JKPK (Jurnal Kimia dan Pendidikan Kimia)

Beta-thalassemia therapy is developed by increasing γ-globin production which binds to α-globin to form haemoglobin fetal (HbF). Meanwhile, DNA methyltransferase 1 (DNMT1) and lysine specific demethylase 1 (LSD1) play an important role in... more

Beta-thalassemia therapy is developed by increasing γ-globin production which binds to α-globin to form haemoglobin fetal (HbF). Meanwhile, DNA methyltransferase 1 (DNMT1) and lysine specific demethylase 1 (LSD1) play an important role in silencing the HbF gene by inhibiting the production of HbF and inducing haemoglobin subunit alpha (HbA) expression. 6-Shogaol and curcumin induce HbF by inhibiting signal transducer and activator of transcription 3 (STAT3) expression. Therefore, this study predicts the interaction between 6-shogaol and curcumin on DNMT1 and LSD1. The protein structure of DNMT1 (3SWR) and LSD1 (6KGP) was prepared by removing the water molecules, while the validation step was performed by separating protein from native ligands (sinefungin for 3SWR and flavine-adenine dinucleotide (FAD) for 6KGP) in new protein data bank files. Furthermore, the protein was docked with a native ligand to obtain grid box coordinates, while the root means standard deviation (RMSD) was ca...

2024, Russian Chemical Reviews

The review is devoted to the achievements in analysis of information on chemical reactions using machine learning methods. Four large areas that actively use these methods are outlined: computer-assisted planning of synthesis, analysis... more

The review is devoted to the achievements in analysis of information on chemical reactions using machine learning methods. Four large areas that actively use these methods are outlined: computer-assisted planning of synthesis, analysis and visualization of chemical reaction data, prediction of the quantitative characteristics of reactions and computer-aided design of catalysts.

2024, Journal of Chemical Information and Modeling

Here we show that Generative Topographic Mapping (GTM) [1] can be used to explore the latent space of the SMILES-based autoencoders and generate focused molecular libraries of interest. We have built a sequence-to-sequence neural network... more

Here we show that Generative Topographic Mapping (GTM) [1] can be used to explore the latent space of the SMILES-based autoencoders and generate focused molecular libraries of interest. We have built a sequence-to-sequence neural network with Bidirectional Long Short-Term Memory layers and trained it on the SMILES strings from ChEMBL23. Very high reconstruction rates of the test set molecules were achieved (>98%), which are comparable to the ones reported in related publications [2,3]. Using GTM, we have visualized the autoencoder latent space on the two-dimensional topographic map. Targeted map zones can be used for generating novel molecular structures by sampling associated latent space points and decoding them to SMILES. The sampling method based on a genetic algorithm was introduced to optimize compound properties "on the fly". The generated focused molecular libraries were shown to contain original and a priori feasible compounds which, pending actual synthesis and testing, showed encouraging behavior in independent "structure-based" affinity estimation procedures (pharmacophore matching, docking).

2024

Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, medicine. For natural product discovery, high throughput in silico screening offers a cost-effective... more

Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, medicine. For natural product discovery, high throughput in silico screening offers a cost-effective alternative to traditional resource-heavy assay-guided exploration of structurally novel chemical space. In this data descriptor, we report a characterized database of 68,113,839 natural product-like molecules generated using a recurrent neural network trained on known natural products, demonstrating a significant 167-fold expansion in library size over the currently estimated 406,919 natural products known. This study highlights the potential of using deep generative models to uncover novel natural product chemical space for high throughput in silico screening toward natural product discovery.

2024

Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, medicine. For natural product discovery, high throughput in silico screening offers a cost-effective... more

Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, medicine. For natural product discovery, high throughput in silico screening offers a cost-effective alternative to traditional resource-heavy assay-guided exploration of structurally novel chemical space. In this data descriptor, we report a characterized database of 68,113,839 natural product-like molecules generated using a recurrent neural network trained on known natural products, demonstrating a significant 167-fold expansion in library size over the currently estimated 406,919 natural products known. This study highlights the potential of using deep generative models to uncover novel natural product chemical space for high throughput in silico screening toward natural product discovery.

2024

Although sources of social media data abound, companies are often reluctant to share data, even anonymized or aggregated, for fear of violating user privacy. This paper introduces an approach for learning the probability of link formation... more

Although sources of social media data abound, companies are often reluctant to share data, even anonymized or aggregated, for fear of violating user privacy. This paper introduces an approach for learning the probability of link formation from data using generative adversarial neural networks. In our generative adversarial network (GAN) paradigm, one neural network is trained to generate the graph topology, and a second network attempts to discriminate between the synthesized graph and the original data. After the generative network is fully trained, the learned weights can be disseminated and used to "clone" the hidden dataset with minimal risk of privacy breaches. We believe that the learned neural network also has the potential to serve as a more general model of social network evolution.

2024, Scientific reports

Global prevalence of breast cancer and its rising frequency makes it a key area of research in drug discovery programs. The research article describes the development of field based 3D-QSAR model based on human breast cancer cell line... more

Global prevalence of breast cancer and its rising frequency makes it a key area of research in drug discovery programs. The research article describes the development of field based 3D-QSAR model based on human breast cancer cell line MCF7 in vitro anticancer activity, which defines the molecular level understanding and regions of structure-activity relationship for triterpene maslinic acid and its analogs. The key features such as average shape, hydrophobic regions and electrostatic patterns of active compounds were mined and mapped to virtually screen potential analogs. Then, field points based descriptors were used to develop a 3D-QSAR model by aligning known active compounds onto identified pharmacophore template. The derived LOO validated PLS regression QSAR model showed acceptable r(2) 0.92 and q(2) 0.75. After screening through Lipinski's rule of five filter for oral bioavailability, ADMET risk filter for drug like features, and synthetic accessibility for chemical synthe...

2024

Facultat de Matemàtiques i Informàtica MSc Bioactivity-oriented de novo design of small molecules by conditional variational autoencoders by Alex Castrelo Cid Deep generative networks are an emerging technology in drug discovery. Our work... more

Facultat de Matemàtiques i Informàtica MSc Bioactivity-oriented de novo design of small molecules by conditional variational autoencoders by Alex Castrelo Cid Deep generative networks are an emerging technology in drug discovery. Our work is divided in two parts. In the first one, we built a variational autoencoder (VAE) that is able to learn the grammar of the molecules, represent them in a latent space, and generate new ones. In the second one, we built and trained a conditional variational autoencoder (CVAE) that is capable of generating new molecules based on desired properties. We will see in detail the architecture of both models and how they were This MSc thesis would not have been possible without the help and support of many people to whom I would like to dedicate the following words. Firstly, thanks to my supervisor, Dr. Jordi Vitrià Marca, for sharing his knowledge and experience. Secondly, I would like to thank all my labmates for making my stay in the laboratory amazing. In particular, thanks to Dr. Patrick Aloy and Dr. Miquel Duran-Frigola for leading me through the project and being so patient explaining me the necessary biological/chemical background to proceed in the project. Also, big thanks to Dr. Martino Bertoni for being my drug dealer, for providing me the drug-like molecules and their properties when I needed them. Finally, all my sincere gratitude to Gisela, for her support in the most stressful days during the Master, and for the help in the English writing process.

2024, International Journal of Computing and Digital Systems

Artificial Intelligence (AI) has appeared as a life-changing innovation in recent years transforming the conventional problem-solving strategies adopted so far. ML and DL-based approaches are making a monumental impact in the fields of... more

Artificial Intelligence (AI) has appeared as a life-changing innovation in recent years transforming the conventional problem-solving strategies adopted so far. ML and DL-based approaches are making a monumental impact in the fields of life sciences and health care. The tremendous amount of biochemical data has set off leading-edge research in health care and Drug Discovery. Molecular Machine Learning has precisely adopted ML techniques to uncover new insights from biochemical data. Biochemical datasets essentially hold text-based sequential information about molecules in several forms. Simplified Molecular Input Line Entry System (SMILES) is a highly efficient format for representing biochemical data that can be suitably utilized for countless relevant applications. This work presents the SMILES molecular representation in a nutshell and is centered on the major applications of ML and DL in health care especially in the drug discovery process using SMILES. This work utilizes a sequence-to-sequence architecture built on Recurrent Neural Networks (RNNs) for generating small drug-like molecules using the benchmark datasets. The experimental results prove that the Long Short Term Memory (LSTM) based RNNs can be trained to encode the raw SMILES strings with nearly perfect accuracy and to generate similar molecular structures with minimal or no feature engineering. The gradient-based optimization strategy is applied to the network and found distinctly suited to assemble the most stable and proficient sequence model. RNNs can thus be employed in Drug Discovery activities like similarity-based virtual screening, lead compound finding, and hit-to-lead optimization.

2023, Digital Discovery

Group SELFIES is a molecular string representation that incorporates tokens which represent substructures while maintaining robustness, which improves the performance of molecular generative models.

2023, Zenodo (CERN European Organization for Nuclear Research)

2023

Recurrent neural networks have been widely used to generate millions of de novo molecules in a known chemical space. These deep generative models are typically setup with LSTM or GRU units and trained with canonical SMILES. In this study,... more

Recurrent neural networks have been widely used to generate millions of de novo molecules in a known chemical space. These deep generative models are typically setup with LSTM or GRU units and trained with canonical SMILES. In this study, we introduce a new robust architecture, Generative Examination Network GEN, based on bidirectional RNNs with concatenated sub-models to learn and generate molecular SMILES within a trained target space. GENs autonomously learn the target space in a few epochs while being subjected to an independent online examination to measure the quality of the generated set. Here we have used online statistical quality control (SQC) on the percentage of valid molecular SMILES as examination measure to select the earliest available stable model weights. Very high levels of valid SMILES (95-98%) can be generated using multiple parallel encoding layers in combination with SMILES augmentation using unrestricted SMILES randomization. Our architecture combines an exce...

2023, Journal of Molecular Recognition

Investigation of protein-ligand interactions obtained from experiments has a crucial part in the design of newly discovered and effective drugs. Analyzing the data extracted from known interactions could help scientists to predict the... more

Investigation of protein-ligand interactions obtained from experiments has a crucial part in the design of newly discovered and effective drugs. Analyzing the data extracted from known interactions could help scientists to predict the binding affinities of promising ligands before conducting experiments. The objective of this study is to advance the CIFAP (compressed images for affinity prediction) method, which is relevant to a protein-ligand model, identifying 2D electrostatic potential images by separating the binding site of protein-ligand complexes and using the images for predicting the computational affinity information represented by pIC 50 values. The CIFAP method has 2 phases, namely, data modeling and prediction. In data modeling phase, the separated 3D structure of the binding pocket with the ligand inside is fitted into an electrostatic potential grid box, which is then compressed through 3 orthogonal directions into three 2D images for each protein-ligand complex. Sequential floating forward selection technique is performed for acquiring prediction patterns from the images. In the prediction phase, support vector regression (SVR) and partial least squares regression are used for testing the quality of the CIFAP method for predicting the binding affinity of 45 CHK1 inhibitors derived from 2-aminothiazole-4-carboxamide. The results show that the CIFAP method using both support vector regression and partial least squares regression is very effective for predicting the binding affinities of CHK1-ligand complexes with low-error values and high correlation. As a future work, the results could be improved by working on the pose of the ligands inside the grid.

2023

With drug resistance becoming extensively pervasive in Plasmodium falciparum infections, research for alternative drugs is becoming mandatory for prevention and cure of malaria. Increased resistance against anti malarials such as... more

With drug resistance becoming extensively pervasive in Plasmodium falciparum infections, research for alternative drugs is becoming mandatory for prevention and cure of malaria. Increased resistance against anti malarials such as chloroquine and sulfadoxin/pyrimethamine, has resulted in developing new drug therapies . Aspartic proteases called plasmepsin are present in different species of Plasmodium. With the use of in silico structure-based drug design approach, the differences in binding energies of the substrate and inhibitor were exploited between target sites of parasite and human. The docking studies show several promising molecules from GSK library with more effective binding as compared to the already known inhibitors for the drug targets. Stronger interactions are shown by several molecules as compared to the reference molecules which have shown to be the potential as drug candidates.

2023, arXiv (Cornell University)

Existing drug discovery pipelines take 5-10 years and cost billions of dollars. Computational approaches aim to sample from regions of the whole molecular and solid-state compounds called chemical space which could be on the order of 10... more

Existing drug discovery pipelines take 5-10 years and cost billions of dollars. Computational approaches aim to sample from regions of the whole molecular and solid-state compounds called chemical space which could be on the order of 10 60. Deep generative models can model the underlying probability distribution of both the physical structures and property of drugs and relate them nonlinearly. By exploiting patterns in massive datasets, these models can distill salient features that characterize the molecules. Generative Adversarial Networks (GANs) discover drug candidates by generating molecular structures that obey chemical and physical properties and show affinity towards binding with the receptor for a target disease. However, classical GANs cannot explore certain regions of the chemical space and suffer from curse-of-dimensionality. A full quantum GAN may require more than 90 qubits even to generate QM9-like small molecules. We propose a qubit-efficient quantum GAN with a hybrid generator (QGAN-HG) to learn richer representation of molecules via searching exponentially large chemical space with few qubits more efficiently than classical GAN. The QGAN-HG model is composed of a hybrid quantum generator that supports various number of qubits and quantum circuit layers, and, a classical discriminator. QGAN-HG with only 14.93% retained parameters can learn molecular distribution as efficiently as classical counterpart. The QGAN-HG variation with patched circuits considerably accelerates our standard QGAN-HG training process and avoids potential gradient vanishing issue of deep neural networks. Code is available on GitHub https://github.com/jundeli/quantum-gan.

2023, Advances in Knowledge Discovery and Data Mining

In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of high-level latent... more

In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of high-level latent random variables is helpful to learn the data distribution and solve the problem that generative adversarial net always emits the similar data. We propose the VGAN model where the generative model is composed of recurrent neural network and VAE. The discriminative model is a convolutional neural network. We train the model via policy gradient. We apply the proposed model to the task of text generation and compare it to other recent neural network based models, such as recurrent neural network language model and Seq-GAN. We evaluate the performance of the model by calculating negative log-likelihood and the BLEU score. We conduct experiments on three benchmark datasets, and results show that our model outperforms other previous models.

2023, Journal of Chemical Information and Modeling

We present a simple, modular graph-based convolutional neural network that takes structural information from protein-ligand complexes as input to generate models for activity and binding mode prediction. Complex structures are generated... more

We present a simple, modular graph-based convolutional neural network that takes structural information from protein-ligand complexes as input to generate models for activity and binding mode prediction. Complex structures are generated by a standard docking procedure and fed into a dual-graph architecture that includes separate sub-networks for the ligand bonded topology and the ligand-protein contact map. This network division allows contributions from ligand identity to be distinguished from effects of protein-ligand interactions on classification. We show, in agreement with recent literature, that dataset bias drives many of the promising results on virtual screening that have previously been reported. However, we also show that our neural network is capable of learning from protein structural information when, as in the case of binding mode prediction, an unbiased dataset is constructed. We develop a deep learning model for binding mode prediction that uses docking ranking as input in combination with docking structures. This strategy mirrors past consensus models and outperforms the baseline docking program in a variety of tests, including on cross-docking datasets that mimic real-world docking use cases. Furthermore, the magnitudes of network predictions serve as reliable measures of model confidence.

2023, Journal of King Saud University - Computer and Information Sciences

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will... more

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2023, Journal of King Saud University - Computer and Information Sciences

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will... more

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2023, arXiv (Cornell University)

The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-stepahead predictions to do multi-step sampling. We introduce the Professor Forcing... more

The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-stepahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.

2023

A recent outbreak of a new strain of Coronavirus (SARS-CoV-2) has become a global health burden, which has resulted in deaths. No proven drug has been found to effectively cure this fast-spreading infection, hence the need to explore old... more

A recent outbreak of a new strain of Coronavirus (SARS-CoV-2) has become a global health burden, which has resulted in deaths. No proven drug has been found to effectively cure this fast-spreading infection, hence the need to explore old drugs with the known profile in tackling this pandemic. A computer-aided drug design approach involving virtual screening was used to obtain the binding scores and inhibiting efficiencies of previously known antibiotics against SARS-CoV-2 main protease (M pro). The drug-likeness analysis of the repurposed drugs were done using the Molinspiration chemoinformatics tool, while the Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) analysis was carried out using ADMET SAR-2 webserver. Other analyses performed include bioactivities of the repurposed drug as a probable anti-SARS-CoV-2 agent and oral bioavailability analyses among others. The results were compared with those of drugs currently involved in clinical trials in the ongoing pandemic. Although antibiotics have been speculated to be of no use in the treatment of viral infections, literature has emerged lately to reveal the antiviral potential and immune-boosting ability of antibiotics. This study identified Tarivid and Ciprofloxacin with binding affinities of-8.3 kcal/mol and-8.1 kcal/mol, respectively as significant inhibitors of SARS-CoV-2 (M pro) with better pharmacokinetics, drug-likeness and oral bioavailability, bioactivity properties, ADMET properties and inhibitory strength compared to Remdesivir (-7.6 kcal/mol) and Azithromycin (-6.3 kcal/mol). These observations will provide insight for further research (clinical trial) in the cure and management of COVID-19. Keywords COVID-19 Á SARS-CoV-2 main protease (M pro) Á Molecular docking Á Antibiotics Á ADMET profiling

2023, Frontiers in Signal Processing

Developing models for identifying mild traumatic brain injury (mTBI) has often been challenging due to large variations in data from subjects, resulting in difficulties for the mTBI-identification models to generalize to data from unseen... more

Developing models for identifying mild traumatic brain injury (mTBI) has often been challenging due to large variations in data from subjects, resulting in difficulties for the mTBI-identification models to generalize to data from unseen subjects. To tackle this problem, we present a long short-term memory-based adversarial variational autoencoder (LSTM-AVAE) framework for subject-invariant mTBI feature extraction. In the proposed model, first, an LSTM variational autoencoder (LSTM-VAE) combines the representation learning ability of the variational autoencoder (VAE) with the temporal modeling characteristics of the LSTM to learn the latent space representations from neural activity. Then, to detach the subject’s individuality from neural feature representations, and make the model proper for cross-subject transfer learning, an adversary network is attached to the encoder in a discriminative setting. The model is trained using the 1 held-out approach. The trained encoder is then use...

2023, Journal of Chemical Information and Modeling

Relative binding free energy calculations in drug design are becoming a useful tool in facilitating lead binding affinity optimization in a cost-and time-efficient manner. However, they have been limited by technical challenges such as... more

Relative binding free energy calculations in drug design are becoming a useful tool in facilitating lead binding affinity optimization in a cost-and time-efficient manner. However, they have been limited by technical challenges such as the manual creation of large numbers of input files to set up, run, and analyze free energy simulations. In this Application Note, we describe FEPrepare, a novel web-based tool, which automates the setup procedure for relative binding FEP calculations for the dual-topology scheme of NAMD, one of the major MD engines, using OPLS-AA force field topology and parameter files. FEPrepare provides the user with all necessary files needed to run a FEP/MD simulation with NAMD. FEPrepare can be accessed and used at https:// feprepare.vi-seem.eu/.

2023, ACM Transactions on Asian and Low-Resource Language Information Processing

Text processing techniques in Natural Language Processing (NLP) find applications in many industries such as pharmaceutical, automation, and automotive. Drug design using variational autoencoders is a popular data-assisted technique to... more

Text processing techniques in Natural Language Processing (NLP) find applications in many industries such as pharmaceutical, automation, and automotive. Drug design using variational autoencoders is a popular data-assisted technique to design drug molecules with control over molecular properties. It generates continuous latent space, which can be optimized. This paper introduces a constrained variational autoencoder-based molecular generation structure using the SMILES format. The proposal is accompanied by the generation of molecules, filtering them based on scores, and subsequently determining the optimal molecules by using NLP matured techniques. To generate more meaningful latent space, a condition vector of molecular properties is combined with the SMILES representation of molecules. A tunable parameter (diversity,D) is also used to control the diversity in the generated molecules. The proposed architecture is evaluated using standard datasets. Validity, uniqueness, and FCD are...

2023, Journal of Biomechanics

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will... more

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2023

ChemML is an open machine learning and informatics program suite that is designed to support and advance the data-driven research paradigm that is currently emerging in the chemical and materials domain. ChemML allows its users to perform... more

ChemML is an open machine learning and informatics program suite that is designed to support and advance the data-driven research paradigm that is currently emerging in the chemical and materials domain. ChemML allows its users to perform various data science tasks and execute machine learning workflows that are adapted specifically for the chemical and materials context. Key features are automation, general-purpose utility, versatility, and user-friendliness in order to make the application of modern data science a viable and widely accessible proposition in the broader chemistry and materials community. ChemML is also designed to facilitate methodological innovation, and it is one of the cornerstones of the software ecosystem for data-driven in silico research outlined in our recent publication1.

2023, Journal of Chemical Theory and Computation

Protein−protein docking typically consists of the generation of putative binding conformations, which are subsequently ranked by fast heuristic scoring functions. The simplicity of these functions allows for computational efficiency but... more

Protein−protein docking typically consists of the generation of putative binding conformations, which are subsequently ranked by fast heuristic scoring functions. The simplicity of these functions allows for computational efficiency but has severe repercussions on their discrimination capabilities. In this work, we show the effectiveness of suitable descriptors calculated along short scaled molecular dynamics runs in recognizing the nearest-native bound conformation among a set of putative structures generated by the HADDOCK tool for eight protein−protein systems.

2023, Journal of chemical theory and computation

We have developed SSTMap, a software package for mapping structural and thermodynamic water properties in molecular dynamics trajectories. The package introduces automated analysis and mapping of local measures of frustration and... more

We have developed SSTMap, a software package for mapping structural and thermodynamic water properties in molecular dynamics trajectories. The package introduces automated analysis and mapping of local measures of frustration and enhancement of water structure. The thermodynamic calculations are based on Inhomogeneous Fluid Solvation Theory (IST), which is implemented using both site-based and grid-based approaches. The package also extends the applicability of solvation analysis calculations to multiple molecular dynamics (MD) simulation programs by using existing cross-platform tools for parsing MD parameter and trajectory files. SSTMap is implemented in Python and contains both command-line tools and a Python module to facilitate flexibility in setting up calculations and for automated generation of large data sets involving analysis of multiple solutes. Output is generated in formats compatible with popular Python data science packages. This tool will be used by the molecular mo...

2023, International Journal of Computing and Digital Systems

Artificial Intelligence (AI) has appeared as a life-changing innovation in recent years transforming the conventional problem-solving strategies adopted so far. ML and DL-based approaches are making a monumental impact in the fields of... more

Artificial Intelligence (AI) has appeared as a life-changing innovation in recent years transforming the conventional problem-solving strategies adopted so far. ML and DL-based approaches are making a monumental impact in the fields of life sciences and health care. The tremendous amount of biochemical data has set off leading-edge research in health care and Drug Discovery. Molecular Machine Learning has precisely adopted ML techniques to uncover new insights from biochemical data. Biochemical datasets essentially hold text-based sequential information about molecules in several forms. Simplified Molecular Input Line Entry System (SMILES) is a highly efficient format for representing biochemical data that can be suitably utilized for countless relevant applications. This work presents the SMILES molecular representation in a nutshell and is centered on the major applications of ML and DL in health care especially in the drug discovery process using SMILES. This work utilizes a sequence-to-sequence architecture built on Recurrent Neural Networks (RNNs) for generating small drug-like molecules using the benchmark datasets. The experimental results prove that the Long Short Term Memory (LSTM) based RNNs can be trained to encode the raw SMILES strings with nearly perfect accuracy and to generate similar molecular structures with minimal or no feature engineering. The gradient-based optimization strategy is applied to the network and found distinctly suited to assemble the most stable and proficient sequence model. RNNs can thus be employed in Drug Discovery activities like similarity-based virtual screening, lead compound finding, and hit-to-lead optimization.

2023, Journal of Chemical Theory and Computation

In the context of drug−receptor binding affinity calculations using molecular dynamics techniques, we implemented a combination of Hamiltonian replica exchange (HREM) and a novel nonequilibrium alchemical methodology, called virtual... more

In the context of drug−receptor binding affinity calculations using molecular dynamics techniques, we implemented a combination of Hamiltonian replica exchange (HREM) and a novel nonequilibrium alchemical methodology, called virtual double-system single-box, with increased accuracy, precision, and efficiency with respect to the standard nonequilibrium approaches. The method has been applied for the determination of absolute binding free energies of 16 newly designed noncovalent ligands of the main protease (3CL pro) of SARS-CoV-2. The core structures of 3CL pro ligands were previously identified using a multimodal structure-based ligand design in combination with docking techniques. The calculated binding free energies for four additional ligands with known activity (either for SARS-CoV or SARS-CoV-2 main protease) are also reported. The nature of binding in the 3CL pro active site and the involved residues besides the CYS−HYS catalytic dyad have been thoroughly characterized by enhanced sampling simulations of the bound state. We have identified several noncongeneric compounds with predicted low micromolar activity for 3CL pro inhibition, which may constitute possible lead compounds for the development of antiviral agents in Covid-19 treatment.

2023, ArXiv

Chemistry42 is a software platform for de novo small molecule design that integrates Artificial Intelligence (AI) techniques with computational and medicinal chemistry methods. Chemistry42 is unique in its ability to generate novel... more

Chemistry42 is a software platform for de novo small molecule design that integrates Artificial Intelligence (AI) techniques with computational and medicinal chemistry methods. Chemistry42 is unique in its ability to generate novel molecular structures with predefined properties validated through in vitro and in vivo studies. Chemistry42 is a core component of Insilico Medicine’s Pharma.ai drug discovery suite that also includes target discovery and multi-omics data analysis (PandaOmics) and clinical trial outcomes predictions (InClinico).

2023, arXiv (Cornell University)

Graph neural networks are emerging as promising methods for modeling molecular graphs, in which nodes and edges correspond to atoms and chemical bonds, respectively. Recent studies show that when 3D molecular geometries, such as bond... more

Graph neural networks are emerging as promising methods for modeling molecular graphs, in which nodes and edges correspond to atoms and chemical bonds, respectively. Recent studies show that when 3D molecular geometries, such as bond lengths and angles, are available, molecular property prediction tasks can be made more accurate. However, computing of 3D molecular geometries requires quantum calculations that are computationally prohibitive. For example, accurate calculation of 3D geometries of a small molecule requires hours of computing time using density functional theory (DFT). Here, we propose to predict the ground-state 3D geometries from molecular graphs using machine learning methods. To make this feasible, we develop a benchmark, known as Molecule3D, that includes a dataset with precise ground-state geometries of approximately 4 million molecules derived from DFT. We also provide a set of software tools for data processing, splitting, training, and evaluation, etc. Specifically, we propose to assess the error and validity of predicted geometries using four metrics. We implement two baseline methods that either predict the pairwise distance between atoms or atom coordinates in 3D space. Experimental results show that, compared with generating 3D geometries with RDKit, our method can achieve comparable prediction accuracy but with much smaller computational costs. Our Molecule3D is available as a module of the MoleculeX software library (https://github.com/divelab/MoleculeX).