Toxicity prediction of small drug molecules of aryl hydrocarbon receptor using a proposed ensemble model (original) (raw)

Screening of 397 Chemicals and Development of a Quantitative Structure−Activity Relationship Model for Androgen Receptor Antagonism

Chemical Research in Toxicology, 2008

We have screened 397 chemicals for human androgen receptor (AR) antagonism by a sensitive reporter gene assay to generate data for the development of a quantitative structure-activity relationship (QSAR) model. A total of 523 chemicals comprising data on 292 chemicals from our laboratory and data on 231 chemicals from the literature constituted the training set for the model. The chemicals were selected with the purpose of representing a wide range of chemical structures (e.g., organochlorines and polycyclic aromatic hydrocarbons) and various functions (e.g., natural hormones, pesticides, plastizicers, plastic additives, brominated flame retardants, and roast mutagens). In addition, the intention was to obtain an equal number of positive and negative chemicals. Among our own data for the training set, 45.7% exhibited inhibitory activity against the transcriptional activity induced by the synthetic androgen R1881. The MultiCASE expert system was used to construct a QSAR model for AR antagonizing potential. A "5 Times, 2-Fold 50% Cross Validation" of the model showed a sensitivity of 64%, a specificity of 84%, and a concordance of 76%. Data for 102 chemicals were generated for an external validation of the model resulting in a sensitivity of 57%, a specificity of 98%, and a concordance of 92% of the model. The model was run on a set of 176103 chemicals, and 47% were within the domain of the model. Approximately 8% of chemicals was predicted active for AR antagonism. We conclude that the predictability of the global QSAR model for this end point is good. This most comprehensive QSAR model may become a valuable tool for screening large numbers of chemicals for AR antagonism.

Voting-based ensemble method for prediction of bioactive molecules

2017 2nd International Conference on Knowledge Engineering and Applications (ICKEA), 2017

Machine learning has its tentacles spread over all major areas of science. The current rise in the amount of data being generated as necessitated its adoption in virtually all aspects including chemoinformatics. Several machine learning methods have been applied to the drug discovery process due to the importance of prediction of bioactivity before the release of drug into the market. The need for the most accurate method is hence evident. Majority voting ensemble is a method whose application is rare in predicting bioactive molecules. This study applies the method using different combination of commonly used classifiers as the base classifier on a chemical dataset of 8294 instances and 1024 attributes retrieved from the MDL Drug Data Report (MDDR). The accuracy of majority voting with the best combination of classifiers is found to be higher than the accuracy of the commonly used classifiers in the field, and makes it suitable for large chemical datasets.

Ensemble learning method for the prediction of new bioactive molecules

PloS one, 2018

Pharmacologically active molecules can provide remedies for a range of different illnesses and infections. Therefore, the search for such bioactive molecules has been an enduring mission. As such, there is a need to employ a more suitable, reliable, and robust classification method for enhancing the prediction of the existence of new bioactive molecules. In this paper, we adopt a recently developed combination of different boosting methods (Adaboost) for the prediction of new bioactive molecules. We conducted the research experiments utilizing the widely used MDL Drug Data Report (MDDR) database. The proposed boosting method generated better results than other machine learning methods. This finding suggests that the method is suitable for inclusion among the in silico tools for use in cheminformatics, computational chemistry and molecular biology.

Stacked Ensemble for Bioactive Molecule Prediction

IEEE Access

Bioactive molecular compounds are essential for drug discovery. The biological activity of these compounds needs to be predicted as this is used to determine the drug-target ability. As ineffective drugs are discarded after production, leading to resource and time wastage, it is important to predict bioactive molecules with models having high predictive performance. This study utilizes the stacked ensemble which uses the prediction of multiple base classifiers as features, used to train a meta classifier which makes the final prediction. Using three datasets DS1, DS2, and DS3 gotten from MDL Drug Data Report (MDDR) database, the performance of stacked ensemble was compared to three other ensembles: adaboost, bagging, and vote ensemble, based on different evaluation criteria and also a statistical method, Kendall's W test. The accuracy of Stacked ensemble ranged from 96.7002%, 98.2260% and 94.9007% for the three datasets respectively, although Vote had the best accuracy using dataset DS2 which consist of structurally homogeneous bioactive molecules. Also, using Kendall's W test to rank the ensembles, Stacked ensemble was ranked best with datasets DS1 and DS3, with both having a mean average of 4.00 and an overall level of agreement, W, of 0.986 and 1.000 respectively. Using dataset DS2, it was ranked after Vote and Adaboost with mean average of 2.33 and an overall level of agreement, W of 0.857. Stacked ensemble is recommended for the prediction of heterogeneous bioactive molecules during drug discovery and can also be implemented in other research areas.

Integration of the Butina algorithm and ensemble learning strategies for the advancement of a pharmacophore ligand-based model: an in silico investigation of apelin agonists

Frontiers in chemistry, 2024

Introduction: 3D pharmacophore models describe the ligand's chemical interactions in their bioactive conformation. They offer a simple but sophisticated approach to decipher the chemically encoded ligand information, making them a valuable tool in drug design. Methods: Our research summarized the key studies for applying 3D pharmacophore models in virtual screening for 6,944 compounds of APJ receptor agonists. Recent advances in clustering algorithms and ensemble methods have enabled classical pharmacophore modeling to evolve into more flexible and knowledge-driven techniques. Butina clustering categorizes molecules based on their structural similarity (indicated by the Tanimoto coefficient) to create a structurally diverse training dataset. The learning method combines various individual pharmacophore models into a set of pharmacophore models for pharmacophore space optimization in virtual screening. Results: This approach was evaluated on Apelin datasets and afforded good screening performance, as proven by Receiver Operating Characteristic (AUC score of 0.994 ± 0.007), enrichment factor of (EF1% of 50.07 ± 0.211), Güner-Henry score of 0.956 ± 0.015, and F-measure of 0.911 ± 0.031. Discussion: Although one of the high-scoring models achieved statistically superior results in each dataset (AUC of 0.82; an EF1% of 19.466; GH of 0.131 and F1-score of 0.071), the ensemble learning method including voting and stacking method balanced the shortcomings of each model and passed with close performance measures.

Anti-cancer Drug Activity Prediction by Ensemble Learning

Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 2016

Personalized cancer treatment is an ever-evolving approach due to complexity of cancer. As a part of personalized therapy, effectiveness of a drug on a cell line is measured. However, these experiments are backbreaking and money consuming. To surmount these difficulties, computational methods are used with the provided data sets. In the present study, we considered this as a regression problem and designed an ensemble model by combining three different regression models to reduce prediction error for each drug-cell line pair. Two major data sets were used to evaluate our method. Results of this evaluation show that predictions of ensemble method are significantly better than models per se. Furthermore, we report the cytotoxicty predictions of our model for the drug-cell line pairs that do not appear in the original data sets.

Classifier Ensemble Based on Feature Selection and Diversity Measures for Predicting the Affinity of A 2B Adenosine Receptor Antagonists

Journal of Chemical Information and Modeling, 2013

A 2B adenosine receptor antagonists may be beneficial in treating diseases like asthma, diabetes, diabetic retinopathy, and certain cancers. This has stimulated research for the development of potent ligands for this subtype, based on quantitative structure-affinity relationships. In this work, a new ensemble machine learning algorithm is proposed for classification and prediction of the ligand-binding affinity of A 2B adenosine receptor antagonists. This algorithm is based on the training of different classifier models with multiple training sets (composed of the same compounds but represented by diverse features). The k-nearest neighbor, decision trees, neural networks, and support vector machines were used as single classifiers. To select the base classifiers for combining into the ensemble, several diversity measures were employed. The final multiclassifier prediction results were computed from the output obtained by using a combination of selected base classifiers output, by utilizing different mathematical functions including the following: majority vote, maximum and average probability. In this work, 10-fold cross-and external validation were used. The strategy led to the following results: i) the single classifiers, together with previous features selections, resulted in good overall accuracy, ii) a comparison between single classifiers, and their combinations in the multiclassifier model, showed that using our ensemble gave a better performance than the single classifier model, and iii) our multiclassifier model performed better than the most widely used multiclassifier models in the literature. The results and statistical analysis demonstrated the supremacy of our multiclassifier approach for predicting the affinity of A 2B adenosine receptor antagonists, and it can be used to develop other QSAR models.

T-Ensemble Approach for Drug Toxicity Prediction

People are imperil by umpteen chemicals unwittingly through the disparate sources like food, medicines etc. These chemicals can be toxic. Assessing the toxicity of chemical compounds has the power to improve environmental chemicals. Machine learning has received much attention in the predictive analytics. But drug data is complex and consists of millions of features or chemical descriptors. To properly analyze the overall impact of these features on the prediction model, an efficient feature selection method should be adopted. This work proposed a T-Ensemble framework that employs a triple ensemble feature selection technique and an ensemble classifier for the classification of drug toxicity molecules. The experiments are carried out on high dimensional drug data and achieved an accuracy of 97%. Promising results are found, when the performance of the proposed T-Ensemble framework is compared with the standard classifiers like SVM, random forest, bagging., etc. using different evaluation metrics like accuracy, sensitivity, etc. With the appearance of tremendous growth of toxic chemicals, machine learning will play a big part in improving the quality of chemical compounds in the future.

Assessment of Prediction Confidence and Domain Extrapolation of Two Structure-Activity Relationship Models for Predicting Estrogen Receptor Binding Activity

Environmental Health Perspectives, 2004

Quantitative structure-activity relationship (QSAR) methods have been widely applied in drug discovery, lead optimization, toxicity prediction, and regulatory decisions. Despite major advances in algorithms and software, QSAR models have inherent limitations associated with a size and chemical-structure diversity of the training set, experimental error, and many characteristics of structure representation and correlation algorithms. Whereas excellent fit to the training data may be readily attainable, often models fail to predict accurately chemicals that are outside their domain of applicability. A QSAR's utility and, in the case of regulatory decisions, justification for usage increasingly depend on the ability to quantify a model's potential for predicting unknown chemicals with some known degree of certainty. It is never possible to predict an unknown chemical with absolute certainty. Here we report on two QSAR models based on different data sets for classification of chemicals according to their ability to bind to the estrogen receptor. The models were developed by using a novel QSAR method, Decision Forest, which combines the results of multiple heterogeneous but comparable Decision Tree models to produce a consensus prediction. We used an extensive cross-validation process to define an applicability domain for model predictions based on two quantitative measures: prediction confidence and domain extrapolation. Together, these measures quantify the accuracy of each prediction within and outside of the training domain. Despite being based on large and diverse training sets, both QSAR models had poor accuracy for chemicals within the domain of low confidence, whereas good accuracy was obtained for those within the domain of high confidence. For prediction in the high confidence domain, accuracy was inversely proportional to the degree of domain extrapolation. The model with a larger training set of 1,092, compared with 232 for the other, was more accurate in predicting chemicals at larger domain extrapolation, and could be particularly useful for rapidly prioritizing potential endocrine disruptors from large chemical universe.

Evaluation of Quantitative Structure-Activity Relationship Methods for Large-Scale Prediction of Chemicals Binding to the Estrogen Receptor

Journal of Chemical Information and Modeling, 1998

A thorough comparison between different QSAR modeling strategies is presented. The comparison is conducted for local versus global modeling strategies, risk assessment, and computational cost. The strategies are implemented using random forests, support vector machines, and partial least squares. Results are presented for simulated data, as well as for real data, generally indicating that a global modeling strategy is preferred over a local strategy. Furthermore, the results also show that there is an pronounced risk and a comparatively high computational cost when using the local modeling strategies.