Random forest models of the retention constants in (original) (raw)

Random forest models of the retention constants in the thin layer chromatography

arXiv (Cornell University), 2011

In the current study we examine an application of the machine learning methods to model the retention constants in the thin layer chromatography (TLC). This problem can be described with hundreds or even thousands of descriptors relevant to various molecular properties, most of them redundant and not relevant for the retention constant prediction. Hence we employed feature selection to significantly reduce the number of attributes. Additionally we have tested application of the bagging procedure to the feature selection. The random forest regression models were built using selected variables. The resulting models have better correlation with the experimental data than the reference models obtained with linear regression. The cross-validation confirms robustness of the models.

A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies

Chemometrics and Intelligent Laboratory Systems, 2005

As datasets are becoming larger, a solution to the problem of variable prediction, this problem is becoming harder. The problem is to define which subset of variables produces optimum predictions. The example studied aims to predict the chromatographic retention of 83 basic drugs on a Unisphere PBD column at pH 11.7 using 1272 molecular descriptors. The goal of this paper is to compare the relative performance of recently developed data mining methods, specifically classification and regression trees (CART), stochastic gradient boosting for tree-based models (Treeboost), and random forests (RF), with common statistical techniques in chemometrics; and genetic algorithms on multiple linear regression (GA-MLR), uninformative variable elimination partial least squares (UVE-PLS), and SIMPLS. The comparison will be performed primarily on predictive performance, but also on the variables found to be most important for the predictions. The results of this study indicated that, individually, GA-MLR (R 2 =0.93) outperformed all models. Further analysis found that a combination approach of GA-MLR and Treeboost (R 2 =0.98) further improved these results. D

Evaluating the performances of quantitative structure-retention relationship models with different sets of molecular descriptors and databases for high-performance liquid chromatography predictions

Journal of Chromatography A, 2009

Quantitative structure-retention relationship (QSRR) models were studied for two databases: one with 151 compounds and the other with 1719 compounds. In both cases, the three modeling methods employed (multiple linear regression, partial least squares, and random forests) provided similar prediction results with regard to root-mean-square error of prediction. The reversed-phase retention related seven molecular descriptors provided better models for the smaller dataset, while the use of over 2000 molecular descriptors generated better models for the larger dataset. The QSRR models were then validated with a mixture of an active pharmaceutical ingredient and its four process/degradation impurities. Finally, classification of compounds based on similar log D profiles before QSRR modeling improved chromatographic predictability for the models used. The results showed that database composition had a desirable effect on prediction accuracy for certain input molecules.

Use of Random forest in the identification of important variables

Microchemical Journal, 2019

Random Forest (RF) technique has been shown to be promising in the supervised classification applied in different matrices. However, approaches to identifying significant variables that weight the model are scarce, in the classification problems. In this paper, we propose a methodology for the selection of variables of greater relevance in the construction of RF models. For the application of this methodology, classification models were developed to discriminating crude oil samples, about to their maximum pour point (MPP). In this sense, data from MPP (ASTM D5853) of 105 crude oil samples, their hydrogen (1 H) NMR spectra and carbon (13 C) NMR spectra were acquired. With MPP ranging from −54°C to 39°C, two classes were assigned: the first containing 43 samples with MPP value ≤ −9°C, and, the second, 62 samples with MPP value > −9°C. The 1 H NMR models, with 90% accuracy, and 13 C NMR, with 71% accuracy, were used in the selection of variable method. The results showed that the methodology proposed to select variables was effective in the distinction of the variables that best contributed to the discrimination of oils. Therefore, this new tool enabled a greater understanding of the interest chemical information, contained in the spectra and its relationship with the MPP property of the crude oil samples.

Variable selection in random forest with application to quantitative structure-activity relationship

Proceedings of the 7th Course on Ensemble Methods for Learning Machines, 2004

Abstract. A wrapper variable selection procedure is proposed for use with learning machines that generate a measure of variable importance, such as Random Forest. The procedure is based on iteratively removing low-ranking variables and assessing the learning machine performance by cross-validation. The procedure is implemented for Random Forest on some QSAR modeling examples from drug discovery and development. It is shown that the non-recursive version of the procedure outperforms the recursive version, and that the default ...

A QSRR Study of Liquid Chromatography Retention Time of

2012

The quantitative structure-retention relationship (QSRR) was employed to predict the retention time (min) (RT) of pesticides using five molecular descriptors selected by genetic algorithm (GA) as a feature selection technique. Then the data set was randomly divided into training and prediction sets. The selected descriptors were used as inputs of multi-linear regression (MLR), multilayer perceptron neural network (MLP-NN) and generalized regression neural network (GR-NN) modeling techniques to build QSRR models. Both linear and nonlinear models show good predictive ability, of which the GR-NN model demonstrated a better performance than that of the MLR and MLP-NN models. The root mean square error of cross validation of the training and the prediction set for the GR-NN model was 1.245 and 2.210, and the correlation coefficients (R) were 0.975 and 0.937 respectively, while the square correlation coefficient of the cross validation (Q 2 LOO) on the GR-NN model was 0.951, revealing the reliability of this model. The obtained results indicated that GR-NN could be used as predictive tools for prediction of RT (min) values for understudy pesticides.

A comparison of three liquid chromatography (LC) retention time prediction models

Talanta, 2018

High-resolution mass spectrometry (HRMS) data has revolutionized the identification of environmental contaminants through non-targeted analysis (NTA). However, chemical identification remains challenging due to the vast number of unknown molecular features typically observed in environmental samples. Advanced data processing techniques are required to improve chemical identification workflows. The ideal workflow brings together a variety of data and tools to increase the certainty of identification. One such tool is chromatographic retention time (RT) prediction, which can be used to reduce the number of possible suspect chemicals within an observed RT window. This paper compares the relative predictive ability and applicability to NTA workflows of three RT prediction models: (1) a logP (octanol-water partition coefficient)-based model using EPI Suite™ logP predictions; (2) a commercially available ACD/ChromGenius model; and, (3) a newly developed Quantitative Structure Retention Re...

Performance comparison of partial least squares-related variable selection methods for quantitative structure retention relationships modelling of retention times in reversed-phase liquid chromatography

Journal of chromatography. A, 2015

The relative performance of six multivariate data analysis methods derived from or combined with partial least squares (PLS) has been compared in the context of quantitative structure-retention relationships (QSRR). These methods include, GA (genetic algorithm)-PLS, Monte Carlo uninformative variable elimination (MC-UVE), competitive adaptive reweighted sampling (CARS), iteratively retaining informative variables (IRIV), variable iterative space shrinkage approach (VISSA) and PLS with automated backward selection of predictors (autoPLS). A set of 825 molecular descriptors was computed for 86 suspected sports doping compounds and used for predicting their gradient retention times in reversed-phase liquid chromatography (RPLC). The correlation between molecular descriptors selected by each technique and the retention time was established using the PLS method. All models derived from a selected subset of descriptors outperformed the reference PLS model derived from all descriptors, wit...

Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling

Journal of Chemical Information and Computer Sciences, 2003

A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.