A Discretization Method Based on Maximizing the Area Under Receiver Operating Characteristic Curve (original) (raw)

ESTIMATING THE ROC CURVE AND ITS SIGNIFICANCE FOR CLASSIFICATION MODELS' ASSESSMENT

Article presents a ROC (receiver operating characteristic) curve and its application for classification models' assessment. ROC curve, along with area under the receiver operating characteristic (AUC) is frequently used as a measure for the diagnostics in many industries including medicine, marketing, finance and technology. In this article, we discuss and compare estimation procedures, both parametric and non-parametric, since these are constantly being developed, adjusted and extended.

ROC curve estimation: an overview

• This work overviews some developments on the estimation of the Receiver Operating Characteristic (ROC) curve. Estimation methods in this area are constantly being developed, adjusted and extended, and it is thus impossible to cover all topics and areas of application in a single paper. Here, we focus on some frequentist and Bayesian methods which have been mostly employed in the medical setting. Although we emphasize the medical domain, we also describe links with other fields where related developments have been made, and where some modeling concepts are often known under other designations.

Improvement of Decision Accuracy Using Discretization of Continuous Attributes

Lecture Notes in Computer Science, 2006

The naïve Bayes classifier has been widely applied to decisionmaking or classification. Because the naïve Bayes classifier prefers to dealing with discrete values, an novel discretization approach is proposed to improve naïve Bayes classifier and enhance decision accuracy in this paper. Based on the statistical information of the naïve Bayes classifier, a distributional index is defined in the new discretization approach. The distributional index can be applied to find a good solution for discretization of continuous attributes so that the naïve Bayes classifier can reach high decision accuracy for instance information systems with continuous attributes. The experimental results on benchmark data sets show that the naïve Bayes classifier with the new discretizer can reach higher accuracy than the C5.0 tree.

Two New Parameters Based on Distances in a Receiver Operating Characteristic Chart for the Selection of Classification Models

Journal of Chemical Information and Modeling, 2011

Traditionally, the techniques used to measure the performance of a discriminant analysis QSAR model are derived from Wilk's lambda or by indices based on the confusion matrix. 1À5 It is well-known that the choice of the Wilk's lambda-based model involves the same problems of predictive choosing the best regression model based only on the determination coefficient R 2 . 6 Among the indices based on confusion matrix, the accuracy, sensitivity, specificity, precision, enrichment factor, and Matthews correlation coefficient should be remarked. 7 In the field of toxicology, Benigni et al. 8 quite successfully introduced the Receiver Operating Characteristic (ROC) chart where the true positive rate (or sensitivity) is plotted against the false positive rate (1-specificity). This chart has the advantage of comparing simultaneously the different aspects of the performance of several systems or models. 8 It has been observed too that ROC curves visually convey the same information as the confusion matrix in a much more intuitive and robust fashion. The Area under the ROC curve (AUC) can be directly computed for any classification model that attaches a probability, like discriminant analysis 10À14 and is also widely used in many disciplines. 15À20 The AUC is the probability of active compounds being ranked earlier than decoy compounds, and it can take values between 1 (perfect classifiers) and 0.5 (useless random classifiers). This AUC metric parameter is not sensitive to early recognition (quickly ability to recognize positives). Truchon and Bayly 21 have discussed several methods to address this problem using the parameter named Boltzmann-Enhanced Discrimination of ROC (BEDROC) based on Robust Initial Enhancement (RIE), 22 which provided a good early recognition of actives. Recently, McGaughey et al. used the Enrichment Factor (EF), 24 the AUC, the RIE and the BEDROC parameters to evaluate different virtual screening (VS) methods. The RIE and BEDROC ABSTRACT: There are several indices that provide an indication of different types on the performance of QSAR classification models, being the area under a Receiver Operating Characteristic (ROC) curve still the most powerful test to overall assess such performance. All ROC related parameters can be calculated for both the training and test sets, but, nevertheless, neither of them constitutes an absolute indicator of the classification performance by themselves. Moreover, one of the biggest drawbacks is the computing time needed to obtain the area under the ROC curve, which naturally slows down any calculation algorithm. The present study proposes two new parameters based on distances in a ROC curve for the selection of classification models with an appropriate balance in both training and test sets, namely the following: the ROC graph Euclidean distance (ROCED) and the ROC graph Euclidean distance corrected with Fitness Function (FIT(λ)) (ROCFIT). The behavior of these indices was observed through the study on the mutagenicity for four genotoxicity end points of a number of nonaromatic halogenated derivatives. It was found that the ROCED parameter gets a better balance between sensitivity and specificity for both the training and prediction sets than other indices such as the Matthews correlation coefficient, the Wilk's lambda, or parameters like the area under the ROC curve. However, when the ROCED parameter was used, the follow-on linear discriminant models showed the lower statistical significance. But the other parameter, ROCFIT, maintains the ROCED capabilities while improving the significance of the models due to the inclusion of FIT(λ).

Binary classification using multivariate receiver operating characteristic curve for continuous data

Journal of Biopharmaceutical Statistics, 2015

The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden.

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

The Knowledge Engineering Review, 2010

Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization ...

An introduction to ROC analysis

Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

A Comparison of the ROC Curve Estimators

The ROC (Receiver Operating Chracteristic) curves are frequently used for difierent diagnostic purposes. There are several difierent approaches how to flnd the suitable estimate of the ROC curve in binormal model. The efiective methods which can be used when the sample sizes are small are still very demanded in difierent applications. In the paper the binormal model is assumed and the parametric, semiparametric and nonparametric estimators are compared by simulation study. The parametric approach is based on the method of weighted least squares, the semiparametric approach is based on the functional modelling and the nonparametric approach is based on the sample or empirical cumulative distributive function (cdf). 2. The ROC and ODC curves The receiver operating characteristic (ROC) curve is used for classiflcation between two groups of subjects. One will be called the group with condition and the other will be called the group without condition. The ROC is deflned as a plot of prob...

Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves

2003

The probability estimates of a naive Bayes classifier are inaccurate if some of its underlying independence assumptions are violated. The decision criterion for using these estimates for classification therefore has to be learned from the data. This paper proposes the use of ROC curves for this purpose. For two classes, the algorithm is a simple adaptation of the algorithm for tracing a ROC curve by sorting the instances according to their predicted probability of being positive. As there is no obvious way to upgrade this algorithm to the multi-class case, we propose a hillclimbing approach which adjusts the weights for each class in a pre-defined order. Experiments on a wide range of datasets show the proposed method leads to significant improvements over the naive Bayes classifier's accuracy. Finally, we discuss an method to find the global optimum, and show how its computational complexity would make it untractable.

Unsupervised Discretization: An Analysis of Classification Approaches for Clinical Datasets

Research Journal of Applied Sciences, Engineering and Technology, 2017

Discretization is a frequently used data preprocessing technique for enhancing the performance of data mining tasks in knowledge discovery from clinical data. It is used to transform the real-world quantitative data into qualitative data. The aim of this study is to present an experimental analysis of the variation in performance of two trivial unsupervised discretization methods with respect to different classification approaches. Equal width discretization and equal frequency discretization methods are applied for four benchmark clinical datasets obtained from the University of California, Irvine, machine learning repository. Both the methods were applied for transforming quantitative attributes into qualitative attributes with three, five, seven and ten intervals. Six classification approaches were evaluated using four evaluation measures. From the results of this experimental analysis, it can be observed that there is a variation in the performance of classification algorithms. Accuracy of classification varies with respect to the discretization method used and also with respect to the number of intervals of discretization. Moreover it can be inferred that different classification approaches require different discretization methods. No method can be deemed to be 'the best-suitable' for all applications; hence the choice of an appropriate discretization method depends on data distribution, data interpretability, correlation, classification performance and domain of application.