A novel performance metric for building an optimized classifier (original) (raw)

Improving Accuracy Metric with Precision and Recall Metrics for Optimizing Stochastic Classifier

All stochastic classifiers attempt to improve their classification performance by constructing an optimized classifier. Typically, all of stochastic classification algorithms employ accuracy metric to discriminate an optimal solution. However, the use of accuracy metric could lead the solution towards the sub-optimal solution due less discriminating power. Moreover, the accuracy metric also unable to perform optimally when dealing with imbalanced class distribution. In this study, we propose a new evaluation metric that combines accuracy metric with the extended precision and recall metrics to negate these detrimental effects. We refer the new evaluation metric as optimized accuracy with recall-precision (OARP). This paper demonstrates that the OARP metric is more discriminating than the accuracy metric and able to perform optimally when dealing with imbalanced class distribution using one simple counter-example. We also demonstrate empirically that a naïve stochastic classification...

A hybrid evaluation metric for optimizing classifier

2011

The accuracy metric has been widely used for discriminating and selecting an optimal solution in constructing an optimized classifier. However, the use of accuracy metric leads the searching process to the sub-optimal solutions due to its limited capability of discriminating values. In this study, we propose a hybrid evaluation metric, which combines the accuracy metric with the precision and recall metrics. We call this new performance metric as Optimized Accuracy with Recall-Precision (OARP). This paper demonstrates that the OARP metric is more discriminating than the accuracy metric using two counter-examples. To verify this advantage, we conduct an empirical verification using a statistical discriminative analysis to prove that the OARP is statistically more discriminating than the accuracy metric. We also empirically demonstrate that a naive stochastic classification algorithm trained with the OARP metric is able to obtain better predictive results than the one trained with the conventional accuracy metric. The experiments have proved that the OARP metric is a better evaluator and optimizer in the constructing of optimized classifier.

Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.

PERFORMANCE ANALYSIS OF LEARNING AND CLASSIFICATION

There are different learning and classification algorithms that are used to learn patterns and categorize data according to the frequency and relationship between attributes in a given dataset. The desired result has always been higher accuracy in predicting future values or events from the given dataset. These algorithms are crucial in the field of knowledge discovery and data mining and have been extensively researched and improved for accuracy.

Data mining in metric space: an empirical analysis of supervised learning performance criteria

2004

Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other criteria. For example, SVMs and boosting are designed to optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We conducted an empirical study using a variety of learning methods (SVMs, neural nets, k-nearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold. The three metrics that are appropriate when predictions are interpreted as probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far away from metrics that depend on the relative order of the predicted values: ROC area, average precision, break-even point, and lift. In between them fall two metrics that depend on comparing predictions to a threshold: accuracy and F-score. As expected, maximum margin methods such as SVMs and boosted trees have excellent performance on metrics like accuracy, but perform poorly on probability metrics such as squared error. What was not expected was that the margin methods have excellent performance on ordering metrics such as ROC area and average precision. We introduce a new metric, SAR, that combines squared error, accuracy, and ROC area into one metric. MDS and correlation analysis shows that SAR is centrally located and correlates well with other metrics, suggesting that it is a good general purpose metric to use when more specific criteria are not known.

A Comparative Framework for Evaluating Classification Algorithms

Data mining methods have been widely used for extracting precious knowledge from large amounts of data. Classification algorithms are the most popular models. The model is selected with respect to its classification accuracy; therefore, the performance of each classifier plays a very crucial role. This paper discusses the application of some classification models on multiple datasets and compares the accuracy of the results. The relationship between dataset characteristics and accuracy is also debated, and finally, a regression model is introduced for predicting the classifier accuracy on a given dataset.

Precision-recall operating characteristic (P-ROC) curves in imprecise environments

18th International Conference on Pattern Recognition (ICPR'06), 2006

Traditionally, machine learning algorithms have been evaluated in applications where assumptions can be reliably made about class priors and/or misclassification costs. In this paper, we consider the case of imprecise environments, where little may be known about these factors and they may well vary significantly when the system is applied. Specifically, the use of precision-recall analysis is investigated and compared to the more well known performance measures such as error-rate and the receiver operating characteristic (ROC). We argue that while ROC analysis is invariant to variations in class priors, this invariance in fact hides an important factor of the evaluation in imprecise environments. Therefore, we develop a generalised precision-recall analysis methodology in which variation due to prior class probabilities is incorporated into a multi-way analysis of variance (ANOVA). The increased sensitivity and reliability of this approach is demonstrated in a remote sensing application.

Accuracy Measures for the Comparison of Classifiers

2011

The selection of the best classification algorithm for a given dataset is a very widespread problem. It is also a complex one, in the sense it requires to make several important methodological choices. Among them, in this work we focus on the measure used to assess the classification performance and rank the algorithms. We present the most popular measures and discuss their properties. Despite the numerous measures proposed over the years, many of them turn out to be equivalent in this specific case. They can also lead to interpretation problems and be unsuitable for our purpose. Consequently, the classic overall success rate or marginal rates should be preferred for this specific task.

Assessing and comparing classification algorithms

2001

Abstract Machine learning algorithms induce classifiers that depend on the training set and hyperparameters and there is a need for statistical testing for (i) assessing the expected error rate of a classifier, and (ii) comparing the expected error rates of two classifiers. We review interval estimation and hypothesis testing and discuss three tests for error rate assessment and four tests for error rate comparison.

Performance Measures in Binary Classification

International Journal of Statistics in Medical Research, 2012

We give a brief overview over common performance measures for binary classification. We cover sensitivity, specificity, positive and negative predictive value, positive and negative likelihood ratio as well as ROC curve and AUC.