A Random Forest with Minority Condensation and Decision Trees for Class Imbalanced Problems (original) (raw)

Enhancing techniques for learning decision trees from imbalanced data

Advances in Data Analysis and Classification, 2019

Several machine learning techniques assume that the number of objects in considered classes is approximately similar. Nevertheless, in real-world applications, the class of interest to be studied is generally scarce. The data imbalance status may allow high global accuracy through most standard learning algorithms, but it poses a real challenge when considering the minority class accuracy. To deal with this issue, we introduce in this paper a novel adaptation of the decision tree algorithm to imbalanced data situations. A new asymmetric entropy measure is proposed. It adjusts the most uncertain class distribution to the a priori class distribution and involves it in the node splitting-process. Unlike most competitive split criteria, which include only the maximum uncertainty vector in their formula, the proposed entropy is customizable with an adjustable concavity to better comply with the system expectations. The experimental results across thirty-five differently class-imbalanced data-sets show significant improvements over various split criteria adapted for imbalanced situations. Furthermore, being combined with sampling strategies and based-ensemble methods, our entropy proves significant enhancements on the minority class prediction, along with a good handling of the data difficulties related to the class imbalance problem. Keywords Asymmetric decision trees • Imbalanced data • Entropy measures • Classification problem • Index of balanced accuracy B Ikram Chaabane

Learning Decision Trees for Unbalanced Data

Lecture Notes in Computer Science, 2008

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria-information gain, Gini measure, and DKM-and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, proposes its application in forming decision trees, and performs a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.

ImbTree: Minority Class Sensitive Weighted Decision Tree for Classification of Unbalanced Data

International Journal of Intelligent Systems and Applications in Engineering, 2021

A reliable and precise tool for medical machine learning is in demand. The diagnosis datasets are mostly unbalanced. To propose an accurate prediction tool for medical data we need an accurate machine-learning algorithm for unbalanced data classification. In binary class unbalanced medical dataset, accurate prediction of the minority class is important. Traditional classifiers designed to improve accuracy by giving more weight to the majority class. Existing techniques gives good results by accurately classifying the majority class. Despite the fact that they misclassify the minority cases, the total accuracy value does not reflect this. When the misclassification cost of minority class is high, research should focus on reducing the total misclassification cost. This paper presents a new cost-sensitive classification algorithm that classifies unbalanced data accurately without compromising the accuracy of the minority class. Our proposed minority-sensitive decision tree algorithm employs new splitting criteria called MSplit to ensure accurate prediction of the minority class. The proposed splitting criteria MSplit derived from the exclusive causes of the minority class. For our experiment, we mainly focused on the breast cancer dataset by considering its importance in women's health. Our proposed model shows good results as compared to the recent studies of breast cancer detection. It shows 0.074 misclassification cost that is the least among the other comparison methods. Our model improves the performance for other unbalanced medical datasets as well.

AECID: Asymmetric entropy for classifying imbalanced data

Information Sciences, 2018

In class imbalance problems, it is often more important and expensive to recognize examples from the minority class than from the majority. Standard entropies are known to exhibit poor performance towards the rare class since they take their maximal value for the uniform distribution. To deal with this issue, the present paper introduced a novel adaption of the decision-tree algorithm to imbalanced data situations. We focused, more specifically, on how to let the split criterion discriminate the minority-class examples on a binary-classification problem. Our algorithm uses a new asymmetric entropy measure, termed AECID, which adjusts the most uncertain class distribution to the prior class distribution and includes it in the evaluation of a node impurity. Unlike most competitive split, which include only the prior imbalanced class distribution in their formula, the proposed entropy is customizable with an adjustable concavity to take into account the specificities of each data-set and to better comply with the users' requirements. Extensive experiments were conducted on thirty-six real life imbalanced data-sets to apprise the effectiveness of the proposed approach. Furthermore, the comparative results prove that the new proposal outperforms various algorithmic, data level and ensemble approaches that have been already proposed for imbalanced learning.

Biased Random Forest For Dealing With the Class Imbalance Problem

IEEE Transactions on Neural Networks and Learning Systems, 2018

The class imbalance issue has been a persistent problem in machine learning that hinders the accurate predictive analysis of data in many real-world applications. The Class imbalance problem exists when the number of instances present in a class (or classes) is significantly fewer than the number of instances belonging to another class (or classes). Sufficiently recognising the minority class during classification is a problem as most algorithms employed to learn from data input are biased towards the majority class. The underlying issue is made more complex with the presence of data difficult factors embedded in such data input. This paper presents a novel and effective ensemble-based method for dealing with the class imbalance problem. This study is motivated by the idea of moving the oversampling from the data level to the algorithm level, instead of increasing the the minority instances in the datasets, the algorithms in this paper aims toâoversample the classification ensembleâ by increasing the number of classifiers that represent the minority class in the ensemble i.e. Random Forest. The proposed Biased Random Forest BRAF algorithm employs the nearest neighbour algorithm to identify the critical areas in a given dataset. The standard random forest is then fed with more random-trees generated based on the critical areas. The results show that the proposed algorithm is very effective in dealing with the class imbalance problem.

A survey on applicability of decision trees on class Imbalance Learning

— The immense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.

Oblique Decision Tree Algorithm with Minority Condensation for Class Imbalanced Problem

Engineering Journal, 2020

In recent years, a significant issue in classification is to handle a dataset containing imbalanced number of instances in each class. Classifier modification is one of the well-known techniques to deal with this particular issue. In this paper, the effective classification model based on an oblique decision tree is enhanced to work with an imbalanced dataset that is called oblique minority condensed decision tree (OMCT). Initially, it selects the best axis-parallel hyperplane based on the decision tree algorithm using the minority entropy of instances within the minority inner fence selection. Then it perturbs this hyperplane along each axis to improve its minority entropy. Finally, it stochastically perturbs this hyperplane to escape the local solution. From the experimental results, OMCT significantly outperforms six state-of-the-art decision tree algorithms that are CART, C4.5, OC1, AE, DCSM and ME on 18 real-world datasets from UCI in term of precision, recall and F1 score. Moreover, the size of a decision tree from OMCT is significantly smaller than others.

Cost-sensitive decision tree ensembles for effective imbalanced classification

Applied Soft Computing, 2014

Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.

CLASSIFICATION OF AN IMBALANCED DATA SET USING DECISION TREE ALGORITHMS

U.P.B. Scientific Bulletin, Series C,, 2017

Machine learning algorithms have recently become very popular for different tasks involving data analysis, classification or prediction. They can provide valuable knowledge for very large sets of data and can reach very good accuracy. However, most algorithms are sensitive to the nature of the data sets, as well as different calibrations which can lead to large differences in performance, accuracy or false positives. In this paper, a classification solution for imbalanced data sets containing information about defects of various trees is presented. The experimental results present a comparison that evaluates the classification performance of the Decision Tree, Random Forest, and Extremely Randomized Trees classifiers. The measures used in the comparison take into account weighted accuracy, precision, and recall for binary and multi-class classification.

A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System

2008

Sampling methods are a direct approach to tackle the problem of class imbalance. These methods sample a data set in order to alter the class distributions. Usually these methods are applied to obtain a more balanced distribution. An open-ended question about sampling methods is which distribution can provide the best results, if any. In this work we develop a broad empirical study aiming to provide more insights into this question. Our results suggest that altering the class distribution can improve the classification performance of classifiers considering AUC as a performance metric. Furthermore, as a general recommendation, random over-sampling to balance distribution is a good starting point in order to deal with class imbalance.