Tackling Class Imbalance with Deep Convolutional Neural Networks (original) (raw)

Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data

IEEE transactions on neural networks and learning systems, 2017

Class imbalance is a common problem in the case of real-world object detection and classification tasks. Data of some classes are abundant, making them an overrepresented majority, and data of other classes are scarce, making them an underrepresented minority. This imbalance makes it challenging for a classifier to appropriately learn the discriminating boundaries of the majority and minority classes. In this paper, we propose a cost-sensitive (CoSen) deep neural network, which can automatically learn robust feature representations for both the majority and minority classes. During training, our learning procedure jointly optimizes the class-dependent costs and the neural network parameters. The proposed approach is applicable to both binary and multiclass problems without any modification. Moreover, as opposed to data-level approaches, we do not alter the original data distribution, which results in a lower computational cost during the training process. We report the results of ou...

OVERVIEW ON STATE-OF-THE-ART DEEP LEARNING-BASED MODELS FOR IMBALANCE CLASSIFICATION

IJARW, 2024

In recent years, the problem of data imbalance has become a major challenge, significantly impacting the process of mining information from data. This often happens when some classes have significantly larger samples than other classes. With the development of deep learning, there have been significant advances in representing and understanding information from images. However, when applying deep learning to practical image recognition tasks, the problem of "deep long-tail" becomes apparent. Training models to face even rare cases helps create robust and flexible models that are able to adapt well to real-world data fluctuations. This paper aims to comprehensively analyze the long-tail problem in image recognition, summarize the highlights and limitations of previous methods, and provide a view on future research directions.

Sensitivity of Modern Deep Learning Neural Networks to Unbalanced Datasets in Multiclass Classification Problems

One of the critical problems in multiclass classification tasks is the imbalance of the dataset. This is especially true when using contemporary pre-trained neural networks, where, in fact, the last layers of the neural network are retrained. Therefore, the large datasets with highly unbalanced classes are not good for models’ training since the use of such a dataset leads to overfitting and, accordingly, poor metrics on test and validation datasets. In this paper the sensitivity to a dataset imbalance of Xception, ViT-384, ViT-224, VGG19, ResNet34, ResNet50, ResNet101, Inception_v3, DenseNet201, DenseNet161, DeIT was studied using a highly imbalanced dataset of 20,971 images sorted into 7 classes. It is shown that the best metrics were obtained when using a cropped dataset with augmentation of missing images in classes up to 15% of the initial number. So, the metrics can be increased by 2-6% compared to the metrics of the models on the initial unbalanced data set. Moreover, the met...

Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks

IEEE Transactions on Neural Networks and Learning Systems

Imbalanced class distribution is an inherent problem in many real-world classification tasks where the minority class is the class of interest. Many conventional statistical and machine learning classification algorithms are subject to frequency bias and learning discriminating boundaries between the minority and majority classes could be challenging. To address the class distribution imbalance in deep learning, we propose a class re-balancing strategy based on a class-balanced dynamically weighted loss function where weights are assigned based on class frequency and predicted probability of ground truth class. The ability of dynamic weighting scheme to self-adapt its weights depending on the prediction scores allows the model to adjust for instances with varying levels of difficulty resulting in gradient updates driven by hard minority class samples. We further show that the proposed loss function is classification calibrated. Experiments conducted on highly imbalanced data across different applications of cyber intrusion detection (CICIDS2017 dataset) and medical imaging (ISIC2019 dataset) show robust generalization. Theoretical results supported by superior empirical performance provide justification for the validity of the proposed Dynamically Weighted Balanced (DWB) Loss Function.

Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks

arXiv (Cornell University), 2022

Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. ID is the occurrence of a situation where the quantity of the samples belonging to one class outnumbers that of the other by a wide margin, making such models' learning process biased towards the majority class. In recent years, to address this issue, several solutions have been put forward, which opt for either synthetically generating new data for the minority class or reducing the number of majority classes for balancing the data. Hence, in this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), mixed with a variety of well-known imbalanced data solutions meaning oversampling and undersampling. To evaluate our methods, we have used KEEL, breast cancer, and Z-Alizadeh Sani datasets. In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions. The classification results demonstrate that the mixed Synthetic Minority Oversampling Technique (SMOTE)-Normalization-CNN outperforms different methodologies achieving 99.08% accuracy on the 24 imbalanced datasets. Therefore, the proposed mixed model can be applied to imbalanced binary classification problems on other real datasets.

Comparative Performance of Deep Learning and Machine Learning Algorithms on Imbalanced Handwritten Data

International Journal of Advanced Computer Science and Applications, 2018

Imbalanced data is one of the challenges in a classification task in machine learning. Data disparity produces a biased output of a model regardless how recent the technology is. However, deep learning algorithms, such as deep belief networks showed promising results in many domains, especially in image processing. Therefore, in this paper, we will review the effect of imbalanced data disparity in classes using deep belief networks as the benchmark model and compare it with conventional machine learning algorithms, such as backpropagation neural networks, decision trees, naïve Bayes and support vector machine with MNIST handwritten dataset. The experiment shows that although the algorithm is stable and suitable for multiple domains, the imbalanced data distribution still manages to affect the outcome of the conventional machine learning algorithms.

Tackling the Imbalance for GNNs

ArXiv, 2021

Different from deep neural networks for non-graph data classification, graph neural networks (GNNs) leverage the information exchange between nodes (or samples) when representing nodes. The category distribution shows an imbalance or even a highlyskewed trend on nearly all existing benchmark GNN data sets. The imbalanced distribution will cause misclassification of nodes in the minority classes, and even cause the classification performance on the entire data set to decrease. This study explores the effects of the imbalance problem on the performances of GNNs and proposes new methodologies to solve it. First, a node-level index, namely, the label difference index (LDI ), is defined to quantitatively analyze the relationship between imbalance and misclassification. The less samples in a class, the higher the value of its average LDI ; the higher the LDI of a sample, the more likely the sample will be misclassified. We define a new loss and propose four new methods based on LDI . Expe...

Improving Model Accuracy for Imbalanced Image Classification Tasks by Adding a Final Batch Normalization Layer: An Empirical Study

2021

Some real-world domains, such as Agriculture and Healthcare, comprise early-stage disease indications whose recording constitutes a rare event, and yet, whose precise detection at that stage is critical. In this type of highly imbalanced classification problems, which encompass complex features, deep learning (DL) is much needed because of its strong detection capabilities. At the same time, DL is observed in practice to favor majority over minority classes and consequently suffer from inaccurate detection of the targeted early-stage indications. To simulate such scenarios, we artificially generate skewness (99% vs. 1%) for certain plant types out of the PlantVillage dataset as a basis for classification of scarce visual cues through transfer learning. By randomly and unevenly picking healthy and unhealthy samples from certain plant types to form a training set, we consider a base experiment as fine-tuning ResNet34 and VGG19 architectures and then testing the model performance on a balanced dataset of healthy and unhealthy images. We empirically observe that the initial F1 test score jumps from 0.29 to 0.95 for the minority class upon adding a final Batch Normalization (BN) layer just before the output layer in VGG19. We demonstrate that utilizing an additional BN layer before the output layer in modern CNN architectures has a considerable impact in terms of minimizing the training time and testing error for minority classes in highly imbalanced data sets. Moreover, when the final BN is employed, minimizing the loss function may not be the best way to assure a high F1 test score for minority classes in such problems. That is, the network might perform better even if it is not 'confident' enough while making a prediction; leading to another discussion about why softmax output is not a good uncertainty measure for DL models. We also report on the corroboration of these findings on the ISIC Skin Cancer as well as the Wall Crack datasets.

Convolutional Neural Networks and Deep Belief Networks for Analysing Imbalanced Class Issue in Handwritten Dataset

International Journal on Advanced Science, Engineering and Information Technology, 2017

Imbalanced class is one of the trials in classifying materials of big data. Data disparity produces a biased output of a model regardless how recent the technology is. However, deep learning algorithms such as convolutional neural networks and deep belief networks have proven to provide promising results in many research domains, especially in image processing as well as time series forecasting, intrusion detection, and classification. Therefore, this paper will investigate the effect of imbalanced data discrepancy of classes in MNIST handwritten dataset using convolutional neural networks and deep belief networks. Based on the experiment conducted, the results show that although the algorithm is suitable for multiple domains and have shown stability, the imbalanced distribution of data still able to affect the overall performance of the models.

Detection and Mitigation of Rare Subclasses in Deep Neural Network Classifiers

2021 IEEE International Conference on Artificial Intelligence Testing (AITest)

Regions of high-dimensional input spaces that are underrepresented in training datasets reduce machine-learnt classifier performance, and may lead to corner cases and unwanted bias for classifiers used in decision making systems. When these regions belong to otherwise well-represented classes, their presence and negative impact are very hard to identify. We propose an approach for the detection and mitigation of such rare subclasses in deep neural network classifiers. The new approach is underpinned by an easy-to-compute commonality metric that supports the detection of rare subclasses, and comprises methods for reducing the impact of these subclasses during both model training and model exploitation. We demonstrate our approach using two well-known datasets, MNIST's handwritten digits and Kaggle's cats/dogs, identifying rare subclasses and producing models which compensate for subclass rarity. In addition we demonstrate how our run-time approach increases the ability of users to identify samples likely to be misclassified at run-time.