Mass Ratio Variance Majority Undersampling and Minority Oversampling Technique for Class Imbalance (original) (raw)

MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning

IEEE Transactions on Knowledge and Data Engineering, 2014

Imbalanced learning problems contain an unequal distribution of data samples among different classes and pose a challenge to any classifier as it becomes hard to learn the minority class samples. Synthetic oversampling methods address this problem by generating the synthetic minority class samples to balance the distribution between the samples of the majority and minority classes. This paper identifies that most of the existing oversampling methods may generate the wrong synthetic minority samples in some scenarios and make learning tasks harder. To this end, a new method, called Majority Weighted Minority Oversampling TEchnique (MWMOTE), is presented for efficiently handling imbalanced learning problems. MWMOTE first identifies the hard-to-learn informative minority class samples and assigns them weights according to their Euclidean distance from the nearest majority class samples. It then generates the synthetic samples from the weighted informative minority class samples using a clustering approach. This is done in such a way that all the generated samples lie inside some minority class cluster. MWMOTE has been evaluated extensively on four artificial and twenty real-world data sets. The simulation results show that our method is better than or comparable with some other existing methods in terms of various assessment metrics, such as geometric mean (G-mean) and area under the receiver operating curve (ROC), usually known as area under curve (AUC).

SMOUTE:Synthetics Minority Over-sampling and Under-sampling TEchniques for class imbalanced problem

Proceedings of the Annual International Conference on Computer Science Education: Innovation & Technology CSEIT 2010 & Proceedings of the Annual International Conference on Software Engineering SE 2010, 2010

SMOTE is an over-sampling technique for handling a class imbalanced problem. It improves the precision measure of the minority class prediction by generating more minority class instances near the existing ones. Nevertheless, the large number of synthesized minority class instances may outweigh majority class instances. In this paper, we introduce the mixture techniques of over-sampling by SMOTE and under-sampling by reduction around centroids. Our algorithm, Synthetic Minority Over-Sampling and Under-sampling TEchnique called SMOUTE, avoids synthesizing a large number of minority class instances while balances both class instances. We perform experiments based on three classifiers, C4.5, Naïve Bayes and multilayer perceptron. Our results show that classifiers using SMOUTE are correctly grouped the minority class better than SMOTE. Moreover, the speed of SMOUTE is much faster than that of SMOTE for large datasets.

A novel imbalanced data classification approach using both under and over sampling

Bulletin of Electrical Engineering and Informatics, 2021

The performance of the data classification has encountered a problem when the data distribution is imbalanced. This fact results in the classifiers tend to the majority class which has the most of the instances. One of the popular approaches is to balance the dataset using over and under sampling methods. This paper presents a novel pre-processing technique that performs both over and under sampling algorithms for an imbalanced dataset. The proposed method uses the SMOTE algorithm to increase the minority class. Moreover, a cluster-based approach is performed to decrease the majority class which takes into consideration the new size of the minority class. The experimental results on 10 imbalanced datasets show the suggested algorithm has better performance in comparison to previous approaches.

Cluster-Based Minority Over-Sampling for Imbalanced Datasets

IEICE Transactions on Information and Systems, 2016

Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a preprocessing step for synthetic over-sampling methods such as SMOTE and its extended versions.

A New Hybrid Under-sampling Approach to Imbalanced Classification Problems

Applied Artificial Intelligence

Among many machine learning applications, classification is one of the important tasks. Most classification algorithms have been designed under the assumption that the number of samples for each class is approximately balanced. However, if the conventional classification approaches are applied to a class imbalanced dataset, it is likely to cause misclassification and, as a result, may distort classification performance results. Thus, in this study, we consider imbalanced classification problems and adopt an efficient preprocessing technique to improve the classification performances. In particular, we focus on borderline noise and outlier samples that belong to the majority class since they may influence classification performance. For this, we propose a hybrid resampling method, called BOD-based undersampling, which is based on density-based spatial clustering of applications with noise (DBSCAN) approach as well as noise and outlier detection methods, that is, borderline noise factor (BNF) and outlierness based on neighborhood (OBN) to divide majority class samples into four distinctive categories, i.e., safe, borderline noise, rare, and outlier. Specifically, we first determine the borderline noise samples in the overlapped region using the BNF method. Secondly, we use the OBN method to detect outlier samples and apply the DBSCAN approach to cluster the samples. Based on the results obtained from the sample identification analysis, we then segregate the safe category samples which are not abnormal samples while keeping the rest of the samples as rare samples. Finally, we remove some of safe samples by using the random under-sampling (RUS) method and verify the effectiveness of the proposed algorithm through the comprehensive experimental analysis with considering several class imbalance datasets.

A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets

IEEE Access

Handling an imbalanced class problem is a challenging task in real-world applications. This problem affects various prediction models that predict only the majority class and fail to identify the minority class because of the skewed data. The oversampling technique is one of the exciting solutions that handles the imbalanced class problem. However, several existing oversampling methods do not consider the distribution of the target variable and cause an overlapping class problem. Therefore, this study introduces a new oversampling technique, namely Synthetic Minority based on Probabilistic Distribution (SyMProD), to handle skewed datasets. Our technique normalizes data using a Z-score and removes noisy data. Then, the proposed method selects minority samples based on the probability distribution of both classes. The synthetic instances are generated from selected points and several minority nearest neighbors. Our technique aims to create synthetic instances that cover the minority class distribution, avoid the noise generation, and reduce the possibilities of overlapping classes and overgeneralization problems. Our proposed technique is validated using 14 benchmark datasets and three classifiers. Moreover, we compare the performance with seven other conventional oversampling algorithms. The empirical results show that our method achieves better performance compared with other oversampling techniques.

Synthetic minority oversampling technique for multiclass imbalance problems

Pattern Recognition, 2017

Multiclass imbalance data learning has attracted increasing interests from the research community. Unfortunately, existing oversampling solutions, when facing this more challenging problem as compared to two-class imbalance case, have shown their respective deficiencies such as causing serious over generalization or not actively improving the class imbalance in data space. We propose a k-nearest neighbors (k-NN)-based synthetic minority oversampling algorithm, termed SMOM, to handle multiclass imbalance problems. Different from previous k-NN-based oversampling algorithms, where for any original minority instance the synthetic instances are randomly generated in the directions of its k-nearest neighbors, SMOM assigns a selection weight to each neighbor direction. The neighbor directions that can produce serious over generalization will be given small selection weights. This way, SMOM forms a mechanism of avoiding over generalization as the safer neighbor directions are more likely to be selected to yield the synthetic instances. Owing to this, SMOM can aggressively explore the regions of minority classes by configuring a high value for parameter k, but do not result in severe over generalization. Extensive experiments using 27 real-world data sets demonstrate the effectiveness of our algorithm.

Handling Imbalanced Data: A Case Study for Binary Class Problems

2020

For several years till date, the major issues in terms of solving for classification problems are the issues of Imbalanced data. Because majority of the machine learning algorithms by default assumes all data are balanced, the algorithms do not take into consideration the distribution of the data sample class. The results tend to be unsatisfactory and skewed towards the majority sample class distribution. This implies that the consequences as a result of using a model built using an Imbalanced data without handling for the Imbalance in the data could be misleading both in practice and theory. Most researchers have focused on the application of Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) Sampling Approach in handling data Imbalance independently in their works and have failed to better explain the algorithms behind these techniques with computed examples. This paper focuses on both synthetic oversampling techniques and manually computes synthetic...

A novel approach to handle class imbalance : A Survey

International Journal of Engineering Development and Research, 2019

Machine learning is study of algorithms that a system uses to effectively perform a specific task. It depends on the patterns and inference instead of any instructions. In machine learning, majorly there is some level of class imbalance issue in real-world classification. This problem arises when each class does not make up an equal division of a data-set. It is essential to properly alter the metrics and methods to balance the data set goals. This means that many learning algorithms of machine learning have low predictive accuracy for the not often occurring class. In this paper, we shall discuss this problem and look in to different approaches used to solve the class imbalanced issue. This paper discusses the survey of different approaches done to improve the class imbalance issue in the data sets by learning about the data level approaches and the algorithm approaches. We have discussed the oversampling and undersampling methods to overcome the data imbalance problem.

A Novel Approach to Handle Class Imbalance in Machine Learning

International journal of engineering research and technology, 2019

Machine learning is the study of algorithms that a system uses to effectively perform a specific task. It depends on the patterns and inference instead of any instructions. In machine learning, majorly there is always some level of class imbalance issue in realworld classification. This problem arises when each class does not make up an equal division of a data-set. It is important to properly change the metrics and methods to balance the data set goals. This means that many learning algorithms of machine learning have low predictive accuracy for the not often occurring class. In this paper, we shall discuss this problem and look into different approaches used to solve the class imbalanced issue. This paper discusses the survey of different approaches done to improve the class imbalance issue in the data sets by learning about the data level approaches and the algorithm approaches. We have discussed the oversampling and under sampling methods to overcome the data imbalance problem. ...