Cold Start Active Learning Strategies in the Context of Imbalanced Classification (original) (raw)
Related papers
Active learning for imbalanced data under cold start
2021
Modern systems that rely on Machine Learning (ML) for predictive modelling, may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios, where labels of the positive class take longer to accumulate. We propose an Active Learning (AL) system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where ODAL is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies without ODAL warm-up. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annot...
Active Learning for Imbalanced Datasets
2020
Active learning increases the effectiveness of labeling when only subsets of unlabeled datasets can be processed manually. To our knowledge, existing algorithms are designed under the assumption that datasets are balanced. However, many real-life datasets are actually imbalanced and we propose two adaptations of active learning to tackle imbalance. First, we modify acquisition functions to select samples by taking advantage of a deep model pretrained on a source domain. Second, we introduce a balancing step in the acquisition process to reduce the imbalance of the labeled subset. Evaluation is done with four imbalanced datasets using existing active learning methods and their modifications introduced here. Results show that our adaptations are useful as long as knowledge from the source domain is transferable to target domains.
Minority Class Oriented Active Learning for Imbalanced Datasets
2020 25th International Conference on Pattern Recognition (ICPR)
Active learning aims to optimize the dataset annotation process when resources are constrained. Most existing methods are designed for balanced datasets. Their practical applicability is limited by the fact that a majority of real-life datasets are actually imbalanced. Here, we introduce a new active learning method which is designed for imbalanced datasets. It favors samples likely to be in minority classes so as to reduce the imbalance of the labeled subset and create a better representation for these classes. We also compare two training schemes for active learning: (1) the one commonly deployed in deep active learning using model fine tuning for each iteration and (2) a scheme which is inspired by transfer learning and exploits generic pre-trained models and train shallow classifiers for each iteration. Evaluation is run with three imbalanced datasets. Results show that the proposed active learning method outperforms competitive baselines. Equally interesting, they also indicate that the transfer learning training scheme outperforms model fine tuning if features are transferable from the generic dataset to the unlabeled one. This last result is surprising and should encourage the community to explore the design of deep active learning methods.
Active learning for online training in imbalanced data streams under cold start
ArXiv, 2021
Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform emp...
A Novel Low-Query-Budget Active Learner with Pseudo-Labels for Imbalanced Data
Mathematics, 2022
Despite the availability of a large amount of free unlabeled data, collecting sufficient training data for supervised learning models is challenging due to the time and cost involved in the labeling process. The active learning technique we present here provides a solution by querying a small but highly informative set of unlabeled data. It ensures high generalizability across space, improving classification performance with test data that we have never seen before. Most active learners query either the most informative or the most representative data to annotate them. These two criteria are combined in the proposed algorithm by using two phases: exploration and exploitation phases. The former aims to explore the instance space by visiting new regions at each iteration. The second phase attempts to select highly informative points in uncertain regions. Without any predefined knowledge, such as initial training data, these two phases improve the search strategy of the proposed algori...
Many machine learning datasets are noisy with a substantial number of mislabeled instances. This noise yields sub-optimal classification performance. In this paper we study a large, low quality annotated dataset, created quickly and cheaply using Amazon Mechanical Turk to crowdsource annotations. We describe computationally cheap feature weighting techniques and a novel non-linear distribution spreading algorithm that can be used to iteratively and interactively correcting mislabeled instances to significantly improve annotation quality at low cost. Eight different emotion extraction experiments on Twitter data demonstrate that our approach is just as effective as more computationally expensive techniques. Our techniques save a considerable amount of time.
Reducing class imbalance during active learning for named entity annotation
In lots of natural language processing tasks, the classes to be dealt with often occur heavily imbalanced in the underlying data set and classifiers trained on such skewed data tend to exhibit poor performance for low-frequency classes. We introduce and compare different approaches to reduce class imbalance by design within the context of active learning (AL). Our goal is to compile more balanced data sets up front during annotation time when AL is used as a strategy to acquire training material. We situate our approach in the context of named entity recognition. Our experiments reveal that we can indeed reduce class imbalance and increase the performance of classifiers on minority classes while preserving a good overall performance in terms of macro F-score.
Active Learning for Reducing Labeling Effort in Text Classification Tasks
Communications in Computer and Information Science, 2022
Labeling data can be an expensive task as it is usually performed manually by domain experts. This is cumbersome for deep learning, as it is dependent on large labeled datasets. Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative. Little research has been done on AL in a text classification setting and next to none has involved the more recent, state-of-the-art NLP models. Here, we present an empirical study that compares different uncertainty-based algorithms with BERT base as the used classifier. We evaluate the algorithms on two NLP classification datasets: Stanford Sentiment Treebank and KvK-Frontpages. Additionally, we explore heuristics that aim to solve presupposed problems of uncertainty-based AL; namely, that it is unscalable and that it is prone to selecting outliers. Furthermore, we explore the influence of the querypool size on the performance of AL. Whereas it was found that the proposed heuristics for AL did not improve performance of AL; our results show that using uncertainty-based AL with BERT base outperforms random sampling of data. This difference in performance can decrease as the query-pool size gets larger.
Importance-Weighted Label Prediction for Active Learning with Noisy Annotations
This paper presents a practical method for pool-based active learning that is robust to annotation noise. Our work is inspired by recent approaches to active learning in two different noise-free settings: importanceweighted methods for streams and unbiased pool-based techniques. In our proposed method, we employ an ensemble of classifiers to guide the label requests from a pool of unlabeled training data. We demonstrate, using several standard datasets, that the proposed approach, which employs label prediction in combination with importance-weighting, significantly improves active learning in the presence of annotation noise. Moreover, the ease with which the proposed method can be implemented should make it widely applicable to a broad range of real-world applications.
d-Confidence: an active learning strategy which efficiently identifies small classes
In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled examples to train a classifier. In such circumstances it is common to have massive corpora where a few examples are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled examples to improve classification models. However, these techniques assume that the labeled examples cover all the classes to learn which might not stand. In the presence of an imbalanced class distribution getting labeled examples from minority classes might be very costly if queries are randomly selected. Active learning allows asking an oracle to label new examples, that are criteriously selected, and does not assume a previous knowledge of all classes. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we discu...