Optimal Bayesian classification in nonstationary streaming environments (original) (raw)

Semi-supervised learning in nonstationary environments

Neural Networks (IJCNN), The 2011 …, 2011

Learning in nonstationary environments, also called learning concept drift, has been receiving increasing attention due to increasingly large number of applications that generate data with drifting distributions. These applications are usually associated with streaming data, either online or in batches, and concept drift algorithms are trained to detect and track the drifting concepts. While concept drift itself is a significantly more complex problem than the traditional machine learning paradigm of data coming from a fixed ...

Incremental Learning of Concept Drift from Streaming Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering, 2012

Learning in nonstationary environments, also known as learning concept drift, is concerned with learning from data whose statistical characteristics change over time. Concept drift is further complicated if the dataset is class-imbalanced. While these two issues have been independently addressed, their joint treatment has been mostly underexplored. We describe two ensemble-based approaches for learning concept drift from imbalanced data. Our first approach is a logical combination of our previously introduced ...

Hellinger distance based drift detection for nonstationary environments

Computational Intelligence in Dynamic …, 2011

Most machine learning algorithms, including many online learners, assume that the data distribution to be learned is fixed. There are many real-world problems where the distribution of the data changes as a function of time. Changes in nonstationary data distributions can significantly reduce the generalization ability of the learning algorithm on new or field data, if the algorithm is not equipped to track such changes. When the stationary data distribution assumption does not hold, the learner must take appropriate actions to ...

Active learning in nonstationary environments

ABSTRACT Increasing number of practical applications that involve streaming nonstationary data have led to a recent surge in algorithms designed to learn from such data. One challenging version of this problem that has not received as much attention, however, is learning streaming nonstationary data when a small initial set of data are labeled, with unlabeled data being available thereafter. We have recently introduced the COMPOSE algorithm for learning in such scenarios, which we refer to as initially labeled nonstationary streaming data. COMPOSE works remarkably well, however it requires limited (gradual) drift, and cannot address special cases such as introduction of a new class or significant overlap of existing classes, as such scenarios cannot be learned without additional labeled data. Scenarios that provide occasional or periodic limited labeled data are not uncommon, however, for which many of COMPOSE's restrictions can be lifted. In this contribution, we describe a new version of COMPOSE as a proof-of-concept algorithm that can identify the instances whose labels — if available — would be most beneficial, and then combine those instances with unlabeled data to actively learn from streaming nonstationary data, even when the distribution of the data experiences abrupt changes. On two carefully designed experiments that include abrupt changes as well as addition of new classes, we show that COMPOSE.AL significantly outperforms original COMPOSE, while closely tracking the optimal Bayes classifier performance.

Learning from Data Streams with Concept Drift

2008

Summary Increasing access to incredibly large, nonstationary datasets and corresponding demands to analyse these data has led to the development of new online algorithms for performing machine learning on data streams. An important feature of real-world data streams is “concept drift,” whereby the distributions underlying the data can change arbitrarily over time.

COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data

— An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions that change over time. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this paper, we introduce compacted object sample extraction (COMPOSE), a computational geometry-based framework to learn from nonstationary streaming data, where labels are unavailable (or presented very sporadically) after initialization. We introduce the algorithm in detail, and discuss its results and performances on several synthetic and real-world data sets, which demonstrate the ability of the algorithm to learn under several different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopula-tion tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch. Index Terms— Alpha shape, concept drift, nonstationary environment, semisupervised learning (SSL), verification latency.

Learning in Nonstationary Environments: A Survey

Applications that generate data from nonstationary environments, where the underlying phenomena change over time, are becoming increasingly prevalent. Examples of these applications include making inferences or predictions based on financial data, energy demand and climate data analysis, web usage or sensor network monitoring, and malware/spam detection, among many others. In nonstationary environments, particularly those that generate streaming or multi-domain data, the probability density function of the data-generating process may change (drift) over time. Therefore, the fundamental and rather naïve assumption made by most computational intelligence approaches – that the training and testing data are sampled from the same fixed, albeit unknown, probability distribution – is simply not true. Learning in nonstationary environments requires adaptive or evolving approaches that can monitor and track the underlying changes, and adapt a model to accommodate those changes accordingly. In this effort, we provide a comprehensive survey and tutorial of established as well as state-of-the-art approaches, while highlighting two primary perspectives, active and passive, for learning in nonstationary environments. Finally, we also provide an inventory of existing real and synthetic datasets, as well as tools and software for getting started, evaluating and comparing different approaches.

An Ensemble Based Incremental Learning Framework for Concept Drift and Class Imbalance

… The 2010 International Joint Conference on, 2010

Abstract—We have recently introduced an incremental learning algorithm, Learn++. NSE, designed to learn in nonstationary environments, and has been shown to provide an attractive solution to a number of concept drift problems under different drift scenarios. However, Learn++. NSE relies on error to weigh the classifiers in the ensemble on the most recent data. For balanced class distributions, this approach works very well, but when faced with imbalanced data, error is no longer an acceptable measure of performance. On the ...