Data Streams Research Papers - Academia.edu (original) (raw)
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional... more
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional security of human lives which often lead to mortality, illness, pain, grief and even disability. This paper proposes a scheme that reduces the severity of road traffic accidents given its inevitable occurrence. The rational is to search for nearest hospitals to the accident location using Dijkstra algorithm and Fuzzy logic to recommend suitable hospitals out of list of nearest hospitals to timely attend to the emergency situation considering factors such as distance, severity of the accident, available facilities in the hospitals and other factors. The obtained results showed the practicability of the system to recommendation of quick solution to accident emergencies.
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional... more
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional security of human lives which often lead to mortality, illness, pain, grief and even disability. This paper proposes a scheme that reduces the severity of road traffic accidents given its inevitable occurrence. The rational is to search for nearest hospitals to the accident location using Dijkstra algorithm and Fuzzy logic to recommend suitable hospitals out of list of nearest hospitals to timely attend to the emergency situation considering factors such as distance, severity of the accident, available facilities in the hospitals and other factors. The obtained results showed the practicability of the system to recommendation of quick solution to accident emergencies.
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional... more
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional security of human lives which often lead to mortality, illness, pain, grief and even disability. This paper proposes a scheme that reduces the severity of road traffic accidents given its inevitable occurrence. The rational is to search for nearest hospitals to the accident location using Dijkstra algorithm and Fuzzy logic to recommend suitable hospitals out of list of nearest hospitals to timely attend to the emergency situation considering factors such as distance, severity of the accident, available facilities in the hospitals and other factors. The obtained results showed the practicability of the system to recommendation of quick solution to accident emergencies.
Data streams are massive, dynamic and unbounded. Due to these issues data stream clustering is challenging problem. Data stream are observed in network monitoring, critical scientific application, weather monitoring and astronomical... more
Data streams are massive, dynamic and unbounded. Due to these issues data stream clustering is challenging problem. Data stream are observed in network monitoring, critical scientific application, weather monitoring and astronomical applications, electronic business, stock trading etc. Data stream clustering puts additional constraints on clustering algorithms. Data streams must be processed in single pass with limited memory as well as with less processing time, but the streams can be highly dynamic. Most of the existing clustering algorithms are distance based and unable to handle the interwoven clusters and also it is impossible to save the data streams, because of infinite characteristic. Proposed work focuses on density based clustering algorithms using micro-clusters. The process is divided into two-phases, online and offline, micro clusters are created in online phase and final clusters are generated in offline phase.
- by Albert Bifet
- •
- Data Streams
10th International Conference on Data Mining & Knowledge Management Process (DKMP 2022) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Data Mining and... more
10th International Conference on Data Mining & Knowledge Management Process (DKMP
2022) will provide an excellent international forum for sharing knowledge and results in theory,
methodology and applications of Data Mining and knowledge management process. The goal of this
conference is to bring together researchers and practitioners from academia and industry to focus on
understanding Modern data mining concepts and establishing new collaborations in these areas. Authors are solicited to contribute to the conference by submitting articles that illustrate research results,
projects, surveying works and industrial experiences that describe significant advances in the areas of
Computer Science, Engineering and Applications.
Due to the enormous amount of data and opinions being produced, shared and transferred everyday across the internet and other media, Sentiment analysis has become vital for developing opinion mining systems. This paper introduces a... more
Due to the enormous amount of data and opinions being produced, shared and transferred everyday across the internet and other media, Sentiment analysis has become vital for developing opinion mining systems. This paper introduces a developed classification sentiment analysis using deep learning networks and introduces comparative results of different deep learning networks. Multilayer Perceptron (MLP) was developed as a baseline for other networks results. Long short-term memory (LSTM) recurrent neural network, Convolutional Neural Network (CNN) in addition to a hybrid model of LSTM and CNN were developed and applied on IMDB dataset consists of 50K movies reviews files. Dataset was divided to 50% positive reviews and 50% negative reviews. The data was initially pre-processed using Word2Vec and word embedding was applied accordingly. The results have shown that, the hybrid CNN_LSTM model have outperformed the MLP and singular CNN and LSTM networks. CNN_LSTM have reported the accuracy of 89.2% while CNN has given accuracy of 87.7%, while MLP and LSTM have reported accuracy of 86.74% and 86.64 respectively. Moreover, the results have elaborated that the proposed deep learning models have also outperformed SVM, Naïve Bayes and RNTN that were published in other works using English datasets.
—Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity,... more
—Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.
Internet of Things (IoT) is changing the physical world. Billions of smart objects across the world generate sensed data about the environmental changes and report them. These sensed data create a lot of traffic in the network, a lot of... more
Internet of Things (IoT) is changing the physical world. Billions of smart objects across the world generate sensed data about the environmental changes and report them. These sensed data create a lot of traffic in the network, a lot of redundancy, and duplicated records, and increase the storage size in BigData in cloud computing. Some of the sensed data are required to perform a real-time action. In this paper, we propose a framework to provide the decision makers in today's organizations with real-time data that could help them make better decisions. In addition, it could filter and fuse the data with Adaptive Sampling rate using Feedback control mechanism (FCM) in the network layer. The proposed framework will reduce the number of IoT records. Moreover, it avoids a lot of traffic in the network, increases the network life, and minimizes the energy consumption. The proposed approach can help most of Businessmen in their decisions, increases the performance and reduces the cost. It can also be applied to any Businesslike ERP, CRM, and SCM. We have implemented and tested it by SQL and CQL. The results of the experiments show that the reduction has been achieved by 82.34 percent compared to the base one. This reduction has improved the energy consumption in the traffic in the network layer, generating an efficient BigData in terms of non-noise, storage capacity and cost.
Due to recent advances in data collection techniques, massiv e amounts of data are being collected at an extremely fast pace. Also, these data are potentially unboun ded. Boundless streams of data collected from sensors, equipments, and... more
Due to recent advances in data collection techniques, massiv
e amounts of data are being collected at an
extremely fast pace. Also, these data are potentially unboun
ded. Boundless streams of data collected from
sensors, equipments, and other data sources are referred to
as data streams. Various data mining tasks
can be performed on data streams in search of interesting patte
rns. This paper studies a particular data
mining task, clustering, which can be used as the first s
tep in many knowledge discovery processes. By
grouping data streams into homogeneous clusters, data miners can l
earn about data characteristics
which can then be developed into classification models for
new data or predictive models for unknown
events. Recent research addresses the problem of data-stream
mining to deal with applications that
require processing huge amounts of data such as sensor data analy
sis and financial applications. For
such analysis, single-pass algorithms that consume a small amo
unt of memory are critical.
Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees—Hoeffding Trees with naive Bayes models at the leaf... more
Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees—Hoeffding Trees with naive Bayes models at the leaf nodes—albeit with increased runtime compared to standard Hoeffding Trees. In this paper, we show that runtime can be reduced by replacing naive Bayes with perceptron classifiers, while maintaining highly competitive accuracy. We also show that accuracy can be increased even further by combining majority vote, naive Bayes, and perceptrons. We evaluate four perceptron-based learning strategies and compare them against appropriate baselines: simple perceptrons, Perceptron Hoeffding Trees, hybrid Naive Bayes Perceptron Trees, and bagged versions thereof. We implement a perceptron that uses the sigmoid activation function instead of the threshold activation function and optimizes the squared error, with one perceptron per class value. We test our methods by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.
- by Albert Bifet
- •
- Data Streams
- by Albert Bifet
- •
- Data Streams
Knowledge of the largest trac flows in a network is im- portant for many network management applications. The problem of finding these flows is known as the heavy-hitter problem and has been the subject of many studies in the past years.... more
Knowledge of the largest trac flows in a network is im- portant for many network management applications. The problem of finding these flows is known as the heavy-hitter problem and has been the subject of many studies in the past years. One of the most ecient and well-known algo- rithms for finding heavy hitters is lossy counting (29). In this work we introduce probabilistic lossy counting (PLC), which enhances lossy counting in computing network traf- fic heavy hitters. PLC uses on a tighter error bound on the estimated sizes of trac flows and provides probabilistic rather than deterministic guarantees on its accuracy. The probabilistic-based error bound substantially improves the memory consumption of the algorithm. In addition, PLC reduces the rate of false positives of lossy counting and achieves a low estimation error, although slightly higher than that of lossy counting. We compare PLC with state-of-the-art algorithms for find- ing heavy hitters. Our experiments using real tr...
- by Albert Bifet
- •
- Data Streams
We propose and illustrate a method for developing algorithms that can adaptively learn from data streams that drift over time. As an example, we take Hoeffding Tree, an incremental decision tree inducer for data streams, and use as a... more
We propose and illustrate a method for developing algorithms that can adaptively learn from data streams that drift over time. As an example, we take Hoeffding Tree, an incremental decision tree inducer for data streams, and use as a basis it to build two new methods that can deal with distribution and concept drift: a sliding window-based algorithm, Hoeffding Window Tree, and an adaptive method, Hoeffding Adaptive Tree. Our methods are based on using change detectors and estimator modules at the right places; we choose implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm. A main advantage of our methods is that they require no guess about how fast or how often the stream will drift; other methods typically have several user-defined parameters to this effect. In our experiments, the new methods never do worse, and in some cases do much better, than CVFDT, a well-known method for tree induction on data streams with drift.
- by Albert Bifet
- •
- Data Streams
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional... more
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional security of human lives which often lead to mortality, illness, pain, grief and even disability. This paper proposes a scheme that reduces the severity of road traffic accidents given its inevitable occurrence. The rational is to search for nearest hospitals to the accident location using Dijkstra algorithm and Fuzzy logic to recommend suitable hospitals out of list of nearest hospitals to timely attend to the emergency situation considering factors such as distance, severity of the accident, available facilities in the hospitals and other factors. The obtained results showed the practicability of the system to recommendation of quick solution to accident emergencies.
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we... more
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes,... more
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of Adaptive-Size Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.
- by Albert Bifet
- •
- Data Streams
- by Mohammad M. Masud and +1
- •
- Data Mining, Cloud Computing, Data Streams, Intrusion Detection
Supervisory processes are fundamental when running data center operations striving for fault resilience: any downtime can directly affect the business's income and definitely its reputation. Current monitoring tools rely on experts to... more
Supervisory processes are fundamental when running data center operations striving for fault resilience: any downtime can directly affect the business's income and definitely its reputation. Current monitoring tools rely on experts to configure constant thresholds on single streams, which is not appropriated for dynamic systems and insufficient to capture complex patterns. We present HOLMES, built to support data center experts to anticipate failures with a solution that combines Event Driven Architecture, Complex Event Processing and an unsupervised machine learning algorithm. Based on rules created by the users, the system continuously checks for known problems. Meanwhile, for the unknown ones, we leverage the CEP engine for aggregating and joining streams of real-time data to feed normalized input to FRAHST, our machine learning algorithm that detects anomalous patterns across multivariate numerical streams. We describe how the UI module also operates within the publish/subscribe paradigm to enhance situational awareness. The system had very well acceptance and was successfully implemented at one of the largest Internet Service Providers in South America.
- by Albert Bifet
- •
- Data Streams
This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing... more
This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed,
- by Albert Bifet
- •
- Data Streams
This work studies the issues related to dynamic memory management in Data Stream Processing, an emerging paradigm enabling the real-time processing of live data streams. In this paper we consider two streaming parallel patterns and we... more
This work studies the issues related to dynamic memory management in Data Stream Processing, an emerging paradigm enabling the real-time processing of live data streams. In this paper we consider two streaming parallel patterns and we discuss different implementation variants related on how dynamic memory is managed. The results show that the standard mechanisms provided by modern C++ are not entirely adequate for maximizing the performance. Instead, the combined use of an efficient general-purpose memory allocator, a custom allocator optimized for the pattern considered and a custom variant of the C++ shared pointer mechanism, provides a performance improvement up to 16% on the best case.
- by Stefan Schiffer and +1
- •
- Stream Mining (Data Mining), Data Streams
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional... more
Victims of road traffic accidents face severe health problems on-site or after the event when they arrive at hospital lately in their emergency cycle. Road traffic accident has negative effect on the physical, social and emotional security of human lives which often lead to mortality, illness, pain, grief and even disability. This paper proposes a scheme that reduces the severity of road traffic accidents given its inevitable occurrence. The rational is to search for nearest hospitals to the accident location using Dijkstra algorithm and Fuzzy logic to recommend suitable hospitals out of list of nearest hospitals to timely attend to the emergency situation considering factors such as distance, severity of the accident, available facilities in the hospitals and other factors. The obtained results showed the practicability of the system to recommendation of quick solution to accident emergencies.
Current online social networks are massive and still growing. For example, Face book has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped traditional graph analysis... more
Current online social networks are massive and still growing. For example, Face book has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped traditional graph analysis methods. Real-time monitoring for anomalies may require dynamic analysis rather than repeated static analysis. The massive state behind multiple persistent queries requires shared data structures and flexible representations. We present a framework based on the STINGER data structure ...
Most of the data stream classification methods need plenty of labeled samples to achieve a reasonable result. However, in a real data stream environment, it is crucial and expensive to obtain labeled samples, unlike the unlabeled ones.... more
Most of the data stream classification methods need plenty of labeled samples to achieve a reasonable result. However, in a real data stream environment, it is crucial and expensive to obtain labeled samples, unlike the unlabeled ones. Although Active learning is one way to tackle this challenge, it ignores the effect of unlabeled instances utilization that can help with strength supervised learning. This paper proposes a hybrid framework named “DSeSAL”, which combines active learning and dynamic self-training to achieve both strengths. Also, this framework introduces variance based self-training that uses minimal variance as a confidence measure. Since an early mistake by the base classifier in self-training can reinforce itself by generating incorrectly labeled data, especially in multi-class condition. A dynamic approach to avoid classifier accuracy deterioration, is considered. The other capability of the proposed framework is controlling the accuracy reduction by specifying a tolerance measure. To overcome data stream challenges, i.e., infinite length and evolving nature, we use the chunking method along with a classifier ensemble. A classifier is trained on each chunk and with previous classifiers form an ensemble of M such classifiers. Experimental results on synthetic and real-world data indicate the performance of the proposed framework in comparison with other approaches.
Techniques to handle traffic bursts and out-of-order arrivals are of paramount importance to provide real-time sensor data analytics in domains like traffic surveillance, transportation management, healthcare and security applications. In... more
Techniques to handle traffic bursts and out-of-order arrivals are of paramount importance to provide real-time sensor data analytics in domains like traffic surveillance, transportation management, healthcare and security applications. In these systems the amount of raw data coming from sensors must be analyzed by continuous queries that extract value-added information used to make informed decisions in real-time. To perform this task with timing constraints, parallelism must be exploited in the query execution in order to enable the real-time processing on parallel architectures. In this paper we focus on continuous preference queries, a representative class of continuous queries for decision making, and we propose a parallel query model targeting the efficient processing over out-of-order and bursty data streams. We study how to integrate punctuation mechanisms in order to enable out-of-order processing. Then, we present advanced scheduling strategies targeting scenarios with different burstiness levels, parameterized using the index of dispersion quantity. Extensive experiments have been performed using synthetic datasets and real-world data streams obtained from an existing real-time locating system. The experimental evaluation demonstrates the efficiency of our parallel solution and its effectiveness in handling the out-of-orderness degrees and burstiness levels of real-world applications.