Cloud-based data stream processing (original) (raw)

StreamCloud: An Elastic and Scalable Data Streaming System

IEEE Transactions on Parallel and Distributed Systems, 2000

Many applications in several domains such as telecommunications, network security, large scale sensor networks, require online processing of continuous data flows. They produce very high loads that requires aggregating the processing capacity of many nodes. Current Stream Processing Engines do not scale with the input load due to single-node bottlenecks. Additionally, they are based on static configurations that lead to either under or over-provisioning.

StreamCloud: A Large Scale Data Streaming System

2010 IEEE 30th International Conference on Distributed Computing Systems, 2010

Data streaming has become an important paradigm for the real-time processing of continuous data flows in domains such as telecommunications, networking, . . . Some applications in these domains require to process massive data flows that current data streaming technology is unable to manage. That is, streams that will require even for a single query operator the capacity of potentially many nodes. Current research efforts have mainly focused on scaling in the number of queries and/or query operators having overlooked the scalability with respect the stream volume. In this paper we present StreamCloud a large scale data streaming system for processing large data stream volumes. The focus of the paper is on how to parallelize continuous queries to attain a highly scalable data streaming infrastructure. StreamCloud goes beyond the state of the art by using a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. StreamCloud is implemented as a middleware and is highly independent of the underlying data streaming engine. We explore and evaluate different strategies to parallelize data streaming and identify and tackle with the main bottlenecks and overheads to achieve large scalability. The paper presents the system design, implementation and a thorough evaluation of the scalability of the fully implemented system.

Efficient Stream Processing in the Cloud

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2012

In the recent years, many emerging on-line data analysis applications require real-time delivery of the streaming data while dealing with unpredictable increase in the volume of data. In this paper we propose a novel approach for efficient stream processing of bursts in the Cloud. Our approach uses two queues to schedule requests pending execution. When bursts occur, incoming requests that exceed maximum processing capacity of the node, instead of being dropped, are diverted to a secondary queue. Requests in the secondary queue are concurrently scheduled with the primary queue, so that they can be immediately executed whenever the node has any processing power unused as the results of burst fluctuations. With this mechanism, processing power of nodes is fully utilized and the bursts are efficiently accommodated. Our experimental results illustrate the efficiency of our approach.

Elastic stream processing in the Cloud

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2013

Stream processing is a computing paradigm that has emerged from the necessity of handling high volumes of data in real time. In contrast to traditional databases, stream processing systems perform continuous queries and handle data on-the-fly. Today, a wide range of application areas relies on efficient pattern detection and queries over streams. The advent of Cloud computing fosters the development of elastic stream processing platforms which are able to dynamically adapt based on different cost-benefit tradeoffs. This article provides an overview of the historical evolution and the key concepts of stream processing, with special focus on adaptivity and Cloud-based elasticity.

Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

2012

Many important "big data" applications need to process data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional programming API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions in streaming databases: parallel recovery of lost state across the cluster. We have prototyped D-Streams in an extension to the Spark cluster computing framework called Spark Streaming, which lets users seamlessly intermix streaming, batch and interactive queries.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

2012

Many "big data" applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databasesparallel recovery of lost state-and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees.

Esc: Towards an Elastic Stream Computing Platform for the Cloud

2011 IEEE 4th International Conference on Cloud Computing, 2011

Today, most tools for processing big data are batch-oriented. However, many scenarios require continuous, online processing of data streams and events. We present ESC, a new stream computing engine. It is designed for computations with real-time demands, such as online data mining. It offers a simple programming model in which programs are specified by directed acyclic graphs (DAGs). The DAG defines the data flow of a program, vertices represent operations applied to the data. The data which are streaming through the graph are expressed as key/value pairs. ESC allows programmers to focus on the problem at hand and deals with distribution and fault tolerance. Furthermore, it is able to adapt to changing computational demands. In the cloud, ESC can dynamically attach and release machines to adjust the computational capacities to the current needs. This is crucial for stream computing since the amount of data fed into the system is not under the platform's control. We substantiate the concepts we propose in this paper with an evaluation based on a highfrequency trading scenario.

Towards Elastic Stream Processing: Patterns and Infrastructure

Distributed, highly-parallel processing frameworks as Hadoop are deemed to be state-of-the-art for handling big data today. But they burden application developers with the task to manually implement program logic using lowlevel batch processing APIs. Thus, a movement can be observed that high-level languages are developed which allow to declaratively model dataflows that are automatically optimized and mapped to the batch-processing backends. However, most of these systems are based on programming models as MapReduce that provide elasticity and fault-tolerance in a natural manner since intermediate results are materialized and, therefore, processes can simply be restarted and scaled with partitioning input datasets. For continuous query processing on data streams, these concepts cannot be applied directly since it must be guaranteed that no data is lost when nodes fail. Usually, these long running queries contain operators that maintain state information which depends on the data that has already been processed and hence they cannot be restarted without information loss. This also is an issue when streaming tasks should be scaled. Therefore, integrating elasticity and fault-tolerance in this context is a challenging task which is subject of this paper. We show how common patterns from parallel and distributed algorithms can be applied to tackle these problems and how they are mapped to the Mesos cluster management system.

Flexible Data Streaming In Stream Cloud

International Journal of Innovative Research in Science, Engineering and Technology, 2013

Most of the applications in some special domains such as Telecommunication systems, Share market, Fraud detection and network security which required online processing of incoming data. They produce very high incoming load which needs to process by multiple nodes. The current system is on single node bottleneck and with the static configuration and hence it is not able to scale with the input load. So in this paper we present stream cloud a high flexible data stream processing engine for processing bulky data stream. This is particularly suitable for applications in online transactions, monitoring financial data processing and fraud detection systems that require timely processing of continuous data. Stream cloud using novel parallelization technique which splits one query into sub queries which are independently allocates to individual nodes for execution. It is having some elastic protocols which is used to dynamic resource management and load balancing of incoming load.