Esc: Towards an Elastic Stream Computing Platform for the Cloud (original) (raw)

Building A Massive Stream Computing Platform For Flexible Applications

Driven by the rapid growth of large scale real-time data mining applications for personalized ads and content recommendations, distributed stream processing systems are widely applied in modern big-data architectures. Designs of existing stream computing systems are mostly focusing on the scalability and availability issues. Other important issues which are essential to the actual cost and productivity, such as the fluctuating work load handling, the stream topology alternation efficiency and the computing topology overlapping, are not well studied. To address these issues in a live, production environment, a new stream processing architecture that is based on a scalability enhanced subscription model is proposed in this paper. We also present a system, called Vortex, that has been implemented using this new architecture. Vortex is a distributed stream computing system engineered to support flexible applications at Baidu. The new architecture enables Vortex to scale well for highly fluctuating workloads and perform on-demand stream topology alternations with minimal overheads. Furthermore, the dynamic message routing mechanism of Vortex allows one processing node to serve different stream topologies. This maximizes the computing resource utilization in the scenarios of topology overlapping. With all these features, Vortex is a powerful platform for both realtime data processing and Map-Reduce job acceleration. Finally, in this paper, we also discuss some applications at Baidu to demonstrate how Vortex can be deployed for various stream computing applications ranging from real-time analytics to the efficient large-scale data mining.

Cloud-based data stream processing

Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems - DEBS '14, 2014

In this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) scalability and fault tolerance in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.

StreamCloud: An Elastic and Scalable Data Streaming System

IEEE Transactions on Parallel and Distributed Systems, 2000

Many applications in several domains such as telecommunications, network security, large scale sensor networks, require online processing of continuous data flows. They produce very high loads that requires aggregating the processing capacity of many nodes. Current Stream Processing Engines do not scale with the input load due to single-node bottlenecks. Additionally, they are based on static configurations that lead to either under or over-provisioning.

Elastic stream computing with clouds

Cloud Computing (CLOUD), 2011 IEEE …, 2011

Stream computing, also known as data stream processing, has emerged as a new processing paradigm that processes incoming data streams from tremendous numbers of sensors in a real-time fashion. Data stream applications must have low latency even when the incoming data rate fluctuates wildly. This is almost impossible with a local stream computing environment because its computational resources are finite. To address this kind of problem, we have devised a method and an architecture that transfers data stream processing to a Cloud environment as required in response to the changes of the data rate in the input data stream. Since a trade-off exists between application's latency and the economic costs when using the Cloud environment, we treat it as an optimization problem that minimizes the economic cost of using the Cloud. We implemented a prototype system using Amazon EC2 and an IBM System S stream computing system to evaluate the effectiveness of our approach. Our experimental results show that our approach reduces the costs by 80% while keeping the application's response latency low.

Elastic stream processing in the Cloud

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2013

Stream processing is a computing paradigm that has emerged from the necessity of handling high volumes of data in real time. In contrast to traditional databases, stream processing systems perform continuous queries and handle data on-the-fly. Today, a wide range of application areas relies on efficient pattern detection and queries over streams. The advent of Cloud computing fosters the development of elastic stream processing platforms which are able to dynamically adapt based on different cost-benefit tradeoffs. This article provides an overview of the historical evolution and the key concepts of stream processing, with special focus on adaptivity and Cloud-based elasticity.

Efficient Stream Processing in the Cloud

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2012

In the recent years, many emerging on-line data analysis applications require real-time delivery of the streaming data while dealing with unpredictable increase in the volume of data. In this paper we propose a novel approach for efficient stream processing of bursts in the Cloud. Our approach uses two queues to schedule requests pending execution. When bursts occur, incoming requests that exceed maximum processing capacity of the node, instead of being dropped, are diverted to a secondary queue. Requests in the secondary queue are concurrently scheduled with the primary queue, so that they can be immediately executed whenever the node has any processing power unused as the results of burst fluctuations. With this mechanism, processing power of nodes is fully utilized and the bursts are efficiently accommodated. Our experimental results illustrate the efficiency of our approach.

Flexible Data Streaming In Stream Cloud

International Journal of Innovative Research in Science, Engineering and Technology, 2013

Most of the applications in some special domains such as Telecommunication systems, Share market, Fraud detection and network security which required online processing of incoming data. They produce very high incoming load which needs to process by multiple nodes. The current system is on single node bottleneck and with the static configuration and hence it is not able to scale with the input load. So in this paper we present stream cloud a high flexible data stream processing engine for processing bulky data stream. This is particularly suitable for applications in online transactions, monitoring financial data processing and fraud detection systems that require timely processing of continuous data. Stream cloud using novel parallelization technique which splits one query into sub queries which are independently allocates to individual nodes for execution. It is having some elastic protocols which is used to dynamic resource management and load balancing of incoming load.

Towards Elastic Stream Processing: Patterns and Infrastructure

Distributed, highly-parallel processing frameworks as Hadoop are deemed to be state-of-the-art for handling big data today. But they burden application developers with the task to manually implement program logic using lowlevel batch processing APIs. Thus, a movement can be observed that high-level languages are developed which allow to declaratively model dataflows that are automatically optimized and mapped to the batch-processing backends. However, most of these systems are based on programming models as MapReduce that provide elasticity and fault-tolerance in a natural manner since intermediate results are materialized and, therefore, processes can simply be restarted and scaled with partitioning input datasets. For continuous query processing on data streams, these concepts cannot be applied directly since it must be guaranteed that no data is lost when nodes fail. Usually, these long running queries contain operators that maintain state information which depends on the data that has already been processed and hence they cannot be restarted without information loss. This also is an issue when streaming tasks should be scaled. Therefore, integrating elasticity and fault-tolerance in this context is a challenging task which is subject of this paper. We show how common patterns from parallel and distributed algorithms can be applied to tackle these problems and how they are mapped to the Mesos cluster management system.

Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

2012

Many important "big data" applications need to process data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional programming API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions in streaming databases: parallel recovery of lost state across the cluster. We have prototyped D-Streams in an extension to the Spark cluster computing framework called Spark Streaming, which lets users seamlessly intermix streaming, batch and interactive queries.

StreamCloud: A Large Scale Data Streaming System

2010 IEEE 30th International Conference on Distributed Computing Systems, 2010

Data streaming has become an important paradigm for the real-time processing of continuous data flows in domains such as telecommunications, networking, . . . Some applications in these domains require to process massive data flows that current data streaming technology is unable to manage. That is, streams that will require even for a single query operator the capacity of potentially many nodes. Current research efforts have mainly focused on scaling in the number of queries and/or query operators having overlooked the scalability with respect the stream volume. In this paper we present StreamCloud a large scale data streaming system for processing large data stream volumes. The focus of the paper is on how to parallelize continuous queries to attain a highly scalable data streaming infrastructure. StreamCloud goes beyond the state of the art by using a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. StreamCloud is implemented as a middleware and is highly independent of the underlying data streaming engine. We explore and evaluate different strategies to parallelize data streaming and identify and tackle with the main bottlenecks and overheads to achieve large scalability. The paper presents the system design, implementation and a thorough evaluation of the scalability of the fully implemented system.