Stream processing platforms for analyzing big dynamic data (original) (raw)

Real-time stream processing for Big Data

it - Information Technology, 2016

With the rise of the web 2.0 and the Internet of things, it has become feasible to track all kinds of information over time, in particular fine-grained user activities and sensor data on their environment and even their biometrics. However, while efficiency remains mandatory for any application trying to cope with huge amounts of data, only part of the potential of today's Big Data repositories can be exploited using traditional batch-oriented approaches as the value of data often decays quickly and high latency becomes unacceptable in some applications. In the last couple of years, several distributed data processing systems have emerged that deviate from the batch-oriented approach and tackle data items as they arrive, thus acknowledging the growing importance of timeliness and velocity in Big Data analytics. In this article, we give an overview over the state of the art of stream processors for low-latency Big Data analytics and conduct a qualitative comparison of the most po...

A stream processing abstraction framework

Frontiers in Big Data

Real-time analysis of large multimedia streams is nowadays made efficient by the existence of several Big Data streaming platforms, like Apache Flink and Samza. However, the use of such platforms is difficult due to the fact that facilities they offer are often too raw to be effectively exploited by analysts. We describe the evolution of RAM3S, a software infrastructure for the integration of Big Data stream processing platforms, to SPAF, an abstraction framework able to provide programmers with a simple but powerful API to ease the development of stream processing applications. By using SPAF, the programmer can easily implement real-time complex analyses of massive streams on top of a distributed computing infrastructure, able to manage the volume and velocity of Big Data streams, thus effectively transforming data into value.

PiCo: A Novel Approach to Stream Data Analytics

Lecture Notes in Computer Science, 2018

In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo's programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: 1) unifying batch and stream data access models, 2) decoupling processing from data layout, and 3) exploiting a streamoriented, scalable, e cient C++11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on diā†µerent data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.

M3: Stream Processing on Main-Memory MapReduce

2012 IEEE 28th International Conference on Data Engineering, 2012

The continuous growth of social web applications along with the development of sensor capabilities in electronic devices is creating countless opportunities to analyze the enormous amounts of data that is continuously steaming from these applications and devices. To process large scale data on large scale computing clusters, MapReduce has been introduced as a framework for parallel computing. However, most of the current implementations of the MapReduce framework support only the execution of fixed-input jobs. Such restriction makes these implementations inapplicable for most streaming applications, in which queries are continuous in nature, and input data streams are continuously received at high arrival rates. In this demonstration, we showcase M 3 , a prototype implementation of the MapReduce framework in which continuous queries over streams of data can be efficiently answered. M 3 extends Hadoop, the open source implementation of MapReduce, bypassing the Hadoop Distributed File System (HDFS) to support main-memory-only processing. Moreover, M 3 supports continuous execution of the Map and Reduce phases where individual Mappers and Reducers never terminate.

Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Proceedings of The Vldb Endowment

Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.

Watershed-ng: an extensible distributed stream processing framework

Concurrency and Computation: Practice and Experience, 2016

Most high-performance data processing (a.k.a. big data) systems allow users to express their computation using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most frameworks, however, do not allow users to specify how communication must take place: That element is deeply embedded into the run-time system abstractions, making changes hard to implement. In this work, we describe Wathershed-ng, our re-engineering of the Watershed system, a framework based on the filter-stream paradigm and originally focused on continuous stream processing. Like other big-data environments, Watershed provided object-oriented abstractions to express computation (filters), but the implementation of streams was a run-time system element. By isolating stream functionality into appropriate classes, combination of communication patterns and reuse of common message handling functions (like compression and blocking) become possible. The new architecture even allows the design of new communication patterns, for example, allowing users to choose MPI, TCP, or shared memory implementations of communication channels as their problem demands. Applications designed for the new interface showed reductions in code size on the order of 50% and above in some cases. The performance results also showed significant improvements, because some implementation bottlenecks were removed in the re-engineering process.

Towards Elastic Stream Processing: Patterns and Infrastructure

Distributed, highly-parallel processing frameworks as Hadoop are deemed to be state-of-the-art for handling big data today. But they burden application developers with the task to manually implement program logic using lowlevel batch processing APIs. Thus, a movement can be observed that high-level languages are developed which allow to declaratively model dataflows that are automatically optimized and mapped to the batch-processing backends. However, most of these systems are based on programming models as MapReduce that provide elasticity and fault-tolerance in a natural manner since intermediate results are materialized and, therefore, processes can simply be restarted and scaled with partitioning input datasets. For continuous query processing on data streams, these concepts cannot be applied directly since it must be guaranteed that no data is lost when nodes fail. Usually, these long running queries contain operators that maintain state information which depends on the data that has already been processed and hence they cannot be restarted without information loss. This also is an issue when streaming tasks should be scaled. Therefore, integrating elasticity and fault-tolerance in this context is a challenging task which is subject of this paper. We show how common patterns from parallel and distributed algorithms can be applied to tackle these problems and how they are mapped to the Mesos cluster management system. * This work is partially funded by an IBM PhD Fellowship. phere Streams or our own AnduIN engine provide abstractions to process continuous and possibly infinite streams of data instead of disk-resident datasets. Typically, this includes standard (relational) query operators, window-based operators for computing joins and aggregations as well as more advanced data analytics and data mining operators working on portions of the stream, e.g. windows or synopses of data. Complex Event Processing systems (CEP) particularly support the identification of event patterns in (temporal) streams of data such as a sequence of specific event types within a given time interval. Typically, systems of both classes provide a declarative interface, either in form of SQLlike query languages like CQL for DSMS, event languages like SASE, or in the form of dataflow specifications like SPL in IBM Infosphere Streams. Recently, several new distributed stream computing platforms have been developed, aiming at providing scalable and fault-tolerant operation in cluster environments. Examples are Apache S4 or Storm. In contrast to DSMS or CEP engines theses platforms do not (yet) provide declarative interfaces and, therefore, require to program applications instead of writing queries. Developers of these systems argue that they provide the same for stream processing what Hadoop did for batch processing-which raises the hope of a similar movement towards higher-level languages as we can see with Pig, Jaql etc. for MapReduce. However, there are some challenges in scalable and elastic stream processing which are different from batch processing with Hadoop. Whereas in Hadoop, input data as well as intermediate results are materialized on disk and, therefore,

Defining the execution semantics of stream processing engines

Journal of Big Data, 2017

Several modern data-intensive applications need to process large volumes of data on the fly as they are produced. Examples range from credit card fraud detection systems, which analyze massive streams of credit card transactions to identify suspicious patterns, to environmental monitoring applications that continuously analyze sensor data, to click stream analysis of Web sites that identify frequent patterns of interactions. More in general, stream processing is a central requirement in today's information systems. This state of facts pushed the development of several stream processing engines (SPEs) that continuously analyze streams of data to produce new results as new elements enter the streams. Unfortunately, existing SPEs adopt different processing models and standardized execution semantics have not yet emerged. This severely hampers the usability

Scalable and Low-Latency Data Processing with Stream MapReduce

2011 IEEE Third International Conference on Cloud Computing Technology and Science, 2011

We present StreamMapReduce, a data processing approach that combines ideas from the popular MapReduce paradigm and recent developments in Event Stream Processing. We adopted the simple and scalable programming model of MapReduce and added continuous, low-latency data processing capabilities previously found only in Event Stream Processing systems. This combination leads to a system that is efficient and scalable, but at the same time, simple from the user's point of view. For latency-critical applications, our system allows a hundred-fold improvement in response time. Notwithstanding, when throughput is considered, our system offers a tenfold pernode throughput increase in comparison to Hadoop. As a result, we show that our approach addresses classes of applications that are not supported by any other existing system and that the MapReduce paradigm is indeed suitable for scalable processing of real-time data streams.

Beyond Hadoop: The Paradigm Shift of Data From Stationary to Streaming Data for Data Analytics

2017

The paradigm shift of data from static to fast flowing data is an important move in the industry, to accommodate growing size of data. The velocity and volume of data are continuing to expand which has started to make its impact in business and other applications of Big Data. The paper describes the paradigm shift of data from static data to streaming data for data analytics beyond Hadoop. It describes how the first generation of Hadoop applications were largely built for batch-oriented paradigm . Streaming data is essentially different from traditional data handling patterns and comes with its own set of challenges and requirements. New applications such as Storm, Flume, Kafka, and other technologies are evolving to bring in an era of real-time analytics Data is generated incessantly from thousands of sources simultaneously and it can be of various type such as log files, mobile and web data, transaction etc. The sections of my paper are Introduction followed by Streaming data, Had...