Defining the execution semantics of stream processing engines (original) (raw)
Related papers
SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems
There are many academic and commercial stream processing engines (SPEs) today, each of them with its own execution semantics. This variation may lead to seemingly inexplicable differences in query results. In this paper, we present SECRET, a model of the behavior of SPEs. SECRET is a descriptive model that allows users to analyze the behavior of systems and understand the results of window-based queries for a broad range of heterogeneous SPEs. The model is the result of extensive analysis and experimentation with several commercial and academic engines. In the paper, we describe the types of heterogeneity found in existing engines, and show with experiments on real systems that our model can explain the key differences in windowing behavior.
2015
There are many academic and commercial stream processing engines (SPEs) today, each of them with its own execution semantics. This variation may lead to seemingly inexplica-ble differences in query results. In this paper, we present SECRET, a model of the behavior of SPEs. SECRET is a descriptive model that allows users to analyze the behav-ior of systems and understand the results of window-based queries for a broad range of heterogeneous SPEs. The model is the result of extensive analysis and experimentation with several commercial and academic engines. In the paper, we describe the types of heterogeneity found in existing engines, and show with experiments on real systems that our model can explain the key differences in windowing behavior. 1.
An Analysis of Stream processing Languages
2009
Stream processing languages and stream processing engines have become more popu- lar as they emerged from several modern data stream intensive applications such as sensor monitoring, stock markets and network moni- toring. This study discusses the characteris- tics and features of the stream processing technology to provide an in-depth high-level guidance and comparison for stream processing systems and its underlying lan- guages and technology with respect to the characteristics and features used by certain applications. The overall aim of this paper is to analyze and to identify the desired features of stream processing languages and to eva- luate a few representative stream processing systems and languages on the basis of those desired features. The analysis could help in the identification of a suitable stream processing technology for particular applica- tions as well as aiding the design and devel- opment of such languages for new, emerging applications.
Design principles for developing stream processing applications
Software: Practice and Experience, 2010
Stream processing applications are used to ingest, process, and analyze continuous data streams from heterogeneous sources of live and stored data, generating streams of output results. These applications are, in many cases, complex, large-scale, low-latency, and distributed in nature. In this paper, we describe the design principles and architectural underpinnings for stream processing applications. These principles are distilled from our experience in building real-world applications both for internal use as well as with customers from several industrial and academic domains. We provide principles, guidelines, as well as appropriate implementation examples to highlight the different aspects of stream processing application design and development.
A stream processing abstraction framework
Frontiers in Big Data
Real-time analysis of large multimedia streams is nowadays made efficient by the existence of several Big Data streaming platforms, like Apache Flink and Samza. However, the use of such platforms is difficult due to the fact that facilities they offer are often too raw to be effectively exploited by analysts. We describe the evolution of RAM3S, a software infrastructure for the integration of Big Data stream processing platforms, to SPAF, an abstraction framework able to provide programmers with a simple but powerful API to ease the development of stream processing applications. By using SPAF, the programmer can easily implement real-time complex analyses of massive streams on top of a distributed computing infrastructure, able to manage the volume and velocity of Big Data streams, thus effectively transforming data into value.
Stream processing platforms for analyzing big dynamic data
it - Information Technology, 2016
Nowadays, data is produced in every aspect of our lives, leading to a massive amount of information generated every second. However, this vast amount is often too large to be stored and for many applications the information contained in these data streams is only useful when it is fresh. Batch processing platforms like Hadoop MapReduce do not fit these needs as they require to collect data on disk and process it repeatedly. Therefore, modern data processing engines combine the scalability of distributed architectures with the one-pass semantics of traditional stream engines. In this paper, we survey the current state of the art in scalable stream processing from a user perspective. We examine and describe their architecture, execution model, programming interface, and data analysis support as well as discuss the challenges and limitations of their APIs. In this connection, we introduce Piglet, an extended Pig Latin language and code generator that compiles (extended) Pig Latin code ...
Harnessing Sliding-Window Execution Semantics for Parallel Stream Processing
Journal of Parallel and Distributed Computing, Elsevier, 2017
According to the recent trend in data acquisition and processing technology, big data are increasingly available in the form of unbounded streams of elementary data items to be processed in real-time. In this paper we study in detail the paradigm of sliding windows, a well-known technique for approximated queries that update their results continuously as new fresh data arrive from the stream. In this work we focus on the relationship between the various existing sliding window semantics and the way the query processing is performed from the parallelism perspective. From this study two alternative parallel models are identified, each covering semantics with very precise properties. Each model is described in terms of its pros and cons, and parallel implementations in the FastFlow framework are analyzed by discussing the layout of the concurrent data structures used for the efficient windows representation in each model.
Sequences, yet Functions: The Dual Nature of Data-Stream Processing
2018
Data-stream processing has continuously risen in importance as the amount of available data has been steadily increas- ing over the last decade. Besides traditional domains such as data-center monitoring and click analytics, there is an increasing number of network-enabled production machines that generate continuous streams of data. Due to their continuous nature, queries on data-streams can be more complex, and distinctly harder to understand then database queries. As users have to consider operational details, maintenance and debugging become challenging. Current approaches model data-streams as sequences, be- cause this is the way they are physically received. These models result in an implementation-focused perspective. We explore an alternate way of modeling data-streams by focusing on time-slicing semantics. This focus results in a model based on functions, which is better suited for reasoning about query semantics. By adapting the definitions of relevant concepts in stream p...
SPADE: the system s declarative stream processing engine
2008
In this paper, we present Spade − the System S declarative stream processing engine. System S is a large-scale, distributed data stream processing middleware under development at IBM T. J. Watson Research Center. As a front-end for rapid application development for System S, Spade provides (1) an intermediate language for flexible composition of parallel and distributed data-flow graphs, (2) a toolkit of type-generic, built-in stream processing operators, that support scalar as well as vectorized processing and can seamlessly inter-operate with user-defined operators, and (3) a rich set of stream adapters to ingest/publish data from/to outside sources. More importantly, Spade automatically brings performance optimization and scalability to System S applications. To that end, Spade employs a code generation framework to create highly-optimized applications that run natively on the Stream Processing Core (SPC), the execution and communication substrate of System S, and take full advantage of other System S services. Spade allows developers to construct their applications with fine granular stream operators without worrying about the performance implications that might exist, even in a distributed system. Spade's optimizing compiler automatically maps applications into appropriately sized execution units in order to minimize communication overhead, while at the same time exploiting available parallelism. By virtue of the scalability of the System S runtime and Spade's effective code generation and optimization, we can scale applications to a large number of nodes. Currently, we can run Spade jobs on ≈ 500 processors within more than 100 physical nodes in a tightly connected cluster environment. Spade has been in use at IBM Research to create real-world streaming appli-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
2011
Today, there is a diverse range of stream processing engines available for use. However, due to lack of standardization, they differ greatly in semantics, syntax and execution model which may lead differences in query results. SECRET model [1] is proposed to explain such behavioral differences. Yet, exploring relationships between heterogenous stream processing engines remains as an important task. This thesis investigates how SECRET can be used to explore execution model relationships between heterogenous Stream Processing Engines. We define a methodology and propose a technique to predict relationships between any given engine configurations with high efficieny. We further show the validity of our technique through extensive experiments. We present design and architecture of a simulation and analysis software to serve as a multipurpose auxiliary tool in exploration of relationships. We also provide a prototype implementation of the proposed technique as part of this software.