SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems (original) (raw)

SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems,” November 2009 (under conference submission

2015

There are many academic and commercial stream processing engines (SPEs) today, each of them with its own execution semantics. This variation may lead to seemingly inexplica-ble differences in query results. In this paper, we present SECRET, a model of the behavior of SPEs. SECRET is a descriptive model that allows users to analyze the behav-ior of systems and understand the results of window-based queries for a broad range of heterogeneous SPEs. The model is the result of extensive analysis and experimentation with several commercial and academic engines. In the paper, we describe the types of heterogeneity found in existing engines, and show with experiments on real systems that our model can explain the key differences in windowing behavior. 1.

Tools and techniques for exploring execution model relationships across heterogeneous stream processing engines

2011

Today, there is a diverse range of stream processing engines available for use. However, due to lack of standardization, they differ greatly in semantics, syntax and execution model which may lead differences in query results. SECRET model [1] is proposed to explain such behavioral differences. Yet, exploring relationships between heterogenous stream processing engines remains as an important task. This thesis investigates how SECRET can be used to explore execution model relationships between heterogenous Stream Processing Engines. We define a methodology and propose a technique to predict relationships between any given engine configurations with high efficieny. We further show the validity of our technique through extensive experiments. We present design and architecture of a simulation and analysis software to serve as a multipurpose auxiliary tool in exploration of relationships. We also provide a prototype implementation of the proposed technique as part of this software.

Defining the execution semantics of stream processing engines

Journal of Big Data, 2017

Several modern data-intensive applications need to process large volumes of data on the fly as they are produced. Examples range from credit card fraud detection systems, which analyze massive streams of credit card transactions to identify suspicious patterns, to environmental monitoring applications that continuously analyze sensor data, to click stream analysis of Web sites that identify frequent patterns of interactions. More in general, stream processing is a central requirement in today's information systems. This state of facts pushed the development of several stream processing engines (SPEs) that continuously analyze streams of data to produce new results as new elements enter the streams. Unfortunately, existing SPEs adopt different processing models and standardized execution semantics have not yet emerged. This severely hampers the usability

Efficient execution of sliding-window queries over data streams

2003

Emerging data stream processing systems rely on windowing to enable on-the-fly processing of continuous queries over unbounded streams. As a result, several recent efforts have developed window-aware implementations of query operators such as joins and aggregates. This focus on individual operators, however, ignores the larger issue of how to coordinate the pipelined execution of such operators when combined into a full windowed query plan. In this paper, we first show how the straightforward application of traditional pipelined query processing techniques to sliding window queries can result in inefficient and incorrect behavior. We then present three alternative execution techniques that guarantee correct behavior for pipelined sliding window queries and develop new algorithms for correctly evaluating window-based duplicateelimination, Group-By and Set operators in this context. We implemented all of these techniques in a prototype data stream system and report the results of a detailed performance study of the system.

Harnessing Sliding-Window Execution Semantics for Parallel Stream Processing

Journal of Parallel and Distributed Computing, Elsevier, 2017

According to the recent trend in data acquisition and processing technology, big data are increasingly available in the form of unbounded streams of elementary data items to be processed in real-time. In this paper we study in detail the paradigm of sliding windows, a well-known technique for approximated queries that update their results continuously as new fresh data arrive from the stream. In this work we focus on the relationship between the various existing sliding window semantics and the way the query processing is performed from the parallelism perspective. From this study two alternative parallel models are identified, each covering semantics with very precise properties. Each model is described in terms of its pros and cons, and parallel implementations in the FastFlow framework are analyzed by discussing the layout of the concurrent data structures used for the efficient windows representation in each model.

STREAM: The Stanford Stream Data Manager

IEEE Data(base) Engineering Bulletin, 2003

We propose to demonstrate a Data Stream Management System (DSMS) called STREAM, for STanford stREam datA Manager. The challenges in building a DSMS instead of a traditional DBMS arise from two fundamental differences: ¡ In addition to managing traditional stored data such as relations, a DSMS must handle multiple continuous, unbounded, possibly rapid and time-varying data streams. ¡ Due to the continuous nature of the data, a DSMS typically supports long-running continuous queries, which are expected to produce answers in a continuous and timely fashion.

Scalability via summaries: Stream query processing using promising tuples

2005

In many data st.reaming applications. streams may cont ain data tuples that are either redundant. repetitive, or that are not "interesting" to any of the standing continuous queries. Processing such tuples may waste s~'stem resources \\'ithout producing useful answers. To the contrary, some other tuples can be categorized as promi8ing. This paper proposes that stream query engines can have the option to execute on promising tuples only and not on all tuples. 'Ve propose to maintain intermediate stream summaries and indices that can direct the stream query engine to detect and operate on promising tuples. As an illustration. the proposed intermediate stream summaries are tuned towards capturing promising tuples that (1) maximize the number of output tuples. (2) contribute to producing a faithful representative sample of the output tuples (compared to the output produced when assuming infinite resources), or (3) produce the outlier or deviant results. Experiments are conducted in the context of Nile [24]. a prototype stream query processing engine developed at Purdue Unil l ersity.

Systems Group, Department of Computer Science, ETH Zurich

Today, there is a diverse range of stream processing engines available for use. However, due to lack of standardization, they differ greatly in semantics, syntax and execution model which may lead differences in query results. SECRET model [1] is proposed to explain such behavioral differences. Yet, exploring relationships between heterogenous stream processing engines remains as an important task.

Network-Aware Query Processing for Stream-based Applications

Proceedings 2004 VLDB Conference, 2004

This paper investigates the benefits of network awareness when processing queries in widelydistributed environments such as the Internet. We present algorithms that leverage knowledge of network characteristics (e.g., topology, bandwidth, etc.) when deciding on the network locations where the query operators are executed. Using a detailed emulation study based on realistic network models, we analyse and experimentally evaluate the proposed approaches for distributed stream processing. Our results quantify the significant benefits of the network-aware approaches and reveal the fundamental trade-off between bandwidth efficiency and result latency that arises in networked query processing.