Stream schema: providing and exploiting static metadata for data stream processing (original) (raw)
Related papers
Proceedings of the 13th International Conference on Extending Database Technology - EDBT '10, 2010
Schemas, and more generally metadata specifying structural and semantic constraints, are invaluable in data management. They facilitate conceptual design and enable checking of data consistency. They also play an important role in permitting semantic query optimization, that is, optimization and processing strategies that are often highly effective, but only correct for data conforming to a given schema. While the use of metadata is well-established in relational and XML databases, the same is not true for data streams. The existing work mostly focuses on the specification of dynamic information. In this paper, we consider the specification of static metadata for streams in a model called Stream Schema. We show how Stream Schema can be used to validate the consistency of streams. By explicitly modeling stream constraints, we show that stream queries can be simplified by removing predicates or subqueries that check for consistency. This can greatly enhance programmability of stream processing systems. We also present a set of semantic query optimization strategies that both permit compiletime checking of queries (for example, to detect empty queries) and new runtime processing options, options that would not have been possible without a Stream Schema specification. Case studies on two stream processing platforms (covering different applications and underlying stream models), along with an experimental evaluation, show the benefits of Stream Schema.
Supporting views in data stream management systems
ACM Transactions on Database Systems, 2010
In Relational database nlanageme~~t systems, views supplement basic query constructs to cope with the demand for "higher-level" views of data. h?oreover! in traditional query optimization, answering a query using aset of existing materialized views can yield a more efficient query execution plan. Due to their effectiveness, views are attractive to data stream management systems.
Unifying the Processing of XML Streams and Relational Data Streams
2006
Relational data streams and XML streams have previously provided two separate research foci, but their unified support by a single Data Stream Management System (DSMS) is very desirable from an application viewpoint. In this paper, we propose a simple approach to extend relational DSMSs to support both kinds of streams efficiently. In our Stream Mill system, XML streams expressed as SAX events, can be easily transformed into relational streams, and vice versa. This enables a close cooperation of their query languages, resulting in great power and flexibility. For instance, XQuery can call functions defined in our SQLbased Expressive Stream Language (ESL) using the logical/physical windows that have proved so useful on relational data streams. Many benefits are also gained at the system level, since relational DSMS techniques for load shedding, memory management, query scheduling, approximate query answering, and synopsis maintenance can now be applied to XML streams. Moreover, the many FSA-based optimization techniques developed for XPath and XQuery can be easily and efficiently incorporated in our system. Indeed, we show that YFilter, which is capable of efficiently processing multiple complex XML queries, can be easily integrated in Stream Mill via ESL user-defined and systemdefined aggregates. This approach produces a powerful and flexible system where relational and XML streams are unified and processed efficiently.
Defining the execution semantics of stream processing engines
Journal of Big Data, 2017
Several modern data-intensive applications need to process large volumes of data on the fly as they are produced. Examples range from credit card fraud detection systems, which analyze massive streams of credit card transactions to identify suspicious patterns, to environmental monitoring applications that continuously analyze sensor data, to click stream analysis of Web sites that identify frequent patterns of interactions. More in general, stream processing is a central requirement in today's information systems. This state of facts pushed the development of several stream processing engines (SPEs) that continuously analyze streams of data to produce new results as new elements enter the streams. Unfortunately, existing SPEs adopt different processing models and standardized execution semantics have not yet emerged. This severely hampers the usability
Semantic stream query optimization exploiting dynamic metadata
2011 IEEE 27th International Conference on Data Engineering, 2011
Data stream management systems (DSMS) processing long-running queries over large volumes of stream data must typically deliver time-critical responses. We propose the first semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. We identify four SQO techniques guaranteed to result in performance gains. Based on classic satisfiability theory we then design a lightweight query optimization algorithm that efficiently detects SQO opportunities at runtime. At the logical level, our algorithm instantiates multiple concurrent SQO plans, each processing different partially overlapping substreams. Our novel execution paradigm employs multi-modal operators to support the execution of these concurrent SQO logical plans in a single physical plan. This highly agile execution strategy reduces resource utilization while supporting lightweight adaptivity. Our extensive experimental study in the CAPE stream processing system using both synthetic and real data confirms that our optimization techniques significantly reduce query execution times, up to 60%, compared to the traditional approach.
Revisiting formal ordering in data stream querying
Proceedings of the 27th Annual ACM Symposium on Applied Computing - SAC '12, 2012
The use of stream based applications is in expansion in many contexts and easy and efficient data stream management is crucial for such applications. That is why numerous solutions for stream query processing have been proposed by the scientific community. Several query processors exist and offer heterogeneous querying capabilities. This paper reports a formal work on the operators behind such query processing solutions. It points out the semantic heterogeneity of some important operators and how this leads to some kind of semantic ambiguity which may affect the application semantics. This paper revisits the definition of the main operators used for stream query processing and proposes definitions which are semantically unambiguous. The main issue is the positional order of data items in a stream and its propagation across the operators. The proposed formalization deepens the understanding of stream queries and facilitates the comparison of the semantics implemented by existing systems. This paper also presents the prototype implementing our formal proposal.
Proceedings of the 2019 International Conference on Management of Data
Real-time data analysis and management are increasingly critical for today's businesses. SQL is the de facto lingua franca for these endeavors, yet support for robust streaming analysis and management with SQL remains limited. Many approaches restrict semantics to a reduced subset of features and/or require a suite of non-standard constructs. Additionally, use of event timestamps to provide native support for analyzing events according to when they actually occurred is not pervasive, and often comes with important limitations. We present a three-part proposal for integrating robust streaming into the SQL standard, namely: (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of timevarying query results. Motivated and illustrated using examples and lessons learned from implementations in Apache Calcite, Apache Flink, and Apache Beam, we show how with these minimal additions it is possible to utilize the complete suite of standard SQL semantics to perform robust stream processing. Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
A data stream language and system designed for power and extensibility
2006
By providing an integrated and optimized support for user-defined aggregates (UDAs), data stream management systems (DSMS) can achieve superior power and generality while preserving compatibility with current SQL standards. This is demonstrated by the Stream Mill system that, through is Expressive Stream Language (ESL), efficiently supports a wide range of applications-including very advanced ones such as data stream mining, streaming XML processing, time-series queries, and RFID event processing. ESL supports physical and logical windows (with optional slides and tumbles) on both built-in aggregates and UDAs, using a simple framework that applies uniformly to both aggregate functions written in an external procedural languages and those natively written in ESL. The constructs introduced in ESL extend the power and generality of DSMS, and are conducive to UDA-specific optimization and efficient execution as demonstrated by several experiments.
SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems
There are many academic and commercial stream processing engines (SPEs) today, each of them with its own execution semantics. This variation may lead to seemingly inexplicable differences in query results. In this paper, we present SECRET, a model of the behavior of SPEs. SECRET is a descriptive model that allows users to analyze the behavior of systems and understand the results of window-based queries for a broad range of heterogeneous SPEs. The model is the result of extensive analysis and experimentation with several commercial and academic engines. In the paper, we describe the types of heterogeneity found in existing engines, and show with experiments on real systems that our model can explain the key differences in windowing behavior.
2015
There are many academic and commercial stream processing engines (SPEs) today, each of them with its own execution semantics. This variation may lead to seemingly inexplica-ble differences in query results. In this paper, we present SECRET, a model of the behavior of SPEs. SECRET is a descriptive model that allows users to analyze the behav-ior of systems and understand the results of window-based queries for a broad range of heterogeneous SPEs. The model is the result of extensive analysis and experimentation with several commercial and academic engines. In the paper, we describe the types of heterogeneity found in existing engines, and show with experiments on real systems that our model can explain the key differences in windowing behavior. 1.