Luping Ding - Academia.edu (original) (raw)
Papers by Luping Ding
2007 IEEE 23rd International Conference on Data Engineering, 2007
Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004
We explore join optimizations in the presence of both timebased constraints (sliding windows) and... more We explore join optimizations in the presence of both timebased constraints (sliding windows) and value-based constraints (punctuations). We present the first join solution named PWJoin that exploits such combined constraints to shrink the runtime join state and to propagate punctuations to benefit downstream operators. We design a state structure for PWJoin that facilitates the exploitation of both constraint types. We also explore optimizations enabled by the interactions between window and punctuation, e.g., early punctuation propagation. The costs of the PWJoin are analyzed using a cost model. We also conduct an experimental study using CAPE continuous query system. The experimental results show that in most cases, by exploiting punctuations, PWJoin outperforms the pure window join with regard to both memory overhead and throughput. Our technique complements the joins in the literature, such as symmetric hash join or window join, to now require less runtime resources without compromising the accuracy of the result.
Proceedings of the 2nd international workshop on Scalable stream processing system, 2008
Groupby queries are prevalent in stream processing applications. We propose parameterized streami... more Groupby queries are prevalent in stream processing applications. We propose parameterized streaming groupby query (or PSGB query) to serve the needs for providing customized results based on widely-varying user requests. SPGB query is resource-efficient by assuming the pull execution model. In targeted applications, fast arriving streaming data and selective user requests can be observed. This raises the need for an indexing strategy that can effectively organize data in the quickly evolving groupby state to support heavy and fluctuating query workloads. In this paper, we tackle this problem by proposing an efficient yet lightweight index solution named IMP index for SPGB groupby operator. We propose the EPrune algorithm that is guaranteed to find the optimal IMP index configuration for a given query workload. To support frequent index tuning required for coping with dynamic stream environments, efficiency of the index selection may become more important than guaranteed optimality. We thus design a greedy index selection algorithm named RGreedy and equip it with three alternative heuristics-OWL, RCL and Hybrid. Our experiments conducted in a real stream processing system show that RGreedy algorithm with the Hybrid heuristic finds optimal IMP configuration in a variety of test cases with xx% confidence interval. And while EPrune takes hours to finish, RGreedy always terminates within seconds. Query 1 SELECT srcIP, destIP, SUM(len) FROM Packets WHERE collectorID = 'B' GROUP BY srcIP, destIP WINDOW 20 Minutes Query 2 SELECT categoryID, buyer_state, buyer_job, COUNT(*) FROM Bid_info GROUP BY categoryID, buyer_state, buyer_job WINDOW 24 Hours
2008 IEEE 24th International Conference on Data Engineering, 2008
Detecting complex patterns in event streams, i.e., complex event processing (CEP), has become inc... more Detecting complex patterns in event streams, i.e., complex event processing (CEP), has become increasingly important for modern enterprises to react quickly to critical situations. In many practical cases business events are generated based on pre-defined business logics. Hence constraints, such as occurrence and order constraints, often hold among events. Reasoning using these known constraints enables us to predict the non-occurrences of certain future events, thereby helping us to identify and then terminate the long running query processes that are guaranteed to not lead to successful matches. In this work, we focus on exploiting event constraints to optimize CEP over large volumes of business transaction streams. Since the optimization opportunities arise at runtime, we develop a runtime query unsatisfiability (RunSAT) checking technique that detects optimal points for terminating query evaluation. To assure efficiency of RunSAT checking, we propose mechanisms to precompute the query failure conditions to be checked at runtime. This guarantees a constant-time RunSAT reasoning cost, making our technique highly scalable. We realize our optimal query termination strategies by augmenting the query with Event-Condition-Action rules encoding the pre-computed failure conditions. This results in an event processing solution compatible with state-of-the-art CEP architectures. Extensive experimental results demonstrate that significant performance gains are achieved, while the optimization overhead is small.
Proceedings of the 4th international workshop on Web information and data management, 2002
XML has become an important medium for representing and exchanging data over the Internet. Data s... more XML has become an important medium for representing and exchanging data over the Internet. Data sources, including structural and semi-structural sources, often export XML views over base data, and then materialize the view by storing its XML query result to provide faster data access. Upon change to the base data, it is typically more efficient to maintain a view by incrementally propagating the delta changes to the view than by re-computing it from scratch. Techniques for the incremental maintenance of relational views have been extensively studied in the literature. However, the maintenance of views created using XQuery is as of now unexplored. In this paper we propose an algebraic approach for incremental XQuery view maintenance. In our approach, an update to the XML source is transformed into a set of well defined update primitives which are propagated through the XML algebra tree. This algebraic update propagation process generates incremental update primitives to be applied to the result view. Our update propagation strategy based on the XAT XML algebra and the propagation rules for individual algebra operators are given. We briefly discuss our XQuery view maintenance system implementation, and present our experiments that confirm that incremental view maintenance is indeed faster than re-computation.
Practical and Scalable Semantic Systems, 2003
In order to realize the vision of the Semantic Web, a semantic model for encoding content in the ... more In order to realize the vision of the Semantic Web, a semantic model for encoding content in the World Wide Web, efficient storage and retrieval of large RDF data sets is required. A common technique for storing RDF data (graphs) is to use a single relational database table, a triple store, for the graph. However, we believe a single triple
Lecture Notes in Computer Science, 2004
Proceedings of the Third ACM International Conference on Distributed Event-Based Systems, 2009
Event stream processing (ESP) [7][2][5][6] technologies en-able enterprise applications such as a... more Event stream processing (ESP) [7][2][5][6] technologies en-able enterprise applications such as algorithmic trading, RFID data processing, fraud detection and location-based services in telecommunications. The key applications of the ESP technologies rely on the detection ...
Proceedings of the 2003 ACM SIGMOD international conference on on Management of data - SIGMOD '03, 2003
Outlook. We present multiple XQuery optimization based on materialized XML view technology in the... more Outlook. We present multiple XQuery optimization based on materialized XML view technology in the Rainbow system. In this demo we in particular show: (1) Rainbow's support for defining and incrementally maintaining materialized XQuery views, (2) XQuery optimization by query rewriting to use materialized views, (3) Performing multiple query optimization by merging multiple XML queries (XATs) into one global access plan to decide upon materialization of intermediate results as views, and (4) Query processing of updates issued on XML views that wrap relational data by decomposing the updates into SQL update statements and consistency checks on the relational base data. The Rainbow System. We have extended Rainbow [1], our existing XML data management system, as shown in Figure 1. Rainbow accepts an XQuery query or an update request in an extended XQuery syntax from the user. The XQuery is parsed into an algebraic representation, called XML Algebra Tree (XAT) [3]. The XAT is then optimized by the global query optimizer using algebraic rewrite rules [2]. We have introduced a separate phase of XAT cleanup [2] which includes the XAT table schema cleanup and cutting of unnecessary XML operators. This optimization often significantly improves the query performance. The optimized XAT is then executed by the query manager.
9th International Database Engineering & Application Symposium (IDEAS'05)
Adaptive operator scheduling algorithms for continuous query processing are usually designed to s... more Adaptive operator scheduling algorithms for continuous query processing are usually designed to serve a single performance objective, such as minimizing memory usage or maximizing query throughput. We observe that different performance objectives may sometimes conflict with each other. Also due to the dynamic nature of streaming environments, the performance objective may need to change dynamically. Furthermore, the performance specification defined by users may itself be multi-dimensional. Therefore, utilizing a single scheduling algorithm optimized for a single objective is no longer sufficient. In this paper, we propose a novel adaptive scheduling algorithm selection framework named AMoS. It is able to leverage the strengths of existing scheduling algorithms to meet multiple performance objectives. AMoS employs a lightweight learning mechanism to assess the effectiveness of each algorithm. The learned knowledge can be used to select the algorithm that probabilistically has the best chance of improving the performance. In addition, AMoS has the flexibility to add and adapt to new scheduling algorithms, query plans and data sets during execution. Our experimental results show that AMoS significantly outperforms the existing scheduling algorithms with regard to satisfying both uni-objective and multi-objective performance requirements.
Advances in Database Systems
Proceedings 2004 VLDB Conference, 2004
27th International Conference on Distributed Computing Systems Workshops (ICDCSW'07), 2007
Complex event processing has become increasingly important in modern applications, ranging from s... more Complex event processing has become increasingly important in modern applications, ranging from supply chain management for RFID tracking to real-time intrusion detection. The goal is to extract patterns from such event streams in order to make informed decisions in real-time. However, networking latencies and even machine failure may cause events to arrive out-of-order at the event stream processing engine. In this work, we address the problem of processing event pattern queries specified over event streams that may contain out-of-order data. First, we analyze the problems state-of-the-art event stream processing technology would experience when faced with out-of-order data arrival. We then propose a new solution of physical implementation strategies for the core stream algebra operators such as sequence scan and pattern construction, including stackbased data structures and associated purge algorithms. Optimizations for sequence scan and construction as well as state purging to minimize CPU cost and memory consumption are also introduced. Lastly, we conduct an experimental study demonstrating the effectiveness of our approach. 2.1 Events, Event Stream and Sequence Query Events. An event is defined to be an instantaneous occurrence of interest at a point in time. It can be a primitive event or a composite event [2]. Throughout this report, we use capitalized letters to represent event types and lowercase letters to represent event instances. A schema is associated with each event type. It includes the event type ID, a set of application-specific attributes and the timestamp which records the time when the event is generated. Event stream. In most event processing scenarios, it is assumed that the input to the query system is a potentially infinite event stream that contains all events that might be of interest [10, 2, 1]. Therefore, the event stream is heterogeneous populated with event instances of different event types, thereby having different schemas. For example, in the RFID-based retail management scenario explained in [10], all the RFID readings are merged into a single stream and sorted by their timestamps. Hence the stream will contain the SHELF-READING events, the COUNTER-READING events and the EXIT-READING events. Event Sequence Query. Event sequence queries are queries on the sequential event stream. [10] defines a language that can specify how individual event is filtered and how multiple events are correlated via time-based and value-based constraints. In this work, we utilize the SASE query language to express sequence queries with sliding windows. The following is an example using the SASE language for our previous case study, which finds out if any book is being taken out of a book store without going through the store's check-out counter:
IEEE Data(base) Engineering Bulletin - DEBU, 2003
To realize the vision of the Semantic Web, efficient sto rage and retrieval of large RDF data set... more To realize the vision of the Semantic Web, efficient sto rage and retrieval of large RDF data sets is required. A common technique for persisting RDF data (graphs) is to use a single relational database table, a triple store. But, we believe a single triple store cannot scale for large-scale applications. This paper describes storing and querying persistent RDF graphs in Jena, a Semantic Web programmers' toolkit. Jena augments the triple store with property tables that cluster multiple property values in a single table row. We also describe two tools to assist in designing application-specific RDF sto rage schema. The first is a synthetic data generator that generates RDF graphs consistent with an underlying ontology. The second mines an RDF graph or an RDF query log for frequently occurring patterns. These patterns can be applied to schema design or caching strategies to improve performance. We also briefly describe Jena inferencing and a new approach to context in RDF which w...
2011 IEEE 27th International Conference on Data Engineering, 2011
Data stream management systems (DSMS) processing long-running queries over large volumes of strea... more Data stream management systems (DSMS) processing long-running queries over large volumes of stream data must typically deliver time-critical responses. We propose the first semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. We identify four SQO techniques guaranteed to result in performance gains. Based on classic satisfiability theory we then design a lightweight query optimization algorithm that efficiently detects SQO opportunities at runtime. At the logical level, our algorithm instantiates multiple concurrent SQO plans, each processing different partially overlapping substreams. Our novel execution paradigm employs multi-modal operators to support the execution of these concurrent SQO logical plans in a single physical plan. This highly agile execution strategy reduces resource utilization while supporting lightweight adaptivity. Our extensive experimental study in the CAPE stream processing system using both synthetic and real data confirms that our optimization techniques significantly reduce query execution times, up to 60%, compared to the traditional approach.
Proceedings of the 2nd international workshop on Distributed event-based systems, 2003
Join algorithms must be redesigned when processing stream data instead of persistently stored dat... more Join algorithms must be redesigned when processing stream data instead of persistently stored data. Data streams are potentially infinite and the query result is expected to be generated incrementally instead of once only. Data arrival patterns are often unpredictable and the statistics of the data and other relevant metadata often are only known at runtime. In some cases they are supplied interleaved with the actual data in the form of stream markers. Recently, stream join algorithms, like Symmetric Hash Join and XJoin, have been designed to perform in a pipelined fashion to cope with the latent delivery of data. However, none of them to date takes metadata, especially runtime metadata, into consideration. Hence, the join execution logic defined statically before runtime may not be well suited to deal with varying types of dynamic runtime scenarios. Also the potentially unbounded state needs to be maintained by the join operator to guarantee the precision of the result. In this paper, we propose a metadata-aware stream join operator called MJoin which is able to exploit metadata to (1) detect and purge useless materialized data to save computation resources and (2) optimize the execution logic to target different optimization goals. We have implemented the MJoin operator. The experimental results validate our metadata-driven join optimization strategies.
2007 IEEE 23rd International Conference on Data Engineering, 2007
Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004
We explore join optimizations in the presence of both timebased constraints (sliding windows) and... more We explore join optimizations in the presence of both timebased constraints (sliding windows) and value-based constraints (punctuations). We present the first join solution named PWJoin that exploits such combined constraints to shrink the runtime join state and to propagate punctuations to benefit downstream operators. We design a state structure for PWJoin that facilitates the exploitation of both constraint types. We also explore optimizations enabled by the interactions between window and punctuation, e.g., early punctuation propagation. The costs of the PWJoin are analyzed using a cost model. We also conduct an experimental study using CAPE continuous query system. The experimental results show that in most cases, by exploiting punctuations, PWJoin outperforms the pure window join with regard to both memory overhead and throughput. Our technique complements the joins in the literature, such as symmetric hash join or window join, to now require less runtime resources without compromising the accuracy of the result.
Proceedings of the 2nd international workshop on Scalable stream processing system, 2008
Groupby queries are prevalent in stream processing applications. We propose parameterized streami... more Groupby queries are prevalent in stream processing applications. We propose parameterized streaming groupby query (or PSGB query) to serve the needs for providing customized results based on widely-varying user requests. SPGB query is resource-efficient by assuming the pull execution model. In targeted applications, fast arriving streaming data and selective user requests can be observed. This raises the need for an indexing strategy that can effectively organize data in the quickly evolving groupby state to support heavy and fluctuating query workloads. In this paper, we tackle this problem by proposing an efficient yet lightweight index solution named IMP index for SPGB groupby operator. We propose the EPrune algorithm that is guaranteed to find the optimal IMP index configuration for a given query workload. To support frequent index tuning required for coping with dynamic stream environments, efficiency of the index selection may become more important than guaranteed optimality. We thus design a greedy index selection algorithm named RGreedy and equip it with three alternative heuristics-OWL, RCL and Hybrid. Our experiments conducted in a real stream processing system show that RGreedy algorithm with the Hybrid heuristic finds optimal IMP configuration in a variety of test cases with xx% confidence interval. And while EPrune takes hours to finish, RGreedy always terminates within seconds. Query 1 SELECT srcIP, destIP, SUM(len) FROM Packets WHERE collectorID = 'B' GROUP BY srcIP, destIP WINDOW 20 Minutes Query 2 SELECT categoryID, buyer_state, buyer_job, COUNT(*) FROM Bid_info GROUP BY categoryID, buyer_state, buyer_job WINDOW 24 Hours
2008 IEEE 24th International Conference on Data Engineering, 2008
Detecting complex patterns in event streams, i.e., complex event processing (CEP), has become inc... more Detecting complex patterns in event streams, i.e., complex event processing (CEP), has become increasingly important for modern enterprises to react quickly to critical situations. In many practical cases business events are generated based on pre-defined business logics. Hence constraints, such as occurrence and order constraints, often hold among events. Reasoning using these known constraints enables us to predict the non-occurrences of certain future events, thereby helping us to identify and then terminate the long running query processes that are guaranteed to not lead to successful matches. In this work, we focus on exploiting event constraints to optimize CEP over large volumes of business transaction streams. Since the optimization opportunities arise at runtime, we develop a runtime query unsatisfiability (RunSAT) checking technique that detects optimal points for terminating query evaluation. To assure efficiency of RunSAT checking, we propose mechanisms to precompute the query failure conditions to be checked at runtime. This guarantees a constant-time RunSAT reasoning cost, making our technique highly scalable. We realize our optimal query termination strategies by augmenting the query with Event-Condition-Action rules encoding the pre-computed failure conditions. This results in an event processing solution compatible with state-of-the-art CEP architectures. Extensive experimental results demonstrate that significant performance gains are achieved, while the optimization overhead is small.
Proceedings of the 4th international workshop on Web information and data management, 2002
XML has become an important medium for representing and exchanging data over the Internet. Data s... more XML has become an important medium for representing and exchanging data over the Internet. Data sources, including structural and semi-structural sources, often export XML views over base data, and then materialize the view by storing its XML query result to provide faster data access. Upon change to the base data, it is typically more efficient to maintain a view by incrementally propagating the delta changes to the view than by re-computing it from scratch. Techniques for the incremental maintenance of relational views have been extensively studied in the literature. However, the maintenance of views created using XQuery is as of now unexplored. In this paper we propose an algebraic approach for incremental XQuery view maintenance. In our approach, an update to the XML source is transformed into a set of well defined update primitives which are propagated through the XML algebra tree. This algebraic update propagation process generates incremental update primitives to be applied to the result view. Our update propagation strategy based on the XAT XML algebra and the propagation rules for individual algebra operators are given. We briefly discuss our XQuery view maintenance system implementation, and present our experiments that confirm that incremental view maintenance is indeed faster than re-computation.
Practical and Scalable Semantic Systems, 2003
In order to realize the vision of the Semantic Web, a semantic model for encoding content in the ... more In order to realize the vision of the Semantic Web, a semantic model for encoding content in the World Wide Web, efficient storage and retrieval of large RDF data sets is required. A common technique for storing RDF data (graphs) is to use a single relational database table, a triple store, for the graph. However, we believe a single triple
Lecture Notes in Computer Science, 2004
Proceedings of the Third ACM International Conference on Distributed Event-Based Systems, 2009
Event stream processing (ESP) [7][2][5][6] technologies en-able enterprise applications such as a... more Event stream processing (ESP) [7][2][5][6] technologies en-able enterprise applications such as algorithmic trading, RFID data processing, fraud detection and location-based services in telecommunications. The key applications of the ESP technologies rely on the detection ...
Proceedings of the 2003 ACM SIGMOD international conference on on Management of data - SIGMOD '03, 2003
Outlook. We present multiple XQuery optimization based on materialized XML view technology in the... more Outlook. We present multiple XQuery optimization based on materialized XML view technology in the Rainbow system. In this demo we in particular show: (1) Rainbow's support for defining and incrementally maintaining materialized XQuery views, (2) XQuery optimization by query rewriting to use materialized views, (3) Performing multiple query optimization by merging multiple XML queries (XATs) into one global access plan to decide upon materialization of intermediate results as views, and (4) Query processing of updates issued on XML views that wrap relational data by decomposing the updates into SQL update statements and consistency checks on the relational base data. The Rainbow System. We have extended Rainbow [1], our existing XML data management system, as shown in Figure 1. Rainbow accepts an XQuery query or an update request in an extended XQuery syntax from the user. The XQuery is parsed into an algebraic representation, called XML Algebra Tree (XAT) [3]. The XAT is then optimized by the global query optimizer using algebraic rewrite rules [2]. We have introduced a separate phase of XAT cleanup [2] which includes the XAT table schema cleanup and cutting of unnecessary XML operators. This optimization often significantly improves the query performance. The optimized XAT is then executed by the query manager.
9th International Database Engineering & Application Symposium (IDEAS'05)
Adaptive operator scheduling algorithms for continuous query processing are usually designed to s... more Adaptive operator scheduling algorithms for continuous query processing are usually designed to serve a single performance objective, such as minimizing memory usage or maximizing query throughput. We observe that different performance objectives may sometimes conflict with each other. Also due to the dynamic nature of streaming environments, the performance objective may need to change dynamically. Furthermore, the performance specification defined by users may itself be multi-dimensional. Therefore, utilizing a single scheduling algorithm optimized for a single objective is no longer sufficient. In this paper, we propose a novel adaptive scheduling algorithm selection framework named AMoS. It is able to leverage the strengths of existing scheduling algorithms to meet multiple performance objectives. AMoS employs a lightweight learning mechanism to assess the effectiveness of each algorithm. The learned knowledge can be used to select the algorithm that probabilistically has the best chance of improving the performance. In addition, AMoS has the flexibility to add and adapt to new scheduling algorithms, query plans and data sets during execution. Our experimental results show that AMoS significantly outperforms the existing scheduling algorithms with regard to satisfying both uni-objective and multi-objective performance requirements.
Advances in Database Systems
Proceedings 2004 VLDB Conference, 2004
27th International Conference on Distributed Computing Systems Workshops (ICDCSW'07), 2007
Complex event processing has become increasingly important in modern applications, ranging from s... more Complex event processing has become increasingly important in modern applications, ranging from supply chain management for RFID tracking to real-time intrusion detection. The goal is to extract patterns from such event streams in order to make informed decisions in real-time. However, networking latencies and even machine failure may cause events to arrive out-of-order at the event stream processing engine. In this work, we address the problem of processing event pattern queries specified over event streams that may contain out-of-order data. First, we analyze the problems state-of-the-art event stream processing technology would experience when faced with out-of-order data arrival. We then propose a new solution of physical implementation strategies for the core stream algebra operators such as sequence scan and pattern construction, including stackbased data structures and associated purge algorithms. Optimizations for sequence scan and construction as well as state purging to minimize CPU cost and memory consumption are also introduced. Lastly, we conduct an experimental study demonstrating the effectiveness of our approach. 2.1 Events, Event Stream and Sequence Query Events. An event is defined to be an instantaneous occurrence of interest at a point in time. It can be a primitive event or a composite event [2]. Throughout this report, we use capitalized letters to represent event types and lowercase letters to represent event instances. A schema is associated with each event type. It includes the event type ID, a set of application-specific attributes and the timestamp which records the time when the event is generated. Event stream. In most event processing scenarios, it is assumed that the input to the query system is a potentially infinite event stream that contains all events that might be of interest [10, 2, 1]. Therefore, the event stream is heterogeneous populated with event instances of different event types, thereby having different schemas. For example, in the RFID-based retail management scenario explained in [10], all the RFID readings are merged into a single stream and sorted by their timestamps. Hence the stream will contain the SHELF-READING events, the COUNTER-READING events and the EXIT-READING events. Event Sequence Query. Event sequence queries are queries on the sequential event stream. [10] defines a language that can specify how individual event is filtered and how multiple events are correlated via time-based and value-based constraints. In this work, we utilize the SASE query language to express sequence queries with sliding windows. The following is an example using the SASE language for our previous case study, which finds out if any book is being taken out of a book store without going through the store's check-out counter:
IEEE Data(base) Engineering Bulletin - DEBU, 2003
To realize the vision of the Semantic Web, efficient sto rage and retrieval of large RDF data set... more To realize the vision of the Semantic Web, efficient sto rage and retrieval of large RDF data sets is required. A common technique for persisting RDF data (graphs) is to use a single relational database table, a triple store. But, we believe a single triple store cannot scale for large-scale applications. This paper describes storing and querying persistent RDF graphs in Jena, a Semantic Web programmers' toolkit. Jena augments the triple store with property tables that cluster multiple property values in a single table row. We also describe two tools to assist in designing application-specific RDF sto rage schema. The first is a synthetic data generator that generates RDF graphs consistent with an underlying ontology. The second mines an RDF graph or an RDF query log for frequently occurring patterns. These patterns can be applied to schema design or caching strategies to improve performance. We also briefly describe Jena inferencing and a new approach to context in RDF which w...
2011 IEEE 27th International Conference on Data Engineering, 2011
Data stream management systems (DSMS) processing long-running queries over large volumes of strea... more Data stream management systems (DSMS) processing long-running queries over large volumes of stream data must typically deliver time-critical responses. We propose the first semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. We identify four SQO techniques guaranteed to result in performance gains. Based on classic satisfiability theory we then design a lightweight query optimization algorithm that efficiently detects SQO opportunities at runtime. At the logical level, our algorithm instantiates multiple concurrent SQO plans, each processing different partially overlapping substreams. Our novel execution paradigm employs multi-modal operators to support the execution of these concurrent SQO logical plans in a single physical plan. This highly agile execution strategy reduces resource utilization while supporting lightweight adaptivity. Our extensive experimental study in the CAPE stream processing system using both synthetic and real data confirms that our optimization techniques significantly reduce query execution times, up to 60%, compared to the traditional approach.
Proceedings of the 2nd international workshop on Distributed event-based systems, 2003
Join algorithms must be redesigned when processing stream data instead of persistently stored dat... more Join algorithms must be redesigned when processing stream data instead of persistently stored data. Data streams are potentially infinite and the query result is expected to be generated incrementally instead of once only. Data arrival patterns are often unpredictable and the statistics of the data and other relevant metadata often are only known at runtime. In some cases they are supplied interleaved with the actual data in the form of stream markers. Recently, stream join algorithms, like Symmetric Hash Join and XJoin, have been designed to perform in a pipelined fashion to cope with the latent delivery of data. However, none of them to date takes metadata, especially runtime metadata, into consideration. Hence, the join execution logic defined statically before runtime may not be well suited to deal with varying types of dynamic runtime scenarios. Also the potentially unbounded state needs to be maintained by the join operator to guarantee the precision of the result. In this paper, we propose a metadata-aware stream join operator called MJoin which is able to exploit metadata to (1) detect and purge useless materialized data to save computation resources and (2) optimize the execution logic to target different optimization goals. We have implemented the MJoin operator. The experimental results validate our metadata-driven join optimization strategies.