Meichun Hsu - Academia.edu (original) (raw)

Papers by Meichun Hsu

Proceedings of the 16th International Database Engineering & Applications Sysmposium on - IDEAS '12, 2012

ABSTRACT The current generation of stream processing systems is in general built separately from ... more ABSTRACT The current generation of stream processing systems is in general built separately from the query engine thus lacks the expressive power of SQL and causes significant overhead in data access and movement. This situation has motivated us to leverage the query engine for stream processing. Stream-join is a window operation where the key issue is how to punctuate and pair two or more correlated streams. In this work we tackle this issue in the specific context of query engine supported stream processing. We focus on the following problems: a SQL query is definable on bounded relation data but stream data are unbounded, and join multiple streams is a stateful (thus history-sensitive) operation but a SQL query only cares about the current state; further, relation join typically requires relation re-scan in a nested-loop but by nature a stream cannot be re-captured as reading a stream always gets newly incoming data. To leverage query processing for analyzing unbounded stream, we defined the Epoch-based Continuous Query (ECQ) model which allows a SQL query to be executed epoch by epoch for processing the stream data chunk by chunk. However, unlike multiple one-time queries, an ECQ is a single, continuous query instance across execution epochs for keeping the continuity of the application state as required by the history-sensitive operations such as sliding-window join. To joining multiple streams, we further developed the techniques to cache one or more consecutive data chunks falling in a sliding window across query execution epochs in the ECQ instance, to allow them to be re-delivered from the cache. In this way join multiple streams and self-join a single stream in the data chunk based window or sliding window, with various pairing schemes, are made possible. We extended the PostgreSQL engine to support the proposed approach. Our experience has demonstrated its value.

Proceedings of the International Conference on Management of Emergent Digital EcoSystems - MEDES '09, 2009

... Machine, SVM, model for classifying video frames to concepts) for multilevel, multidimensiona... more ... Machine, SVM, model for classifying video frames to concepts) for multilevel, multidimensional feature ... and analysis, our OpBI video platform supports in-DB, multi-level, multi-dimensional ... contexts, models and features (Fig.6), yielding model-based classification expressed by ...

Lecture Notes in Computer Science, 2002

Lecture Notes in Computer Science, 2009

Lecture Notes in Computer Science, 2010

ABSTRACT Scaling-out data-intensive analytics is generally made by means of parallel computation ... more ABSTRACT Scaling-out data-intensive analytics is generally made by means of parallel computation for gaining CPU bandwidth, and incremental computation for balancing workload. Combining these two mechanisms is the key to support large scale stream analytics. Map-Reduce (M-R) is a programming model for supporting parallel computation over vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of data intensive applications. In-DB M-R allows these functions to be embedded within standard queries to exploit the SQL expressive power, and allows them to be executed by the query engine with fast data access and reduced data move. However, when the data form infinite streams, the semantics and scale-out capability of M-R are challenged. To solve this problem, we propose to integrate M-R with the continuous query model characterized by Cut-Rewind (C-R), i.e. cut a query execution based on some granule of the stream data and then rewind the state of the query without shutting it down, for processing the next chunk of stream data. This approach allows an M-R query with full SQL expressive power to be applied to dynamic stream data chunk by chunk for continuous, window-based stream analytics. Our experience shows that integrating M-R and C-R can provide a powerful combination for parallelized and granulized stream processing. This combination enables us to scale-out stream analytics “horizontally” based on the M-R model, and “vertically” based on the C-R model. The proposed approach has been prototyped on a commercial and proprietary parallel database engine. Our preliminary experiments reveal the merit of using query engine for near-real-time parallel and incremental stream analytics.

Lecture Notes in Computer Science, 2010

... T3 T4 T5 T6 T7 T8 T9 Used for current hour h h -1 h -2 h -8 h -7 loading archiving T1 T2 T3 T... more ... T3 T4 T5 T6 T7 T8 T9 Used for current hour h h -1 h -2 h -8 h -7 loading archiving T1 T2 T3 T4 T5 T6 T7 T8 T9 loading archiving at hour h Table indices Query generator Retrieve request at hour h+1 Page 15. Data Stream Analytics as Cloud Service for Mobile Applications 723 ...

Lecture Notes in Computer Science, 2011

ABSTRACT When cloud services become popular, how to consume a cloud service efficiently by an ent... more ABSTRACT When cloud services become popular, how to consume a cloud service efficiently by an enterprise application, as the client of the cloud service either on a device or on the application tier of the enterprise software stack, is an important issue. Focusing on the consumption of the real-time events service, in this work we extend the Data Access Object (DAO) pattern of enterprise applications for on-demand access and analysis of real-time events. We introduce the notion of Operational Event Pipe for caching the most recent events delivered by an event service, and the on-demand data analysis pattern based on this notion. We implemented the operational event pipe as a special kind of continuous query referred to as Event Pipe Query (EPQ). An EPQ is a long-standing SQL query with User Defined Functions (UDFs) that provides a pipe for the stream data to be buffered and to flow continuously in the boundary of a sliding window; when not requested, the EPQ just maintains and updates the buffer but returns noting, once requested, it returns the query processing results on the selected part of the sliding window buffer, under the request-and-rewind mechanism. Integrating event buffering and analysis in a single continuous query leverages the SQL expressive power and the query engine's data processing capability, and reduces the data movement overhead. By extending the PostgreSQL query engine, we implement this operation pattern as the Continuous Data Access Object (CDAO) - an extension to the J2EE DAO. While DAO provides static data access interfaces, CDAO adds dynamic event processing interfaces with one or more EPQs.

Lecture Notes in Computer Science, 2009

Most conventional video processing platforms treat database merely as a storage engine rather tha... more Most conventional video processing platforms treat database merely as a storage engine rather than a computation engine, which causes inefficient data access and massive amount of data movement. Motivated by providing a convergent platform, we push down video processing to the database engine using User Defined Functions (UDFs). However, the existing UDF technology suffers from two major limitations. First, a

Lecture Notes in Computer Science, 2009

ABSTRACT

Lecture Notes in Computer Science, 2011

ABSTRACT Many enterprise applications are based on continuous analytics of data streams. Integrat... more ABSTRACT Many enterprise applications are based on continuous analytics of data streams. Integrating data-intensive stream processing with query processing allows us to take advantage of SQL's expressive power and DBMS's data management capability. However, it also raises serious challenges in dealing with complex dataflow, applying queries to unbounded stream data, and providing highly scalable, dynamically configurable, elastic infrastructure. In this project we tackle these problems in three dimensions. First, we model the general graph-structured, continuous dataflow analytics as a SQL Streaming Process with multiple connected and stationed continuous queries. Next, we extend the query engine to support cycle-based query execution for processing unbounded stream data in bounded chunks with sound semantics. Finally, we develop the Query Engine Grid (QE-Grid) over the Distributed Caching Platforms (DCP) as a dynamically configurable elastic infrastructure for parallel and distributed execution of SQL Streaming Processes. The proposed infrastructure is preliminarily implemented using PostgreSQL engines. Our experience shows its merit in leveraging SQL and query engines to analyze real-time, graph-structured and unbounded streams. Integrating it with a commercial and proprietary MPP based database cluster is being investigated.

SFL (pronounced as Sea-Flow) is an analytics system that supports a declarative language that ext... more SFL (pronounced as Sea-Flow) is an analytics system that supports a declarative language that extends SQL for specifying the dataflow of data-intensive analytics. The extended SQL language is motivated by providing a top-level representation of the converged platform for analytics and data management. Due to fast data access and reduced data transfer, such convergence has become the key to speed

The massively growing data volume and the pressing need for low latency are pushing the tradition... more The massively growing data volume and the pressing need for low latency are pushing the traditional store-first-query-later data warehousing technologies beyond their limits. Many enterprise applications are now based on continuous analytics of data streams. While integrating stream processing with query processing takes advantage of SQL's expressive power and DBMS's data management capability, it raises serious challenges in dealing with complex dataflow, applying queries to unbounded stream data, and providing highly scalable, dynamically configurable, elastic infrastructure. To solve these problems, we model the general graph-structured, continuous dataflow analytics as a SQL Streaming Process with multiple connected and stationed continuous queries; then we extend the query engine to support cycle-based query execution for processing unbounded stream data chunk-wise with sound semantics; and finally, we develop the Query Engine Net (QE-Net) over the Distributed Caching P...

With the booming of microblogs on the Web, people have begun to express their opinions on a wide ... more With the booming of microblogs on the Web, people have begun to express their opinions on a wide variety of topics on Twitter and other similar services. Sentiment analysis on entities (e.g., products, organizations, people, etc.) in tweets (posts on Twitter) thus becomes a rapid and effective way of gauging public opinion for business marketing or social studies. However, Twitter's unique characteristics give rise to new problems for current sentiment analysis methods, which originally focused on large opinionated corpora such as product reviews. In this paper, we propose a new entity-level sentiment analysis method for Twitter. The method first adopts a lexiconbased approach to perform entity-level sentiment analysis. This method can give high precision, but low recall. To improve recall, additional tweets that are likely to be opinionated are identified automatically by exploiting the information in the result of the lexicon-based method. A classifier is then trained to assig...

Lecture Notes in Computer Science, 1988

Without Abstract

Lecture Notes in Computer Science, 2008

ABSTRACT

Lecture Notes in Computer Science, 2008

ABSTRACT

Lecture Notes in Computer Science, 2008

A technical trend in supporting large scale scientific applications is converging data intensive ... more A technical trend in supporting large scale scientific applications is converging data intensive computation and data management for fast data access and reduced data flow. In a combined cluster platform, co-locating computation and data is the key to efficiency and scalability; and to make it happen, data must be partitioned in a way consistent with the computation model. However, with

Proceedings of the 16th International Database Engineering & Applications Sysmposium on - IDEAS '12, 2012

Proceedings of the International Conference on Management of Emergent Digital EcoSystems - MEDES '09, 2009

Lecture Notes in Computer Science, 2002

Lecture Notes in Computer Science, 2009

Lecture Notes in Computer Science, 2010

Lecture Notes in Computer Science, 2011

Lecture Notes in Computer Science, 2009

ABSTRACT

Lecture Notes in Computer Science, 2011

Lecture Notes in Computer Science, 1988

Without Abstract

Lecture Notes in Computer Science, 2008

ABSTRACT

Lecture Notes in Computer Science, 2008

ABSTRACT

Lecture Notes in Computer Science, 2008