Philip Yu | University of Illinois at Chicago (original) (raw)
Papers by Philip Yu
Abstract A parallel hash join algorithm based on the concept of hierarchical hashing is proposed ... more Abstract A parallel hash join algorithm based on the concept of hierarchical hashing is proposed to address the problem data skew. The proposed algorithm adds an extra scheduling phase to the usual hash and join phases. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase.
Abstract Analyzing the executions of a buggy software program is essentially a data mining proces... more Abstract Analyzing the executions of a buggy software program is essentially a data mining process. Although many interesting methods have been developed to trace crashing bugs (such as memory violation and core dumps), it is still difficult to analyze noncrashing bugs (such as logical errors). In this paper, we develop a novel method to classify the structured traces of program executions using software behavior graphs.
Abstract Due to rapid growth of the Internet technology and new scientific/technological advances... more Abstract Due to rapid growth of the Internet technology and new scientific/technological advances, the number of applications that model data as graphs increases, because graphs have high expressive power to model complicated structures. The dominance of graphs in real-world applications asks for new graph data management so that users can access graph data effectively and efficiently. In this paper, we study a graph pattern matching problem over a large data graph.
Abstract There are numerous applications that need to deal with a large graph and need to query r... more Abstract There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O (| V|.| E| 1/2) space, and be used to answer reachability query efficiently. However, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate.
Abstract In this paper, we devise efficient algorithms for mining association rules with adjustab... more Abstract In this paper, we devise efficient algorithms for mining association rules with adjustable accuracy. It is noted that several applications require mining the transaction data to capture the customer behavior frequently. In those applications, the efficiency of data mining could be a more important faktor t. han the requirement for complete accuracy of the mining results. Allowing imprecise results can significantly improve the data mining efficiency.
Abstract Mining data streams of changing class distributions is important for real-time business ... more Abstract Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution.
Abstract In recent years, the World Wide Web has shown enormous growth in size. Vast repositories... more Abstract In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages.
This paper discusses a framework and provides an overview of general methods for optimizing the m... more This paper discusses a framework and provides an overview of general methods for optimizing the management of advertisements on web servers. We discuss the major issues which arise in web advertisement management, and describe basic mathematical techniques which can be employed to handle such problems. These include a number of statistical, optimization and scheduling models.
Abstract Data stream values are often associated with multiple aspects. For example, each value f... more Abstract Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (eg, temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion?
Abstract We present data representations, distance measures and organizational structures for fas... more Abstract We present data representations, distance measures and organizational structures for fast and efficient retrieval of similar shapes in image databases. Using the Hough Transform we extract shape signatures that correspond to important features of an image. The new shape descriptor is robust against line discontinuities and takes into consideration not only the shape boundaries, but also the content inside the object perimeter.
Abstract Unlike traditional clustering methods that focus on grouping objects with similar values... more Abstract Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including large scale scientific data analysis, target marketing, Web usage analysis, etc.
A clear trend of the Web is that a variety of new consumer devices with diverse processing powers... more A clear trend of the Web is that a variety of new consumer devices with diverse processing powers, display capabilities, and network connections is gaining access to the Internet. Tailoring Web content to match the device characteristics requires functionalities for content transformation, namely transcoding, that are typically carried out by the content provider or by some proxy server at the edge.
Abstract DNA microarray technology is about to bring an explosion of gene expression data that ma... more Abstract DNA microarray technology is about to bring an explosion of gene expression data that may dwarf even the human sequencing projects. Researchers are motivated to identify genes whose expression levels rise and fall coherently under a set of experimental perturbations, that is, they exhibit fluctuation of a similar shape when conditions change.
Technology breakthroughs are needed to–manage and analyze continuous streams for knowledge extrac... more Technology breakthroughs are needed to–manage and analyze continuous streams for knowledge extraction–adapt system management rapidly based on changes of the data and the environment–make numerous real-time decisions about priorities of what inputs to examine, what analyses to execute, etc–operate over physically distributed sites–be highly secure and to support protection of private information–be scalable in many dimensions
Abstract. In many classification and data-mining applications the user does not know a priori whi... more Abstract. In many classification and data-mining applications the user does not know a priori which distance measure is the most appropriate for the task at hand without examining the produced results. Also, in several cases, different distance functions can provide diverse but equally intuitive results (according to the specific focus of each measure).
Abstract Many patterns have been discovered to explain and analyze how people make friends. Among... more Abstract Many patterns have been discovered to explain and analyze how people make friends. Among them is the triadic closure, supported by the principle of the transitivity of friendship, which means for an individual the friends of her friend are more likely to become her new friends. However, people's motivations under this principle haven't been well studied, and it's still unknown that how this principle works in diverse situations.
Abstract In recent years, a number of indirect data collection methodologies have lead to the pro... more Abstract In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding deterministic value is not known. This increases the challenge of mining and managing uncertain data, since the precise behavior of the underlying data is no longer known. In this paper, we provide a survey of uncertain data mining and management applications.
Abstract Monitoring continual queries or subscriptions is to determine the subset of all queries ... more Abstract Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is usually accomplished by first identifying a" small" candidate set of subscriptions for an event and then determining the matched subscriptions from the candidate set. Prior work has focused on using equality clauses to identify the candidate set.
This paper presents distributed divergence control algorithms for epsilon serializability for bot... more This paper presents distributed divergence control algorithms for epsilon serializability for both homogeneous and heterogeneous distributed databases. Epsilon serializability allows for more concurrency by permitting non-serializable interleavings of database operations among epsilon transactions.
Abstract There has been a good deal of progress made recently toward the efficient parallelizatio... more Abstract There has been a good deal of progress made recently toward the efficient parallelization of individual phases of single queries in multiprocessor database systems.
Abstract A parallel hash join algorithm based on the concept of hierarchical hashing is proposed ... more Abstract A parallel hash join algorithm based on the concept of hierarchical hashing is proposed to address the problem data skew. The proposed algorithm adds an extra scheduling phase to the usual hash and join phases. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase.
Abstract Analyzing the executions of a buggy software program is essentially a data mining proces... more Abstract Analyzing the executions of a buggy software program is essentially a data mining process. Although many interesting methods have been developed to trace crashing bugs (such as memory violation and core dumps), it is still difficult to analyze noncrashing bugs (such as logical errors). In this paper, we develop a novel method to classify the structured traces of program executions using software behavior graphs.
Abstract Due to rapid growth of the Internet technology and new scientific/technological advances... more Abstract Due to rapid growth of the Internet technology and new scientific/technological advances, the number of applications that model data as graphs increases, because graphs have high expressive power to model complicated structures. The dominance of graphs in real-world applications asks for new graph data management so that users can access graph data effectively and efficiently. In this paper, we study a graph pattern matching problem over a large data graph.
Abstract There are numerous applications that need to deal with a large graph and need to query r... more Abstract There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O (| V|.| E| 1/2) space, and be used to answer reachability query efficiently. However, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate.
Abstract In this paper, we devise efficient algorithms for mining association rules with adjustab... more Abstract In this paper, we devise efficient algorithms for mining association rules with adjustable accuracy. It is noted that several applications require mining the transaction data to capture the customer behavior frequently. In those applications, the efficiency of data mining could be a more important faktor t. han the requirement for complete accuracy of the mining results. Allowing imprecise results can significantly improve the data mining efficiency.
Abstract Mining data streams of changing class distributions is important for real-time business ... more Abstract Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution.
Abstract In recent years, the World Wide Web has shown enormous growth in size. Vast repositories... more Abstract In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages.
This paper discusses a framework and provides an overview of general methods for optimizing the m... more This paper discusses a framework and provides an overview of general methods for optimizing the management of advertisements on web servers. We discuss the major issues which arise in web advertisement management, and describe basic mathematical techniques which can be employed to handle such problems. These include a number of statistical, optimization and scheduling models.
Abstract Data stream values are often associated with multiple aspects. For example, each value f... more Abstract Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (eg, temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion?
Abstract We present data representations, distance measures and organizational structures for fas... more Abstract We present data representations, distance measures and organizational structures for fast and efficient retrieval of similar shapes in image databases. Using the Hough Transform we extract shape signatures that correspond to important features of an image. The new shape descriptor is robust against line discontinuities and takes into consideration not only the shape boundaries, but also the content inside the object perimeter.
Abstract Unlike traditional clustering methods that focus on grouping objects with similar values... more Abstract Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including large scale scientific data analysis, target marketing, Web usage analysis, etc.
A clear trend of the Web is that a variety of new consumer devices with diverse processing powers... more A clear trend of the Web is that a variety of new consumer devices with diverse processing powers, display capabilities, and network connections is gaining access to the Internet. Tailoring Web content to match the device characteristics requires functionalities for content transformation, namely transcoding, that are typically carried out by the content provider or by some proxy server at the edge.
Abstract DNA microarray technology is about to bring an explosion of gene expression data that ma... more Abstract DNA microarray technology is about to bring an explosion of gene expression data that may dwarf even the human sequencing projects. Researchers are motivated to identify genes whose expression levels rise and fall coherently under a set of experimental perturbations, that is, they exhibit fluctuation of a similar shape when conditions change.
Technology breakthroughs are needed to–manage and analyze continuous streams for knowledge extrac... more Technology breakthroughs are needed to–manage and analyze continuous streams for knowledge extraction–adapt system management rapidly based on changes of the data and the environment–make numerous real-time decisions about priorities of what inputs to examine, what analyses to execute, etc–operate over physically distributed sites–be highly secure and to support protection of private information–be scalable in many dimensions
Abstract. In many classification and data-mining applications the user does not know a priori whi... more Abstract. In many classification and data-mining applications the user does not know a priori which distance measure is the most appropriate for the task at hand without examining the produced results. Also, in several cases, different distance functions can provide diverse but equally intuitive results (according to the specific focus of each measure).
Abstract Many patterns have been discovered to explain and analyze how people make friends. Among... more Abstract Many patterns have been discovered to explain and analyze how people make friends. Among them is the triadic closure, supported by the principle of the transitivity of friendship, which means for an individual the friends of her friend are more likely to become her new friends. However, people's motivations under this principle haven't been well studied, and it's still unknown that how this principle works in diverse situations.
Abstract In recent years, a number of indirect data collection methodologies have lead to the pro... more Abstract In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding deterministic value is not known. This increases the challenge of mining and managing uncertain data, since the precise behavior of the underlying data is no longer known. In this paper, we provide a survey of uncertain data mining and management applications.
Abstract Monitoring continual queries or subscriptions is to determine the subset of all queries ... more Abstract Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is usually accomplished by first identifying a" small" candidate set of subscriptions for an event and then determining the matched subscriptions from the candidate set. Prior work has focused on using equality clauses to identify the candidate set.
This paper presents distributed divergence control algorithms for epsilon serializability for bot... more This paper presents distributed divergence control algorithms for epsilon serializability for both homogeneous and heterogeneous distributed databases. Epsilon serializability allows for more concurrency by permitting non-serializable interleavings of database operations among epsilon transactions.
Abstract There has been a good deal of progress made recently toward the efficient parallelizatio... more Abstract There has been a good deal of progress made recently toward the efficient parallelization of individual phases of single queries in multiprocessor database systems.