sam sung - Academia.edu (original) (raw)

Papers by sam sung

Research paper thumbnail of Text Retrieval from Document Images Based on Word Shape Analysis

Applied Intelligence, 2003

In this paper, we propose a method of text retrieval from document images using a similarity meas... more In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N -gram algorithm for text documents.

Research paper thumbnail of A trimmed mean approach to finding spatial outliers

Intelligent Data Analysis, 2004

Haya Shida, Subscribe (Full Service), Register (Limited Service, Free), Login. Search: The ACM Di... more Haya Shida, Subscribe (Full Service), Register (Limited Service, Free), Login. Search: The ACM Digital Library The Guide. ...

Research paper thumbnail of Virtual Card Payment Protocol and Risk Analysis Using Performance Scoring

... Lilia Yerosheva, Shannon K. Kuntz, Peter M. Kogge, Jay B. Brockman. Page: 3. Influence of Arr... more ... Lilia Yerosheva, Shannon K. Kuntz, Peter M. Kogge, Jay B. Brockman. Page: 3. Influence of Array Allocation Mechanisms on Memory System Energy. R. Athavale, Narayanan Vijaykrishnan, Mahmut T. Kandemir, Mary Jane Irwin. Page: 3. A PIM-based Multiprocessor System. ...

Research paper thumbnail of Forecasting Association Rules Using Existing Data Sets

IEEE Transactions on Knowledge and Data Engineering, 2003

An important issue that needs to be addressed when using data mining tools is the validity of the... more An important issue that needs to be addressed when using data mining tools is the validity of the rules outside of the data set from which they are generated. Rules are typically derived from the patterns in a particular data set. When a new situation occurs, the change in the set of rules obtained from the new data set could be significant. In this paper, we provide a novel model for understanding how the differences between two situations affect the changes of the rules, based on the concept of fine partitioned groups that we call caucuses. Using this model, we provide a simple technique called combination data set, to get a good estimate of the set of rules for a new situation. Our approach works independently of the core mining process and it can be easily implemented with all variations of rule mining techniques. Through experiments with real-life and synthetic data sets, we show the effectiveness of our technique in finding the correct set of rules under different situations.

Research paper thumbnail of Efficient Wrapper Reinduction from Dynamic Web Sources

Research paper thumbnail of Discovery of maximum length frequent itemsets

Information Sciences, 2008

The use of frequent itemsets has been limited by the high computational cost as well as the large... more The use of frequent itemsets has been limited by the high computational cost as well as the large number of resulting itemsets. In many real-world scenarios, however, it is often sufficient to mine a small representative subset of frequent itemsets with low computational cost. To that end, in this paper, we define a new problem of finding the frequent itemsets with a maximum length and present a novel algorithm to solve this problem. Indeed, maximum length frequent itemsets can be efficiently identified in very large data sets and are useful in many application domains. Our algorithm generates the maximum length frequent itemsets by adapting a pattern fragment growth methodology based on the FP-tree structure. Also, a number of optimization techniques have been exploited to prune the search space. Finally, extensive experiments on real-world data sets validate the proposed algorithm.

Research paper thumbnail of Performance Analysis of Disk Modulo Allocation Method for Cartesian Product Files

IEEE Transactions on Software Engineering, 1987

Cartesian product files have been shown to exhibit attractive properties for partial match querie... more Cartesian product files have been shown to exhibit attractive properties for partial match queries. The Disk Modulo (DM) allocation method is shown to have good performance on the distribution of Cartesian product files into an m-disk system. However, there was no explicit expression made before to represent the DM method's response time to a given partial match query. In this paper, based upon discrete Fourier transform, we derive one formula for such a computation. After obtaining this representation, the performance characteristics of the DM method can now be given an analytic interpretation. Some theoretical results are derived from this formula. We also use our formula to analyze the performance of several popular Disk Modulo algorithms.

Research paper thumbnail of A comprehensive comparative study on term weighting schemes for text categorization with support vector machines

Term weighting scheme, which has been used to convert the documents as vectors in the term space,... more Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital step in automatic text categorization. In this paper, we conducted comprehensive experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf.rf to improve the term's discriminating power. The controlled experimental results showed that this newly proposed tf.rf scheme is significantly better than other widely-used term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization.

Research paper thumbnail of Caucus-based Transaction Clustering

Page 1. Proceedings of the Eighth International Conference on Database Systems for Advanced Appli... more Page 1. Proceedings of the Eighth International Conference on Database Systems for Advanced Applications (DASFAA'03) 0-7695-1895/03 $17.00 © 2003 IEEE Page 2. Proceedings of the Eighth International Conference on ...

Research paper thumbnail of Performance Evaluation and Analysis of Protocols for IP Mobility Support: A Quantitative Study

Page 1. Performance Evaluation and Analysis of Protocols for IP Mobility Support: A Quantitative ... more Page 1. Performance Evaluation and Analysis of Protocols for IP Mobility Support: A Quantitative Study Peng Sun, Sam Y. Sung, Zhao Li Department of Computer Science National University of Singapore {sunpeng1, ssung, zhaoli}@comp.nus.edu.sg ...

Research paper thumbnail of Text Retrieval from Document Images Based on Word Shape Analysis

Applied Intelligence, 2003

In this paper, we propose a method of text retrieval from document images using a similarity meas... more In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N -gram algorithm for text documents.

Research paper thumbnail of A trimmed mean approach to finding spatial outliers

Intelligent Data Analysis, 2004

Haya Shida, Subscribe (Full Service), Register (Limited Service, Free), Login. Search: The ACM Di... more Haya Shida, Subscribe (Full Service), Register (Limited Service, Free), Login. Search: The ACM Digital Library The Guide. ...

Research paper thumbnail of Virtual Card Payment Protocol and Risk Analysis Using Performance Scoring

... Lilia Yerosheva, Shannon K. Kuntz, Peter M. Kogge, Jay B. Brockman. Page: 3. Influence of Arr... more ... Lilia Yerosheva, Shannon K. Kuntz, Peter M. Kogge, Jay B. Brockman. Page: 3. Influence of Array Allocation Mechanisms on Memory System Energy. R. Athavale, Narayanan Vijaykrishnan, Mahmut T. Kandemir, Mary Jane Irwin. Page: 3. A PIM-based Multiprocessor System. ...

Research paper thumbnail of Forecasting Association Rules Using Existing Data Sets

IEEE Transactions on Knowledge and Data Engineering, 2003

An important issue that needs to be addressed when using data mining tools is the validity of the... more An important issue that needs to be addressed when using data mining tools is the validity of the rules outside of the data set from which they are generated. Rules are typically derived from the patterns in a particular data set. When a new situation occurs, the change in the set of rules obtained from the new data set could be significant. In this paper, we provide a novel model for understanding how the differences between two situations affect the changes of the rules, based on the concept of fine partitioned groups that we call caucuses. Using this model, we provide a simple technique called combination data set, to get a good estimate of the set of rules for a new situation. Our approach works independently of the core mining process and it can be easily implemented with all variations of rule mining techniques. Through experiments with real-life and synthetic data sets, we show the effectiveness of our technique in finding the correct set of rules under different situations.

Research paper thumbnail of Efficient Wrapper Reinduction from Dynamic Web Sources

Research paper thumbnail of Discovery of maximum length frequent itemsets

Information Sciences, 2008

The use of frequent itemsets has been limited by the high computational cost as well as the large... more The use of frequent itemsets has been limited by the high computational cost as well as the large number of resulting itemsets. In many real-world scenarios, however, it is often sufficient to mine a small representative subset of frequent itemsets with low computational cost. To that end, in this paper, we define a new problem of finding the frequent itemsets with a maximum length and present a novel algorithm to solve this problem. Indeed, maximum length frequent itemsets can be efficiently identified in very large data sets and are useful in many application domains. Our algorithm generates the maximum length frequent itemsets by adapting a pattern fragment growth methodology based on the FP-tree structure. Also, a number of optimization techniques have been exploited to prune the search space. Finally, extensive experiments on real-world data sets validate the proposed algorithm.

Research paper thumbnail of Performance Analysis of Disk Modulo Allocation Method for Cartesian Product Files

IEEE Transactions on Software Engineering, 1987

Cartesian product files have been shown to exhibit attractive properties for partial match querie... more Cartesian product files have been shown to exhibit attractive properties for partial match queries. The Disk Modulo (DM) allocation method is shown to have good performance on the distribution of Cartesian product files into an m-disk system. However, there was no explicit expression made before to represent the DM method's response time to a given partial match query. In this paper, based upon discrete Fourier transform, we derive one formula for such a computation. After obtaining this representation, the performance characteristics of the DM method can now be given an analytic interpretation. Some theoretical results are derived from this formula. We also use our formula to analyze the performance of several popular Disk Modulo algorithms.

Research paper thumbnail of A comprehensive comparative study on term weighting schemes for text categorization with support vector machines

Term weighting scheme, which has been used to convert the documents as vectors in the term space,... more Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital step in automatic text categorization. In this paper, we conducted comprehensive experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf.rf to improve the term's discriminating power. The controlled experimental results showed that this newly proposed tf.rf scheme is significantly better than other widely-used term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization.

Research paper thumbnail of Caucus-based Transaction Clustering

Page 1. Proceedings of the Eighth International Conference on Database Systems for Advanced Appli... more Page 1. Proceedings of the Eighth International Conference on Database Systems for Advanced Applications (DASFAA'03) 0-7695-1895/03 $17.00 © 2003 IEEE Page 2. Proceedings of the Eighth International Conference on ...

Research paper thumbnail of Performance Evaluation and Analysis of Protocols for IP Mobility Support: A Quantitative Study

Page 1. Performance Evaluation and Analysis of Protocols for IP Mobility Support: A Quantitative ... more Page 1. Performance Evaluation and Analysis of Protocols for IP Mobility Support: A Quantitative Study Peng Sun, Sam Y. Sung, Zhao Li Department of Computer Science National University of Singapore {sunpeng1, ssung, zhaoli}@comp.nus.edu.sg ...