lan anh vu - Academia.edu (original) (raw)
Papers by lan anh vu
IEEE International Conference on High Performance Computing, Data, and Analytics, Apr 12, 2015
In this paper, we present a new parallel method named SDFEM that enables frequent pattern mining ... more In this paper, we present a new parallel method named SDFEM that enables frequent pattern mining (FPM) on cluster with multiple multi-core compute nodes to provide high performance. SDFEM is distinguished from previous parallel FPM works due to incorporating three advanced features to provide high mining performance for large-scale data analytic applications. First, SDFEM combines both shared memory and distributed memory computational models to leverage benefits of shared memory within a node in cluster. Second, it employs a multi-strategy load balancing approach to address the most challenging issue of parallel FPM to balance the mining workload among all cores of the cluster. Finally, its self-adaptive mining solution with the capability of dynamically adjusting to the characteristics of the database to perform efficiently on different data types either sparse or dense. For performance evaluation, we implement SDFEM using a hybrid model of OpenMP and MPI in which OpenMP is for the shared memory model and MPI is for message passing. SDFEM has been tested on a cluster of multiple 12-core shared memory compute nodes. Our experimental results on real databases show that performance of SDFEM is up to 329.5% faster than the parallel FPM approach that uses only distributed memory model with message passing (i.e. using pure MPI). In addition, SDFEM can achieve up to 45.4-64.8 speedup on 120 cores (i.e. 10 compute nodes and 12 cores per node).
IEEE International Conference on High Performance Computing, Data, and Analytics, Apr 12, 2015
Frequent pattern mining (FPM) is an important and computationally intensive task in data mining. ... more Frequent pattern mining (FPM) is an important and computationally intensive task in data mining. We present a novel method, CGMM (CPU & GPU based Multi-strategy Mining), for mining frequent patterns that combines the computing power of CPU and GPU to speed up the frequent pattern mining. CGMM employs two different mining strategies and dynamically switches between them; the CPU-based strategy uses FP-tree data structure to perform the mining task on CPU while the GPU-based method converts the allocated data portions to bit vectors to work mainly on GPU. This unique approach has the following advantages compared to the existing methods: (1) utilizes the parallel processing capability of GPU for computationally intensive portions; the flexibility and low memory latency of CPU for the sophisticated data processing needed to manipulate the more complex data structures to enhance the overall performance (2) applies two mining strategies to efficiently mine both sparse and dense databases. The performance evaluation of CGMM on a machine with AMD CPUs and NVIDIA Tesla GPUs shows that in the best cases, the proposed method runs up to 229 times faster than well-known sequential FPM algorithms and 7.2--13.9 times faster than GPApriori, a GPU based algorithm for FPM. In addition to outperforming them, CGMM has more stable performance on both dense and sparse test datasets.
Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems - DISCS-2013, 2013
ABSTRACT Frequent pattern mining is an important problem in data mining with many practical appli... more ABSTRACT Frequent pattern mining is an important problem in data mining with many practical applications. Current parallel methods for mining frequent patterns unstably perform for different database types and under-utilize the benefits of multi-core shared memory machines. We present ShaFEM, a novel parallel frequent pattern mining method, to address these issues. Our method can dynamically adapt to the data characteristics to efficiently perform on both sparse and dense databases. Its parallel mining lock free approach minimizes the synchronization needs and maximizes the data independence to enhance the scalability. Its structure lends itself well for dynamic job scheduling resulting in well-balanced load on new multi-core shared memory architectures. We evaluate ShaFEM on a 12-core multi-socket server and find that our method runs 2.1--5.8 times faster than the state-of-the-art parallel method. For some test cases, we have shown that ShaFEM saves 4.9 days and 12.8 hours of execution time over the compared method.
Parallel Computing, 2014
Association rule mining (ARM) is an important task in data mining with many practical application... more Association rule mining (ARM) is an important task in data mining with many practical applications. Current methods for association rule mining have shown unstable performance for different database types and under-utilize the benefits of multi-core shared memory machines. In this paper, we address these issues by presenting a novel parallel method for finding frequent patterns, the most computational intensive phase of ARM. Our proposed method, named ShaFEM, combines two mining strategies and applies the most appropriate one to each data subset of the database to efficiently adapt to the data characteristics and run fast on both sparse and dense databases. In addition, our newlock-free design minimizes the synchronization needs and maximizes the data independence to enhance the scalability. The new structure lends itself well to dynamic job scheduling resulting in a well-balanced load on the new multi-core shared memory architectures. We have evaluated ShaFEM on 12-core multi-socket servers and found that our method run up to 5.8 times faster and consumes memory up to 7.1 times less than the state-of-the-art parallel method. For some test cases, ShaFEM can save up to 4.9 days of execution time over the compared method.
2012 International Conference on Collaboration Technologies and Systems (CTS), 2012
ABSTRACT Mining frequent patterns is a fundamental data mining task with numerous practical appli... more ABSTRACT Mining frequent patterns is a fundamental data mining task with numerous practical applications such as consumer market-basket analysis, web mining, and network intrusion detection. When database size is large, executing this mining task on a personal computer is non-trivial because of huge computational time and memory consumption. In our previous research, we proposed a novel algorithm named FEM which is more efficient than well-known algorithms like Apriori, Eclat or FP-growth in discovering frequent patterns from both dense and sparse databases. However, in order to apply FEM to applications with large-scale databases, it is essential to develop new parallel algorithms that are based on FEM and deploy this mining task on high performance computer systems. In this paper, we present a new method named PFEM that parallelizes the FEM algorithm for a cluster of multi-core machines. Our proposed method allows each machine in the cluster execute an independent mining workload to improve the scalability. Computations within a multi-core machine use shared memory model to reduce communication overhead and maintain load balance. With the collaboration of both distributed memory and shared memory computational models, PFEM can adapt well to large computer systems with many multi-core.
Journal of Intelligent Systems, 2015
In this article, we present a new approach for frequent pattern mining (FPM) that runs fast for b... more In this article, we present a new approach for frequent pattern mining (FPM) that runs fast for both sparse and dense databases. Two algorithms, FEM and DFEM, based on our approach are also introduced. FEM applies a fixed threshold as the condition for switching between the two mining strategies; meanwhile, DFEM adopts this threshold dynamically at runtime to best fit the characteristics of the database during the mining process, especially when minimum support threshold is low. Additionally, we present optimization techniques for the proposed algorithms to speed the mining process, reduce the memory usage, and optimize the I/O cost. We also analyze in depth the performance of FEM and DFEM and compare them with several existing algorithms. The experimental results show that FEM and DFEM achieve a significant improvement in execution time and consume less memory than many popular FPM algorithms including the well-known Apriori, FP-growth, and Eclat.
2014 World Congress on Computer Applications and Information Systems (WCCAIS), 2014
Traditional association rule mining based on the support-confidence framework provides the object... more Traditional association rule mining based on the support-confidence framework provides the objective measure of the rules that are of interest to users. However, it does not reflect the semantic measure among the items. The semantic measure of an itemset is characterized with utility values that are typically associated with transaction items, where a user will be interested to an itemset only if it satisfies a given utility constraint. In this paper, we first define the problem of finding association rules using utility-confidence framework, which is a generalization of the amount-confidence measure. Using this semantic concept of rules, we then propose a compressed representation for association rules having minimal antecedent and maximal consequent. This representation is generated with the help of high utility closed itemsets (HUCI) and their generators. We propose the algorithms to generate the utility based non-redundant association rules and methods for reconstructing all association rules. Furthermore, we describe the algorithms which generate high utility itemsets (HUI) and high utility closed itemsets with their generators. These proposed algorithms are implemented using both synthetic and real datasets. The results demonstrate better efficiency and effectiveness of the proposed HUCI-Miner algorithm compared to other well-known existing algorithms. In addition, the experimental results show better quality in the compressed representation of the entire rule set under the considered framework.
IEEE International Conference on High Performance Computing, Data, and Analytics, Apr 12, 2015
In this paper, we present a new parallel method named SDFEM that enables frequent pattern mining ... more In this paper, we present a new parallel method named SDFEM that enables frequent pattern mining (FPM) on cluster with multiple multi-core compute nodes to provide high performance. SDFEM is distinguished from previous parallel FPM works due to incorporating three advanced features to provide high mining performance for large-scale data analytic applications. First, SDFEM combines both shared memory and distributed memory computational models to leverage benefits of shared memory within a node in cluster. Second, it employs a multi-strategy load balancing approach to address the most challenging issue of parallel FPM to balance the mining workload among all cores of the cluster. Finally, its self-adaptive mining solution with the capability of dynamically adjusting to the characteristics of the database to perform efficiently on different data types either sparse or dense. For performance evaluation, we implement SDFEM using a hybrid model of OpenMP and MPI in which OpenMP is for the shared memory model and MPI is for message passing. SDFEM has been tested on a cluster of multiple 12-core shared memory compute nodes. Our experimental results on real databases show that performance of SDFEM is up to 329.5% faster than the parallel FPM approach that uses only distributed memory model with message passing (i.e. using pure MPI). In addition, SDFEM can achieve up to 45.4-64.8 speedup on 120 cores (i.e. 10 compute nodes and 12 cores per node).
IEEE International Conference on High Performance Computing, Data, and Analytics, Apr 12, 2015
Frequent pattern mining (FPM) is an important and computationally intensive task in data mining. ... more Frequent pattern mining (FPM) is an important and computationally intensive task in data mining. We present a novel method, CGMM (CPU & GPU based Multi-strategy Mining), for mining frequent patterns that combines the computing power of CPU and GPU to speed up the frequent pattern mining. CGMM employs two different mining strategies and dynamically switches between them; the CPU-based strategy uses FP-tree data structure to perform the mining task on CPU while the GPU-based method converts the allocated data portions to bit vectors to work mainly on GPU. This unique approach has the following advantages compared to the existing methods: (1) utilizes the parallel processing capability of GPU for computationally intensive portions; the flexibility and low memory latency of CPU for the sophisticated data processing needed to manipulate the more complex data structures to enhance the overall performance (2) applies two mining strategies to efficiently mine both sparse and dense databases. The performance evaluation of CGMM on a machine with AMD CPUs and NVIDIA Tesla GPUs shows that in the best cases, the proposed method runs up to 229 times faster than well-known sequential FPM algorithms and 7.2--13.9 times faster than GPApriori, a GPU based algorithm for FPM. In addition to outperforming them, CGMM has more stable performance on both dense and sparse test datasets.
Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems - DISCS-2013, 2013
ABSTRACT Frequent pattern mining is an important problem in data mining with many practical appli... more ABSTRACT Frequent pattern mining is an important problem in data mining with many practical applications. Current parallel methods for mining frequent patterns unstably perform for different database types and under-utilize the benefits of multi-core shared memory machines. We present ShaFEM, a novel parallel frequent pattern mining method, to address these issues. Our method can dynamically adapt to the data characteristics to efficiently perform on both sparse and dense databases. Its parallel mining lock free approach minimizes the synchronization needs and maximizes the data independence to enhance the scalability. Its structure lends itself well for dynamic job scheduling resulting in well-balanced load on new multi-core shared memory architectures. We evaluate ShaFEM on a 12-core multi-socket server and find that our method runs 2.1--5.8 times faster than the state-of-the-art parallel method. For some test cases, we have shown that ShaFEM saves 4.9 days and 12.8 hours of execution time over the compared method.
Parallel Computing, 2014
Association rule mining (ARM) is an important task in data mining with many practical application... more Association rule mining (ARM) is an important task in data mining with many practical applications. Current methods for association rule mining have shown unstable performance for different database types and under-utilize the benefits of multi-core shared memory machines. In this paper, we address these issues by presenting a novel parallel method for finding frequent patterns, the most computational intensive phase of ARM. Our proposed method, named ShaFEM, combines two mining strategies and applies the most appropriate one to each data subset of the database to efficiently adapt to the data characteristics and run fast on both sparse and dense databases. In addition, our newlock-free design minimizes the synchronization needs and maximizes the data independence to enhance the scalability. The new structure lends itself well to dynamic job scheduling resulting in a well-balanced load on the new multi-core shared memory architectures. We have evaluated ShaFEM on 12-core multi-socket servers and found that our method run up to 5.8 times faster and consumes memory up to 7.1 times less than the state-of-the-art parallel method. For some test cases, ShaFEM can save up to 4.9 days of execution time over the compared method.
2012 International Conference on Collaboration Technologies and Systems (CTS), 2012
ABSTRACT Mining frequent patterns is a fundamental data mining task with numerous practical appli... more ABSTRACT Mining frequent patterns is a fundamental data mining task with numerous practical applications such as consumer market-basket analysis, web mining, and network intrusion detection. When database size is large, executing this mining task on a personal computer is non-trivial because of huge computational time and memory consumption. In our previous research, we proposed a novel algorithm named FEM which is more efficient than well-known algorithms like Apriori, Eclat or FP-growth in discovering frequent patterns from both dense and sparse databases. However, in order to apply FEM to applications with large-scale databases, it is essential to develop new parallel algorithms that are based on FEM and deploy this mining task on high performance computer systems. In this paper, we present a new method named PFEM that parallelizes the FEM algorithm for a cluster of multi-core machines. Our proposed method allows each machine in the cluster execute an independent mining workload to improve the scalability. Computations within a multi-core machine use shared memory model to reduce communication overhead and maintain load balance. With the collaboration of both distributed memory and shared memory computational models, PFEM can adapt well to large computer systems with many multi-core.
Journal of Intelligent Systems, 2015
In this article, we present a new approach for frequent pattern mining (FPM) that runs fast for b... more In this article, we present a new approach for frequent pattern mining (FPM) that runs fast for both sparse and dense databases. Two algorithms, FEM and DFEM, based on our approach are also introduced. FEM applies a fixed threshold as the condition for switching between the two mining strategies; meanwhile, DFEM adopts this threshold dynamically at runtime to best fit the characteristics of the database during the mining process, especially when minimum support threshold is low. Additionally, we present optimization techniques for the proposed algorithms to speed the mining process, reduce the memory usage, and optimize the I/O cost. We also analyze in depth the performance of FEM and DFEM and compare them with several existing algorithms. The experimental results show that FEM and DFEM achieve a significant improvement in execution time and consume less memory than many popular FPM algorithms including the well-known Apriori, FP-growth, and Eclat.
2014 World Congress on Computer Applications and Information Systems (WCCAIS), 2014
Traditional association rule mining based on the support-confidence framework provides the object... more Traditional association rule mining based on the support-confidence framework provides the objective measure of the rules that are of interest to users. However, it does not reflect the semantic measure among the items. The semantic measure of an itemset is characterized with utility values that are typically associated with transaction items, where a user will be interested to an itemset only if it satisfies a given utility constraint. In this paper, we first define the problem of finding association rules using utility-confidence framework, which is a generalization of the amount-confidence measure. Using this semantic concept of rules, we then propose a compressed representation for association rules having minimal antecedent and maximal consequent. This representation is generated with the help of high utility closed itemsets (HUCI) and their generators. We propose the algorithms to generate the utility based non-redundant association rules and methods for reconstructing all association rules. Furthermore, we describe the algorithms which generate high utility itemsets (HUI) and high utility closed itemsets with their generators. These proposed algorithms are implemented using both synthetic and real datasets. The results demonstrate better efficiency and effectiveness of the proposed HUCI-Miner algorithm compared to other well-known existing algorithms. In addition, the experimental results show better quality in the compressed representation of the entire rule set under the considered framework.