A Comparative Study of Association Rule Mining Algorithms on Grid and Cloud Platform (original) (raw)
Related papers
A Survey on Association Rule Mining Algorithm and Architecture for Distributed Processing
2014
Association rule mining is a data mining technique used to uncover previously unknown hidden patterns or rules from huge databases usually tera and peta bytes of data. There are many popular algorithms for mining various association rules like Apriori, portioning, dynamic item set counting etc. But the main drawback of these algorithms is their sequential nature. Processing large databases in sequential order has many disadvantages like time consuming, scalability and performance issues. In order to avoid the above said problems we look for parallel or distributed association rule mining for providing scalability and better performance.
Mining Association Rules in Various Computing Environments: A Survey
2016
Association Rule Mining (ARM) is one of the well know and most researched technique of data mining. There are so many ARM algorithms have been designed that their counting is a large number. In this paper we have surveyed the various ARM algorithms in four computing environments. The considered computing environments are sequential computing, parallel and distributed computing, grid computing and cloud computing. With the emergence of new computing paradigm, ARM algorithms have been designed by many researchers to improve the efficiency by utilizing the new paradigm. This paper represents the journey of ARM algorithms started from sequential algorithms, and through parallel and distributed, and grid based algorithms to the current state-of-the-art, along with the motives for adopting new machinery.
Parallel and Distributed Association Rule Mining Algorithms: A Recent Survey
Information Management and Computer Science, 2019
Data investigation is an essential key factor now a days due to rapidly growing electronic technology. It generates a large number of transactional data logs from a range of sources devices. Parallel and distributed computing is a useful approach for enhancing the data mining process. The aim of this research is to present a systematic review of parallel association rule mining (PARM) and distributed association rule mining (DARM) approaches. We have observed that the parallelized nature of Apriori, Equivalence class, Hadoop (MapReduce), and Spark proves to be very efficient in PARM and DARM environment. We conclude that this comprehensive review, references cited in this article will convey foremost hypothetical issues and a guideline to the researcher an interesting research direction. The most important hypothetical issue and challenges include the large size of databases, dimensionality of data, indexing schemes of data in the database, data skewness, database location, load balancing strategies, methods of adaptability in incremental databases and orientation of the database.
Executing association rule mining algorithms under a Grid computing environment
Proceedings of the Workshop on Parallel and Distributed Systems Testing, Analysis, and Debugging - PADTAD '11, 2011
Grids are now regarded as promising platforms for data and computation-intensive applications like data mining. However, the exploration of such large-scale computing resources necessitates the development of new distributed algorithms. The major challenge facing the developers of distributed data mining algorithms is how to adjust the load imbalance that occurs during execution. This load imbalance is due to the dynamic nature of data mining algorithms (i.e. we cannot predict the load before execution) and the heterogeneity of Grid computing systems. In this paper, we propose a dynamic load balancing strategy for distributed association rule mining algorithms under a Grid computing environment. We evaluate the performance of the proposed strategy by the use of Grid'5000. A Grid infrastructure distributed in nine sites around France, for research in large-scale parallel and distributed systems.
2012
Decrease in hardware costs and increase in computer networking technologies have led to the exponential growth in the use of large-scale parallel and distributed computing systems. One of the biggest issues in such systems is the development of effective techniques/algorithms for the distribution of the processes/load of a parallel program on multiple hosts to achieve goal(s) such as minimizing execution time, minimizing communication delays, maximizing resource utilization and maximizing throughput. The algorithms known as load balancing algorithms, helps to achieve the above said goal(s). The objective of this paper is to identify the challenges of the dynamic load balancing of association rule mining algorithm in the distributed computing environment. In future this work can be extended to analyze the efficiency of existing algorithm and if the existing are not, develop an algorithm for dynamic load balancing of association rule mining (ARM) in distributed computing environment .
Parallel Association Rule Mining by Data De-Clustering to Support Grid Computing
Most of the association rule mining algorithms suffer from the time-consuming elaboration on finding all candidates that fit the subjective conditions. We believe the most effective way is to develop parallel algorithms to promote the performance. However, prior parallel architectures and algorithms suffer from overhead in inter-site communications or requiring large number of space to maintain the local support counts of a large number of candidate sets. In this paper, we propose a parallel approach, which absolutely eliminates the inter-site communication cost for the most influential Apriori algorithm or its variations. The merit makes our approach to be easily deployed in a grid computing environment. Our work is based on the idea of data de-clustering, such that the transaction database is de-clustered into partitions for all participating sites. That guarantees all subgroups are not only quite similar to each other, but also quite similar to the original group. To balance the workload of the most time-consuming subtasks (i.e., the candidate itemsets generation process) of all participating sites, elements in the frequent 1-itemset are dispatched in row-prime order to each processor to execute in parallel. We have conducted experiments to show that the result obtained by our approach is almost the same as that obtained by running the Apriori algorithm on a single site. However, if there are m processors executed in our parallel approach, then the total speed up can be promoted up to m 2 , which makes our work a very efficient and effective approach.
SeaRum: A Cloud-Based Service for Association Rule Mining
2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 2013
Large volumes of data are being produced by various modern applications at an ever increasing rate. These applications range from wireless sensors networks to social networks. The automatic analysis of such huge data volume is a challenging task since a large amount of interesting knowledge can be extracted. Association rule mining is an exploratory data analysis method able to discover interesting and hidden correlations among data. Since this data mining process is characterized by computationally intensive tasks, efficient distributed approaches are needed to increase its scalability. This paper proposes a novel cloud-based service, named SEARUM, to efficiently mine association rules on a distributed computing model. SEARUM consists of a series of distributed MapReduce jobs run in the cloud. Each job performs a different step in the association rule mining process. As a case study, the proposed approach has been applied to the network data scenario. The experimental validation, performed on two real network datasets, shows the effectiveness and the efficiency of SEARUM in mining association rules on a distributed computing model.
Mining of Association Rules on Large Database Using Distributed and Parallel Computing
Procedia Computer Science, 2016
Now days due to rapid growth of data in organizations, extensive data processing is a central point of Information Technology. Mining of Association rules in large database is the challenging task. An Apriori algorithm is widely used to find out the frequent item sets from database. But it will be inefficient in case of large database because it will require more I/O load. Later drawback of the Apriori algorithm is overcome by many algorithms / parallel algorithms (model) but those are also inefficient to find frequent item sets from large database with less time and with great efficiency. Hence hybrid architecture is proposed which consists of integrated distributed and parallel computing concept. The main idea of new architecture is that we combine distributed as well as parallel computing in such a way that it will be efficient to find out frequent item sets from large databases in less time. It also handle large database with efficiently than existing algorithms.
Performance Analysis of Distributed Association Rule Mining with Apriori Algorithm
International Journal of Computer Theory and Engineering, 2011
One of the most crucial problem in data mining is association rule mining. It requires large computation and I/O traffic capacity. One approach to resolve this problem is the use of distributed data mining algorithms in grid. It offers an effective way to mine for large data sets. Therefore, we implemented distributed data mining with Apriori algorithm in grid environment. However, usage of grid environment raises some issues about the optimization of the Apriori algorithm, especially the cost of the node to node communication and data distribution. In this paper, an Optimized Distributed Association rule mining approach for geographically distributed data is introduced in parallel and distributed environment; therefore, it reduces communication costs.