Association Rule Mining using Apriori Algorithm for Distributed System: a Survey (original) (raw)

Performance Analysis of Association Rule Mining Using Hadoop

To discover association between different items in large datasets; Association rule mining plays a major role. There are several association algorithms among which the Apriori Algorithm is most suitable one. Actually the Apriori Algorithm is capable of run on single node or computer, due to which it limits the use of this algorithm on large datasets. There have various studies for parallelizing the algorithm. In this paper, Apache G-Hadoop was chosen as the distributed framework to implement the algorithm, to evaluate the performance of the algorithm on G-Hadoop The performance and analysis shows the most suitable platform for distributed association rule mining.

An Efficient Implementation of Apriori Algorithm Based on Hadoop-Mapreduce Model

2012

Finding frequent itemsets is one of the most important fields of data mining. Apriori algorithm is the most established algorithm for finding frequent itemsets from a transactional dataset; however, it needs to scan the dataset many times and to generate many candidate itemsets. Unfortunately, when the dataset size is huge, both memory use and computational cost can still be very expensive. In addition, single processor’s memory and CPU resources are very limited, which make the algorithm performance inefficient. Parallel and distributed computing are effective strategies for accelerating algorithms performance. In this paper, we have implemented an efficient MapReduce Apriori algorithm (MRApriori) based on HadoopMapReduce model which needs only two phases (MapReduce Jobs) to find all frequent k-itemsets, and compared our proposed MRApriori algorithm with current two existed algorithms which need either one or k phases (k is maximum length of frequent itemsets) to find the same freq...

Review of Apriori Based Algorithms on MapReduce Framework

The Apriori algorithm that mines frequent itemsets is one of the most popular and widely used data mining algorithms. Now days many algorithms have been proposed on parallel and distributed platforms to enhance the performance of Apriori algorithm. They differ from each other on the basis of load balancing technique, memory system, data decomposition technique and data layout used to implement them. The problems with most of the distributed framework are overheads of managing distributed system and lack of high level parallel programming language. Also with grid computing there is always potential chances of node failures which cause multiple re-executions of tasks. These problems can be overcome by the MapReduce framework introduced by Google. MapReduce is an efficient, scalable and simplified programming model for large scale distributed data processing on a large cluster of commodity computers and also used in cloud computing. In this paper, we present the overview of parallel Apriori algorithm implemented on MapReduce framework. They are categorized on the basis of Map and Reduce functions used to implement them e.g. 1-phase vs. k-phase, I/O of Mapper, Combiner and Reducer, using functionality of Combiner inside Mapper etc. This survey discusses and analyzes the various implementations of Apriori on MapReduce framework on the basis of their distinguishing characteristics. Moreover, it also includes the advantages and limitations of MapReduce framework.

Apriori Association Rule Algorithms using VMware Environment

Research Journal of Applied Sciences, Engineering and Technology, 2014

The aim of this study is to carry out a research in distributed data mining using cloud platform. Distributed Data mining becomes a vital component of big data analytics due to the development of network and distributed technology. Map-reduce hadoop framework is a very familiar concept in big data analytics. Association rule algorithm is one of the popular data mining techniques which finds the relationships between different transactions. A work has been executed using weighted apriori and hash T apriori algorithms for association rule mining on a map reduce hadoop framework using a retail data set of transactions. This study describes the above concepts, explains the experiment carried out with retail data set on a VMW are environment and compares the performances of weighted apriori and hash-T apriori algorithms in terms of memory and time.

A Multi-Nodal Implementation of Apriori Algorithm for Big Data Analytics using MapReduce Framework

2020

This paper developed a distributed algorithm for Big Data Analytics to address the delay in the processing of big data. In order to achieve the aim of this research, an inspection of organizational documents, direct observation and collection of existing data from the National Health Insurance Scheme (NHIS) in Nigeria. The algorithm was formulated using Apriori Association Rule Mining and was specified using the enterprise application diagram. The implementation of the prototype for the algorithm was using MongoDB as the big data storage mechanism for the input. Comma Separated Values (CSV) files was used as the storage facility for the intermediate results generated during processing, and MySQL was used as the storage mechanism for the final output. Finally, Apache MapReduce as the big data multi-nodal processing platform and Java programming language as the implementation technology. This prototype was able to analyze different formats of data (i.e., pdf, excel, csv and images) wi...

Performance Analysis of Distributed Association Rule Mining with Apriori Algorithm

International Journal of Computer Theory and Engineering, 2011

One of the most crucial problem in data mining is association rule mining. It requires large computation and I/O traffic capacity. One approach to resolve this problem is the use of distributed data mining algorithms in grid. It offers an effective way to mine for large data sets. Therefore, we implemented distributed data mining with Apriori algorithm in grid environment. However, usage of grid environment raises some issues about the optimization of the Apriori algorithm, especially the cost of the node to node communication and data distribution. In this paper, an Optimized Distributed Association rule mining approach for geographically distributed data is introduced in parallel and distributed environment; therefore, it reduces communication costs.

Modified Apriori Algorithm For Predefind Support And Confidence In Cloud Computing Environment For Frequent Pattern Mining

2013

Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done by giving the right programming model. Cloud can be meant as an infrastructure that provides resources and/or service over the internet. A cloud can be a storage cloud that provides block or file based storage service or it can be a compute cloud that provides computational services. Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. Mining association rules is one of the most important aspects in data mining. Association rules are dependency rules which predict occurrence of an item based on occurrences of other items. Apriori is the best-known algorithm to mine association rules. The Apriori algorithm had a major problem of multiple scans through the entire data. It required a lot of space and time. The modification in our paper suggests that we do not scan the whole database to count the support for every attribute. This is possible by keeping the count of minimum support and then comparing it with the support of every attribute. The support of an attribute is counted only till the time it reaches the minimum support value. using sector/sphere framework with association rules. .

A Survey on Association Rule Mining Algorithm and Architecture for Distributed Processing

2014

Association rule mining is a data mining technique used to uncover previously unknown hidden patterns or rules from huge databases usually tera and peta bytes of data. There are many popular algorithms for mining various association rules like Apriori, portioning, dynamic item set counting etc. But the main drawback of these algorithms is their sequential nature. Processing large databases in sequential order has many disadvantages like time consuming, scalability and performance issues. In order to avoid the above said problems we look for parallel or distributed association rule mining for providing scalability and better performance.

Association Rule Mining in Big Data using MapReduce Approach in Hadoop

The concept of Association rule mining is an important task in data mining. In case of big data the large volume of data makes is impossible to generate rules at a faster pace. By making use of parallel execution in Hadoop using the MapReduce framework, the rules can be generated much faster and in an efficient way. The existing method transforms the input dataset into binomial representation before processing them using MapReduce. But binomial conversion is not user-friendly since it is complex in case of continuous values. In this paper, an improved and scalable algorithm is proposed for association rule mining that will convert the input dataset into key-value pairs instead of binomial. All the stages of proposed association rule mining algorithm are parallelized using MapReduce. The proposed algorithm works on high cardinality features and so no dimension detection is needed.