A Survey on Association Rule Mining Algorithm and Architecture for Distributed Processing (original) (raw)

Mining of Association Rules on Large Database Using Distributed and Parallel Computing

Procedia Computer Science, 2016

Now days due to rapid growth of data in organizations, extensive data processing is a central point of Information Technology. Mining of Association rules in large database is the challenging task. An Apriori algorithm is widely used to find out the frequent item sets from database. But it will be inefficient in case of large database because it will require more I/O load. Later drawback of the Apriori algorithm is overcome by many algorithms / parallel algorithms (model) but those are also inefficient to find frequent item sets from large database with less time and with great efficiency. Hence hybrid architecture is proposed which consists of integrated distributed and parallel computing concept. The main idea of new architecture is that we combine distributed as well as parallel computing in such a way that it will be efficient to find out frequent item sets from large databases in less time. It also handle large database with efficiently than existing algorithms.

Parallel and Distributed Association Rule Mining Algorithms: A Recent Survey

Information Management and Computer Science, 2019

Data investigation is an essential key factor now a days due to rapidly growing electronic technology. It generates a large number of transactional data logs from a range of sources devices. Parallel and distributed computing is a useful approach for enhancing the data mining process. The aim of this research is to present a systematic review of parallel association rule mining (PARM) and distributed association rule mining (DARM) approaches. We have observed that the parallelized nature of Apriori, Equivalence class, Hadoop (MapReduce), and Spark proves to be very efficient in PARM and DARM environment. We conclude that this comprehensive review, references cited in this article will convey foremost hypothetical issues and a guideline to the researcher an interesting research direction. The most important hypothetical issue and challenges include the large size of databases, dimensionality of data, indexing schemes of data in the database, data skewness, database location, load balancing strategies, methods of adaptability in incremental databases and orientation of the database.

A new parallel association rule mining algorithm on distributed shared memory system

International Journal of Business Intelligence and Data Mining, 2012

Frequent itemset finding is the most time consuming step in analysing large transactional databases. The use of sequential algorithms cannot give analytical ability for such very large databases especially in terms of run-time performance. Therefore, we must rely on high performance parallel computing. In this paper, we present a new parallel algorithm for frequent itemset mining, called 'HorVertical' algorithm. This algorithm introduces a new database partitioning called 'HorVertical' partitioning. This technique in partitioning the database reduces the dependency in the parallel computation and gives new properties to reduce the computations. The algorithm passes the database only one time and starts a new stage with the finished itemsets while some other itemsets in the same stage have not been finished yet. We present the result on the performance of our algorithm on various databases, and compare it against well known algorithms.

An Efficient Approach of Association Rule Mining on Distributed Database Algorithm

International Journal of Computer Applications, 2013

Applications requiring huge data processing have two main problems, one a massive storage and its supervision and next processing time, when the quantity of data increases. Distributed databases determine the first trouble to a huge amount but second problem increase. Since, current stage is of networking and communication and community are involved in maintenance huge data on networks, therefore, researchers are suggest a range of novel algorithms to raise the throughput of resulted data over distributed databases. Within our research, we are proposing an novel algorithm to process large quantity of data at the a variety of servers and collect the processed data on customer machine as much as necessary.

A Comprehensive Survey of Non-Apriori Parallel Association Rule Mining Algorithms

There is three main parallel association rule mining algorithms:-Count Distribution algorithm, Data Distribution algorithm and Candidate Distribution algorithm. Existing parallel association rule mining algorithms suffer from many problems when mining huge transactional datasets. One major problem is that most of the parallel algorithms for a shared nothing environment are Apriori based algorithms. Apriori-based algorithms are proven to be non scalable due to many reasons, mainly: (1) the repetitive I/O disk scans, (2) the huge computation and communication 3) great deal of redundancy calculation involved during the candidacy generation. Since the databases to be mined are often very large, and the association rule mining is computationally and I/O intensive, we must rely on high-performance parallel mining method. This paper presents different parallel algorithms given by various researches to generate association rules by various methods. We have done comparative analysis of different algorithms for association rules on various parameters.

Parallel implementation of association rule in data mining

2006 Proceeding of the Thirty Eighth Southeastern Symposium on System Theory, 2006

This paper discusses parallel Data Mining architecture for large volume of data which eventually scanning billions of rows of data per record. Here we compare the different parallel algorithms for Association Rule Mining and discuss the advantages and disadvantages of each method. We also compare the computational time of serial and parallel algorithms for Association Rule Mining.

Performance Analysis of Distributed Association Rule Mining with Apriori Algorithm

International Journal of Computer Theory and Engineering, 2011

One of the most crucial problem in data mining is association rule mining. It requires large computation and I/O traffic capacity. One approach to resolve this problem is the use of distributed data mining algorithms in grid. It offers an effective way to mine for large data sets. Therefore, we implemented distributed data mining with Apriori algorithm in grid environment. However, usage of grid environment raises some issues about the optimization of the Apriori algorithm, especially the cost of the node to node communication and data distribution. In this paper, an Optimized Distributed Association rule mining approach for geographically distributed data is introduced in parallel and distributed environment; therefore, it reduces communication costs.

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey

IOSR Journal of Computer Engineering, 2014

Data mining technologies provided through Cloud computing is an absolutely necessary characteristic for today's businesses to make proactive, knowledge driven decisions, as it helps them have future trends and behaviors predicted. By implementation of data mining techniques in Cloud will allow users to retrieve meaningful information from virtually integrated data repository that reduces the costs of resources. Research in data mining continues growing in business and in learning organization over coming decades. Association rule mining is a most important area in data mining domain. In association rule mining Apriori algorithm is a very basic and important algorithm as a research point of view. It has some disadvantages it becomes expensive because of frequently scanning of database and it did not support a large amount of raw data and also we have limited resources to implement scalable algorithm. For implementation of scalable Apriori algorithm Map Reduce programming model will be used. Map reduce is a programming model which used to implement and process a scalable raw data. Hadoop provides an open source platform to the map reduce for implementation. But Hadoop have some limited and default task scheduler. So In this paper we have made a survey to implement Apriori algorithm for huge raw data and also overcome Hadoop limitation by using enhanced scheduling algorithm.

A Comparative Study of Association Rule Mining Algorithms on Grid and Cloud Platform

2014

Association rule mining is a time consuming process due to involving both data intensive and computation intensive nature. In order to mine large volume of data and to enhance the scalability and performance of existing sequential association rule mining algorithms, parallel and distributed algorithms are developed. These traditional parallel and distributed algorithms are based on homogeneous platform and are not lucrative for heterogeneous platform such as grid and cloud. This requires design of new algorithms which address the issues of good data set partition and distribution, load balancing strategy, optimization of communication and synchronization technique among processors in such heterogeneous system. Grid and cloud are the emerging platform for distributed data processing and various association rule mining algorithms have been proposed on such platforms. This survey article integrates the brief architectural aspect of distributed system, various recent approaches of grid based and cloud based association rule mining algorithms with comparative perception. We differentiate between approaches of association rule mining algorithms developed on these architectures on the basis of data locality, programming paradigm, fault tolerance, communication cost, partition and distribution of data sets. Although it is not complete in order to cover all algorithms, yet it can be very useful for the new researchers working in the direction of distributed association rule mining algorithms.

Scalable Parallel Data Mining for Association Rules

Sigmod Record, 1997

One of the important problems in data mining is dBcovering association rules from databases of transactions where each transaction consists of a set of iterns. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user defined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potentiaJ solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelltgent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candidate psrtit ioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the In-teUigent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.