Efficient Large Scale Frequent Itemset Mining with Hybrid Partitioning Approach (original) (raw)

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when i) the amount of data tends to be very large and/or ii) the minimum support (M inSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Ter-abytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently , relying on an absolute minimum support (AM inSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

Efficient Parallel Mining Of Frequent Itemset Using MapReduce

International Journal of Information Systems and Computer Sciences, 2019

Big dataextremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interaction and the data mining used for dig deep into analyzing the patterns and relationships of data. Frequent item set mining is a data mining method that was developed for market basket analysis. In the project proposed to anefficient data processing using Lshfp growth algorithm and grouping similar objects as the clusters with group id. The traditional datamining is based on the fp growth algorithm focused on the load balancing, and distributed among the nodes of the clusters.The process is mainly based on mapreduce which highly supported by Hadoop.Hadoop is a efficient popular frame work which supports mapreduce and itemset mining .Map reduce is that which contains map phase and reduce phase.Map phase which results the pair of key values and reduce phase which results the reduced results. It aims to decrease network overhead and efficient processing.

Improving Efficiency of Parallel Mining of Frequent Itemsets using Fidoop-hd

2016

ARTICLE INFO Present parallel mining algorithms for frequent item sets lack a mechanism that enables automatic parallelization, load balancing, data administration, and fault liberality on large clusters. As a solution to this problem, design a parallel frequent item sets mining algorithm called Fidoop using the Map Reduce programming model. To achieve compressed storage and avoid building restrictive pattern bases, Fidoop incorporates the frequent items ultrametric tree, rather than regular FP trees. In Fidoop, two MapReduce jobs are implemented to complete the mining task. In the complex MapReduce job, the mappers independently decay item sets, the reducers perform combination operations by compressing data. This system implement Fidoop on private Hadoop cluster. This system show that Fidoop on the cluster is sensitive to data distribution and dimensions, because item sets with distinct lengths have different decaying and construction costs. In this paper system improve Fidoop’s p...

ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework

—Nowadays, the explosive growth in data collection in business and scientific areas has required the need to analyze and mine useful knowledge residing in these data. The recourse to data mining techniques seems to be inescapable in order to extract useful and novel patterns/models from large datasets. In this context, frequent itemsets (patterns) play an essential role in many data mining tasks that try to find interesting patterns from datasets. However, conventional approaches for mining frequent itemsets in Big Data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm, called ParallelCharMax, that is based on a powerful sequential algorithm, called Charm, and computes the maximal frequent itemsets that are considered perfect summaries of the frequent ones. The proposed algorithm has been implemented using MapReduce framework. The experimental component of the study shows the efficiency and the performance of the proposed algorithm compared with well known algorithms such as MineWithRounds and HMBA.

The MapReduce Model on Cascading Platform for Frequent Itemset Mining

IJCCS (Indonesian Journal of Computing and Cybernetics Systems), 2018

The implementation of parallel algorithms is very interesting research recently. Parallelism is very suitable to handle large-scale data processing. MapReduce is one of the parallel and distributed programming models. The implementation of parallel programming faces many difficulties. The Cascading gives easy scheme of Hadoop system which implements MapReduce model.Frequent itemsets are most often appear objects in a dataset. The Frequent Itemset Mining (FIM) requires complex computation. FIM is a complicated problem when implemented on large-scale data. This paper discusses the implementation of MapReduce model on Cascading for FIM. The experiment uses the Amazon dataset product co-purchasing network metadata.The experiment shows the fact that the simple mechanism of Cascading can be used to solve FIM problem. It gives time complexity O(n), more efficient than the nonparallel which has complexity O(n2/m).

SPLIT AND RULE ALGORITHM TO MINE FREQUENT ITEMSETS IN BIG DATA

IAEME PUBLICATION, 2020

The discovery of frequent items from big data or very large dataset is probably not a new technique but lot of the existing algorithms and approaches needs some fine tuning, and this paper deals with a very large data by utilizing the divide and conquer approach where the raw dataset is partitioned or sub divided into many parts based on the size of the input data and the number of process the algorithm uses to unearth the frequent itemsets. The proposed approach computes the count (native support) of each items present in the individual partitions and no pruning is carried out, but then the discovered itemset are combined together in the next stage and universal support is computed to prune away the unpromising itemsets and then the data is divided to calculate the native support. This process is continued until the entire frequent itemsets are unearthed. The proposed algorithm Split and Rule algorithm (SR algorithm) is compared with many existing algorithms to prove its versatility and efficiency related to execution time and memory consumption.

Article ID: IJEET_11_08_009 Frequent Itemsets in Big Data

The discovery of frequent items from big data or very large dataset is probably not a new technique but lot of the existing algorithms and approaches needs some fine tuning, and this paper deals with a very large data by utilizing the divide and conquer approach where the raw dataset is partitioned or sub divided into many parts based on the size of the input data and the number of process the algorithm uses to unearth the frequent itemsets. The proposed approach computes the count (native support) of each items present in the individual partitions and no pruning is carried out, but then the discovered itemset are combined together in the next stage and universal support is computed to prune away the unpromising itemsets and then the data is divided to calculate the native support. This process is continued until the entire frequent itemsets are unearthed. The proposed algorithm Split and Rule algorithm (SR algorithm) is compared with many existing algorithms to prove its versatility and efficiency related to execution time and memory consumption.

Data Partitioning In Frequent Item Set Mining on Hadoop Cluster Using Map reduce Report

Parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional Frequent Pattern trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mapper independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.

IJERT-An Efficient Approach for Frequent Pattern Mining Using Parallel Computing

International Journal of Engineering Research and Technology (IJERT), 2014

https://www.ijert.org/an-efficient-approach-for-frequent-pattern-mining-using-parallel-computing https://www.ijert.org/research/an-efficient-approach-for-frequent-pattern-mining-using-parallel-computing-IJERTV3IS071244.pdf The highly researchable filed of data mining is nothing but frequent itemset mining. Apriori and FP Growth algorithms are most traditional algorithms for it. To develop fast and efficient algorithm for frequent pattern mining is the most challenging task. In this paper, we are improving the efficiency of Apriori algorithm using Hadoop concept and techniques to handle big data problem.