Article ID: IJEET_11_08_009 Frequent Itemsets in Big Data (original) (raw)

SPLIT AND RULE ALGORITHM TO MINE FREQUENT ITEMSETS IN BIG DATA

IAEME PUBLICATION, 2020

The discovery of frequent items from big data or very large dataset is probably not a new technique but lot of the existing algorithms and approaches needs some fine tuning, and this paper deals with a very large data by utilizing the divide and conquer approach where the raw dataset is partitioned or sub divided into many parts based on the size of the input data and the number of process the algorithm uses to unearth the frequent itemsets. The proposed approach computes the count (native support) of each items present in the individual partitions and no pruning is carried out, but then the discovered itemset are combined together in the next stage and universal support is computed to prune away the unpromising itemsets and then the data is divided to calculate the native support. This process is continued until the entire frequent itemsets are unearthed. The proposed algorithm Split and Rule algorithm (SR algorithm) is compared with many existing algorithms to prove its versatility and efficiency related to execution time and memory consumption.

Efficient Large Scale Frequent Itemset Mining with Hybrid Partitioning Approach

International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2019

In today’s world, voluminous data are available which are generated from various sources in various forms. Mining or analyzing this large scale data in an efficient way so as to make them useful for the mankind is difficult with the existing approaches. Frequent itemset mining is one such technique used for analyzing in many fields like finance, health care system where the main focus is gathering frequent patterns and grouping them to be meaningful inorder to gather useful insights from the data. Some major applications include customer segmentation in marketing, shopping cart analyses, management relationship, web usage mining, player tracking and so on. Many parallel algorithms, like Dist-Eclat Algorithm, Big FIM algorithm are available to perform large scale Frequent itemset mining. In Dist-Eclat algorithm, datasets are partitioned using Round Robin technique which uses a hybrid partitioning approach, which can improve the overall efficiency of the system. The system works as follows: Initially the data collected are distributed by mapreduce. Then the local frequent k-itmesets are computed using FP-Tree and sent to the map phase. Later the mining results are combined to the center node. Finally, global frequent itemsets are gathered by mapreduce. The proposed system is expected to improve in efficiency by using hybrid partitioning approach in the datasets based on the identification of frequent items.

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when i) the amount of data tends to be very large and/or ii) the minimum support (M inSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Ter-abytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently , relying on an absolute minimum support (AM inSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

Efficient Algorithm for Frequent Item Set Generation in Big Data

International Journal of Scientific Research in Science, Engineering and Technology, 2019

Data mining faces a lot of challenges in the big data era. Association rule mining is an important area of research in the field of data mining. Association rule mining algorithm is not sufficient to process large data sets. Apriori algorithm has limitations like the high I/O load and low performance. The FP-Growth algorithm also has certain limitations like less internal memory. Mining the frequent itemset in the dynamic scenarios is a challenging task. To overcome these issues a parallelized approach using the mapreduce framework has been used. The mining algorithm has been implemented using the Hadoop.

A new approximate method for mining frequent itemsets from big data

Computer Science and Information Systems, 2021

Mining frequent itemsets in transaction databases is an important task in many applications. It becomes more challenging when dealing with a large transaction database because traditional algorithms are not scalable due to the limited main memory. In this paper, we propose a new approach for the approximately mining of frequent itemsets in a big transaction database. Our approach is suitable for mining big transaction databases since it uses the frequent itemsets from a subset of the entire database to approximate the result of the whole data, and can be implemented in a distributed environment. Our algorithm is able to efficiently produce high-accurate results, however it misses some true frequent itemsets. To address this problem and reduce the number of false negative frequent itemsets we introduce an additional parameter to the algorithm to discover most of the frequent itemsets contained in the entire data set. In this article, we show an empirical evaluation of the results of ...

A further study in the data partitioning approach for frequent itemsets mining

Frequent itemsets mining is well explored for various data types, and its computational complexity is well understood. Based on our previous work by , this paper shows the extension of the data pre-processing approach to further improve the performance of frequent itemsets computation. The methods focus on potential reduction of the size of the input data required for deployment of the partitioning based algorithms. We have made a series of the data pre-processing methods such that the final step of the Partition algorithm, where a combination of all local candidate sets must be processed, is executed on substantially smaller input data. Moreover, we have made a comparison among these methods based on the experiments with particular data sets.

P-BBA: A Master/Slave Parallel Binary-based Algorithm for Mining Frequent Itemsets in Big Data

2020

Frequent itemset mining is a data mining technique to discover the frequent patterns from a collection of databases. However, it becomes a computational expensive task when it is used for mining large volume of data. Hence, there is a necessity for a scalable algorithm that can handle bigger datasets. Binary-based Technique Algorithm (BBT) can simplify the process of generating frequent patterns by using bit wise operations and binary database representation. However, it still suffers with the problem of low performance when dealing with high volume of data and a minimum values of support threshold to generate the list of frequent itemset patterns. This is due to its design which run in a single thread of execution. This research proposed a Parallel Binary-Based Algorithm (P-BBA) to solve the mentioned problem. The proposed algorithm is designed with collaborative threads which simultaneously work together to generate frequent itemsets in a big data environment. A master/slave architecture is used to fit the algorithm with distributed computing platform. The obtained results showed significant reductions in execution time when using the proposed parallel binary-based algorithm.

FIN ALGORITHM FOR GENERATING FREQUENT ITEMSET IN BIG DATA

Big data being an emerging research area, handling issues like storing, searching, sorting, retrieving, securing, analyzing and visualizing are of immense importance. Association rule mining helps to analyze the customer behavior in the market thereby excelling business intelligence. In this paper we propose a novel FIN algorithm with map-reduce concept to mine frequent itemset in big data which helps in increase the performance by parallel processing. The map-reducer function helps in performing the parallel execution on identifying the frequent itemset by using FIN algorithm. The performance of FIN is faster when compared with traditional algorithm Apriori, FP growth. This will also be applicable when it is perform in big data. The effectiveness of the algorithm is increased because of parallelization. This proposed strategy can recommend the closely related products to the customers.

An efficient approach based on selective partitioning for maximal frequent itemsets mining

Sādhanā

We present a maximal frequent itemset (MFI) mining algorithm based on selective partitioning called SelPMiner. It makes use of a novel data format named Itemset-count tree-a compact and optimized representation in the form of partition that reduces memory requirement. It also does selective partitioning of the database, which reduces runtime to scan database. As the algorithm progressively searches for longer frequent itemsets in a depth-first manner, it creates new partitions with even smaller sizes having less dimensions and unique data instances, which results in faster support counting. SelPMiner uses a number of optimizations to prune the search space. We also prove upper bounds on the amount of memory consumed by these partitions. Experimental comparisons of the SelPMiner algorithm with popular existing fastest MFI mining algorithms on different types of datasets show significant speedup in computation time for many cases. SelPMiner works especially well when the minimum support is low and consumes less memory.

D-GENE: Deferring the GENEration of Power Sets for Discovering Frequent Itemsets in Sparse Big Data

IEEE Access, 2020

Sparseness is the distinctive aspect of big data generated by numerous applications at present. Furthermore, several similar records exist in real-world sparse datasets. Based on Iterative Trimmed Transaction Lattice (ITTL), the recently proposed TRICE algorithm learns frequent itemsets efficiently from sparse datasets. TRICE stores alike transactions once, and eliminates the infrequent part of each distinct transaction afterward. However, removing the infrequent part of two or more distinct transactions may result in similar trimmed transactions. TRICE repeatedly generates ITTLs of similar trimmed transactions that induce redundant computations and eventually, affects the runtime efficiency. This paper presents D-GENE, a technique that optimizes TRICE by introducing a deferred ITTL generation mechanism. D-GENE suspends the process of ITTL generation till the completion of transaction pruning phase. The deferral strategy enables D-GENE to generate ITTLs of similar trimmed transactions once. Experimental results show that by avoiding the redundant computations, D-GENE gets better runtime efficiency. D-GENE beats TRICE, FP-growth, and optimized versions of SaM and RElim algorithms comprehensively, especially when the difference between distinct transactions and distinct trimmed transactions is high. INDEX TERMS Big data applications, pattern recognition, association rules, frequent item set mining, IoT.