Efficient frequent itemsets mining by sampling (original) (raw)
Lecture Notes in Computer Science, 2012
The tasks of extracting (top-K) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under-or over-sampling any one of an unknown number of frequent itemsets.
A New Approach for Approximately Mining Frequent Itemsets
2019
Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the memory limit. In this paper, we propose a new approach for approximately mining of frequent itemsets in a transaction database. First, we partition the set of transactions in the database into disjoint subsets and make the distribution of frequent itemsets in each subset similar to that of the entire database. Then, we randomly select a set of subsets and independently mine the frequent itemsets in each of them. After that, each frequent itemset discovered from these subsets is voted and the one appearing in the majority subsets is determined as a frequent itemset, called a popular frequent itemset. All popular frequent itemsets are compared with the frequent itemsets discovered directly from the entire database using the same frequency threshold. The r...
A new approximate method for mining frequent itemsets from big data
Computer Science and Information Systems, 2021
Mining frequent itemsets in transaction databases is an important task in many applications. It becomes more challenging when dealing with a large transaction database because traditional algorithms are not scalable due to the limited main memory. In this paper, we propose a new approach for the approximately mining of frequent itemsets in a big transaction database. Our approach is suitable for mining big transaction databases since it uses the frequent itemsets from a subset of the entire database to approximate the result of the whole data, and can be implemented in a distributed environment. Our algorithm is able to efficiently produce high-accurate results, however it misses some true frequent itemsets. To address this problem and reduce the number of false negative frequent itemsets we introduce an additional parameter to the algorithm to discover most of the frequent itemsets contained in the entire data set. In this article, we show an empirical evaluation of the results of ...
CBW: an efficient algorithm for frequent itemset mining
37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the, 2004
Frequent itemset generation is the prerequisite and most time-consuming process for association rule mining. Nowadays, most efficient Apriori-like algorithms rely heavily on the minimum support constraint to prune a vast amount of non-candidate itemsets. This pruning technique, however, becomes less useful for some real applications where the supports of interesting itemsets are extremely small, such as medical diagnosis, fraud detection, among the others. In this paper, we propose a new algorithm that maintains its performance even at relative low supports. Empirical evaluations show that our algorithm is, on the average, more than an order of magnitude faster than Apriori-like algorithms.
Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis
Proceedings of the 2006 SIAM International Conference on Data Mining, 2006
Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional "exact" model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms.
Efficiently Mining Frequent Itemsets using Various Approaches: A Survey
International Journal of Computer Applications, 2012
In this paper we present the various elementary traversal approaches for mining association rules. We start with a formal definition of association rule and its basic algorithm. We then discuss the association rule mining algorithms from several perspectives such as breadth first approach, depth first approach and Hybrid approach. Comparison of the various approaches is done in terms of time complexity and I/O overhead on CPU. Finally, this paper prospects the association rule mining and discuss the areas where there is scope for scalability.
Mining top-K frequent itemsets through progressive sampling
Data Mining and Knowledge Discovery, 2010
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.
Efficient Algorithms for Mining Frequent Itemsets with Constraint
2011
An important problem of interactive data mining is "to find frequent item sets contained in a subset C of set of all items on a given database". Reducing the database on C or incorporating it into an algorithm for mining frequent item sets (such as Charm-L, Eclat) and resolving the problem are very time consuming, especially when C is often changed. In this paper, we propose an efficient approach for mining them as follows. Firstly, it is necessary to mine only one time from database the class LGA containing the closed item sets together their generators. After that, when C is changed, the class of all frequent closed item sets and their generators on C is determined quickly from LGA by our algorithm MINE_CG_CONS. We obtain the algorithm MINE_FS_CONS to mine and classify efficiently all frequent item sets with constraint from that class. Theoretical results and the experiments proved the efficiency of our approach.
A LITERATURE SURVEY ON FREQUENT ITEMSET MINING – AN ARM PERSPECTIVE
Association Rules mining (ARM) which finds the relationship between distinct item sets plays an essential role in Item set mining. Frequent item set mining is one of the popular data mining techniques and it can be used in many data mining fields for finding highly correlated itemsets. Frequent items are those items that have been frequently used in the database. Infrequent itemset mining which is the inverse of frequent item set mining that finds the rarely occurring itemsets in the database. Several techniques were existing for mining frequent itemsets and infrequent itemsets with high computing time and are less scalable when the database size increases. This paper focuses on relating the existing algorithms that mines the frequent and infrequent itemsets which creates future researchers to find a way in the domain of association rule mining. Keywords—Association Rules mining (ARM), Apriori, Frequent items, FP-growth, Infrequent Items, performance.
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
International Journal of Information Sciences and Techniques, 2014
Data mining and knowledge discovery of database is magnetizing wide array of non-trivial research arena, making easy to industrial decision support systems and continues to expand even beyond imagination in one such promising field like Artificial Intelligence and facing the real world challenges. Association rules forms an important paradigm in the field of data mining for various databases like transactional database, time-series database, spatial, object-oriented databases etc. The burgeoning amount of data in multiple heterogeneous sources coalesces with the impediment in building and preserving central vital repositories compels the need for effectual distributive mining techniques. The majority of the previous studies rely on an Apriori-like candidate set generation-and-test approach. For these applications, these forms of aged techniques are found to be quite expensive, sluggish and highly subjective in case there exists long length patterns.
1 IMPROVING PERFORMANCE OF FREQUENT ITEMSET ALGORITHM
Frequent itemset mining leads to the discovery of associations among items in large transactional database. The apriori algorithm adopts candidate generation and testing which is easy to implement but candidate generation and support counting is very expensive in this.
A Hybrid Approach for Mining Frequent Itemsets
2013 IEEE International Conference on Systems, Man, and Cybernetics, 2013
Frequent itemset mining is a fundamental element with respect to many data mining problems. Recently, the PrePost algorithm has been proposed, a new algorithm for mining frequent itemsets based on the idea of N-lists. PrePost in most cases outperforms other current state-of-the-art algorithms. In this paper, we present an improved version of PrePost that uses a hash table to enhance the process of creating the N-lists associated with 1-itemsets and an improved N-list intersection algorithm. Furthermore, two new theorems are proposed for determining the "subsume index" of frequent 1-itemsets based on the N-list concept. The experimental results show that the performance of the proposed algorithm improves on that of PrePost.
A Survey on Approaches for Mining Frequent Itemsets
Data mining is gaining importance due to huge amount of data available. Retrieving information from the warehouse is not only tedious but also difficult in some cases. The most important usage of data mining is customer segmentation in marketing, shopping cart analyzes, management of customer relationship, campaign management, Web usage mining, text mining, player tracking and so on. In data mining, association rule mining is one of the important techniques for discovering meaningful patterns from large collection of data. Discovering frequent itemsets play an important role in mining association rules, sequence rules, web log mining and many other interesting patterns among complex data. This paper presents a literature review on different techniques for mining frequent itemsets.
A New Approach for Mining Frequent K-itemset
2007
Discovery of frequent itemsets is an important problem in Data Mining. Most of the previous research based on Apriori, which suffers with generation of huge number of candidate itemsets and performs repeated passes for finding frequent itemsets. To address this problem, we propose an algorithm for finding frequent K-itemsets in which the itemsets whose length is less than K will be pruned from the database and will not be considered for further processing which reduces the size and number of comparisons to be performed. In addition to this, it generates 1-itemset as a data pre processing step which saves time and makes execution fast. The experimental results are included.
Proceedings of the 1st …, 2005
We study the relative effectiveness and the efficiency of computing support-bounding rules that can be used to prune the search space in algorithms to solve the frequent item-sets mining problem (FIM). We develop a formalism wherein these rules can be stated and analyzed using the concept of differentials and density functions of the support function. We derive a general bounding theorem, which provides lower and upper bounds on the supports of item-sets in terms of the supports of their subsets. Since, in general, many lower and upper bounds exists for the support of an item-set, we show how to the best bounds. The result of this optimization shows that the best bounds are among those that involve the supports of all the strict subsets of an item-set of a particular size q. These bounds are determined on the basis of so called q-rules. In this way, we derive the bounding theorem established by Calders . For these types of bounds, we consider how they compare relative to each other, and in so doing determine the best bounds. Since determining these bounds is combinatorially expensive, we study heuristics that efficiently produce bounds that are usually the best. These heuristics always produce the best bounds on the support of item-sets for basket databases that satisfies independence properties. In particular, we show that for an item-set I determining which bounds to compute that lead to the best lower and upper bounds on freq(I) can be done in time O(|I|). Even though, in practice, basket databases do not have these independence properties, we argue that our analysis carries over to a much larger set of basket databases where local "near" independence hold. Finally, we conduct an experimental study using real baskets databases, where we compute upper bounds in the context of generalizing the Apriori algorithm. Both the analysis and the study confirm that the q-rule (q odd and larger than 1) will almost always do better than the 1-rule (Apriori rule) on large dense baskets databases. Our experiment re- * The first two authors were supported by NSF Grant IIS-0082407.
Optimizing inductive queries in frequent itemsets mining
2004
In [6] we have presented new classes of constraints, called contex dependent constraints (CDC) whose satisfaction for a given pattern depends on the context in the database in which the pattern occurs. In contrast, traditionally studied constraints [3, 4, 2, 7, 8] are satisfied for a given pattern either in all their occurrences in the database or in none of them. We call these constraints item dependent constraint (IDC). We present new algorithms to deal with context dependent constraints
Extraction of itemsets frequents
Journal of Mathematics Research, 2020
Frequent model extraction is the most important step in association rules. The time required for generating frequent itemsets plays an important role. This paper provides a comparative study of algorithms Eclat, Apriori and FP-Growth. The performance of these algorithms is compared according to the efficiency of the time and memory usage
Fast Algorithms for Mining Interesting Frequent Itemsets without Minimum Support
2009
Real world datasets are sparse, dirty and contain hundreds of items. In such situations, discovering interesting rules (results) using traditional frequent itemset mining approach by specifying a user defined input support threshold is not appropriate. Since without any domain knowledge, setting support threshold small or large can output nothing or a large number of redundant uninteresting results. Recently a novel approach of mining only N-most/Top-K interesting frequent itemsets has been proposed, which discovers the top N interesting results without specifying any user defined support threshold. However, mining interesting frequent itemsets without minimum support threshold are more costly in terms of itemset search space exploration and processing cost. Thereby, the efficiency of their mining highly depends upon three main factors (1) Database representation approach used for itemset frequency counting, (2) Projection of relevant transactions to lower level nodes of search space and (3) Algorithm implementation technique. Therefore, to improve the efficiency of mining process, in this paper we present two novel algorithms called (N-MostMiner and Top-K-Miner) using the bit-vector representation approach which is very efficient in terms of itemset frequency counting and transactions projection. In addition to this, several efficient implementation techniques of N-MostMiner and Top-K-Miner are also present which we experienced in our implementation. Our experimental results on benchmark datasets suggest that the N-MostMiner and Top-K-Miner are very efficient in terms of processing time as compared to current best algorithms BOMO and TFP.
CC-IFIM: an efficient approach for incremental frequent itemset mining based on closed candidates
The Journal of Supercomputing
Frequent itemset mining (FIM) is the crucial task in mining association rules that finds all frequent k-itemsets in the transaction dataset from which all association rules are extracted. In the big-data era, the datasets are huge and rapidly expanding, so adding new transactions as time advances results in periodic changes in correlations and frequent itemsets present in the dataset. Re-mining the updated dataset is impractical and costly. This problem is solved via incremental frequent itemset mining. Numerous researchers view the new transactions as a distinct dataset (partition) that may be mined to obtain all of its frequent item sets. The extracted local frequent itemsets are then combined to create a collection of global candidates, where it is possible to estimate the support count of the combined candidates to avoid re-scanning the dataset. However, these works are hampered by the growth of a huge number of candidates, and the support count estimation is still imprecise. In...