Efficient frequent itemsets mining by sampling (original) (raw)
Related papers
Lecture Notes in Computer Science, 2012
The tasks of extracting (top-K) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under-or over-sampling any one of an unknown number of frequent itemsets.
A New Approach for Approximately Mining Frequent Itemsets
2019
Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the memory limit. In this paper, we propose a new approach for approximately mining of frequent itemsets in a transaction database. First, we partition the set of transactions in the database into disjoint subsets and make the distribution of frequent itemsets in each subset similar to that of the entire database. Then, we randomly select a set of subsets and independently mine the frequent itemsets in each of them. After that, each frequent itemset discovered from these subsets is voted and the one appearing in the majority subsets is determined as a frequent itemset, called a popular frequent itemset. All popular frequent itemsets are compared with the frequent itemsets discovered directly from the entire database using the same frequency threshold. The r...
A new approximate method for mining frequent itemsets from big data
Computer Science and Information Systems, 2021
Mining frequent itemsets in transaction databases is an important task in many applications. It becomes more challenging when dealing with a large transaction database because traditional algorithms are not scalable due to the limited main memory. In this paper, we propose a new approach for the approximately mining of frequent itemsets in a big transaction database. Our approach is suitable for mining big transaction databases since it uses the frequent itemsets from a subset of the entire database to approximate the result of the whole data, and can be implemented in a distributed environment. Our algorithm is able to efficiently produce high-accurate results, however it misses some true frequent itemsets. To address this problem and reduce the number of false negative frequent itemsets we introduce an additional parameter to the algorithm to discover most of the frequent itemsets contained in the entire data set. In this article, we show an empirical evaluation of the results of ...
CBW: an efficient algorithm for frequent itemset mining
37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the, 2004
Frequent itemset generation is the prerequisite and most time-consuming process for association rule mining. Nowadays, most efficient Apriori-like algorithms rely heavily on the minimum support constraint to prune a vast amount of non-candidate itemsets. This pruning technique, however, becomes less useful for some real applications where the supports of interesting itemsets are extremely small, such as medical diagnosis, fraud detection, among the others. In this paper, we propose a new algorithm that maintains its performance even at relative low supports. Empirical evaluations show that our algorithm is, on the average, more than an order of magnitude faster than Apriori-like algorithms.
Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis
Proceedings of the 2006 SIAM International Conference on Data Mining, 2006
Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional "exact" model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms.
Efficiently Mining Frequent Itemsets using Various Approaches: A Survey
International Journal of Computer Applications, 2012
In this paper we present the various elementary traversal approaches for mining association rules. We start with a formal definition of association rule and its basic algorithm. We then discuss the association rule mining algorithms from several perspectives such as breadth first approach, depth first approach and Hybrid approach. Comparison of the various approaches is done in terms of time complexity and I/O overhead on CPU. Finally, this paper prospects the association rule mining and discuss the areas where there is scope for scalability.
Mining top-K frequent itemsets through progressive sampling
Data Mining and Knowledge Discovery, 2010
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.
A new approach for the discovery of frequent itemsets
1999
The discovery of the most recurrent association rules, in a large database of sales transactions requires that the sets of items bought together by a sufficiently large population of customers are identified. This is a critical task, since the number of generated itemsets grows exponentially with the total number of items. Most of the algorithms start identifying the sets with the lowest cardinality, and subsequently, increase it progressively.
Efficient Algorithms for Mining Frequent Itemsets with Constraint
2011
An important problem of interactive data mining is "to find frequent item sets contained in a subset C of set of all items on a given database". Reducing the database on C or incorporating it into an algorithm for mining frequent item sets (such as Charm-L, Eclat) and resolving the problem are very time consuming, especially when C is often changed. In this paper, we propose an efficient approach for mining them as follows. Firstly, it is necessary to mine only one time from database the class LGA containing the closed item sets together their generators. After that, when C is changed, the class of all frequent closed item sets and their generators on C is determined quickly from LGA by our algorithm MINE_CG_CONS. We obtain the algorithm MINE_FS_CONS to mine and classify efficiently all frequent item sets with constraint from that class. Theoretical results and the experiments proved the efficiency of our approach.
Proceedings of the 1st …, 2005
We study the relative effectiveness and the efficiency of computing support-bounding rules that can be used to prune the search space in algorithms to solve the frequent item-sets mining problem (FIM). We develop a formalism wherein these rules can be stated and analyzed using the concept of differentials and density functions of the support function. We derive a general bounding theorem, which provides lower and upper bounds on the supports of item-sets in terms of the supports of their subsets. Since, in general, many lower and upper bounds exists for the support of an item-set, we show how to the best bounds. The result of this optimization shows that the best bounds are among those that involve the supports of all the strict subsets of an item-set of a particular size q. These bounds are determined on the basis of so called q-rules. In this way, we derive the bounding theorem established by Calders . For these types of bounds, we consider how they compare relative to each other, and in so doing determine the best bounds. Since determining these bounds is combinatorially expensive, we study heuristics that efficiently produce bounds that are usually the best. These heuristics always produce the best bounds on the support of item-sets for basket databases that satisfies independence properties. In particular, we show that for an item-set I determining which bounds to compute that lead to the best lower and upper bounds on freq(I) can be done in time O(|I|). Even though, in practice, basket databases do not have these independence properties, we argue that our analysis carries over to a much larger set of basket databases where local "near" independence hold. Finally, we conduct an experimental study using real baskets databases, where we compute upper bounds in the context of generalizing the Apriori algorithm. Both the analysis and the study confirm that the q-rule (q odd and larger than 1) will almost always do better than the 1-rule (Apriori rule) on large dense baskets databases. Our experiment re- * The first two authors were supported by NSF Grant IIS-0082407.