On the Effectiveness and Efcienc y of Computing Bounds on the Support of Item›Sets in the Frequent Item›Sets Mining Problem (original) (raw)
Related papers
Proceedings of the 1st …, 2005
We study the relative effectiveness and the efficiency of computing support-bounding rules that can be used to prune the search space in algorithms to solve the frequent item-sets mining problem (FIM). We develop a formalism wherein these rules can be stated and analyzed using the concept of differentials and density functions of the support function. We derive a general bounding theorem, which provides lower and upper bounds on the supports of item-sets in terms of the supports of their subsets. Since, in general, many lower and upper bounds exists for the support of an item-set, we show how to the best bounds. The result of this optimization shows that the best bounds are among those that involve the supports of all the strict subsets of an item-set of a particular size q. These bounds are determined on the basis of so called q-rules. In this way, we derive the bounding theorem established by Calders . For these types of bounds, we consider how they compare relative to each other, and in so doing determine the best bounds. Since determining these bounds is combinatorially expensive, we study heuristics that efficiently produce bounds that are usually the best. These heuristics always produce the best bounds on the support of item-sets for basket databases that satisfies independence properties. In particular, we show that for an item-set I determining which bounds to compute that lead to the best lower and upper bounds on freq(I) can be done in time O(|I|). Even though, in practice, basket databases do not have these independence properties, we argue that our analysis carries over to a much larger set of basket databases where local "near" independence hold. Finally, we conduct an experimental study using real baskets databases, where we compute upper bounds in the context of generalizing the Apriori algorithm. Both the analysis and the study confirm that the q-rule (q odd and larger than 1) will almost always do better than the 1-rule (Apriori rule) on large dense baskets databases. Our experiment re- * The first two authors were supported by NSF Grant IIS-0082407.
Efficient Algorithms for Mining Frequent Itemsets with Constraint
2011
An important problem of interactive data mining is "to find frequent item sets contained in a subset C of set of all items on a given database". Reducing the database on C or incorporating it into an algorithm for mining frequent item sets (such as Charm-L, Eclat) and resolving the problem are very time consuming, especially when C is often changed. In this paper, we propose an efficient approach for mining them as follows. Firstly, it is necessary to mine only one time from database the class LGA containing the closed item sets together their generators. After that, when C is changed, the class of all frequent closed item sets and their generators on C is determined quickly from LGA by our algorithm MINE_CG_CONS. We obtain the algorithm MINE_FS_CONS to mine and classify efficiently all frequent item sets with constraint from that class. Theoretical results and the experiments proved the efficiency of our approach.
Adaptive and resource-aware mining of frequent sets
2002
The performance of an algorithm that mines frequent sets from transactional databases may severely depend on the specific features of the data being analyzed. Moreover, some architectural characteristics of the computational platform used -e.g. the available main memory -can dramatically change the runtime behaviors of the algorithm. In this paper we present DCI (Direct Count & Intersect), an efficient data mining algorithm for discovering frequent sets from large databases, which effectively addresses the issues mentioned above. DCI adopts a classical level-wise approach based on candidate generation to extract frequent sets, but uses a hybrid method to determine candidate supports. The most innovative contribution of DCI relies on the multiple heuristics strategies employed, which permits DCI to adapt its behavior not only to the features of the specific computing platform, but also to the features of the dataset being mined, so that it results effective in mining both short and long patterns from sparse and dense datasets. The large amount of tests conducted permit us to state that DCI sensibly outperforms state-ofthe-art algorithms for both synthetic and real-world datasets. Finally we also discuss the parallelization strategies adopted in the design of ParDCI, a distributed and multi-threaded implementation of DCI.
The segment support map: Scalable mining of frequent itemsets
2000
Since its introduction, frequent set mining has been generalized to many forms, including online mining with Carma, and constrained mining with CAP. Regardless, scalability is always an important aspect of the development. In this paper, we propose a novel structure called segment support map to help mining of frequent itemsets of the various forms. A light-weight structure, the segment support map improves the performance of frequent-set mining algorithms by: (i) obtaining sharper bounds on the support of itemsets, and/or (ii) better exploiting properties of constraints. Our experimental results show the effectiveness of the segment support map.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2012
Frequent item set mining is one of the best known and most popular data mining methods. Originally developed for market basket analysis, it is used nowadays for almost any task that requires discovering regularities between (nominal) variables. This paper provides an overview of the foundations of frequent item set mining, starting from a definition of the basic notions and the core task. It continues by discussing how the search space is structured to avoid redundant search, how it is pruned with the a priori property, and how the output is reduced by confining it to closed or maximal item sets or generators. In addition, it reviews some of the most important algorithmic techniques and data structures that were developed to make the search for frequent item sets as efficient as possible. C
Simple Algorithms for Frequent Item Set Mining
Advances in Machine Learning II, 2010
In this paper I introduce SaM, a split and merge algorithm for frequent item set mining. Its core advantages are its extremely simple data structure and processing scheme, which not only make it quite easy to implement, but also very convenient to execute on external storage, thus rendering it a highly useful method if the transaction database to mine cannot be loaded into main memory. Furthermore, I review RElim (an algorithm I proposed in an earlier paper and improved in the meantime) and discuss different optimization options for both SaM and RElim. Finally, I present experiments comparing SaM and RElim with classical frequent item set mining algorithms (like Apriori, Eclat and FP-growth).
Efficient frequent itemsets mining by sampling
2006
As the first stage for discovering association rules, frequent itemsets mining is an important challenging task for large databases. Sampling provides an efficient way to get approximating answers in much shorter time. Based on the characteristics of frequent itemsets counting, a new bound for sampling is proposed, with which less samples are necessary to achieve the required accuracy and the efficiency is much improved over traditional Chernoff bounds.
Structure of frequent itemsets with extended double constraints
Vietnam Journal of Computer Science, 2016
Frequent itemset discovering has been one essential task in data mining. In the worst case, the cardinality of the class of all frequent itemsets is of exponent which leads to many difficulties for users. Therefore, a model of constraintbased mining is necessary when their needs and interests are the top priority. This paper aims to find a structure of frequent itemsets that satisfy the following conditions: they include a subset C 10 , contain no items of a subset C 11 , and have at least an item belonging to subset C 21. The first new point of the paper is the proposed theoretical result that is the generalization of our former researches (Hai et al. in Adv Comput Methods Knowl Eng Sci 479:367-378, 2013). Second, based on new sufficient and necessary conditions discovered just for closed itemsets and their generators in association with the methods of creating borders and eliminating branches and nodes on the lattice, we can effectively and quickly eliminate not only a class of frequent itemsets but also one or more branches of equivalence classes of which elements are insatiate the constraints. Third, a structure and a unique representation of frequent itemsets with extended double constraints are shown by representative closed itemsets and their generators. Finally, all theoretical results in this paper are proven to be reliable and they are firm bases to guarantee the correctness and efficiency of a new algorithm, MFS-EDC, which is used to effectively mine all constrained frequent itemsets. Experiments show the outstanding efficiency of this