New parallel algorithms for frequent itemset mining in very large databases (original) (raw)
Related papers
A highly parallel algorithm for frequent itemset mining
Advances in Pattern …, 2010
Abstract. Mining frequent itemsets in large databases is a widely used technique in Data Mining. Several sequential and parallel algorithms have been developed, although, when dealing with high data volumes, the execution of those algorithms takes more time and resources ...
Parallel and distributed methods for incremental frequent itemset mining
2004
Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites. Index Terms-Distributed computing, grid computing, incremental data mining, parallel computing. I. INTRODUCTION T HE FIELD of knowledge discovery and data mining (KDD), spurred by advances in data collection technology, is concerned with the process of deriving interesting and useful patterns from large datasets. Frequent itemset mining is a core data mining task. It has an elegantly simple problem statement: to find the set of all subsets of items that frequently occur together in database records or transactions. Although this task has a simple statement, it is CPU and input/output (I/O) intensive, mainly because the large number of itemsets that are typically generated and the large size of the datasets involved in the process.
Parallel and distributed frequent itemset mining on dynamic datasets
2003
Traditional methods for data mining typically make the assumption that data is centralized and static. This assumption is no longer tenable. Such methods waste computational and I/O resources when data is dynamic, and they impose excessive communication overhead when data is distributed. As a result, the knowledge discovery process is harmed by slow response times. Efficient implementation of incremental data mining ideas in distributed computing environments is thus becoming crucial for ensuring scalability and facilitate knowledge discovery when data is dynamic and distributed. In this paper we address this issue in the context of frequent itemset mining, an important data mining task. Frequent itemsets are most often used to generate correlations and association rules, but more recently they have also been used in such far-reaching domains as bio-informatics and e-commerce applications. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize the incremental algorithm, so that it can asynchronously mine frequent itemsets. Further, we also propose a distributed algorithm, which imposes low communication overhead for mining distributed datasets. Several experiments confirm that our algorithm results in excellent execution time improvements.
A fast parallel algorithm for frequent itemsets mining
IFIP The International Federation for Information Processing
Mining frequent itemsets from leirge databases is an important computational task with a lot of applications. The most known among them is the market-basket problem which assumes that we have a large number of items and we want to know which items are bought together. A recent application is that of web pages (baskets) and linked pages (items). Pages with many common references may be about the same topic. In this paper we present a parallel algorithm for mining frequent itemsets. We provide experimental evidence that our algorithm scales quite well and we discuss the merits of parallelization for this problem.
Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments
Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when i) the amount of data tends to be very large and/or ii) the minimum support (M inSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Ter-abytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently , relying on an absolute minimum support (AM inSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.
A generalized parallel algorithm for frequent itemset mining
A parallel algorithm for finding the frequent itemsets in a set of transactions is presented. The frequent individual items are identified by their index. We assume that processors number (m) is less than the frequent items number (n). At the first stage, every processor Pi, i isin; {1, ...,m - 1} sequentially computes the frequent itemsets from the interval Ii = [(i - 1) cdot; p + 1, i cdot; p], where p = lfloor;n/mrfloor;. The processor Pm computes frequent itemsets from the interval Im = [(m - 1) cdot; p + 1, n]. In the second stage, the parallel algorithm is applied. The processor Pi computes, step by step, the sets FIi,Ij of the frequent itemsets with individual items from the intervals Ii,j = Ii∪Ii+1∪...∪Ij, j = i+1,...,m. In order to compute the set FIi,Ij, the processor Pi uses FIi,Ij-1 obtained in the previous step and FIi+1,Ij received from the processor Pi+1. The main advantage of our parallel algorithm is that it uses a communication pattern known before algorithm start,...
Parallel Binary Approach for Frequent Itemsets Mining
The technique of association rules discovering is one of the most known and the most explored techniques of data mining. This technique has two main phases: the first is to extract all the frequent itemsets and the second is to generate association rules from these frequent itemsets. The first phase is the most expensive given the large number of accesses to transactions database and the large number of candidate itemsets. As databases are generally very large, a solution to avoid the repetitive and costly accesses is to represent them by compact structures. In this paper, we propose a parallel binary approach for frequent itemsets extracting, to deal with the great number of candidates and to take advantage of multicore architectures. This approach is implemented using a compact data structure based on signatures tree for the representation of the database to access it only once.
A fast Parallel Association Rule Mining Algorithm Based on the Probability of Frequent Itemsets
Frequent itemset finding is the most costly processing step in analyzing large transactional databases. At each stage in discovering frequent itemset a huge number of candidate itemsets are produced. Then, if we predict which candidate itemset will be frequent and which will not, we can reduce wastage of time in the processing unfrequent itemsets. In this paper we propose a new parallel algorithm for frequent itemset mining, called probability of frequent itemset (PFI) mining algorithm. The PFI algorithm can predict frequency of the candidate based on the probability of its subset and makes priority between candidate itemsets base on it's probability. Moreover, the PFI algorithm passes the database only one time by dividing the database horizontally and distributes it over the system nodes. Also, while finding the k-itemsets, the algorithm can start a new stage (finding k+1 itemsets) with the discovered frequent k-itemsets while some other itemsets in the same stage have not been finished yet. Moreover, we introduce a method for reducing the number of transactions. We present the result on the performance of our algorithm on various datasets, and compare it against well known algorithms.
Efficient Data Mining for Frequent Itemsets in Dynamic and Distributed Databases
2003
Data Mining is one of the central activities associated with understanding and exploiting the world of digital data. It is the mechanized process of modeling large databases by means of discovering useful patterns. A frequent itemset is a pattern describing a relevant subset of the data, and a collection of frequent itemsets is particularly useful because it is an extremely compact model of the database. Discovering frequent itemsets in large databases is usually a hard computational task, which can be even harder when data is dynamic and distributed. Applying traditional algorithms in such data results in high communication overhead, excessive wastage of CPU and I/O resources, privacy violations, and often does not meet the stringent rapid response times, to essentially an interactive process of exploiting the data. Hence, there is an urgent need for non-trivial algorithms that can effectively mine frequent itemsets in dynamic and distributed databases. Such algorithms are presented in this master thesis.
Mining of Association Rules on Large Database Using Distributed and Parallel Computing
Procedia Computer Science, 2016
Now days due to rapid growth of data in organizations, extensive data processing is a central point of Information Technology. Mining of Association rules in large database is the challenging task. An Apriori algorithm is widely used to find out the frequent item sets from database. But it will be inefficient in case of large database because it will require more I/O load. Later drawback of the Apriori algorithm is overcome by many algorithms / parallel algorithms (model) but those are also inefficient to find frequent item sets from large database with less time and with great efficiency. Hence hybrid architecture is proposed which consists of integrated distributed and parallel computing concept. The main idea of new architecture is that we combine distributed as well as parallel computing in such a way that it will be efficient to find out frequent item sets from large databases in less time. It also handle large database with efficiently than existing algorithms.