Mining Association Rules: A Case Study on Benchmark Dense Data (original) (raw)
Abstract
Data mining is the process of discovering knowledge and previously unknown pattern from large amount of data. The association rule mining (ARM) has been in trend where a new pattern analysis can be discovered to project for an important prediction about any issues. Since the first introduction of frequent itemset mining, it has received a major attention among researchers and various efficient and sophisticated algorithms have been proposed to do frequent itemset mining. Among the best-known algorithms are Apriori and FP-Growth. In this paper, we explore these algorithms and comparing their results in generating association rules based on benchmark dense datasets. The datasets are taken from frequent itemset mining data repository. The two algorithms are implemented in Rapid Miner 5.3.007 and the performance results are shown as comparison. FP-Growth is found to be better algorithm when encountering the support-confidence framework.
Figures (16)
The rest of this paper is organized as follow. Section 2 describes rudimentary of association rules. Section 3 describes Apriori and FP Growth algorithms. Section 4 describes experimental results. Finally, the conclusion of this work is described in section 5. 1.1. Association Rules t t Following is the formal definition of the problem in [3]. Let / = {/, b,...,/m} be the set of items. Let Dis a set of transaction where each transaction T is a set of items such that T ¢ J. An association rule is an implication of the form X € Y, where X represents the antecedent part of he rule and Y represents the consequent part of the rule where X CJ, YCI and X NY = @. The itemset that satisfies minimum support is called frequent itemset. The rule X = Y holds in he transaction set D with confidencec if c% of transactions in D that contain X also contain Y. The rule X = Y has supports in the transaction set D if s% of transaction in D contains X UY. The illustration of support-confidence notions is given as below: wt “Thaw aAxcininm~met! wt wi tla SF ws OF lfm. jhe femandtinwc wal ten KnRAE oven Vee TE conidia bala SS maw
Figure 3. Finding Patterns Having P from P-conditional Database [5] Figure 2. Construct FP-tree from a Transaction Database [5]
The W-Apriori process is an extension of Weka-Apriori into the RM tool. First, the benchmark data (in csv) is retrieved by calling retrieve() process. Then data transformation has to be constructed through descretizebyfrequency() process. This operator converts the selected numerical attributes into nominal attributes by discretizing the numerical attribute into a user- specified number of bins. Bins of equal frequency are automatically generated, the range of different bins may vary. Then data is converted from nominaltonumerical() to numericaltopolynominal(). The process nominaltonumerical() is to change the nominal attributes to numerical attributes while the process numericaltopolynominal() is to change the numerical attributes to polynominal attributes, that is allowed in Apriori algorithm. Then we call the Weka extension, W-Apriori() to generate the best rules. The parameter is set to be a default value. Figure 6 depicts on the processes involved in deploying the Weka extension W-FP- P&peierde oleate 2.2. RM Development and Results
The root process starts with retrieving the csv dataset. Then the discretizeby frequency() is selected to change the real attributes to nominal. Next, the NominaltoBinominal()
Figure 7. W-Apriori vs. W-FPGrowth: Execution time (in seconds) when min_conf = 0.9 Figure 7-12 illustrate the graphs of the results obtained. From these figures which representing 3 different values of min_conf (i.e. 0.9, 0.5 and 0.1), it can be seen that the patterns plotted are almost similar. The graphs show that W-FPGrowth outperforms W-Apriori where more number of rules generated within lesser execution time. From the detailed result in RM, between W-Apriori and W-FPGrowth, there are almost similar attributes interpreted to be the antecedent and consequent. With W-FPGrowth, there are more attributes found to be the interesting rules as compared to W-Apriori. For any mining algorithm, it should find the same set of rules although their computational efficiencies and memory requirements may be different [5].
Figure 9. W-Apriori vs. W-FPGrowth: Executior time (in seconds) when min_conf = 0.5
Figure 8. W-Apriori vs. W-FPGrowth: Rules generated when min_conf = 0.9
Figure 10. W-Apriori vs. W-FPGrowth: Rules generated when min_conf = 0.5
Figure 13. Scalability of Apriori vs data size on executing time when min_conf=0.9
In reviewing the scalability of Apriori and FP Growth on five (5) datasets, taking the value in dataset size, there is only small variation on execution time or number of rules generated. However on the whole, these algorithms have a good scalability to data size. This can be seen through Figure 13-16 where the good scalability of algorithm to the data size is highly desirable in real data mining applications [8].
Figure 16. Scalability of FP Growth vs data size on rules generate when min_conf=0.9 Figure 14. Scalability of FP Growth vs data size on executing time when min_conf=0.9
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (8)
- Tan PN, Steinbach M, Kumar V. Introduction to Data Mining. First Edition. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. 2005.
- Trieu TA, Kunieda Y. An improvement for declat algorithm. Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC'12). 2012; 54: 1- 06.
- Agrawal R, Srikant R, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB. 1994; 1215: 487-499.
- Agrawal R, Imieli´nski T, Swami A. Mining association rules between sets of items in large databases. In SIGMOD Rec. 1993; 22: 207-216.
- Han J, Kamber M, Pei J. Data mining: concepts and techniques. Morgan Kaufmann. 2006.
- Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In ACM SIGMOD Record. ACM. 2000; 29(2): 1-12.
- Han J, Pei J, Yin Y, Mao R. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. In Data mining and knowledge discovery. 2004; 8(1): 53-87.
- Mamat R, Herawan T, Deris MM. MAR: Maximum attribute relative of soft set for clustering attribute selection. In Knowledge-Based Systems. 2013; 52: 11-20.