Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) An Extensive Survey on Association Rule Mining Algorithms Mihir R Patel1, Dipak Dabhi2 1,2 Assistant Professor, Department of Computer Engineering, CGPIT, Bardoli, India Confidence of an association rule is defined as the percentage/fraction of the number of transactions that contain XUY to the total number of records that contain X, where if the percentage exceeds the threshold of confidence an interesting association rule X⇒Y can be generated. Abstract— Association rule mining (ARM) aims at extraction, hidden relation, and interesting associations between the existing items in a transactional database. The purpose of this study is to highlight fundamental of association rule mining, association rule mining approaches to mining association rule, various algorithm and comparison between algorithms. confidence(X⇒Y)= Keywords—Assocation Rule Mining; Bottom up Approach; Top Down Approach. ( ) ( ) Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub-problems[2]. 1. Find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. 2. To generate association rules from those large itemsets with the constraints of minimal confidence. I. INTRODUCTION Association rule mining, one of the most important and well researched techniques of data mining, was introduced by Agrawal et al.(1993). It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Let I=I1, I2, … , Im be a set of m distinct attributes, T be transaction that contains a set of items such that T⊆I, D be a database with different transaction records. An association ruleis an implication in the form of X⇒Y, where X, Y⊆I are sets of items called itemsets, and X∩Y =∅. Here, X is called antecedent while Y is called consequent, the rule means X implies Y[1]. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively. The rule X⇒Y holds in the transaction set D with support s, where s of an association rule (X⇒Y) is defined as the percentage/fraction of records that contain XUY to the total number of records in the database(n)[2]. The support-count of an itemset X, denoted by X.count, in a data set D is the number of transactions in D that contain X. Then, II. ASSOCIATION RULE MINING Association rule mining approach can be divided into two classes: A. Bottom Up Approach Bottom Up Approach look for frequent itemsets from the given dataset that satisfy the predefined constraint. Bottom up approach gets large frequent itemsets through the combination and pruning of small frequent itemsets. The principle of the algorithm is: firstly calculates the support of all itemsets in candidate itemset Ck obtained by Lk-1, if the support of the itemset is greater than or equal to the minimum support, the candidate k-itemset is frequent kitemset, that is Lk, then combines all frequent k-itemsets to a new candidate itemset Ck+1, level by level, until finds large frequent itemsets. A major challenge in mining frequent itemsets from a large data set is the fact that such mining often generates a huge number of itemsets satisfying the minimum support threshold, especially when min sup is set low. This is because if an itemset is frequent, each of its subsets is frequent as well.It is useful to identify a small representative set of itemsets from which all other frequent itemsets can be derived such approach is known as Top-Down approach. support( X⇒Y )= 408 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) B. Top Down Approach Top-down Approach looks for more specific frequent itemsets rather than finding more general frequent itemsets. The number of frequent itemsets produced from a transaction data set can be very large. It is useful to identify a small representative set of itemsets from which all other frequent itemsets can be derived. Two such representations are presented in this section are 1) Maximal Frequent Itemset 2) Closed Frequent Itemset An itemset X is a maximal frequent itemset (or maxitemset) in set S if X is frequent, and there exists no superitemset Y such that X⊆Y and Y is frequent in S[2]. An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in S[2]. The precise definition of closed itemset, however, is based on Relations (1) and (2).Given the functions: III. CLASSIFICATION OF ASSOCIATION RULE MINING ALGORITHM A. Classifying Different Frequent Itemset Mining Algorithm A Frequent itemsets mining algorithms, can be Classified into two classes, candidate generation and pattern growth, within which further division is based upon underlying data structures, dataset organization. 1) Candidate Generation: Candidate generation algorithms identify candidate itemsets before validating them with respect to incorporated constraints, where the generation of candidates is based upon previously identified valid itemsets. 2) Pattern Growth: In contrast to the more prolific candidate generation techniques, pattern growth algorithms eliminate the need for candidate generation through the creation of complex hyper structures (data storage structures used in pattern growth algorithms). The structure is comprised of two linked structures, a pattern frame and an item list, which together provide a concise representation of the relevant information contained within D. 3) Based on Itemset Data Structure: As itemsets are generated, different data structures can be used to keep track of them. The most common approach seems to be a hash tree. Alternatively, a trie or lattice may be used. 4) Based on Dataset Organization: A set of transactions in TID-itemset format (that is, {TID : itemset} ), where TID is a transaction-id and itemset is the set of items bought in transaction TID. This data format is known as horizontal data format. Alternatively, data can also be presented in item-TID set format (that is, {item : TIDset}), where item is an item name, and TID set is the set of transaction identifiers containing the item. This format is known as vertical data format. Table I summarizes and provides a means to briefly compare the various algorithms for mining frequent itemset mining based on the maximum number of scans, data structures proposed, and contribution. f (T ) = {i ∈ I | ∀ t ∈ T, i ∈ t} Which returns all the itemset included in the set of transactions T, and g(I ) = {t ∈ T | ∀ i ∈ I, i ∈ t} Which returns the set of transactions supporting a given itemset I (its tid-list ), the composite function fog is called Galois operator or closure operator. Generator: An itemset p is a generator of a closed itemset y if p is one of the itemsets (there may be more than one) that determines y using Galois Closure operator: h(p) = y. It is intuitive that the closure operator defines a set of equivalence classes over the lattice of frequent itemsets: two itemsets belongs to the same equivalence class if and only if they have the same closure, i.e. their support is the same and is given by the same set of transactions. From the above definitions, the relationship between equivalence classes and closed itemsets is clear: the maximal itemsets of all equivalence classes are closed itemsets. Mining all these maximal elements means mining all closed itemsets. 409 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) Table I Comparison of Frequent Itemset Mining Algorithm Algorithm AIS[1] Scan M+1 Data Structure Not Specified Type Candidate Generation Dataset Horizontal Apriori[3] M+1 1 Candidate Generation Candidate Generation Horizontal Apriori-Tid[4] DIC[5] Candidate Generation Horizontal Partitioning & checkpoints Eclat[6] Depends On interval Size 2 Lk-1 : Hash table Ck: Hash tree Lk-1 : Hash table Ck: array indexed by TID ̅̅̅̅: Sequential structure Trie Lattice Vertical Fp-Growth[7] 2 Fp-Tree Prefix Based Equivalence Trie Based Structure Generating Frequent Patterns with The Frequent Pattern List[8] 2 Frequent Pattern List (FPL). Candidate generation Pattern Growth Pattern Growth Horizontal Horizontal Horizontal Contribution First Algorithm for Association rule mining Direct Count and Prune TID Frequent Pattern List and its Operation Minimum Elements Method: Some algorithms choose the minimal elements (or key patterns) of each equivalence class as closure generator. Key patterns form a lattice, and this lattice can be traversed with a simple Apriori-like algorithm. Closure Climbing: Once a generator is found, we calculate its closure, and then create new generators from the closed itemset discovered. In this way the generators are not necessarily minimal elements, and the computation of their closure is faster. 3) Type of Closure Calculation: Generators closure calculation can be done both on-line and off-line. In the off-line case we can imagine to firstly retrieve the complete set of generators, and then to calculate their closure. In the second case we can calculate closures during the browsing: each time a new generator is mined, its closure is immediately computed. Table II summarizes and provides a means to briefly compare the various algorithms for mining maximal itemset Algorithms based on data structures proposed, and contribution. B. Classifying different closed frequent itemsets mining Algorithm A Closed Frequent closed itemsets mining algorithms, can be categorized as following: 1) Type of Search Strategy: The closed itemsets mining algorithms are classified in two classes according to their search strategy. Breadth-first-search strategy: In Breadth-first (or level-wise) strategy, all candidate k -itemsets are generated with set unions of two frequent (k-1) itemsets. Depth-first-search strategy: Given a frequent kitemset X, candidate (k+1)-itemsets are generated by adding an item i, i X, to X. If also {X∪i} is discovered to be frequent, the process is recursively repeated on {X ∪ i}. 2) Type of Generator Selection: We can identify all the closed itemsets by calculating the closure of each generator. 410 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) Table II Comparison of Closed Frequent Itemset Mining Algorithms Algorithm A-close[9] Charm[10] Closest[11] Closet+[12] Fp-Close[13] DCI-Closed[14] Type Of Search Strategy Breadth First Search Depth First Search Depth First Search Depth First Search Depth First Search Depth First Search Type Of Data Format Horizontal Vertical Horizontal Horizontal Horizontal Vertical C. Classifying different Maximal frequent itemsets mining Algorithm A Maximal frequent itemsets mining algorithms, can be Classified into two classes, candidate generation and pattern growth, within which further division is based upon underlying data structures, dataset organization. 1) Candidate Generation: Candidate generation algorithms identify candidate itemsets before validating them with respect to incorporated constraints, where the generation of candidates is based upon previously identified valid itemsets. Type Of Generator Selection Minimum Elements Closure Climbing Closure Climbing Closure Climbing Closure Climbing Closure Climbing Type of Closure Calculation Off-line On-line On-line On-line On-line On-line 2) Pattern Growth: In contrast to the more prolific candidate generation techniques, pattern growth algorithms eliminate the need for candidate generation through the creation of complex hyper structures(data storage structures used in pattern growth algorithms). The structure is comprised of two linked structures, a pattern frame and an item list, which together provide a concise representation of the relevant information contained within D. 3) Based on Itemset Data Structure: As itemsets are generated, different data structures can be used to keep track of them.The most common approach seems to be a hash tree. Alternatively, a trie or lattice may be used. Table III Comparison of Maximal Frequent Itemset Mining Algorithms Algorithm pincer-search[15] MaxMiner[16] A Counting Mining Algorithm Based on Matrix[17] Data Structure maximum frequent candidate set maximum frequent candidate set Associated matrix data structure Mafia[18] maximum frequent candidate set Fpmax[19] FP-Tree A New FP-tree-based Algorithm MMFI[20] Fp-Tree, Matrix, Array. Mining Maximal Frequent Itemsets with Frequent Pattern List[21] Frequent Pattern List Contribution Two-way-search based on MFS Dataset Horizontal Upward close Principle Horizontal Counting operation and elimination method to remove the rows and columns Effective Pruning Mechanisms. Vertical Bitmap Representation of the Dataset Vertical Bitmap Representation of the Dataset Horizontal MFI-Tree -keep track of all maximal frequent itemsets. Matrix-Trim conditional fp tree and subset of mfi. Array-Reducing the frequency of the traversal of the FP-tree. Bit counting, Bit string trimming and migration 411 Horizontal Horizontal International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) [8] 4) Based on Dataset Organization: A set of transactions in TID-itemset format (that is, {TID : itemset}), where TID is a transaction-id and itemset is the set of items bought in transaction TID. This data format is known as horizontal data format. Alternatively, data can also be presented in item-TID set format (that is, {item : TIDset}), where item is an item name, and TID set is the set of transaction identifiers containing the item. This format is known as vertical data format. Table III summarizes and provides a means to briefly compare the various algorithms for mining maximal itemset Algorithms based on data structures proposed, and contribution. [9] [10] [11] [12] [13] IV. CONCLUSION In this paper, we have discussed two approach of association rule mining and classification of association rule mining approach. This paper gives a brief survey on classification and comparison of frequent/closed/maximal itemsets mining algorithms. [14] [15] REFERENCES [1] [2] [3] [4] [5] [6] [7] [16] Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 20thInt. Conf. Very Large Data Bases, 487-499 Jiawei Han and Micheline Kamber "Data Mining Concepts and Techniques", Second Edition Morgan Kaufmann Publishers, March 2006 Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 207-216. Agrawal R, Srikant R, ―Fast algorithms for mining association rules‖, Proceedings of the 20th Int’1 Conference on Very Large Databases[C], New York:IEEE press ,pp. 487-499, 1994. S. Brin, etc, ―Dynamic Itemset Counting and Implication rules for Market Basket Analysis‖, Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 255-264, 1997. M. J. Zaki, "Scalable algorithms for association mining," IEEE Transaction on Knowledge and Data Engineering, pp.372–390, 2000. J. Han, J. Pei, Y. Yin.: Minining frequent patterns without candidate generation. Proceedings of SIGMOD, 2000. [17] [18] [19] [20] [21] 412 F. C. Tseng and C. C. Hsu, ―Generating frequent patterns with the frequent pattern list‖, Proc. 5th Pacific-Asia Conf. on Knowledge Discovery and Data Mining , pp.376-386, April 2001. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering frequent closed itemsets for association rules. In Proc. ICDT ’99, pages 398–416, 1999. M.J. Zaki, and C. Hsiao., "Charm: An efficient algorithm for closed itemset mining",Proc. SIAM Int'l Conf. Data Mining, PP. 457-473, 2002. J. Pei, J. Han, and R. Mao, " CLOSET: An efficient Algorithm for mining frequent closed itemsets", ACM SIGMOD workshop research issue in Data mining and knowledge Discovery, PP. 21-30, 2000. J. Wang, J. Han, and J. Pei, "CLOSET+: Searching for the best strategies for mining frequent closed itemsets", proc. Int'l Conf. Knowledge Discovery and Data Mining, PP. 236-245, 2003. G. Grahne, and J. Zhu, "Efficiently using prefix-trees in mining frequent itemsets", IEEE ICDM Workshop on Frequent Itemset Mining Implementations, 2003. C. Lucchese, S. Orlando, and R. Perego, "DCI_CLOSED: a Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets", IEEE Transaction On Knowledge And Data Engineering, Vol. 18, No. 1, PP. 21-35, 2006. D.Lin, Z. M. Kedem, Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Itemset, EDBT Conference Proceedings, 1998, pages 105-110. R.J. Bayardo, ―Efficiently Mining Long Patterns from Databases,‖ Proc. ACM-SIGMOD Int’l Conf. Management of Data, pp. 85-93, 1998. D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu, ―MAFIA: a maximal frequent itemset algorithm,‖ IEEETransactions on Knowledge and Data Engineering, pp.1490– 1504, Nov 2005. G. Grahne and J.F. Zhu, ―High Performance Mining of Maximal Frequent Item sets‖, Proc. SIAM Int’l Conf. High Performance Data Mining, pp. 135−143, 2003. Hui-ling, Peng; Yun-xing, Shu,"A new FPtreebased algorithm MMFI for mining the maximal frequent itemses "IEEE International Conference on Computer Science and Automation Engineering (CSAE), 2012 Volume: 2 Publication Year: 2012 , Page(s): 61 – 65. Haiwei Jin "A counting mining algorithm of maximum frequent itemset based on matrix" Seventh International Conferences on Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Volume: 3 Publication Year: 2010 , Page(s):1418-1422. Jin Qian; Feiyue Ye "Mining Maximal Frequent Itemsets with Frequent Pattern List "Fourth International Conference on Fuzzy Systems ad Knowledge Discovery,2007 Volume: 1 Publication Year: 2007, Page(s): 628 – 632.