Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Electrical & Computer Science IJECS-IJENS Vol:14 No:03 31 Apriori Based: Mining Infrequent and Non-Present Item Sets from Transactional Data Bases Sujatha Kamepalli1, Raja Sekhara Rao Kurra2 and Sundara Krishna.Y.K.3 1 Research Scholar, CSE Department, Krishna University Machilipatnam, Andhra Pradesh, India sujatha101012@gmail.com 2 Director, Sri Prakash College of Engineering, Thuni , Andhra Pradesh, India. krr_it@yahoo.co.in 3 Professor, CSE Department, Krishna University Machilipatnam, Andhra Pradesh, India. yksk2010@gmail.com Abstract-- Item set mining has been an active area of research due to its successful application in various data mining scenarios including finding association rules. Though most of the past work has been on finding frequent item sets, infrequent item set mining has demonstrated its utility in web mining, bioinformatics and other fields. In this paper, we propose a new method based on Apriori algorithm to find infrequent item sets and non-present item sets. Finally, we analyze the behavior of our proposed method by considering a transactional data base. Index Term-- Data mining, frequent item sets, infrequent item sets, non-present item sets, Apriori. 1. INTRODUCTION Frequent pattern mining was first proposed by Agrawal et al. (1993) [2] for market basket analysis in the form of association rule mining. It analyses customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. For instance, if customers are buying milk, how likely are they going to also buy cereal (and what kind of cereal) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers do selective marketing and arrange their shelf space [3]. Patterns that are rarely found in database are often considered to be uninteresting and are eliminated using the support measure. Such patterns are known as infrequent patterns [4].An infrequent pattern is an item set or a rule whose support is less than the minimum support threshold. Mining frequent item sets has found extensive utilization in various data mining applications including consumer market-basket analysis [13], inference of patterns from web page access logs [17], and iceberg-cube computation [14], Extensive research has, therefore, been conducted in finding efficient algorithms for frequent item set mining, especially in finding association rules [15], However, significantly less attention has been paid to mining of infrequent item sets, even though it has got important usage in (i) mining of negative association rules from infrequent item sets [18], (ii) statistical disclosure risk assessment where rare patterns in anonymous census data can lead to statistical disclosure [16], (iii) fraud detection where rare patterns in financial or tax data may suggest unusual activity associated with fraudulent behavior [16], and (iv) bioinformatics where rare patterns in micro array data may suggest genetic disorders [16]. Although a vast majority of infrequent patterns are uninteresting, some of them might be useful to the analysis, particularly those that correspond to negative correlations in data. For example, the sale of DVDs and VCRs together is low because any customer who buys a DVD will most likely not buy a VCR and vice versa. Such negative correlated patterns are useful to help identify competing items, which are items that can be substituted for one another. Examples of competing items include tea versus coffee, butter versus margarine, regular versus diet soda, and desktop versus laptop computers. Some infrequent patterns may also suggest the occurrence of interesting rare events or exceptional situations in the data. For example, if {Fire = Yes} is frequent but {Fire = Yes, Alarm = On} is infrequent, then the latter is an interesting infrequent pattern because it may indicate faulty alarm systems. To detect such unusual situations, the expected support of a pattern must be determined, so that, if a pattern turns out to have a considerably lower support than expected, it is declared as an interesting infrequent pattern [5]. The mining task that focuses on discovering frequent patterns from the databases is called frequent pattern mining. In frequent pattern mining, only frequent patterns are returned while infrequent patterns are simply discarded without further consideration. This is because the most valuable information is carried by the frequent patterns and the infrequent patterns cannot adequately reflect the typical characteristics from the data because of their rare occurrence [8]. However, since the late 1990s, more and more researchers have realized the importance of infrequent patterns with the increasing demands from applications of anomaly detection, especially in medicine [6], genetics [10], molecular biology [7] and network security [11]. In these 1412503-7474-IJECS-IJENS © June 2014 IJENS IJENS International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03 areas, infrequent patterns are considered significant due to the huge influence they may have. In the study of finding a better treatment approach for a special disease, researchers would like spend more time on studying an abnormal case rather than reading the millions of records of healthy people [9]. In this scenario, more effort has been put into the development of infrequent pattern mining and non-present patterns [12]. 2. FREQUENT, INFREQUENT AND NON -PRESENT ITEM SETS Frequent, Infrequent and non-present item set mining are special cases of pattern mining. The problem of frequent item set mining consists at looking only for patterns with support at least equal to a fixed threshold. However, in infrequent item set mining we are exclusively looking for noncommon patterns: patterns that are present in the data transactions with a frequency smaller than a fixed threshold. In the case of non-present pattern mining, we are only interested by patterns that are not present [1]. Mining infrequent patterns is a challenging endeavor because there are enormous numbers of such patterns that can be derived from a given data set. More specifically, the key issues in mining infrequent patterns are: (1) How to identify interesting infrequent patterns, and (2) How to efficiently discover them in large data sets. 3. RELATED W ORK The problem of mining frequent item sets was first introduced by Agrawal et al. [2], who proposed the Apriori algorithm. Mehdi Adda at all[1] explains about the frame work for rare and non-present pattern mining. Jiawei Han at all[3] explains about frequent pattern mining and its future directions. Ashish Gupta at all proposed algorithms for Minimally Infrequent Itemset Mining using Pattern-Growth Paradigm and Residual Trees. Alex Tze Hiang Sim at all explains about Mining Infrequent and Interesting Rules from Transaction Records. Laszlo Szathmary at all describes about Generating Rare Association Rules Using the Minimal Rare Item sets Family. The research done on Infrequent and nonpresent item set mining is not restricted to the above said ,but the above said papers does not explains about only infrequent and non-present item set mining but also they combined these with frequent pattern mining. 4. P ROPOSED METHOD In the general Apriori algorithm the frequent item sets are found by generating the candidate item sets based on minimum support threshold. In that process the frequent item sets are considered while discarding the infrequent item sets. Since mining infrequent item sets is also very important in many applications such as medical field, fraud detection and anomaly detection in our proposed algorithm only infrequent item sets and non-present item sets are found. For finding the infrequent item sets the support is to be considered that is minimum negative count is considered here 32 and for non-present item sets it is not necessary to consider the minimum support threshold. Table I Infrequent 1-Item sets Infrequent 1-itemsets Data base ITEM INFRE QUENC Y TID TID ITEM SET 1 3 20,40,60 10 1,3,4,7 2 2 10,50 20 2,3,5,6,8 3 1 40 30 1,2,3,5,7 4 5 20,30,40,50,60 40 2,5,8 5 3 10,50,60 50 1,3,7,8 6 5 10,30,40,50,60 60 2,3,7 7 2 20,40 8 3 10,30,60 If the support frame work is considered let us take the minimum support threshold that is minimum negative count as 3, in the above example items 1, 4, 5, 6, 8 are considered as infrequent. In the general Apriori algorithm since the frequent patterns are found and infrequent patterns are discarded, a pruning strategy is applied based on Apriori principle i.e. if an item set is found to be infrequent then all its super sets are also infrequent. In the proposed algorithm also in finding infrequent and non-present item sets the Apriori property is applicable that is if we find an item set as infrequent then all its supersets are considered as infrequent but here these super sets are not to be pruned as in Apriori and they are considered in to the solution as infrequent k- item sets. But here we cannot prune the frequent item sets since if an item set is frequent then its super sets may or may not be frequent. So it is necessary to go through all possible 1-item, 2-item …sets to find all possible infrequent item sets. The non-present item sets are found directly from infrequent 1-item sets. Suppose for example if a 2-item set {1, 6} is considered, from the above infrequent 1-item set table it is clear that the item 1 is not present in TID 20, 40, 60 and item 6 is not present in TID10,30,40,50,60 , by taking the union of TIDs the 2-item set {1,6} is not present in all the TIDs 10,20,30,40,50,60. So the set {1, 6} is said as nonpresent item set. By following the same procedure all the infrequent and non-present item sets are found. 1412503-7474-IJECS-IJENS © June 2014 IJENS IJENS International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03 Table II Infrequent 2-Item sets 33 Table V Infrequent 2-Item sets Infrequent 2-item sets corresponding to item 1 ITEM INFREQUENCY TID (1,2) 5 10,20,40,50,60 (1,3) 3 20,40,60 (1,4) 5 20,30,40,50,60 (1,5) 5 10,20,40,50,60 (1,6) 6 10,20,30,40,50,60 (1,7) 3 20,40,60 (1,8) 5 10,20,30,40,60 In the above table it is clear that the item set (1, 6) is non-present item set and the remaining are infrequent item sets. These infrequent item sets are found from the infrequent 1-item set table. Table III Infrequent 2-Item sets ITEM SET INFREQUENCY TIDS (4,5) (4,6) 6 6 10,20,30,40,50,60 10,20,30,40,50,60 (4,7) 5 20,30,40,50,60 (4,8) 6 10,20,3040,50,60 Table VI Infrequent 2-Item sets ITEM SET INFREQUENCY TIDS (5,6) (5,7) 5 5 10,30,40,50,60 10,20,40,50,60 (5,8) 4 10,30,,50,60 ITEM SET INFREQUENCY TIDS (2,3) 3 10,40,50 (2,4) 6 10,20,30,40,50,60 ITEM SET INFREQUENCY TIDS (2,5) 3 10,50,60 (6,7) (6,8) 6 5 10,20,30,40,50,60 10,30,40,50,60 (2,6) 5 10,30,40,50,60 (2,7) 4 10,20,40,50 (2,8) 4 10,30,50,60 Table IV Infrequent 2-Item sets ITEM SET INFREQUENCY TIDS (3,4) (3,5) 5 4 20,30,40,50,60 10,40,50,60 (3,6) 5 10,30,40,50,60 (3,7) 2 30,40 (3,8) 4 10,30,40,60 Table VII Infrequent 2-Item sets Table VIII Infrequent 2-Item sets ITEM SET INFREQUENCY TIDS (7,8) 5 10,20,30,40,60 The remaining k-item sets can be calculated in the same procedure. 4.1. Algorithm: Input: A Transactional Data base, minimum negative-count. Output: Infrequent k-Item sets and Non-present Item sets. Method: 1. start 2. scan the given transactional data base 3. for(k=1 ton) do //n is the max no.of items 4. for (each k-item set i) do 5. for (each transaction T) do 6. if(i does not belongs to T) 7. return TID; 8. end for1 9. end for2 10. end for3 1412503-7474-IJECS-IJENS © June 2014 IJENS IJENS International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03 11. for(each k-item set i) 12. count number of TIDs //count gives infrequency 13. if(count>= minimum negative count) 14. return infrequent k-item set; 15. if(count=total number of transactions) 16. return non-present item set; 17. end for 18. end The above described algorithm gives the pseudo code for the process that I have described. The practical code is not yet implemented, it is under process. 5. CONCLUSIONS Since the late 1990s, more and more researchers have realized the importance of infrequent patterns with the increasing demands from applications of anomaly detection, especially in medicine [6], genetics [10], molecular biology [7] and network security [11]. In these areas, infrequent patterns are considered significant due to the huge influence they may have. In the study of finding a better treatment approach for a special disease, researchers would like spend more time on studying an abnormal case rather than reading the millions of records of healthy people [9]. In this scenario, more effort has been put into the development of infrequent pattern mining and nonpresent patterns [12]. The proposed algorithm is used to find the infrequent patterns and non-present patterns within one data base scan that is by finding the infrequent 1-item sets from that we can find all the remaining sets of infrequent item sets and nonpresent item sets. 6. FUTURE W ORK The proposed algorithm works efficiently to find the infrequent item sets and non-present item sets. It finds all the infrequent item sets and non-present item sets within one data base scan. But the proposed method not considers any pruning strategy. It is better to implement any pruning strategy to improve the complexity of the proposed method. [1] [2] [3] [4] [5] REFERENCES A. Sehgal, A. Presente, L. Dudus, and J.F. Engelhardt. Isolation of differentially expressed DNAs during ferret tracheal development: Application of differential display PCR.Experimental Lung Research, 22(4):419– 434, 1996. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993ACM- SIGMODinternational conference on management of data (SIGMOD’93), Washington, DC, pp 207– 216. B. Saha, M. Lazarescu, and S Venkatesh. Infrequent item mining in multiple data streams. In Proceedings of IEEE International Conference on Data Mining (ICDM 2007), pages 569–574, Omaha, NE, October 2007. D. J. Haglin and A. M. Manning. On minimal Infrequent itemset mining. In DMIN, pages 141–147, 2007. F. Medici, M.I. Hawa, A. Giorgini, A. Panelo, C.M. Solfelix, R.D.G. Leslie, and P.Pozzilli. Antibodies to GAD65 and a tyrosine [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] 34 phosphatase-like molecule IA-2ic in Filipino Type I diabetic patients. Diabetes Care, 22(9):1458–1461, 1999. INFREQUENT PATTERNS, CHAPTER- 5 Introduction to data mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar Pearson Education, book. Introduction to data mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar Pearson Education, book. J. Srivastava and R. Cooley. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1:12–23, 2000. J. Yang and J. Logan. A data mining and survey study On diseases associated with paraesophageal hernia. In AMIA Annual Symposium Proceedings, pages 829– 833, 2006. Jiawei Han · Hong Cheng · Dong Xin · Xifeng Yan “Frequent pattern mining: current status and future Directions” Data Min Knowl Disc (2007) 15:55–86 DOI 10.1007/s10618-006-0059-1. K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cube.SIGMOD Rec., 28:359–370, 1999. Mehdi Adda, Lei Wu, Sharon White(2012), Yi Feng ” pattern detection with rare item-set mining” International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.1, No.1, August 2012. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487– 499, 1994. R. Agrawal, T. Imieinski, and A. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 207– 216, Washington DC, 1993. ACM Press. W. Shi, F.K. Ngok, and D.R. Zusman. Cell density regulates cellular reversal frequency in Myxococcus xanthus. In Proceedings of the National Academy of Sciences of the United States of America, volume 93(9), pages 4142–4146,1996. X. Dong, Z. Niu, X. Shi, X. Zhang, and D. Zhu. Mining both positive and negative association rules from frequent and infrequent itemsets. In ADMA, pages 122–133,2007. X. Wu, C. Zhang, and S. Zhang. Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3):381–405, 2004. K. Sujatha is pursuing her Ph.D. in Krishna University,Machilipatnam, A.P. She is interested doing research in data mining. Present she is carrying her research in Infrequent Pattern Mining. She has seven international journal publications with high impact factors and indexing. She has two national journal publications in peer refereed journals. She presented papers in two national conferences. She attended for several workshops and faculty development programs. Recently she attended for two AICTE sponsored 2-week workshops. She is member in Indian Association of Engineers (IAE).She has a total of 10 years experience in teaching. She is working as an Associate Professor in CSE Department, Malineni Lakshmaiah Engineering College, Singaraya konda, Prakasam District. A.P. Prof. K. Rajasekhara Rao is director of Sri Prakash College of Engineering, Thuni, East Godavari District, Andhra Pradesh. He hold several key positions in K.L.University, as Dean (Administration) & Principal, K L College of Engineering (Autonomous). Having more than 26 years of teaching and research experience, Prof. Rao is actively engaged in the research related to Embedded Systems, Software Engineering and Knowledge Management. He had obtained Ph.D in Computer Science & Engineering from Acharya Nagarjuna University (ANU), Guntur, Andhra Pradesh and produced 58 publications in various International/National Journals and Conferences. Prof.KRR was awarded with “Patron Award” for his outstanding contribution, by India’s prestigious professional society Computer Society of India (CSI) for the year 2011 in Ahemadabad. He has been adjudged as best teacher and has been honored with “Best Teacher Award”, seven times. Dr. Rajasekhar is a Fellow of IETE, Life Member’s of IE, ISTE, ISCA & CSI (Computer 1412503-7474-IJECS-IJENS © June 2014 IJENS IJENS International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03 35 Society of India). Dr.Rajasekhar is nominated as sectional committee member for Engineering Sciences of 100 th Annual Convention of Indian Science Congress Association. He has been the past Chairman of the Koneru Chapter of CSI. Dr. Y. K. Sundara Krishna qualified in Ph.D. in Computer Science and Engineering from Osmania University, Hyderabad. Now, he is working as Professor in the Department of Computer Science, Krishna University, Machilipatnam and presently holding several key positions in Krishna University. His research interests are Mobile Computing, Service Oriented Architecture and Geographical Information Systems and having practical work experience in the areas of Computing Systems including Developing Simulators for Distributed Dynamic Cellular Computing Systems, Applications of Embedded and Win32 clients, Maintenance of Multi-user System Software. He has about 24 international publications and about 2 national publications. Has attended about 5 international conferences and 25 national conferences. 1412503-7474-IJECS-IJENS © June 2014 IJENS IJENS