Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN : 2229-6093 S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170 Intelligence Ingrained Data Mining Engine Architecture S. P. Deshpande#1, Dr. V. M. Thakare *2 # P.G. Department of Computer Science and Technology, D.C.P.E., Amravati (MS), India. 1 Shrinivasdeshpande68@gmail.com * Department of Computer Science, S.G.B. Amravati University, Amravati (MS), India. 2 vilthakare@yahoo.co.in Abstract: Information plays vital role in every field. To take complete advantage of data; it requires a tool for automatic summarization of data, extraction of the essence of information stored, and the discovery of patterns in raw data. „Data mining‟ is a tool to accomplish the above mentioned needs. Several attempts have been made to design and develop the generic data mining system but no system found completely generic. The domain experts play important role in the different stages of data mining. The decisions at different stages are influenced by the factors like domain and data details, aim of the data mining, and the context parameters. This paper present the architecture of a Data mining engine that work for specific problem domain. The engine is developed with some prerequisite domain knowledge therefore a non expert can use the system effectively for data mining. Key Words: Data Mining, Engine Architecture, problem specific data mining engine. I. Introduction Data mining is the process of extraction of hidden predictive information from large databases; it is a powerful technology with great potential to help organizations focus on the most important information in their data warehouses [1,2,3,4]. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining is one of the tasks in the process of knowledge discovery from the data. The data stored in the database is used to discover the patterns of data, which then interpreted by applying the domain knowledge. The data mining applications can be generic or domain specific. The generic application is required to be an intelligent system that by its own can takes certain decisions like: selection of data, selection of data mining method, presentation and interpretation of the result. Some generic data mining applications cannot take its own these decisions but guide users for selection of data, selection of data mining method and for the interpretation of the results. The domain specific applications are focused to use the domain specific data and data mining algorithm that targeted for specific objective. The data generating sources generate different type of data. Data can be from a simple text, numbers to more complex audio-video data. To mine the patterns and thus knowledge from this data, different types of data mining algorithms are used. The collection and selection of context specific data and applying the data mining algorithm to generate the context specific knowledge is thus a skillful job. In many domains specific data mining applications the domain experts plays vital role to mine useful knowledge. II. The Problem Definition Most of the previous studies on data mining applications in various fields like: Language Science [11,12,11], E-learning[34], Medical science and Health care [12,13,15,18], Education [14,33], Banking [29], Market research [26], E-commerce [31], Engineering [15], Sports Science [24,28,36], Detection of terrorist activities [17,27,30,35] Intrusion detection [21], Quality Control [39], Software Engineering [19,20], Library Science [32], Bio-informatics [22,23] etc. uses the variety of data types range from text to images and stores in variety of databases and data structures. The different methods of data mining are used to extract the patterns and thus the knowledge from this variety databases. Selection of data and methods for data mining is an important task in this process and needs the knowledge of the domain. Several attempts have been made to design and develop the generic data mining system but no system found completely generic 165 ISSN : 2229-6093 S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170 [6,7,8,9,10]. Thus, for every domain the domain expert‟s assistant is mandatory. The domain experts shall be guided by the system to effectively apply their knowledge for the use of data mining systems to generate required knowledge [5]. The domain experts are required to select specific data for data mining, cleaning and transformation of data, extracting patterns for knowledge generation and finally interpretation of the patterns for knowledge generation. Most of the domain specific data mining applications show accuracy above 90%. The generic data mining applications are having the limitations. From the study of various data mining applications it is observed that, no application called generic application is 100 % generic. The intelligent interfaces and intelligent agents up to some extent make the application generic but have limitations. The domain experts play important role in the different stages of data mining. The decisions at different stages are influenced by the factors like domain and data details, aim of the data mining, and the context parameters [37]. The domain specific applications are aimed to extract specific knowledge. The domain experts by considering the user‟s interest and other context parameters guide the system to select the data, preprocess the data, apply the data mining algorithm and to discover the knowledge. The results yields from the domain specific applications are more accurate and useful. Therefore it is concluded that the domain specific applications are more specific for data mining. From above study it seems very difficult to design and develop a data mining system, which can work dynamically for any domain. Designing a domain specific system is also a challenging task. Data mining engine that work for specific problem domain and can be used by the user do not have much knowledge of the domain is thus the central idea of this study. III. The Proposed System Architecture In this research work a new architectural model of a data-mining engine is proposed. As the problem domain is known and well defined, the environment in which the system run is also well defined. The user‟s requirements are clearly defined and target processing on the data is also known, in this situation system design become easy. The essential domain knowledge is maintained in the knowledgebase. The system has four major components: The user Interface, The data store, Data mining engine and Knowledgebase. The problem related data selected from the data repository is preprocessed and then copied in to the data mart. The user interface for selection of specific data mining algorithm enables user in the selection of specific algorithm. The domain knowledge required for actual mining purpose shall be referenced from the domain knowledge component. The discovered knowledge shall be presented to the users through the user interfaces. The generated knowledge is also useful for the future DM applications and the knowledge discovery process therefore it shall be stored in the knowledge base also. The Proposed Data Mining Engine Architecture is as given below: IV. Implementation of Proposed Architecture The proposed architecture is implemented for generating players profile from sports related text. In the pilot application „Profile Generator‟ the system contains four main components: The user interface, Data mining engine, data store and knowledge base. The data store is simple folder contain text of the interest. This text is preprocessed; the interfaces are available to accomplish the preprocessing work. The 166 ISSN : 2229-6093 S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170 preprocessing includes conversion of text in lower case letters, removal of punctuation marks, removal of commonly used words using stop-list of words, and tokenization of the text. The knowledge base contain prerequisite knowledge like: Name of countries, Name of the sports/games, commonly used words (stop words) etc. Rules like: Word presided with, word followed by, if_word_is etc. Ex. IF_PRESIDED_WIT H Mr or Ms or Miss or Mrs or Master or Shri or Smt or Ku. Any word IF_TEXT_IS <Name> IF_TEXT_IS THEN_VALUE Not Name_of_Country Not Commonly_used_W ord Not Numeric string Not date string Not Name_of_Country Not Commonly_used_W ord Not Numeric string Not date string Name May be Name IF_FOLLOWED_BY THEN_VALUE Of <Country> Country He/She representing The rule base for other attributes like: Date of Birth, Country he/she representing, Playing hobby, Favorite food, Favorite Movie, Favorite clothing and Sports Record etc. is created and made available with the system. The knowledge base used with the data mining engine is lace with domain and problem specific knowledge. The training phase also fulfills the knowledge base with relevant knowledge and therefore a user can use data mining tool without domain expertise. This knowledge base is used for the extraction of values for the above-mentioned variables. The concept of use of knowledge base in the data mining application is new. The domain experts are required to create the initial knowledge base. The derived values for different attributes are also stored in the knowledge base. This table is used to train the system and to use in the next pass of the system. The Data mining engine is design to apply different data mining algorithm for the extraction of values for the specific attributes. In the training phase of the system the association rule mining algorithm is used to generate knowledge like associated keywords in any game This algorithm with 5% support and 50% confidence generate the rules those specifies the set of key words associated with the specific game. Ex. Game Cricket associated words: Runs, Wicket, Catch, Bold, Wicketkeeper, Batsman, Bowler, Fielder etc., Game Hockey associated words: Goal, Goalkeeper, Penalty corner, Free hit, Hockey stick etc. In the process of extraction of values of the profile attributes the N-Gram analysis along with association rules is used. The 1tokan, 2-tokans and 3-tokans are used interchangeably for the extraction of different attribute values. The interfaces available in this system provide opportunity to the user to interact at each level in the data mining process. They are user friendly and designed to use effectively by novice user. V. Result and Discussion The results obtained from the application developed using proposed architecture are more convincing. In the pilot application „Profile Generator‟ the values for few attributes of a player profile are derived. The attributes of which values are derived: Name, Date of Birth, Game he/she plays, Country he/she representing, other Sports he/she like, favorite food, favorite movie, favorite clothing. It has derived values successfully for these attributes. The results obtained are as given below: Sr. Attribute Value retrieve No. Success Rate 1 Full Name of 86 % Player 2 Date of Birth 92 % 3 Game He/She Plays 4 6 Country He/She 93% Representing Other Sports 84% He/She Likes Favorite Food 82% 7 Favorite Movie 84% 8 Favorite Clothing 79% 9 Sports Record 61% 5 99% 167 ISSN : 2229-6093 S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170 For the attributes for which success rate is above 90% the algorithm works good and very rare retrieves false value. For the attributes for which success rate is between 80% to 90 % the system prompts the user with few options and asks for selection of most correct value. In the extraction of sports record, the text document plays important role, the algorithm requires modification to extract sport record as it has different parameters in different games. This architecture of a data mining engine is using domain and problem relevant knowledge. This knowledge is partially loaded by domain experts and partially generated in the training phase of the algorithm. Due to the use of knowledge base this system can be used by novice users. VI. Conclusion The application developed using proposed architecture shown success rate 84.44 % in average therefore it can be comfortably used for data mining. The major advantage of this architecture is the presence of domain specific knowledge along with data. This facilitates the non domain-expert users in data mining. Existing data mining tools are required domain expertise and also training of using the tool. They are limited to extraction of patterns of data of features which need further interpretation by the domain experts to generate knowledge. Using this architecture system can generate the knowledge but developing knowledgebase is the main issue in this context. [5] [6] [7] [8] References [9] [1] [2] [3] [4] Introduction to Data Mining and Knowledge Discovery, Third Edition ISBN: 1-892095-02-5, Publication of Two Crows Corporation, 10500 Falls Road, Potomac, MD 20854 (U.S.A.), 1999. Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining, ISBN 0-471-66657-2, John Wiley & Sons, Inc, 2005. Dunham, M. H., Sridhar S., Data Mining: Introductory and Advanced Topics, Pearson Education, New Delhi, ISBN: 817758-785-4, 1st Edition, 2006. Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer and Rüdiger Wirth, CRISP-DM 1.0 : Step-by-step [10] [11] data mining guide, NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringenen Bank Group B.V (The Netherlands), 2000. Abraham Bernstein and Foster Provost, An Intelligent Assistant for the Knowledge Discovery Process, Working Paper of the Center for Digital Economy Research, New York University and also presented at the IJCAI 2001 Workshop on Wrappers for Performance Enhancement in Knowledge Discovery in Databases. H. Baazaoui Zghal, S. Faiz, and H. Ben Ghezala, A Framework for Data Mining Based Multi-Agent: An Application to Spatial Data, volume 5, ISSN 1307-6884, Proceedings of World Academy of Science, Engineering and Technology, April 2005. Ralf Rantzau and Holger Schwarz, A Multi-Tier Architecture for HighPerformance Data Mining, A Technical Project Report of ESPRIT project, The consortium of CRITIKAL project, Attar Software Ltd. (UK), Gehe AG (Denmark); Lloyds TSB Group (UK), Parallel Applications Centre, University of Southampton (UK), BWI, University of Stuttgart (Denmark), IPVR, University of Stuttgart (Denmark). J. A., Botia, M. y Garijo, J. R Velasco,., A. F Skarmeta., A Generic Data mining System basic design and implementation guidelines, A Technical Project Report of CYCYT project of Spanish Government. http://citeseerx.ist.psu.edu/viewdoc/summ ary?doi=10.1.1.53.1935 Marcos M. Campos, Peter J. Stengard, Boriana L. Milenova, Data-Centric Automated Data Mining, , Publication of Oracle Corporation, Web Site. www.oracle.com/technology/products/bi/o dm/pdf/automated_data_mining_paper_12 05.pdf J. Sirgo, A. Lopez, R. Janez, R. Blanco, N. Abajo, M. Tarrio, R. Perez, A Data Mining Engine based on Internet, Emerging Technologies and Factory Automation ETFA 2003, Proceedings of IEEE Conference, 16-19 Sept. 2003.Web Sitewww.citeseerx.ist.psu.edu/viewdoc/su mmary?doi=10.1.1.11.8955 Identification of foreign-accented French using data mining techniques, By Bianca Vieru-Dimulescu, Philippe Boula de 168 ISSN : 2229-6093 S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170 [12] [13] [14] [15] [18] [19] [21] Mareüil and Martine Adda-Decker, Computer Sciences Laboratory for Mechanics and Engineering Sciences (LIMSI).WebSitewww.limsi.fr/Individu/bi anca/article/Vieru&Boula&Madda_ParaLi ng07.pdf Halteren, H. van, Linguistic Profiling for Author Recognition and Verification, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics USA, Barcelona, Spain, Article No. 199, Year of Publication: 2004. Linguistic profiling of texts for the purpose of language verification, By Hans Van Halteren and Nelleke Oostdijk, The ILK research group, Tilburg centre for Creative Computing and the Department of Communication and Information Sciences of the Faculty of Humanities, Tilburg University, The Netherlands. WebSite : www.ilk.uvt.nl/~antalb/textmining/LingPr ofColingDef.pdf Application of Data Mining Techniques for Medical Image Classification, By Maria-Luiza Antonie, Osmar R. Zaiane, Alexandru Coman, Proceedings of the Second International Workshop on Multimedia Data Mining (MDM/KDD 2001) in conjunction with ACM SIGKDD conference, San Francisco, August 26, 2001. Data Mining: Medical and Engineering Case Studies, By A. Kusiak, K.H. Kernstine, J.A. Kern, K.A. McLaughlin, and T.L. Tseng, Proceedings of the Industrial Engineering Research 2000 Conference, Cleveland, Ohio, pp. 1-7,May 21-23, 2000. Data Warehousing and Data Mining System Applied to E-Learning By R.Luis, J.Redol, D.Simoes, N.Horta, Proceedings of the II International Conference on Multimedia and Information & Communication Technologies in Education, Badajoz, Spain, December 36th 2003. Crime Data Mining: An Overview and Case Studies, By Hsinchun Chen, Wingyan Chung, Yi Qin, Michael Chau, Jennifer Jie Xu, Gang Wang, Rong Zheng, Homa Atabakhsh, A project under NSF Digital Government Programme, USA, “COPLINK Center: Information and Knowledge Management for Law Enforcement,”, July 2000 – June 2003. Rao, R. B., Krishnan, S. and Niculescu, R. S., Data Mining for Improved Cardiac Care, SIGKDD Explorations Volume 8, Issue 1. [24] Kanellopoulos, Y., Dimopulos, T., Tjortjis, C., Makris, C., Mining Source Code Elements for Comprehending Object-Oriented Systems and Evaluating Their Maintainability, SIGKDD Explorations Volume 8, Issue 1. [25] Schultz, M. G., Eskin, Eleazar, Zadok, Erez, and Stolfo, Salvatore, J., Data Mining Methods for Detection of New Malicious Executables, Proceedings of the 2001 IEEE Symposium on Security and Privacy, IEEE Computer Society Washington, DC, USA , ISSN:1081-6011, 2001. [26] Cai, W. and Li L., Anomaly Detection using TCP Header Information, STAT753 Class Project Paper, May 2004. Web Site: http://www.scs.gmu.edu/~wcai/stat753/sta t753report.pdf. [27] Comparative genomics using data mining tools, By Tannistha Nandi, Chandrika BRao and Srinivasan Ramchandran, Journal of Bio-Science, Indian Academy of Sciences, Vol. 27,No. 1, Suppl. 1, page No. 15-25, February 2002. [29] A survey of data mining methods for linkage dis-equilibrium mapping, By Paivi Onkamo and Hannu Toivonen, Henry Stewart Publications 1473 – 9542. Human Genomics. VOL 2, NO 5, Page No. 336– 340, MARCH 2006. [30] Data Mining in Sports: Predicting Cy Young Award Winners, By Lloyd Smith, Bret Lipscomb, and Adam Simkins, Journal of Computer Science, Vol. 22, Page No. 115-121,April 2007. [31] Deng, B., Liu, X., Data Mining in Quality Improvement, Proceedings of the Twenty-Seventh Annual SAS Users Group International Conference 2002 by SAS Institute Inc., Cary, NC, USA. ISBN 1-59047-061-3. Web Site : http://www2.sas.com/proceedings/sugi27/ Proceed27.pdf [32] Cohen, J. J., Olivia, C., Rud, P., Data Mining of Market Knowledge in The Pharmaceutical Industry, Proceeding of 13th Annual Conference of North-East SAS Users Group Inc., NESUG2000, Philadelphia Pennsilvania, September 2426 2000. [33] Elovici, Y., Kandel, A., Last, M., Shapira, B., Zaafrany, O.,Using Data Mining Techniques for Detecting TerrorRelated Activities on the Web, Website: www.ise.bgu.ac.il/faculty/mlast/papers/JI W_Paper.pdf 169 ISSN : 2229-6093 S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170 [34] [36] [37] [39] [40] Data Mining in Sports: A Research Overview, By Osama K. Solieman, A Technical Report, MIS Masters Project, August 2006. Foster, D. P. and Stine, R. A., Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy, Journal of the American Statistical Association, Alexandria, VA, ETATSUNIS, vol. 99, ISSN 0162-1459, pp. 303313 January 15, 2004. Kraft, M. R., Desouza, K. C., Androwich, I., Data Mining in Healthcare Information Systems: Case Study of a Veterans’ Administration Spinal Cord Injury Population, IEEE, Proceedings of the 36th Hawaii International Conference on System Sciences, 0-7695-1874-5/03, 2002. Integrating E-Commerce and Data Mining: Architecture and Challenges, By Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng, IEEE International Conference on Data Mining, 2001. Jadhav, S. R., and Kumbargoudar, P., Multimedia Data Mining in Digital Libraries: Standards and Features, Proceedings of conference Recent advances in Information Science and Technology READIT – 2007, pp 54-59, Organized by Madras Library Association - Kalpakkam Chapter & Scintific information Resource Division, Indira Gandhi Center for Atomic research, Department of Atomic Energy, Kalpakkam, Tamilnadu,India. 12-13 July 2007. [41] [42] [43] [44] [60] Anjewierden, A., Koll¨offel, B., and Hulshof C., Towards educational data mining: Using data mining methods for automated chat analysis to understand and support inquiry learning processes, International Workshop on Applying Data Mining in e-Learning, ADML'07, Vol305, Page No 23-32, Sissi, Lassithi - Crete Greece, 18 September, 2007. Knowledge Discovery with Genetic Programming for Providing Feedback to Courseware Authors, By Cristobal Romero, Sebastian Ventura and Paul De Bra, Kluwer Academic Publishers, Printed in the Netherlands, 30/08/2004. Crime Data Mining: A General Framework and Some Examples, By Hsinchun Chen, Wingyan Chung, Jennifer Jie Xu, Gang Wang, Yi Qin, Michael Chau, Technical Report, Published by the IEEE Computer Society, 0018-9162/04, pp 50-56, April 2004. Chodavarapu Y., Using data-mining for effective (optimal) sports squad selections, Web Site: ttp://insightory.com/view/74//using_datamining_for_effective_(optimal)_sports_sq uad_selections Vajirkar Pravin, Singh Sachin, and Lee Yugyung, Context-Aware Data Mining Framework for Wireless Medical Application, Lecture Notes in Computer Science (LNCS), Volume 2736, SpringerVerlag. ISBN 3-540-40806-1, pp. 381 – 391. 170