Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Application of Data Mining in Medical Decision Support System Habib Shariff Mahmud School of Engineering & Computing Sciences University of East London - FTMS College Technology Park Malaysia Bukit Jalil, Kuala Lumpur, Malaysia mahmudhabibshariff@gmail.com Mohamed Ismail Z Senior lecturer, School of Engineering & Computing Sciences FTMS College, Technology Park Malaysia, Bukit Jalil, Kuala Lumpur, Malaysia ismail@ftms.edu.my Abstract Medical decision support systems (MDSS) are now being used in many health care institutions across the Glove. These institutions have large amount of medical data stored in different formats and may contain relevant data that are hidden. The use of data mining is to extract hidden knowledge from a relevant data, which would be the main aim of this paper, is to show how data mining methods can be applied in medical decision support system and also to design a web based expert system that can predict heart condition using neural network. The design of the system is based on VA Medical center long beach database and collected from the University of California, Irvine (UCI) machine learning repository. After analyzing several medical decision support systems in the relevant literature, three algorithms have been identified: multilayer perceptron, decision tree and Naïve Bayes. These algorithms were tested under different configuration in order to find the best on the two medical dataset. Thereafter, a comparison was made with respect to their performance based on some set of performance metrics. The analysis is done using Waikato Environment for Knowledge Analysis (WEKA) software, on the two medical dataset which are diabetes and heart diseases database. It was found out from the analysis that was carried out; that it is quite difficult to name an algorithm that is more suitable than the other because neural network was found to be the best in heart disease while decision tree was found to be more suitable on diabetes disease dataset. Keyword: Decision 1. Medical Decision Support system, Machine Learning, Neural Network, Naïve Bayes Tree. Introduction Page 1 In the last few decades, medical disciplines have become increasingly data-intensive. The advances in digital technology have led to an unprecedented growth in the size, complexity, and quantity of collected data that include medical reports and associated images. According to National electronic decision support taskforce, (2002), highlighted that Modern health centers nowadays comprises not only doctors, patients and medical staff but also various processes, including the patient’s treatment. Also, modern systems and techniques have been introduced in health-care institutions to facilitate their operations. A huge amount of medical records are stored in databases and data warehouses. Such databases and applications differ from one another. The basic ones store only primary information about patients such as name, age, address, blood type, etc. The more advanced ones let the medical staff record patients' visits and store detailed information concerning their health condition. Some systems also facilitate patients' registration, units' finances and scheduling of visits. National electronic decision support taskforce, (2002), explained that in recent years a new type of a medical system has emerged called medical decision support system. It originates in the business intelligence and is to support medical decisions. In their introductory part, Turkoglu I., Arslan A., Ilkay E. (2006) explained that the situation stated above is one of the reason that call for a closer collaboration between computer scientists and those in the medical field. Data mining according to Witten I. H., Frank E., (2006) is a research area which seeks for methods to find knowledge from data. It is also called knowledge discovery. It makes use of different types of data mining algorithms to analyze databases. Data mining is not a single technique; it deploys various machine learning algorithms and any technique that can help to procure information out of the massive data to make it useful. Different algorithms serve different purposes; each algorithm offers its own advantages and disadvantages. However, the most commonly used methods for data mining are based on neural networks, decision trees, a-priori, regressions, k-means, Bayesian networks, and so forth. Newman D.J., et al, (2000) cited among others, the reason why UCI medical data repository was chosen in a research is to allow others to conduct similar experiments and compare their results. This is one of the main reasons why UCI data repository databases were considered. The chosen databases are different from each other although they come from two different medical domains. This allows evaluating the algorithms’ performance under various real medical conditions (attributes features). The UCI Repository of Machine Learning Databases and Domain Theories is a free Internet repository of analytical datasets from several areas. All datasets are in text files format provided with a short description. For the analyses two medical datasets were selected. The chosen data concerns two different medical fields. This research work will only aim to answer the specific research questions stated in the research questions. The reasons being that medical field is very vast field with lots of new discovery every day. There are so many algorithms in data mining but due to time constraints, only the three algorism selected will be tested on the two data set. The algorithms are; decisions trees, Multilayer Perceptron and the Naïve Bayes. Objectives of the study The objectives of the research work are as follows: Page 2 i. ii. To show that data mining can be applied to the medical databases, that will predict or classify the data with a reasonable degree of accuracy. Also to achieve evaluation of three selected data mining algorithms, which are commonly implemented in the Medical Decision Support Systems, with regard to their performance. The evaluation is performed on two medical data sets obtained from the UCI Repository. Research questions The research work will aim to answer the following specific questions: Can a web based expert system be developed to predict heart related ailment using data mining technique? Which of the three data mining algorithms provide the most accurate result in the medical diagnosis of heart disease and diabetes? Outline The research work is outline in such a way that, section one introduces the whole research work, the aims and objectives of the research was stated; also, the limitations of the study as well as the research questions were clearly stated. Then next section, related work were reviewed in the field of data mining and medical decision support system. Section three is the methodology and design section, here sources of data were explained and the system architecture is also explained. How the design was carried out is also explained, while in section four the analysis of the experiment and the result obtained was discussed. Finally, the paper was summarized and concludes. 2. Literature Review Application of data mining methods for medical decision support system According to Krzysztof J. Cios ET. all, (2007) to describe the aim of data mining is to make sense of large amounts of mostly unsupervised data, in some domain. The term to make sense here means the data should be able to be understood, novel and useful to the user. Most probably the most important thing that discovered new knowledge would be to the user is that it should be understood in order to use it to some advantage. Data mining deals with large amount of data not a small quantity that can be supervised manually, for example NASA generates tens of gigabyte per hour in its mission of earth observing system, like wise wall mart, NSA, and whole lots of others that generate terabytes or petabytes of data. The main aim of data mining as suggested by J.hardin,D.chhieng (2008 ) is to discover new pattern for the future, The effect of discovering new pattern can aid in serving two purposes; that is prediction and description. Description aims at finding new patterns to the user that can be understood, while prediction is a process of identifying variables in the database in order to be used to predict future events or some entities behavior. Due to the large volume of data that is being generated from the medical settings, it is imperative for medical organization to utilize data mining technique in order to improve the quality of health care in general. Another application of data mining in medical decision support system as suggested by Durairaj, M, Ranjani, V. (2013) is in health care management where data mining tools can be used Page 3 to identify and track very chronic dieses, it can also be used to develop and design appropriate interventions that will reduce the number of admissions in the hospital and to also help in aiding healthcare management in general. Durairaj, M, Ranjani, V. (2013), also went ahead to suggest that data mining tools can be used to detect an attack by bio terrorist, this can be done by analyzing the massive amount of data in order to search for patterns that might suggest something is wrong. Medical decision support system The complexity of diagnosing is what makes diagnosis to be very hard to understand. However, symptoms are the primary input in medical diagnoses as suggested by Witten I. H., Frank E., (2005). When these symptoms are processed they produced an output which will indicate whether a patient is sick or at risk to some certain health related issues. The process of medical diagnosis is given in figure 1 below Symptoms Knowledge base diagnoses Fig. 1 the process of medical diagnosis After diagnosing the patient, the next step is for the physician to make decision, and the process of decision making was analyzed by Mora M., Forgionne G.A., Jupta J., (2012). They said that this process is continuous and recycled and they involve the following phases: i. ii. iii. iv. Intelligence Design Choice Implementation According to Marakas GM. (2002) said “A typical decision support system consists of five components: i. the data management, ii. the model management iii. the knowledge engine iv. the user interface v. the user(s) 3. Research Design and Methodology Methodology According to the Creswell J. W. (2002), a research can be categorized into three types, which are: quantitative, qualitative or mixed. In view of this categorization, this research can be viewed as a qualitative one. This is because of the fact that the analyses are based on the qualitative aspects of medical data mining. Therefore, this means that the performance of data mining algorithms is the driver of the evaluation. Page 4 There are also other types of research: for instance Dawson. C.W. (2010) describes an evaluation research, i.e. a study which involves evaluation. Therefore this research can also be classified as such. Sources of data The data was collected from two sources, a questionnaire was used in collecting the data for building the expert system, and for the analysis of the two dataset, the data from the UCI repository was used. The choice of medical dataset from UCI repository is to allow other researchers to conduct similar or slightly similar experiments and compare their findings. This is one of the main reasons for choosing UCI medical dataset repository. The UCI repository database is a free internet repository of analytical datasets from different areas. Mostly all the dataset are in text rich format files. The UCI datasets gained recognition from across the world and said to be a very true and valuable source of data. In this research two medical datasets were selected and analyzed. Data preprocessing The process involved in data preprocessing is removing duplicate records, normalization of the database and removing unwanted fields. The data is preprocessed in order to make the data mining more efficient. After that the preprocessed data is then clustered using the k- means clustering algorithms with a value of k =2. Now, this will produce two clusters, one that is relevant to the heart disease and the other is the remaining data. The frequent patterns are chosen based on the pattern with significant weightage greater than the threshold that was already predefined, after this the frequent pattern are mined according to the relevant heart diseases. According to S. Oyyathevan and A. Askarunisa (2014) ascertained that it might be appropriate to combine data in order to minimize the number of data sets and also reduce the amount of storage and processing time by the data mining algorithms. For the missing data, the substitution method was used, that is the missing value was replaced with the mean value that was computed from the same data. According to (Erkki et al, .1998) suggested that the method has been found to be very accurate when compared with the artificial neural network based approach. Determinig the significant frequency pattern Determining the significant frequency pattern can aid in designing heart diseases prediction system, and according to S. Oyyathevan and A. Askarunisa (2014) ascertain that before defining the significant frequency pattern, the significant weightage has to be found, and this can be found with the following formula; Where wi is the weight of each attribute, and the frequency of each rule is denoted by fi. And the patterns with the significant weigh more than the pre-defined will be chosen in order to help in predicting heart diseases. Page 5 The significant frequency pattern (SFP) is given as; Where; SFP is the significant frequent pattern and is the significant weigh. The feed forward network is determined based on the weight and frequency of each attribute and pattern respectively. Now consider a cell as in fig 2 which may be an output layer or a hidden layer, each input layer is given a weight, if for example there are N+ 1input with the last input having a value of 1. The sum function of the weight inputs will deternine the output. E.g In 0---- W0, In1-----W1, In N-1----WN-1 1.0-----WN. the sum of the weight of the input determines the output as follows: Sum ----- f(sum)---- output o/p Fig. 2 a single cell Neuron Building the expert system One of the aims of this research paper is to build a web based expert system using data mining technique, which is neural network that can aid in predicting heart disease for a given patient. And, as stated in the previous sections’ data mining is part of machine learning where some rules are extracted from a given dataset, in order to have more meaningful data. To implement the rules, the back propagation algorithm is adopted as shown below: Back propagation algorithm Input: Let D – be a dataset L – learning rate ntwk – a feed forward network Output – neural network begin; 1. Initialize all weights 2. If condition not satisfied { 3. For each tuple t in D and for each input layer a { oa = la; 4. For each hidden layer a; la = ; // this is to compute the net input of unit a with respect to I which is the previous layer. Page 6 5. Oa = ; // to compute the output of each unit. The next step is to propagate the errors and the steps are as follows: For each j in the output layer Err = Oj (1- Oj)(Tj - Oj); // this is to compute the error. For each j in the hidden layers; Err = Oj (1 - Oj) k Wjk ; // this to compute the next error k with respect to the higher layer For each Wij weight in the network{ Wij = l Err Oj; // increment weight { Oj = (l) Errj ;//increment bias // update bias Initial Input Screen After the successfully login, the user will now be taken to the next page, which will guide in aiding the prediction of the heart disease. The attributes are chest pain, blood pressure, maximum heart rate, blood sugar, cholesterol level and old peak. It is assumed that the user has knowledge about his condition that is why he is using the system. Figure 5.4 shows the initial input screen where a user is expected to input the relevant information. Fig. 3 Default Screen Depending on the input, the user can submit the values to the system, and will instantly receive a response from the system as to whether the user is at risk or not. 4. Results and Discussion Page 7 In this section the results of the experiment performed with the three data mining algorithms are presented. The three algorithms are decision tree, Naïve Bayes and multilayer perceptron. The algorithms were applied to the medical datasets of heart dieses and diabetic, and the experiment is conducted with WEKA (Waikato Environment for Knowledge Analysis). The analyses are as follows: • Several parameters are used to calibrate the algorithms • The parameters of all of the algorithms are used on each of the two dataset. • The results are presented in form of tables The Diabetes Database The diabetes database consists of five attributes and 768 different cases. Out of this 66% will be used for the training set while the remaining will be used for the testing set, as it is always a good practice to have a larger number of training set than the testing set. The decisional attributes takes a binary value of 0 or 1. The figure below shows the WEKA software that was explained. Fig 3 the WEKA software showing the attributes Page 8 The distribution of the attributes are shown above in figure 3, where all the five attributes are presented, the five attributes are pedigree, mass, age, plasma and class which is a decisional attribute. The output obtained by WEKA software after running the experiments are all displayed in figure 4.1together with their accuracy in terms of percentages. A comparison will now be made in order to find which among the algorithms has a better result in terms of accuracy for classification for the diabetic dataset. 90.00% 80.00% 50% split 66% split Training set 70.00% Training set 66% split 50% split Figure 4.1 analyses of the algorithms on diabetic dataset From figure 4.1 it can be seen that decision tree algorithms have the highest percentage of accuracy of classification with 82.03%, followed by Naïve Bayes with 77.60% then multilayer perceptron with accuracy of 76.82%. But in terms of percentage split, when the percentage split is 66%, Naïve Bayes outperformed the others with the accuracy of almost 80%, followed by multilayer perceptron with 78.54% then Decision Tree with 75.86%. It is also shown from the graph that percentage split of 66% always performs better than a smaller percentages. Now let’s represent the confusion matrix together with the graphical representation and analysis of the algorithms that are run under the WEKA software with different parameters setting. Page 9 500 450 400 350 300 250 200 150 100 50 0 A B A B Decision tree A B Naïve Bayes A B Multilayer perceptron Fig.4.2 graphical view of the confusion matrix on training set Legend: A - represent the number of cases that tested negative B - Represents the number of cases that tested positive From the graphical representation of the confusion matrix above, we can conclude that decision tree has the best classification. 160 140 120 100 80 60 40 20 0 A B A B Decision tree A B Naïve Bayes A B Multilayer perceptron Fig. 4.3 graphical view of confusion matrix based on 66% split From the confusion matrix based on 66% split it can be deducted that the Naïve Bayes perform better, followed by multilayer perceptron then followed by decision tree. Page 10 250 200 150 100 A 50 B 0 A B Decision tree A B Naïve Bayes A B Multilayer perceptron Fig 4.4 graphical view on confusion matrix based on 50% split Finally, based on the experiments conducted on the settings provided, it can be seen that Naïve Bayes gives a better prediction, with an incorrect prediction of 33 and 49, when the percentage split is 66% followed by multilayer perceptron and then decision tree. While when the percentage split is 50% both Naïve Bayes and multilayer perceptron gives a better classification than decision tree. Findings from the Heart Disease Database After running the experiment for the three algorithms with different parameters for the heart disease database, the following table summarized the findings. Decision Tree Naïve Bayes Multilayer Perceptron Training set 87.46% 84.49% 95.38% 66% split 82.52% 84.47% 80.58% 50% split 74.83% 82.78% 78.80% Table 4.1 accuracy of the algorithms on heart disease database From table 6.6 it can be seen that multilayer perceptron performed the best on training set with an accuracy of 95.38% followed by decision tree and the Naïve Bayes. But on the percentage split of 66% and 50%, Naïve Bayes was found to perform better than both the two. The high accuracy of multilayer perceptron on predicting the heart disease motivated us to use it in developing an expert system for this thesis. The confusion matrix for each of the algorithm and the corresponding graphical presentation are summarized and presented in the following tables and figures. Page 11 A B Decision tree Naïve Bayes Multilayer perceptron A B A B A B 156 9 149 16 160 5 29 109 31 107 9 129 Table 4.2 confusion matrix based on training set From the analysis above it can be seen that the multilayer perceptron has the most accurate result followed by Naïve Bayes then the decision tree. A B Decision tree Naïve Bayes Multilayer perceptron A B A B A B 68 13 72 9 67 14 25 45 17 53 18 52 Table 4.3 confusion matrix based on 50% split From the above configuration of 50% split the Naïve Bayes has the highest accurate result classification followed by the multilayer perceptron then the decision tree. A B Decision tree Naïve Bayes Multilayer perceptron A B A B A B 45 5 45 6 39 18 13 39 10 42 8 44 Table 4.4 confusion matrix based on 66% split When it comes to 66% configuration, it was found out that still, the Naïve Bayes outperform the rest, followed by multilayer perceptron then the decision tree. 5. Conclusion The conclusions that can be drawn from developing the expert system using neural networks are as follows: (1) dataset of heart diseases from the UCI machine learning Repository was used that was obtained from a previous research, the dataset consist of 303 patients with varying form of symptoms. The data was preprocessed and then clustered using the k- means clustering algorithms with a value of k=2. Also, a questioner was design and administered online in order to develop the web based expert system that can predict or classify heart related risk. (2) to develop the expert system using java server applet (JSP) consist of several steps, among which are; feasibility studies, design, knowledge acquisition and representation, the result which is a web based expert system to predict heart disease. The system has been tested using validation and prototyping. (3) the implementation of the system is an application that provides the user with a set of attributes which a user is expected to fill and get an instant response whether is at risk or not. The application has been tested by a doctor and recommends that in future works some improvements have to be made. Based on the experiments conducted on the two dataset using different configuration, the experiments produced some very interesting result. It was found out that it is very difficult to say which is the best configuration, although, because of time constraints only very few configuration Page 12 was used. But, it was found out that the 50% split produced the worst configuration on the two dataset that was used in the experiments. The analysis also produced some very interesting result, because it was found out that based on the two dataset, that is the diabetes and the heart diseases database, the Naïve Bayes perform better then multilayer perceptron then followed by decision tree. While on the heart disease dataset the multilayer perceptron has the best performance followed by Naïve Bayes then decision tree. It can therefore, be concluded that because of the nature and complexity of medical data, it is very difficult to say which methods has the overall best result in terms of performance for medical dataset, only that different methods works better on some different specific dataset. The results obtained showed the applicability of data mining algorithms on medical datasets, but care should be taken in choosing the algorithms for a particular dataset and not to generalize the results. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] Chae Y. M., Kim H. S., Tark K. C., Park H. J., Ho S. H.,(2003) ‘Analysis of healthcare quality indicator using data mining and decision support system’. Expert Systems with Applications, 167–172. Creswell J. W., (2012) Research Design, Qualitative, Quantitative and Mixed Method Approached. 2nd edn. Sage Publications, Thousand Oaks CA. Dawson C. W., (2010)The Essence of Computing Projects – a Student's Guide. Prentice Hall, Harlow UK. Detmer W., Barnett G., Hersh W., (2002) ‘MedWeaver: Integrating Decision Support. Literature Searching and Web Exploration using the UMLS’, Metathesaurus Durairaj, M, Ranjani, V. (2013) ‘Data Mining Applications In Healthcare Sector: A Study’, 2(10) DXp HST Lec 05.pdf, (2009) Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT. Electronic Decision Support for Austraila’s Health Sector, National electronic decision support taskforce, 2002. http://groups.csail.mit.edu/medg/courses/6872/2004/DXp%20HST%20Lec%2005.pdf, (retrieved on 1.05.2014) J.hardin,D.chhieng (2008 ) ‘support system’ pp. 44-63 Mitchell T. M., (2007) Machine Learning, Redmond, McGraw-Hill. Mora M., Forgionne G.A., Jupta J.,(2012) Decision Making Support Systems: achievements, trends and challenges for the next decade. Idea-Group: Hershey, P.A, Newman D.J., Hettich S., Blake C.L., Merz C.J., UCI Repository of machine learning databases.[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science (retrieved on 2.05.2014) Nong Y.,(2003) ‘The Handbook of Data Mining’. Lawrence Earlbaum Associates S. Oyyathevan , A. Askarunisa. (2014) ‘An Expert System for Heart Disease Prediction Using Data Mining Technique’, Research Paper, 1(4), pp.1-6 Shu-Mei W., Yu C., Yu-Mei, Cheng-Fang Y., Hui-Lian C.,(2205) ‘Decision-making tree for women considering hysterectomy’, Journal of advanced nursing, Blackwell Publishing, pp.361-368. Page 13 [16] [17] [18] [19] [20] Tang Z., MacLennan J., (2005) Data Mining with SQL Server 2005. Indianapolis, Indiana, USA, Wiley Publishing Inc. Teach R. and Shortliffe E.,(2001) ‘An analysis of physician attitudes regarding computerbased clinical consultation systems’. Computers and Biomedical Research, 14, pp. 542-558. Turkoglu I., Arslan A., Ilkay E., (2006)‘ An expert system for diagnosis of the heart valve diseases. Expert Systems with Applications, 23(3), pp.229–236. WHO, Fact sheet No. 297: Cancer, 2006, Retrieved on 26 04 2014 Witten I. H., Frank E., (2005) ‘Data Mining, Practical Machine Learning Tools and Techniques’, 2nd Elsevier. Page 14