Download Application of Data Mining in Medical Decision Support System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Application of Data Mining in Medical
Decision Support System
Habib Shariff Mahmud
School of Engineering & Computing Sciences
University of East London - FTMS College
Technology Park Malaysia
Bukit Jalil, Kuala Lumpur, Malaysia
mahmudhabibshariff@gmail.com
Mohamed Ismail Z
Senior lecturer, School of Engineering & Computing Sciences
FTMS College,
Technology Park Malaysia,
Bukit Jalil, Kuala Lumpur, Malaysia
ismail@ftms.edu.my
Abstract
Medical decision support systems (MDSS) are now being used in many health care institutions across
the Glove. These institutions have large amount of medical data stored in different formats and may
contain relevant data that are hidden. The use of data mining is to extract hidden knowledge from a
relevant data, which would be the main aim of this paper, is to show how data mining methods can
be applied in medical decision support system and also to design a web based expert system that can
predict heart condition using neural network. The design of the system is based on VA Medical center
long beach database and collected from the University of California, Irvine (UCI) machine learning
repository. After analyzing several medical decision support systems in the relevant literature, three
algorithms have been identified: multilayer perceptron, decision tree and Naïve Bayes. These
algorithms were tested under different configuration in order to find the best on the two medical
dataset. Thereafter, a comparison was made with respect to their performance based on some set of
performance metrics. The analysis is done using Waikato Environment for Knowledge Analysis
(WEKA) software, on the two medical dataset which are diabetes and heart diseases database. It was
found out from the analysis that was carried out; that it is quite difficult to name an algorithm that
is more suitable than the other because neural network was found to be the best in heart disease
while decision tree was found to be more suitable on diabetes disease dataset.
Keyword:
Decision
1.
Medical Decision Support system, Machine Learning, Neural Network, Naïve Bayes
Tree.
Introduction
Page 1
In the last few decades, medical disciplines have become increasingly data-intensive. The
advances in digital technology have led to an unprecedented growth in the size, complexity, and
quantity of collected data that include medical reports and associated images.
According to National electronic decision support taskforce, (2002), highlighted that
Modern health centers nowadays comprises not only doctors, patients and medical staff but also
various processes, including the patient’s treatment. Also, modern systems and techniques have
been introduced in health-care institutions to facilitate their operations. A huge amount of
medical records are stored in databases and data warehouses. Such databases and applications
differ from one another. The basic ones store only primary information about patients such as
name, age, address, blood type, etc. The more advanced ones let the medical staff record patients'
visits and store detailed information concerning their health condition. Some systems also
facilitate patients' registration, units' finances and scheduling of visits. National electronic
decision support taskforce, (2002), explained that in recent years a new type of a medical system
has emerged called medical decision support system. It originates in the business intelligence and
is to support medical decisions.
In their introductory part, Turkoglu I., Arslan A., Ilkay E. (2006) explained that the
situation stated above is one of the reason that call for a closer collaboration between computer
scientists and those in the medical field.
Data mining according to Witten I. H., Frank E., (2006) is a research area which seeks for
methods to find knowledge from data. It is also called knowledge discovery. It makes use of
different types of data mining algorithms to analyze databases.
Data mining is not a single technique; it deploys various machine learning algorithms and
any technique that can help to procure information out of the massive data to make it useful.
Different algorithms serve different purposes; each algorithm offers its own advantages and
disadvantages. However, the most commonly used methods for data mining are based on neural
networks, decision trees, a-priori, regressions, k-means, Bayesian networks, and so forth.
Newman D.J., et al, (2000) cited among others, the reason why UCI medical data repository
was chosen in a research is to allow others to conduct similar experiments and compare their
results. This is one of the main reasons why UCI data repository databases were considered. The
chosen databases are different from each other although they come from two different medical
domains. This allows evaluating the algorithms’ performance under various real medical
conditions (attributes features). The UCI Repository of Machine Learning Databases and Domain
Theories is a free Internet repository of analytical datasets from several areas. All datasets are in
text files format provided with a short description. For the analyses two medical datasets were
selected. The chosen data concerns two different medical fields.
This research work will only aim to answer the specific research questions stated in the research
questions. The reasons being that medical field is very vast field with lots of new discovery every
day. There are so many algorithms in data mining but due to time constraints, only the three
algorism selected will be tested on the two data set. The algorithms are; decisions trees,
Multilayer Perceptron and the Naïve Bayes.
Objectives of the study
The objectives of the research work are as follows:
Page 2
i.
ii.
To show that data mining can be applied to the medical databases, that will predict or
classify the data with a reasonable degree of accuracy.
Also to achieve evaluation of three selected data mining algorithms, which are
commonly implemented in the Medical Decision Support Systems, with regard to their
performance. The evaluation is performed on two medical data sets obtained from the
UCI Repository.
Research questions
The research work will aim to answer the following specific questions:
 Can a web based expert system be developed to predict heart related ailment using
data mining technique?
 Which of the three data mining algorithms provide the most accurate result in the
medical diagnosis of heart disease and diabetes?
Outline
The research work is outline in such a way that, section one introduces the whole research
work, the aims and objectives of the research was stated; also, the limitations of the study as well
as the research questions were clearly stated. Then next section, related work were reviewed in
the field of data mining and medical decision support system. Section three is the methodology
and design section, here sources of data were explained and the system architecture is also
explained. How the design was carried out is also explained, while in section four the analysis of
the experiment and the result obtained was discussed. Finally, the paper was summarized and
concludes.
2.
Literature Review
Application of data mining methods for medical decision support system
According to Krzysztof J. Cios ET. all, (2007) to describe the aim of data mining is to make
sense of large amounts of mostly unsupervised data, in some domain. The term to make sense
here means the data should be able to be understood, novel and useful to the user. Most probably
the most important thing that discovered new knowledge would be to the user is that it should be
understood in order to use it to some advantage. Data mining deals with large amount of data not
a small quantity that can be supervised manually, for example NASA generates tens of gigabyte
per hour in its mission of earth observing system, like wise wall mart, NSA, and whole lots of
others that generate terabytes or petabytes of data.
The main aim of data mining as suggested by J.hardin,D.chhieng (2008 ) is to discover new pattern
for the future, The effect of discovering new pattern can aid in serving two purposes; that is
prediction and description. Description aims at finding new patterns to the user that can be
understood, while prediction is a process of identifying variables in the database in order to be
used to predict future events or some entities behavior.
Due to the large volume of data that is being generated from the medical settings, it is
imperative for medical organization to utilize data mining technique in order to improve the
quality of health care in general.
Another application of data mining in medical decision support system as suggested by
Durairaj, M, Ranjani, V. (2013) is in health care management where data mining tools can be used
Page 3
to identify and track very chronic dieses, it can also be used to develop and design appropriate
interventions that will reduce the number of admissions in the hospital and to also help in aiding
healthcare management in general. Durairaj, M, Ranjani, V. (2013), also went ahead to suggest
that data mining tools can be used to detect an attack by bio terrorist, this can be done by
analyzing the massive amount of data in order to search for patterns that might suggest
something is wrong.
Medical decision support system
The complexity of diagnosing is what makes diagnosis to be very hard to understand.
However, symptoms are the primary input in medical diagnoses as suggested by Witten I. H.,
Frank E., (2005). When these symptoms are processed they produced an output which will
indicate whether a patient is sick or at risk to some certain health related issues. The process of
medical diagnosis is given in figure 1 below
Symptoms
Knowledge base
diagnoses
Fig. 1 the process of medical diagnosis
After diagnosing the patient, the next step is for the physician to make decision, and the
process of decision making was analyzed by Mora M., Forgionne G.A., Jupta J., (2012). They said
that this process is continuous and recycled and they involve the following phases:
i.
ii.
iii.
iv.
Intelligence
Design
Choice
Implementation
According to Marakas GM. (2002) said “A typical decision support system consists of five
components:
i. the data management,
ii. the model management
iii. the knowledge engine
iv. the user interface
v. the user(s)
3.
Research Design and Methodology
Methodology
According to the Creswell J. W. (2002), a research can be categorized into three types,
which are: quantitative, qualitative or mixed. In view of this categorization, this research can be
viewed as a qualitative one. This is because of the fact that the analyses are based on the
qualitative aspects of medical data mining. Therefore, this means that the performance of data
mining algorithms is the driver of the evaluation.
Page 4
There are also other types of research: for instance Dawson. C.W. (2010) describes an
evaluation research, i.e. a study which involves evaluation. Therefore this research can also be
classified as such.
Sources of data
The data was collected from two sources, a questionnaire was used in collecting the data
for building the expert system, and for the analysis of the two dataset, the data from the UCI
repository was used.
The choice of medical dataset from UCI repository is to allow other researchers to conduct
similar or slightly similar experiments and compare their findings. This is one of the main reasons
for choosing UCI medical dataset repository. The UCI repository database is a free internet
repository of analytical datasets from different areas. Mostly all the dataset are in text rich format
files. The UCI datasets gained recognition from across the world and said to be a very true and
valuable source of data. In this research two medical datasets were selected and analyzed.
Data preprocessing
The process involved in data preprocessing is removing duplicate records, normalization
of the database and removing unwanted fields. The data is preprocessed in order to make the data
mining more efficient. After that the preprocessed data is then clustered using the k- means
clustering algorithms with a value of k =2. Now, this will produce two clusters, one that is relevant
to the heart disease and the other is the remaining data. The frequent patterns are chosen based
on the pattern with significant weightage greater than the threshold that was already predefined,
after this the frequent pattern are mined according to the relevant heart diseases.
According to S. Oyyathevan and A. Askarunisa (2014) ascertained that it might be
appropriate to combine data in order to minimize the number of data sets and also reduce the
amount of storage and processing time by the data mining algorithms.
For the missing data, the substitution method was used, that is the missing value was
replaced with the mean value that was computed from the same data. According to (Erkki et al,
.1998) suggested that the method has been found to be very accurate when compared with the
artificial neural network based approach.
Determinig the significant frequency pattern
Determining the significant frequency pattern can aid in designing heart diseases
prediction system, and according to S. Oyyathevan and A. Askarunisa (2014) ascertain that before
defining the significant frequency pattern, the significant weightage has to be found, and this can
be found with the following formula;
Where wi is the weight of each attribute, and the frequency of each rule is denoted by fi.
And the patterns with the significant weigh more than the pre-defined will be chosen in order to
help in predicting heart diseases.
Page 5
The significant frequency pattern (SFP) is given as;
Where; SFP is the significant frequent pattern and
is the significant weigh.
The feed forward network is determined based on the weight and frequency of each
attribute and pattern respectively. Now consider a cell as in fig 2 which may be an output layer or
a hidden layer, each input layer is given a weight, if for example there are N+ 1input with the last
input having a value of 1. The sum function of the weight inputs will deternine the output. E.g In 0---- W0, In1-----W1, In N-1----WN-1 1.0-----WN. the sum of the weight of the input determines the
output as follows: Sum ----- f(sum)---- output
o/p
Fig. 2 a single cell Neuron
Building the expert system
One of the aims of this research paper is to build a web based expert system using data
mining technique, which is neural network that can aid in predicting heart disease for a given
patient. And, as stated in the previous sections’ data mining is part of machine learning where
some rules are extracted from a given dataset, in order to have more meaningful data.
To implement the rules, the back propagation algorithm is adopted as shown below:
Back propagation algorithm
Input:
Let D – be a dataset
L – learning rate
ntwk – a feed forward network
Output – neural network
begin;
1. Initialize all weights
2. If condition not satisfied {
3. For each tuple t in D and for each input layer a { oa = la;
4. For each hidden layer a; la =
; // this is to compute the net input of unit a
with respect to I which is the previous layer.
Page 6
5. Oa =
; // to compute the output of each unit.
The next step is to propagate the errors and the steps are as follows:
 For each j in the output layer Err = Oj (1- Oj)(Tj - Oj); // this is to compute the error.
 For each j in the hidden layers; Err = Oj (1 - Oj)
k Wjk ; // this to compute the next
error k with respect to the higher layer
 For each Wij weight in the network{

Wij = l Err Oj; // increment weight
 {
Oj = (l) Errj ;//increment bias

// update bias

Initial Input Screen
After the successfully login, the user will now be taken to the next page, which will guide
in aiding the prediction of the heart disease. The attributes are chest pain, blood pressure,
maximum heart rate, blood sugar, cholesterol level and old peak. It is assumed that the user has
knowledge about his condition that is why he is using the system. Figure 5.4 shows the initial
input screen where a user is expected to input the relevant information.
Fig. 3 Default Screen
Depending on the input, the user can submit the values to the
system, and will instantly receive a response from the system as to whether the user is at risk or
not.
4.
Results and Discussion
Page 7
In this section the results of the experiment performed with the three data mining
algorithms are presented. The three algorithms are decision tree, Naïve Bayes and multilayer
perceptron. The algorithms were applied to the medical datasets of heart dieses and diabetic, and
the experiment is conducted with WEKA (Waikato Environment for Knowledge Analysis).
The analyses are as follows:
• Several parameters are used to calibrate the algorithms
• The parameters of all of the algorithms are used on each of the two dataset.
• The results are presented in form of tables
The Diabetes Database
The diabetes database consists of five attributes and 768 different cases. Out of this 66%
will be used for the training set while the remaining will be used for the testing set, as it is always
a good practice to have a larger number of training set than the testing set. The decisional
attributes takes a binary value of 0 or 1. The figure below shows the WEKA software that was
explained.
Fig 3 the WEKA software showing the attributes
Page 8
The distribution of the attributes are shown above in figure 3, where all the five attributes
are presented, the five attributes are pedigree, mass, age, plasma and class which is a decisional
attribute.
The output obtained by WEKA software after running the experiments are all displayed in
figure 4.1together with their accuracy in terms of percentages. A comparison will now be made in
order to find which among the algorithms has a better result in terms of accuracy for classification
for the diabetic dataset.
90.00%
80.00%
50% split
66% split
Training set
70.00%
Training set
66% split
50% split
Figure 4.1 analyses of the algorithms on diabetic dataset
From figure 4.1 it can be seen that decision tree algorithms have the highest percentage of
accuracy of classification with 82.03%, followed by Naïve Bayes with 77.60% then multilayer
perceptron with accuracy of 76.82%. But in terms of percentage split, when the percentage split is
66%, Naïve Bayes outperformed the others with the accuracy of almost 80%, followed by
multilayer perceptron with 78.54% then Decision Tree with 75.86%. It is also shown from the
graph that percentage split of 66% always performs better than a smaller percentages.
Now let’s represent the confusion matrix together with the graphical representation and
analysis of the algorithms that are run under the WEKA software with different parameters
setting.
Page 9
500
450
400
350
300
250
200
150
100
50
0
A
B
A
B
Decision tree
A
B
Naïve Bayes
A
B
Multilayer
perceptron
Fig.4.2 graphical view of the confusion matrix on training set
Legend: A - represent the number of cases that tested negative
B - Represents the number of cases that tested positive
From the graphical representation of the confusion matrix above, we can conclude that
decision tree has the best classification.
160
140
120
100
80
60
40
20
0
A
B
A
B
Decision tree
A
B
Naïve Bayes
A
B
Multilayer
perceptron
Fig. 4.3 graphical view of confusion matrix based on 66% split
From the confusion matrix based on 66% split it can be deducted that the Naïve Bayes
perform better, followed by multilayer perceptron then followed by decision tree.
Page 10
250
200
150
100
A
50
B
0
A
B
Decision tree
A
B
Naïve Bayes
A
B
Multilayer
perceptron
Fig 4.4 graphical view on confusion matrix based on 50% split
Finally, based on the experiments conducted on the settings provided, it can be seen that
Naïve Bayes gives a better prediction, with an incorrect prediction of 33 and 49, when the
percentage split is 66% followed by multilayer perceptron and then decision tree. While when the
percentage split is 50% both Naïve Bayes and multilayer perceptron gives a better classification
than decision tree.
Findings from the Heart Disease Database
After running the experiment for the three algorithms with different parameters for the
heart disease database, the following table summarized the findings.
Decision Tree
Naïve Bayes
Multilayer Perceptron
Training set
87.46%
84.49%
95.38%
66% split
82.52%
84.47%
80.58%
50% split
74.83%
82.78%
78.80%
Table 4.1 accuracy of the algorithms on heart disease database
From table 6.6 it can be seen that multilayer perceptron performed the best on training
set with an accuracy of 95.38% followed by decision tree and the Naïve Bayes. But on the
percentage split of 66% and 50%, Naïve Bayes was found to perform better than both the two.
The high accuracy of multilayer perceptron on predicting the heart disease motivated us to use it
in developing an expert system for this thesis.
The confusion matrix for each of the algorithm and the corresponding graphical
presentation are summarized and presented in the following tables and figures.
Page 11
A
B
Decision tree
Naïve Bayes
Multilayer perceptron
A
B
A
B
A
B
156
9
149
16
160
5
29
109
31
107
9
129
Table 4.2 confusion matrix based on training set
From the analysis above it can be seen that the multilayer perceptron has the most
accurate result followed by Naïve Bayes then the decision tree.
A
B
Decision tree
Naïve Bayes
Multilayer perceptron
A
B
A
B
A
B
68
13
72
9
67
14
25
45
17
53
18
52
Table 4.3 confusion matrix based on 50% split
From the above configuration of 50% split the Naïve Bayes has the highest accurate result
classification followed by the multilayer perceptron then the decision tree.
A
B
Decision tree
Naïve Bayes
Multilayer perceptron
A
B
A
B
A
B
45
5
45
6
39
18
13
39
10
42
8
44
Table 4.4 confusion matrix based on 66% split
When it comes to 66% configuration, it was found out that still, the Naïve Bayes
outperform the rest, followed by multilayer perceptron then the decision tree.
5.
Conclusion
The conclusions that can be drawn from developing the expert system using neural
networks are as follows: (1) dataset of heart diseases from the UCI machine learning Repository
was used that was obtained from a previous research, the dataset consist of 303 patients with
varying form of symptoms. The data was preprocessed and then clustered using the k- means
clustering algorithms with a value of k=2. Also, a questioner was design and administered online
in order to develop the web based expert system that can predict or classify heart related risk. (2)
to develop the expert system using java server applet (JSP) consist of several steps, among which
are; feasibility studies, design, knowledge acquisition and representation, the result which is a
web based expert system to predict heart disease. The system has been tested using validation
and prototyping. (3) the implementation of the system is an application that provides the user
with a set of attributes which a user is expected to fill and get an instant response whether is at
risk or not. The application has been tested by a doctor and recommends that in future works
some improvements have to be made.
Based on the experiments conducted on the two dataset using different configuration, the
experiments produced some very interesting result. It was found out that it is very difficult to say
which is the best configuration, although, because of time constraints only very few configuration
Page 12
was used. But, it was found out that the 50% split produced the worst configuration on the two
dataset that was used in the experiments.
The analysis also produced some very interesting result, because it was found out that
based on the two dataset, that is the diabetes and the heart diseases database, the Naïve Bayes
perform better then multilayer perceptron then followed by decision tree. While on the heart
disease dataset the multilayer perceptron has the best performance followed by Naïve Bayes then
decision tree.
It can therefore, be concluded that because of the nature and complexity of medical data, it is very
difficult to say which methods has the overall best result in terms of performance for medical
dataset, only that different methods works better on some different specific dataset.
The results obtained showed the applicability of data mining algorithms on medical
datasets, but care should be taken in choosing the algorithms for a particular dataset and not to
generalize the results.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Chae Y. M., Kim H. S., Tark K. C., Park H. J., Ho S. H.,(2003) ‘Analysis of healthcare quality
indicator using data mining and decision support system’. Expert Systems with
Applications, 167–172.
Creswell J. W., (2012) Research Design, Qualitative, Quantitative and Mixed Method
Approached. 2nd edn. Sage Publications, Thousand Oaks CA.
Dawson C. W., (2010)The Essence of Computing Projects – a Student's Guide. Prentice Hall,
Harlow UK.
Detmer W., Barnett G., Hersh W., (2002) ‘MedWeaver: Integrating Decision Support.
Literature Searching and Web Exploration using the UMLS’, Metathesaurus
Durairaj, M, Ranjani, V. (2013) ‘Data Mining Applications In Healthcare Sector: A Study’,
2(10)
DXp HST Lec 05.pdf, (2009) Computer Science and Artificial Intelligence Laboratory
(CSAIL), MIT.
Electronic Decision Support for Austraila’s Health Sector, National electronic decision
support taskforce, 2002.
http://groups.csail.mit.edu/medg/courses/6872/2004/DXp%20HST%20Lec%2005.pdf,
(retrieved on 1.05.2014)
J.hardin,D.chhieng (2008 ) ‘support system’ pp. 44-63
Mitchell T. M., (2007) Machine Learning, Redmond, McGraw-Hill.
Mora M., Forgionne G.A., Jupta J.,(2012) Decision Making Support Systems: achievements,
trends and challenges for the next decade. Idea-Group: Hershey, P.A,
Newman D.J., Hettich S., Blake C.L., Merz C.J., UCI Repository of machine learning
databases.[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University
of California, Department of Information and Computer Science (retrieved on 2.05.2014)
Nong Y.,(2003) ‘The Handbook of Data Mining’. Lawrence Earlbaum Associates
S. Oyyathevan , A. Askarunisa. (2014) ‘An Expert System for Heart Disease Prediction Using
Data Mining Technique’, Research Paper, 1(4), pp.1-6
Shu-Mei W., Yu C., Yu-Mei, Cheng-Fang Y., Hui-Lian C.,(2205) ‘Decision-making tree for
women considering hysterectomy’, Journal of advanced nursing, Blackwell Publishing,
pp.361-368.
Page 13
[16]
[17]
[18]
[19]
[20]
Tang Z., MacLennan J., (2005) Data Mining with SQL Server 2005. Indianapolis, Indiana,
USA, Wiley Publishing Inc.
Teach R. and Shortliffe E.,(2001) ‘An analysis of physician attitudes regarding computerbased clinical consultation systems’. Computers and Biomedical Research, 14, pp. 542-558.
Turkoglu I., Arslan A., Ilkay E., (2006)‘ An expert system for diagnosis of the heart valve
diseases. Expert Systems with Applications, 23(3), pp.229–236.
WHO, Fact sheet No. 297: Cancer, 2006, Retrieved on 26 04 2014
Witten I. H., Frank E., (2005) ‘Data Mining, Practical Machine Learning Tools and
Techniques’, 2nd Elsevier.
Page 14