Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining with Oracle using Clustering and Classification Algorithms Presented by Nhamo Mdzingwa Supervisor: John Ebden Overview of Presentation    Objective of Research Background Methodology  Approach  Implementation    Results Conclusions Questions Problem statement 1 Objective of Research Evaluate two types of algorithms available in Oracle10g for data mining (ODM)  To determine which algorithm builds the most effective model and under what circumstances  And which model produces the most accurate results when applied to new data  Problem statement 2 Objective of Research  Gather information from mined dataset  Find prevention predictors of HIV AIDS To do this distinguish clusters  Or use other mining algorithms to achieve goal  Introduction Background Data mining is a powerful and new technology.  Steered by the revolutionary progress in digital data acquisition and storage which has resulted in the creation of huge databases  Definition Background It is a process of extracting knowledge from large amounts of data,  or simply knowledge discovery in databases  Is the finding of interesting patterns in data  Data mining tool Methodology   Oracle10g database release 1 was installed and configured Oracle data miner 10g (ODM) was also installed and configured for use with the database Algorithms in ODM Methodology Classification    Adaptive Bayes Network Naive Bayes Model Seeker Association rules  Apriori Clustering   k-Means O-Cluster Clustering Algorithms Methodology    Clustering algorithms support identifying naturally occurring groupings within the data population. K-Means  Minimum Error  Tolerance and Maximum Iterations  Maximum number of Clusters (k) O-Cluster  Sensitivity  Maximum number of Clusters (k) Dataset used Methodology Obtained from the Centre for AIDS Development, Research and Evaluation Institute for Social and Economic Research, Rhodes University  Bases on a questionnaire survey   HIV AIDS related  Tsha Tsha - HIV AIDS awareness program Dataset used Methodology 2 Data sets put into database tables  TSHA_TSHA_BUILD1 500 records  Used to build and test models   TSHA_TSHA_APPLY1 399 records  Used to validate models  Methodology Determining model accuracy  Confidence is a measure of the homogeneity of the cluster; that is, how close together are the cluster members  The support is a measure of the relative size of a cluster (the total need not be 1.00), such that the higher the value the larger the cluster Methodology Building and Testing the Models 20 models built in total  The building done in 2 phases  1) Distinct number of clusters 2) Equal number of clusters  Algorithm settings:  based on Trial and Error Methodology settings 1st phase model building Methodology 1st phase model Accuracy Methodology nd 2 phase model building To overcome the problem (bias)  I decided to set k the maximum number of clusters to a fixed value.  I set the value k to 7 for all cluster build in this phase  Methodology 2nd phase model results Methodology Applying the best models  The most accurate models  BUILD3_OC_TSHATSHA2 from the O-Cluster  BUILD5_KM_TSHATSHA2 from the K-Means  were applied to the new data TSHA_TSHA_APPLY1 Methodology Determining Cluster Quality Adopt and implement the evaluation technique by [Roiger et al, 2003]  involves employing supervised learning to evaluate unsupervised learning.  Decide to use classification (ABN)   ODM has classification algorithms  ABN algorithm has been identified as most accurate in previous research MethodologyTechnique Supervised Learning for Unsupervised Model Evaluation    Designate each formed cluster as a class and assign each class an arbitrary name. Choose a random sample of instances from each class for supervised learning. Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model. MethodologyTechnique Apply ABN model to remaining instances Build Classification model Using ABN Methodology Comparison of ClusterIDs CLASSIFICATION TABLE OC_APPLY_ABN KM_APPLY_ABN CLUSTER TABLE Vs APPLY_OC3_TSHATSHA remaining instances O-cluster model results Vs APPLY_KM5_TSHATSHA remaining instances K-Means models results Methodology Comparison of ClusterIDs DATA SOURCE ClusterIDs in BOTH TABLES PERCENTAGE of ClusterIDs in both models For O-Cluster results 42 out of 107 39% For K-Means results 18 out of 107 17% defining predictors Determining HIV Predictors  HIV AIDS predictors of prevention behavior are attributes within our dataset that influence an individual to:  (A) use a condom when he/she decides to be sexually active  (B) lead to abstaining from having sexual intercourse for at least a year or more  (C) attributes that lead to one having fewer sexual partners. Methodology Determining HIV Predictors  2 techniques used to achieve these  Distinguishing the clusters found by the O- Cluster model  and employing association rule (Apriori)  Applied to 2 datasets Cluster found by O-Cluster model  Dataset O-Cluster model was applied to.  predictors found Determining HIV Predictors On distinguishing clusters found, the attributes HIV test and Know Aids were identified as predictors of condom use and abstinence  While from the associations the attributes HIV test and talk openly have been identified as predictors of condom use.  The predictors Determining HIV Predictors HIV test – if one has had an HIV test  Know Aids – if one knows about AIDS  Talk openly – if one talks openly about HIV AIDS or not  Regarding the evaluation Conclusions  The O-Cluster algorithm produced most effective model:  accuracy 95.5%  When applied to new data  39%  Most effective model by K-Means:  accuracy of 86.9%  When applied to new data  17% Regarding ODM Algorithms Conclusions classification 1. 2. 3. Adaptive Bayes Network Naive Bayes Model Seeker clustering 1. 2. k-Means O-Cluster association rules 1. Apriori (association rules) observations Conclusions Model accuracy somehow indicates performance of model on new data  Therefore it is recommended that one finds the most accurate model for accurate results