Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Department of Computer Engineering Lab Manual Final Year Semester-VIII Subject: Data Warehouse and Mining Even Semester 1 Institutional Vision, Mission and Quality Policy Our Vision To foster and permeate higher and quality education with value added engineering, technology programs, providing all facilities in terms of technology and platforms for all round development with societal awareness and nurture the youth with international competencies and exemplary level of employability even under highly competitive environment so that they are innovative adaptable and capable of handling problems faced by our country and world at large. RAIT’s firm belief in new form of engineering education that lays equal stress on academics and leadership building extracurricular skills has been a major contribution to the success of RAIT as one of the most reputed institution of higher learning. The challenges faced by our country and world in the 21 Century needs a whole new range of thought and action leaders, which a conventional educational system in engineering disciplines are ill equipped to produce. Our reputation in providing good engineering education with additional life skills ensure that high grade and highly motivated students join us. Our laboratories and practical sessions reflect the latest that is being followed in the Industry. The project works and summer projects make our students adept at handling the real life problems and be Industry ready. Our students are well placed in the Industry and their performance makes reputed companies visit us with renewed demands and vigour. Our Mission The Institution is committed to mobilize the resources and equip itself with men and materials of excellence thereby ensuring that the Institution becomes pivotal center of service to Industry, academia, and society with the latest technology. RAIT engages different platforms such as technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers, Entrepreneurial Development Center and Societal Interaction Cell. To develop the college to become an autonomous Institution & deemed university at the earliest with facilities for advanced research and development programs on par with international standards. To invite international and reputed national Institutions and Universities to collaborate with our institution on the issues of common interest of teaching and learning sophistication. RAIT’s Mission is to produce engineering and technology professionals who are innovative and inspiring thought leaders, adept at solving problems faced by our nation and world by providing quality education. The Institute is working closely with all stake holders like industry, academia to foster knowledge generation, acquisition, dissemination using best available resources to address the great challenges being faced by our country and World. RAIT is fully dedicated to provide its students skills that make them leaders and solution providers and are Industry ready when they graduate from the Institution. 2 We at RAIT assure our main stakeholders of students 100% quality for the programmes we deliver. This quality assurance stems from the teaching and learning processes we have at work at our campus and the teachers who are handpicked from reputed institutions IIT/NIT/MU, etc. and they inspire the students to be innovative in thinking and practical in approach. We have installed internal procedures to better skills set of instructors by sending them to training courses, workshops, seminars and conferences. We have also a full fledged course curriculum and deliveries planned in advance for a structured semester long programme. We have well developed feedback system employers, alumni, students and parents from to fine tune Learning and Teaching processes. These tools help us to ensure same quality of teaching independent of any individual instructor. Each classroom is equipped with Internet and other digital learning resources. The effective learning process in the campus comprises a clean and stimulating classroom environment and availability of lecture notes and digital resources prepared by instructor from the comfort of home. In addition student is provided with good number of assignments that would trigger his thinking process. The testing process involves an objective test paper that would gauge the understanding of concepts by the students. The quality assurance process also ensures that the learning process is effective. The summer internships and project work based training ensure learning process to include practical and industry relevant aspects. Various technical events, seminars and conferences make the student learning complete. Our Quality Policy OurQuality Policy Itisourearnestendeavourtoproducehighqualityengineeringprofessionalswhoare innovative and inspiring, thought and action leaders, competent to solve problems facedbysociety,nationandworldatlargebystrivingtowardsveryhighstandardsin learning,teaching andtrainingmethodologies. OurMotto: If itis not of quality, itis NOTRAIT! Dr. Vijay D.PatilPresident, RAES 3 Departmental Vision, Mission Vision To impart higher and quality education in computer science with value added engineering and technology programs to prepare technically sound, ethically strong engineers with social awareness. To extend the facilities, to meet the fast changing requirements and nurture the youths with international competencies and exemplary level of employability and research under highly competitive environments. Mission To mobilize the resources and equip the institution with men and materials of excellence to provide knowledge and develop technologies in the thrust areas of computer science and Engineering. To provide the diverse platforms of sports, technical, cocurricular and extracurricular activities for the overall development of student with ethical attitude. To prepare the students to sustain the impact of computer education for social needs encompassing industry, educational institutions and public service. To collaborate with IITs, reputed universities and industries for the technical and overall upliftment of students for continuing learning and entrepreneurship. 4 Departmental Program Educational Objectives (PEOs) 1. Learn and Integrate To provide Computer Engineering students with a strong foundation in the mathematical, scientific and engineering fundamentals necessary to formulate, solve and analyze engineering problems and to prepare them for graduate studies. 2. Think and Create To develop an ability to analyze the requirements of the software and hardware, understand the technical specifications, create a model, design, implement and verify a computing system to meet specified requirements while considering real-world constraints to solve real world problems. 3. Broad Base To provide broad education necessary to understand the science of computer engineering and the impact of it in a global and social context. 4. Techno-leader To provide exposure to emerging cutting edge technologies, adequate training & opportunities to work as teams on multidisciplinary projects with effective communication skills and leadership qualities. 5. Practice citizenship To provide knowledge of professional and ethical responsibility and to contribute to society through active engagement with professional societies, schools, civic organizations or other community activities. 6. Clarify Purpose and Perspective To provide strong in-depth education through electives and to promote student awareness on the life-long learning to adapt to innovation and change, and to be successful in their professional work or graduate studies. 5 Departmental Program Outcomes (POs) Pa. Foundation of computing - An ability to apply knowledge of computing, applied mathematics, and fundamental engineering concepts appropriate to the discipline. Pb. Experiments & Data Analysis - An ability to understand, identify, analyze and design the problem, implement and validate the solution including both hardware and software. Pc. Current Computing Techniques – An ability to use current techniques, skills, and tools necessary for computing practice . Pd. Teamwork – An ability to have leadership and management skills to accomplish a common goal. Pe. Engineering Problems - anabilitytoidentify,formulates, ands olveengineering problems. Pf. Professional Ethics – An understanding of professional, ethical, legal, security and social issues and responsibilities. Pg. Communication – An ability to communicate effectively with a range of audiences in both verbal and written form. Ph. Impact of Technology – An ability to analyze the local and global impact of computing on individuals, organizations, and society. Pi. Life-long learning – An ability to recognize the need for, and an ability to engage in life-long learning. Pj. Contemporary Issues – An ability to exploit gained skills and knowledge of contemporary issues. Pk. Professional Development – Recognition of the need for and an ability to engage in continuing professional development and higher studies. Pl. Employment - An ability to get an employment to the international repute industries through the training programs, internships, projects, workshops and seminars. 6 Index Sr. No. 1. 2. Contents List of Experiments Course Objective, Course Outcomes and Experiment Plan Page No. 8 9 3. CO-PO Mapping 11 4. Study and Evaluation Scheme 12 5. Experiment No. 1 13 6. Experiment No. 2 16 7. Experiment No. 3 21 8. Experiment No. 4 29 9. Experiment No. 5 32 10. Experiment No. 6 39 11. Experiment No. 7 42 12. Experiment No. 8 46 13. Experiment No. 9 53 14. Experiment No. 10 57 15. Experiment No. 11 61 7 List of Experiments Sr. No. 1 2 3 4 5 6 7 8 Experiments Name To study and implement all basic HTML tags. To implement Cascading Style Sheet To implement bank transaction form using JavaScript To design email registration form and validate it using Javascript To implement Javascript document and window object To design home page for RAIT using Kompozer To design online examination form using Kompozer To design home page for online mobile shopping using Kompozer To design XML document using XML schema for representing your semester 9 10 11 marksheet using PHP. To design DTD for representing your semester marksheet. To design XML schema and DTD for railway reservation system. Design HTML form to accept the two numbers N1 and N2. Display prime 12 numbers between N1 and N2 using PHP. Design a login form to add username, id, password into database & validate it 13 (use php). Design course registration form and perform various database operations using 14 15 PHP and MySQL database connectivity Mini Project 8 Course Objectives & Course Outcome, Experiment Plan Course Objectives: 1. 2. 3. 4. 5. To study the methodology of engineering legacy databases for data warehousing. To study the design modeling of data warehouse. To study the preprocessing and online analytical processing of data. To study the methodology of engineering legacy of datamining to derive business rules for decision support systems. To analyze the data, identify the problems, and choose the relevant modelsand algorithms toapply. Course Outcomes: CO1 CO2 CO3 CO4 CO5 Student will be able to understand data warehouse and design model of data warehouse. Students will be able to learned steps of preprocessing Students will be able to understand the analytical operations on data. Students will be able to discover patterns and knowledge from data warehouse. Students will be able to understand and implement classical algorithms in data 9 Experiment Plan Module No. 1 Week No. Experiments Name One case study given to a group of 3 /4 W1, W2 students of a data mart/ data warehouse. Course Outcome CO1 Weightage 10 1 W3 Implementation of classifier like Decision tree using Java CO5 2 1 W4 Use WEKA to implement like Decision tree CO5 3 2 W5 Implementation of clustering algorithm like K-means using Java CO5 4 2 W6 Use WEKA to implement the K-means Clustering Algorithm CO5 5 2 W7 Implementation Association Mining like Apriori using Java CO5 6 CO5 2 W8 Use WEKA to implement Association Mining like Apriori CO3 5 W9 Use R tool to implement Clustering/Association Rule/ Classification Algorithms. CO3 5 7 8 9 W10 Detailed study of BI tool - SPSS, Clementine. 10 W11 Study different OLAP operations. CO4 10 CO2 10 W12 Study different pre-processing steps for DWH 11 10 Mapping Course Outcomes (CO) Program Outcomes (PO) Subject Weight Course Outcomes Contribution to Program outcomes Pa PRATICA L 50% Pb Pc Pd Pe Student will be able to understand 1 CO1 data warehouse and design model of data warehouse. Students will be able to learned CO2 steps of pre-processing Students will be able to understand CO3 the analytical operations on data. 1 2 3 1 1 1 Students will be able to discover CO4 patterns and knowledge from datawarehouse. 1 1 Students will be able to understand and implement classical algorithms in data mining and data warehousing; students will be able CO5 to assess the strengths and weaknesses of the algorithms, identify the application area of algorithms, and apply them. 1 1 Pf Pg P h Pi Pj P k Pl 1 2 1 1 1 1 1 1 2 1 1 1 1 2 3 1 1 1 2 3 1 2 1 2 2 11 Study and Evaluation Scheme Course Code Course Name CPC801 Data Teaching Scheme Credits Assigned Theory Practical Tutorial Theory Practical Tutorial Total Warehouse and 04 02 -- 04 01 -- 05 Mining Course Code Course Name Examination Scheme CPC801 Data Warehouse Term Work Practical Total and Mining 25 25 50 Term Work: Internal Assessment consists of two tests. Test 1, an Institution level central test, isfor 20 marks and is to be based on a minimum of 40% of the syllabus. Test 2 isalso for 20 marks and is to be based on the remaining syllabus. Test 2 may beeither a class test or assignment on live problems or course project Practical & Oral: Oral examination is to be conducted by pair of internal and external examiners based on the syllabus. 12 Data Warehouse and Mining Experiment No. : 1 Case study on Data warehouse System. 13 Experiment No. 1 1. Aim: One case study on Data warehouse System. A. Write Detail Statement Problem and creation of dimensional modelling (creation star and snowflake schema) Implementation of dimensional modeling B. Implementation of all dimension table and fact table C. Implementation of OLAP operations. 2. Objectives: From this experiment, the student will be able to Understand the basics of Data Warehouse Understand the design model of Data Warehouse Study methodology of engineering legacy databases for data warehousing 3. Outcomes: The learner will be able to Apply knowledge of legacy databases in creating data warehouse Understand, identify, analyse and design the warehouse Use current techniques, skills and tools necessary for designing a data warehouse 4. Software Required :Oracle 11g 5. Theory: In computing, online analytical processing, or OLAP is an approach to answering multidimensional analytical (MDA) queries swiftly OLAP is part of the broader category of business intelligence which also encompasses relational database, report writing and data mining. Typical applicationsof OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and similar areas, with new applications coming up, such as agriculture The term OLAP was created as a slight modification of the traditional database term online transaction processing. Dimensional modelingDimensional modeling (DM) names a set of techniques and concepts used in Dimensional modeling (DM) names a set of techniques and concepts used in datawarehouse design. It is considered to be different from Entity relationship (ER). Dimensional Modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. , DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance. Star Schema - Fact table is in middle and dimension tables are arranged around the fact table 14 Snowflake Schema Normalization and expansion of the dimension tables in a star schema result in the implementation of a snowflake design. Snowflaking in the dimensional model can impact understandability of the dimensional model and result in a decrease in performance because more tables will need to be joined to satisfy queries 6. Conclusion: We have studied different schemas of data warehouse, and using the methodology of engineering legacy database, a new data warehouse was built. The normalization was applied wherever required on star schema and snowflake schema was designed. 7. Viva Questions: What is data warehouse? What is multi-dimensional data? What is difference between star and snowflake schema? 8. References: PaulrajPonniah, “Data Warehousing: Fundamentals for IT Professionals”, Wiley India 15 ReemaTheraja “Data warehousing”, Oxford University Press Data Warehouse and Mining Experiment No. : 2 Implementation of decision tree algorithm in JAVA. 16 Experiment No. 2 1. Aim:Implementation of decision tree algorithm in JAVA. 2. Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to apply Understand and implement classical algorithms in data mining Identify the application of classification algorithm in data mining 3. Outcomes: The learner will be able to Assess the strength and weaknesses of algorithms Identify, formulate and solveengineering problems Analyse the local and global impact of data mining on individuals, organizations and society 4. Software Required :JDK for JAVA 5. Theory: Decision Tree learning is one of the most widely used and practical methods for inductive inference over supervised data. A decision tree represents a procedure for classifying categorical data based on their attributes. It is also efficient for processing large amount of data, so is often use in data mining operations. The construction of decision tree does not require any domain knowledge or parameter setting, and therefore appropriate for exploratory knowledge discovery. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. 17 Entropy: A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one. To build a decision tree, we need to calculate two types of entropy using frequency tables as follows: Information Gain: The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). 6. Procedure/Program: 1. Calculate entropy of the target 2. The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The 18 resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. 3. Choose attribute with the largest information gain as the decision node 4. A. A branch with entropy of 0 is a leaf node 19 A. A branch with entropy more than 0 needs further splitting. 5. The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified. 7. Results: Decision Tree to Decision Rules A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes one by one 8. Conclusion: The different classification algorithms of data mining were studied and one among them named decision tree (ID3) algorithm was implemented using JAVA. The need for classification algorithm was recognized and understood. 9. Viva Questions: What are various classification algorithms? What is entropy? How does u find information gain? 10. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 20 Data Warehouse and Mining Experiment No. : 3 Implementation of ID3 algorithm using WEKA tool. 21 Experiment No. 3 1. Aim:Implementation of ID3 algorithm using WEKA tool. 2. Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to apply Understand and implement classical algorithms in data mining Identify the application of classification algorithm in data mining 3. Outcomes: The learner will be able to Assess the strength and weaknesses of algorithms Identify, formulate and solve engineering problems Analyse the local and global impact of data mining on individuals, organizations and society 4. Software Required :WEKA tool 5. Theory: Decision tree learning is a method for assessing the most likely outcome value by taking into account the known values of the stored data instances. This learning method is among the most popular of inductive inference algorithms and has been successfully applied in broad range of tasks such as assessing the credit risk of applicants and improving loyality of regular customers 6. Procedure: 1. Download dataset for implementation of ID3 algorithm (.csv or .arff file). Here bankdata.csvdataset has taken fordecision tree analysis 2. Load data in WEKA tool 22 3. Select the "Classify" tab and click the "Choose" button to select the ID3 classifier 4. Specify the various parameters. These can be specified by clicking in the text box to the right of the "Choose" button. In this example we accept the default values. The default version does perform some pruning (using the subtree raising approach), but does not perform error pruning 23 5. Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy of the generated model. We now click "Start" to generate the model. 6. We can view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop-up menu. 24 7. WEKA also provides view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu. We will now use our model to classify the new instances. However, in the data section, the value of the "pep" attribute is "?" (or unknown). 25 In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows you to open the file containing test instances. In this case, we open the file "bank-new.arff" and upon returning to the main window, we click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict the value of "pep" attribute. 26 The summary of the results in the right panel does not show any statistics. This is because in our test instances the value of the class attribute ("pep") was left as "?", thus WEKA has no actual values to which it can compare the predicted values of new instances. GUI vesion of WEKA is used to create a file containing all the new instances along with their predicted class value resulting from the application of the model. First, right-click the most recent result set in the left "Result list" panel. In the resulting popup window select the menu item "Visualize classifier errors". This brings up a separate window containing a two-dimensional graph. 8. To save the file: In the new window, we click on the "Save" button and save the result as the file: "bank-predicted.arff" 27 This file contains a copy of the new instances along with an additional column for the predicted value of "pep". The top portion of the file can be seen in below figure. 7. Conclusion: The different classification algorithms of data mining were studied and one among them named decision tree (ID3) algorithm was implemented using JAVA. The need for classification algorithm was recognized and understood. 8. Viva Questions: What is the use of WEKA tool? 9. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 28 Data Warehouse and Mining Experiment No. : 4 Implementation of K-means clustering in JAVA. 29 Experiment No. 4 1. Aim:Implementation of K-means clustering in JAVA. 2. Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to apply Understand and implement classical clustering algorithms in data mining Identify the application of clustering algorithm in data mining 3. Outcomes: The learner will be able to Assess the strength and weaknesses of algorithms Identify, formulate and solve engineering problems Analyse the local and global impact of data mining on individuals, organizations and society 4. Software Required :JDK for JAVA 5. Theory: Clustering is dividing data points into homogeneous classes or clusters: Points in the same group are as similar as possible Points in different group are as dissimilar as possible When a collection of objects is given, we put objects into group based on similarity. Clustering Algorithms: A Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. It locates the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the distance between each point from the centroid of the cluster. The goal of clustering is to determine the intrinsic grouping in a set of unlabelled dataTheory: K-means Clustering K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. 30 6. Procedure: Input: K: the number of clusters D: a data set containing n objects. Output: A set of k clusters. 1. Arbitrarily choose K objects from D as the initial cluster centers 2. Partition of objects into k non-empty subsets 3. Identifying the cluster centroids (mean point) of the current partition. 4. Assigning each point to a specific cluster 5. Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum. 6. After re-allotting the points, find the centroid of the new cluster formed. 7. Conclusion: The different clustering algorithms of data mining were studied and one among them named k-means clustering algorithm was implemented using JAVA. The need for clustering algorithm was recognized and understood. 8. Viva Questions: What are different clustering techniques? What is difference between K-means and K-medoids? What is dendogram? 9. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 31 Data Warehouse and Mining Experiment No. : 5 To implement the clustering algorithm – K-means using WEKA tool. 32 Experiment No. 5 1. Aim:To implement the clustering algorithm, K-means using WEKA tool. 2. Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to apply Understand and implement classical clustering algorithms in data mining Identify the application of clustering algorithm in data mining 3. Outcomes: The learner will be able to Assess the strength and weaknesses of algorithms Identify, formulate and solve engineering problems Analyse the local and global impact of data mining on individuals, organizations and society 4. Software Required :WEKA tool 5. Theory: Weka is a landmark system in the history of the data mining and machine learning research communities,because it is the only toolkit that has gained such widespread adoption and survived for an extended period of time The key features responsible for Weka's success are: – • It provides many different algorithms for data mining and machine learning. • Is is open source and freely available. • It is platform-independent. • It is easily useable by people who are not data mining specialists. • It provides flexible facilities for scripting experiments – it has kept up-to-date, with new algorithms WEKA INTERFACE The GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four menus.The buttons can be used to start the following applications: • Explorer : An environment for exploring data with WEKA . 33 • Experimenter : An environment for performing experiments and conducting statistical tests between learning schemes. • KnowledgeFlow : This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. • SimpleCLI : Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. WEKA CLUSTERER It contains “clusterers” for finding groups of similar instances in a dataset. Some implemented schemes are: k-Means, EM, Cobweb, X-means, FarthestFirst .Clusters can be visualized and compared to “true” clusters. 6. Procedure: The basic step of k-means clustering is simple. In the beginning, we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects can also serve as the initial centroids. Then the K means algorithm will do the three steps below until convergence. Iterate until stable (= no object move group): 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroids 3. Group the object based on minimum distance (find the closest centroid) K-means in WEKA 3.7 The sample data set used is based on the "bank data" available in comma-separated format bank-data.csv. The resulting data file is “bank.arff” and includes 600 instances. As an illustration of performing clustering in WEKA, we will use its implementation of the Kmeans algorithm to cluster the cutomers in this bank data set, and to characterize the resulting customer segments. 34 To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This results in a drop down list of available clustering algorithms. In this case we select "SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop-up window shown below, for editing the clustering parameter. In the pop-up window we enter 2 as the number of clusters and we leave the value of "seed" as is. 35 The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. 36 We can even visualize the assigned cluster as below You can choose the cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster. 37 Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to the original data set. In the data portion, each instance now has its assigned cluster as the last attribute value (as shown below). 7. Conclusion: The different clustering algorithms of data mining were studied and one among them named k-means clustering algorithm was implemented using JAVA. The need for clustering algorithm was recognized and understood. 8. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 38 Data Warehouse and Mining Experiment No. : 6 To study and implement Apriori Algorithm. 39 Experiment No. 6 1. Aim: To study and implement Apriori Algorithm. 2. Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to apply Understand and implement classical association mining algorithms Identify the application of association mining algorithms 3. Outcomes: The learner will be able to Assess the strength and weaknesses of algorithms Identify, formulate and solve engineering problems Analyse the local and global impact of data mining on individuals, organizations and society 4. Software Required :JDK for JAVA 5. Theory: Apriori algorithm is well known association rule algorithm is used in most commercial product. It uses itemset property: Any subset of large item set must be large 6. Procedure: Input: I = // itemset D = // db of transactions. S= // support Output: L1 Apiriori Algorithm: K=0; L= #; Ci= I; repeat k=k+1; Lk= #; for each Ji belong to Ck do Ci=0; for each I,j belong to D do for each Ii belong to tj then Ci=Ci+1; for each Ii belong to Ck do 40 ifCi>=(S*/D/)do Lk=L U Ii; L=L U Lk; Ck+1=Apriori-Gen(Lk) until Ck+1= # ; 7. Conclusion: The different association mining algorithms of data mining were studied and one among them named Apriori association mining algorithm was implemented using JAVA. The need for association mining algorithm was recognized and understood. 8. Viva Questions: What is support and confidence? What are differenttypes association mining algorithms? What is the disadvantage of apriori algorithm? 9. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 41 Data Warehouse and Mining Experiment No. : 7 Implementation of Apriori algorithm in WEKA. 42 Experiment No. 7 1. Aim: Implementation of Apriori algorithm in WEKA. 2. Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to apply Understand and implement classical association mining algorithms Identify the application of association mining algorithms 3. Outcomes: The learner will be able to Assess the strength and weaknesses of algorithms Identify, formulate and solve engineering problems Analyse the local and global impact of data mining on individuals, organizations and society 4. Software Required :WEKA tool 5. Theory: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Some key concepts for Apriori algorithm are: Frequent Item sets: The sets of item which has minimum support (denoted by Li for ith-Itemset) Apriori Property: Any subset of frequent item set must be frequent. Join Operation: To find Lk , a set of candidate k itemsets is generated by joining Lk1 with itself. 6. Procedure: WEKA implementation: To learn the system, TEST_ITEM_TRANS.arff has been used. 43 Using the Apriori Algorithm we want to find the association rules that have minSupport=50% and minimum confidence=50%. After we launch the WEKA application and open the TEST_ITEM_TRANS.arff file as shown in below figure. Then we move to the Associate tab and we set up the configuration as shown below After the algorithm is finished, we get the following results: === Run information === 44 Scheme: weka.associations.Apriori -N 20 -T 0 -C 0.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: TEST_ITEM_TRANS Instances: 15 Attributes: 8 ABCDEFGH === Associator model (full training set) ===Apriori ======= Minimum support: 0.5 (7 instances) Minimum metric: 0.5 Number of cycles performed: 10 Generated sets of large itemsets: Size of set of large itemsetsL(1): 10 Size of set of large itemsetsL(2): 12 Size of set of large itemsetsL(3): 3 Best rules found 1. E=TRUE 11 ==> H=TRUE 11 conf:(1) 2. B=TRUE 10 ==> H=TRUE 10 conf:(1) 3. C=TRUE 10 ==> H=TRUE 10 conf:(1) 4. A=TRUE 9 ==> H=TRUE 9 conf:(1) 5. G=FALSE 9 ==> H=TRUE 9 conf:(1) 6. D=TRUE 8 ==> H=TRUE 8 conf:(1) 7. F=FALSE 8 ==> H=TRUE 8 conf:(1) 8. D=FALSE 7 ==> H=TRUE 7 conf:(1) 9. F=TRUE 7 ==> H=TRUE 7 conf:(1) 10. B=TRUE E=TRUE 7 ==> H=TRUE 7 conf:(1) 11. C=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1) 12. E=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1) 13. G=FALSE 9 ==> C=TRUE 7 conf:(0.78) 14. G=FALSE 9 ==> E=TRUE 7 conf:(0.78) 15. G=FALSE H=TRUE 9 ==> C=TRUE 7 conf:(0.78) 16. G=FALSE 9 ==> C=TRUE H=TRUE 7 conf:(0.78) 17. G=FALSE H=TRUE 9 ==> E=TRUE 7 conf:(0.78) 18. G=FALSE 9 ==> E=TRUE H=TRUE 7 conf:(0.78) 19. H=TRUE 15 ==> E=TRUE 11 conf:(0.73) 20. B=TRUE 10 ==> E=TRUE 7 conf:(0.7) 7. Conclusion: The different association mining algorithms of data mining were studied and one among them named Apriori association mining algorithm was implemented using JAVA. The need for association mining algorithm was recognized and understood. 8. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 45 Data Warehouse and Mining Experiment No. : 8 Study of R Tool 46 Experiment No. 8 1. Aim: Study of R Tool. 2. Objectives: From this experiment, the student will be able to Learn basics of mining tool Create web page for mobile shopping using editor tool Study the methodology of engineering legacy of data mining 3. Outcomes: The learner will be able to Use current techniques, skills and tools for mining. Engage them in life-long learning. Able to match industry requirements in domains of data mining 4. Software Required :R tool 5. Theory: R tool is "a programming “environment”, object-oriented similar to S-Plus freeware that provides calculations on matrices, excellent graphics capabilities and supported by a large user network. Installing R: 1) 2) 3) 4) Download from CRAN Select a download site Download the base package at a minimum Download contributed packages as needed R Basics / Components Of R: Objects Naming convention Assignment Functions Workspace History Objects names types of objects: vector, factor, array, matrix, data.frame, ts, list attributes o mode: numeric, character, complex, logical o length: number of elements in object creation o assign a value o create a blank object 47 Naming Convention must start with a letter (A-Z or a-z) can contain letters, digits (0-9), and/or periods “.” case-sensitive o eg. mydata different from MyData do not use underscore “_” Assignment “<-” used to indicate assignment o egs. <-c(1,2,3,4,5,6,7) <-c(1:7) <-1:4 Functions actions can be performed on objects using functions (note: a function is itself an object) have arguments and options, often there are defaults provide a result parentheses () are used to specify that a function is being called. Workspace during an R session, all objects are stored in a temporary, working memory list objects o ls() remove objects o rm() objects that you want to access later must be saved in a “workspace” o from the menu bar: File->save workspace o from the command line: save(x,file=“MyData.Rdata”) History command line history can be saved, loaded, or displayed o savehistory(file=“MyData.Rhistory) o loadhistory(file=“MyData.Rhistory) o history(max.show=Inf) during a session you can use the arrow keys to review the command history Two most common object types for statistics: A. Matrix a matrix is a vector with an additional attribute (dim) that defines the number of columns and rowsonly one mode (numeric, character, complex, or logical) allowedcan be created using matrix() x<-matrix(data=0,nr=2,nc=2) or 48 o x<-matrix(0,2,2) B. Data Frame several modes allowed within a single data framecan be created using data.frame() L<-LETTERS[1:4] #A B C D x<-1:4 #1 2 3 4 data.frame(x,L) #create data frame attach() and detach() o the database is attached to the R search path so that the database is searched by R when it is evaluating a variable. o objects in the database can be accessed by simply giving their names Data Elements: select only one element eg. x[2] select range of elements eg. x[1:3] select all but one element eg. x[-3] slicing: including only part of the object eg. x[c(1,2,5)] select elements based on logical operator eg. x(x>3) Data Import & Entry: Importing Data read.table(): reads in data from an external file data.entry(): create object first, then enter data c(): concatenate scan(): prompted data entry R has ODBC for connecting to other programs. Data entry & editing start editor and save changes o data.entry(x) start editor, changes not saved o de(x) start text editor o edit(x) Useful Functions length(object) # number of elements or components str(object) # structure of an object class(object) # class or type of an object 49 names(object) # names c(object,object,...) # combine objects into a vector cbind(object, object, ...) # combine objects as columns rbind(object, object, ...) # combine objects as rows ls() # list current objects rm(object) # delete an object newobject<- edit(object) # edit copy and save a newobject fix(object) Exporting Data To A Tab Delimited Text File o write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet o library(xlsReadWrite) write.xls(mydata, "c:/mydata.xls") To SAS o library(foreign) write.foreign(mydata,c:/mydata.txt", "c:/mydata.sas", package="SAS") Viewing Data There are a number of functions for listing the contents of an object or dataset: •list objects in the working environment: ls() •list the variables in mydata: names(mydata) •list the structure of mydata: str(mydata) •list levels of factor v1 in mydata: levels(mydata$v1) •dimensions of an object: dim(object) •class of an object (numeric, matrix, dataframe, etc): class(object) •print mydata :mydata •print first 10 rows of mydata: head(mydata, n=10) •print last 5 rows of mydata: tail(mydata, n=5) Interfacing with R CSV Files Excel Files Binary Files XML Files JSON Files Web data Database We can also create 50 pie charts bar charts box plots histograms line graphs scatterplots DataTypesIn R Tool Vectors Lists Matrices Arrays Factors Data Frames Input The source( ) function runs a script in the current session. If the filename does not include a path, the file is taken from the current working directory. #input a script source("myfile") Output The sink( ) function defines the direction of the output. o # direct output to a file sink("myfile", append=FALSE, split=FALSE) o # return output to the terminal sink() The append option controls whether output overwrites or adds to a file. The split option determines if output is also sent to the screen as well as the output file. Creating new variables Use the assignment operator <- to create new variables. A wide array of operators and functions are available here. # Three examples for doing the same computations 1. mydata$sum<- mydata$x1 + mydata$x2 mydata$mean<- (mydata$x1 + mydata$x2)/2 2. attach(mydata) mydata$sum<- x1 + x2 mydata$mean<- (x1 + x2)/2 detach(mydata) 51 3. mydata<- transform( mydata,sum = x1 + x2, mean = (x1 + x2)/2 ) Renaming variables You can rename variables programmatically or interactively. o # rename interactively o fix(mydata) # results are saved on close o # rename programmatically library(reshape) mydata<- rename(mydata, c(oldname="newname")) Sorting To sort a dataframe in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order. Merging To merge two dataframes (datasets) horizontally, use the merge function. In most cases, you join two dataframes by one or more common key variables (i.e., an inner join). Examples: # merge two dataframes by ID -- total <- merge(dataframeA,dataframeB,by="ID") # merge two dataframes by ID and Country -total<- merge(dataframeA,dataframeB,by=c("ID","Country")) 6. Conclusion: R tool, free software environment for statistical computing and graphics is studied. Using R tool, various data mining algorithms were implemented. R and its packages, functions and task views for data mining process and popular data mining techniques were learnt. 7. Viva Questions: How R tool is used for mining big data? 8. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 52 Data Warehouse and Mining Experiment No. : 9 Study ofBI Tool 53 Experiment No. 9 1. Aim:Study of Business Intelligence Tool, SPSS Clementine, and XL Miner etc 2. Objectives: From this experiment, the student will be able to Learn basics of business intelligence Create web page for mobile shopping using editor tool Study the methodology of engineering legacy of data mining 3. Outcomes: The learner will be able to Use current techniques, skills and tools for mining. Engage them in life-long learning. Able to match industry requirements in domains of data mining 4. Software Required :BI tool - SPSS Clementine 5. Theory: IBM SPSS Modeler is a data mining and text analytics software application built by IBM. It is used to build predictive models and conduct other analytic tasks. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming. IBM SPSS Modeler was originally named Clementine by its creators, Applications: SPSS Modeler has been used in these and other industries: • Customer analytics and Customer relationship management (CRM) • Fraud detection and prevention • Optimizing insurance claims • Risk management • Manufacturing quality improvement • Healthcare quality improvement • Forecasting demand or sales • Law enforcement and border security • Education • Telecommunications • Entertainment: e.g., predicting movie box office receipts SPSS is available in two separate bundles of features called editions. 1. SPSS Modeller Professional 2. SPSS Modeller Premium It all includes: o text analytics o entity analytics o social network analysis Both the editions are available in desktop and server configurations. 54 Earlier it was Unix based and designed as a consulting tool and not for sale to the customers. Originally developed by a UK Company called Integral Solutions in collaboration with Artificial Intelligence researchers at Sussex University. It mainly uses two of the Poplog languages, Pop11 and Prolog. It was the first data mining tool to use an icon based graphical user interface rather than writing programming languages. Clementine is a data mining software for business solutions. Previous version was a stand alone application architecture while new version is a distributed architecture. Fig. Previous version (stand alone) Distributed Architecture Fig. New version (Distributed architecture) 55 Multiple model building techniques in Clementine: Rule Induction Graph Clustering Association Rules Linear Regression Neural Networks Functionalities: Classification: Rule Induction, neural Networks Association: Rule Induction, Apriori Clustering: Kohonen Networks, Rule Induction Sequence: Rule Induction, Neural Networks, Linear Regression Prediction: Rule Induction, Neural Networks Applications: Predict market share Detect possible fraud Locate new retail sites Assess financial risk Analyze demographic trends and patterns 6. Conclusion: IBM SPSS Modeler is a data mining and text analytics software application is studied. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming is understood 7. Viva Questions: What are the functionalities of SPSS Clementine? 8. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 56 Data Warehouse and Mining Experiment No. : 10 Study different OLAP operations 57 Experiment No. 10 1. Aim:Study different OLAP operations 2. Objectives: From this experiment, the student will be able to Discover patterns from data warehouse Online analytical processing of data Obtain knowledge from data warehouse 2. Outcomes: The learner will be able to Recognize the need of online analytical processing. Identify, formulate and solve engineering problems. Able to match industry requirements in domains of data warehouse 3. Theory: Following are the different OLAP operations Roll up (drill-up): summarize data o by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up o from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: o project and select Pivot (rotate): o reorient the cube, visualization, 3D to series of 2D planes o Fact table View Multi-dimensional cube Dimension = 3 58 Example Cube aggregation – roll up and drill down Example – slicing 59 Example – slicing and pivoting 4. Conclusion: OLAP, which performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelingis studied. 6. Viva Questions: What are OLAP operations? What is difference between OLTP and OLAP? What is difference between slicing and dicing? 7. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 60 Data Warehouse and Mining Experiment No. : 11 Study different pre-processing steps of data warehouse Experiment No. 11 61 1. Aim:Study different pre-processing steps of data warehouse 2. Objectives: From this experiment, the student will be able to Discover patterns from data warehouse Learn steps of pre-processing of data Obtain knowledge from data warehouse 2. Outcomes: The learner will be able to Recognize the need of data pre-processing. Identify, formulate and solve engineering problems. Able to match industry requirements in domains of data warehouse 3. Theory: Data pre-processing is an often neglected but important step in the data mining process. The phrase "Garbage In, Garbage Out" is particularly applicable to data mining and machine learning. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income:-100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Data Pre-processing Methods Raw data is highly susceptible to noise, missing values, and inconsistency. The quality of data affects the data mining results. In order to help improve the quality of the data and, consequently, of the mining results raw data is pre-processed so as to improve the efficiency and ease of the mining process. Data pre-processing is one of the most critical steps in a data mining process which deals with the preparation and transformation of the initial dataset. Data pre-processing methods are divided into following categories: 1) Data Cleaning 2)Data Integration 3)Data Transformation 4)Data Reduction 62 Fig. Forms of data Preprocessing Data Cleaning Data that is to be analyze by data mining techniques can be incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier values which deviate from the expected), and inconsistent (e.g., containing discrepancies in the department codes used to categorize items).Incomplete, noisy, and inconsistent data are commonplace properties of large, real -world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transaction data. Other data may not be included simply because it was not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, the recording of the history or modifications to the data may have been overlooked. Missing data, particularly for tuples with missing values for some attributes, may need to be inferred. Data can be noisy, having incorrect attribute values, owing to the following. The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Errors in data transmission can also occur. There may be technology limitations, such as limited buffer size for coordinating synchronized data transfer and consumption. Incorrect data may also result from inconsistencies in naming conventions or data codes used. Duplicate tuples also require data cleaning. Data cleaning routines work to “clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Dirty data can cause confusion for the mining procedure. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding over fitting the data to the function being modelled. Therefore, a useful pre-processing step is to run your data through some data cleaning routines. Missing Values: If it is noted that there are many tuples that have no recorded value forseveral attributes, then the missing values can be filled in for the attribute by various methods described below: 1. Ignore the tuple: This is usually done when the class label is missing (assuming themining task involves classification or description). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. 2. Fill in the missing value manually: In general, this approach is time-consuming andmay not be feasible given a large data set with many missing values. 3. Use a global constant to fill in the missing value: Replace all missing attributevalues by the same constant, such as a label like \Unknown", or -∞. If missing values are replaced by, say, \Unknown", then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common | that of \Unknown". Hence, although this method is simple, it is not recommended. 4. Use the attribute mean to fill in the missing value 5. Use the attribute mean for all samples belonging to the same class as the given tuple. 6. Use the most probable value to fill in the missing value: This may be determinedwith inference-based tools using a Bayesian formalism or decision tree induction. Inconsistent data: There may be inconsistencies in the data recorded for some transactions.Some data inconsistencies may be corrected manually using external references. 63 For example, errors made at data entry may be corrected by performing a paper trace. This may be coupled with routines designed to help correct the inconsistent use of codes. Knowledge engineering tools may also be used to detect the violation of known data constraints. For example, known functional dependencies between attributes can be used to find values contradicting the functional constraints. Data Integration It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration. Schema integration can be tricky. How can like real world entities from multiple data sources be 'matched up'? This is referred to as the entity identification problem. For example, how can the data analyst or the computer be sure that customer id in one database, and cust_number in another refer to the same entity? Databases and data warehouses typically have metadata - that is, data about the data. Such metadata can be used to help avoid errors in schema integration. Redundancy is another important issue. An attribute may be redundant if it can be “derived" from another table, such as annual revenue. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Data Transformation In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following: 1. Normalization, where the attribute data are scaled so as to fall within a small specifiedrange, such as -1.0 to 1.0, or 0 to 1.0. 2. Smoothing works to remove the noise from data. Such techniques include binning,clustering, and regression. 3. Aggregation, where summary or aggregation operations are applied to the data. Forexample, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. 4. Generalization of the data, where low level or 'primitive' (raw) data are replaced byhigher level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher level concepts, like city or country. Similarly, values for numeric attributes, like age, may be mapped to higher level concepts, like young, middleaged, and senior. Data Reduction Data reduction techniques have been helpful in analyzing reduced representation of the dataset without compromising the integrity of the original data and yet producing the quality knowledge. The concept of data reduction is commonly understood as either reducing the volume or reducing the dimensions (number of attributes). There are a number of methods that have facilitated in analyzing a reduced volume or dimension of data and yet yield useful knowledge. Certain partition based methods work on partition of data tuples. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results Strategies for data reduction include the following. 64 1. Data cube aggregation, where aggregation operations are applied to the data in theconstruction of a data cube. 2. Dimension reduction, where irrelevant, weakly relevant, or redundant attributes ordimensions may be detected and removed. 3. Data compression, where encoding mechanisms are used to reduce the data set size. Themethods used for data compression are wavelet transform and Principal Component Analysis. 4. Numerosity reduction, where the data are replaced or estimated by alternative, smallerdata representations such as parametric models (which need store only the model parameters instead of the actual data e.g. regression and log-linear models), or nonparametric methods such as clustering, sampling, and the use of histograms. 5.Discretization and concept hierarchy generation, where raw data values for attributes arereplaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction, and are a powerful tool for data mining. 4. Conclusion: Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Such pre-processing is thus studied. 5. Viva Questions: What is pre-processing of data? What is the need for data pre-processing? What kind of data can be cleaned? 6. References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education 65