Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIS4435 Lecture 10 Lecture : Data Mining Dr. Nawaz Khan School of Computing Science E-mail: n.x.khan@mdx.ac.uk Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 1 Reading Assignment Core Text: Lecture 10 GC DL materials on the WebCT: Unit 11 Connolly, T. and Begg, C., 2002, Database Systems: A Practical Approach to Design, Implementation, and Management, Addison Wesley, Harlow, England Additional Reading: Fundamentals of Database Systems. R. Elmasri and S. B. Navathe, 4th Edition, 2004, Addison-Wesley, ISBN 0-32112226-7: Chapter 27 Data Warehousing, Data Mining, and OLAP, Alex Berson and Stephen J. Smith, McGraw-Hill, 1997, ISBN 0-07006272-2 (Chapters 17, 18) Other resources on the Internet Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk BIS4229 – Industrial Data Management Technologies 2 Data Mining Outline Lecture 10 DW & DM: differences The Definition Application areas Comparison with query and Web site analysis tools DM Process Applications, Models and Algorithms Summary Q&A Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk BIS4229 – Industrial Data Management Technologies 3 Data Mining DW & DM: differences Data Mart Lecture 10 Data Transformation Data Warehouse Metadata Access Tools Information Delivery System Operational Data Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 4 Data Mining DW & DM: differences Lecture 10 They have the same purpose - decision support DW assembles, formats, and organises historical data to answer user query as it is - depends on content of DW DW will not attempt to extract further information or predict trends and patterns from data DM will extract previously unknown and useful information as well as predict trends and patterns DM can be performed on DW and/or traditional DB, files Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 5 Data Mining The Definition DM is the process of extracting previously unknown, valid and actionable information from large sets of data Lecture 10 Unknown - look for things that are not intuitive Valid - useful Actionable - translate into business advantage Example: Rule 1: people don’t buy shares when political situation is not stable Rule 2: share market is less active when people don’t want to spend Outcome statement 1 based on rule 1 and 2 is: Share market is less active when political situation is not stable Outcome statement 2 based on rule 1 and 2 is: People don’t want to spend when political situation is not stable Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 6 Data Mining Application areas Direct Marketing The ability to predict who is most likely to be interested in what products can save companies immense amounts in marketing expenditures Trend Analysis Lecture 10 Understanding trends in the marketplace is a strategic advantage, because it is useful in reducing costs and timeliness to market Security Fraud detection: data mining techniques can help discover which insurance claims, cellular phone calls, or credit card purchases are likely to be fraudulent IDS (intrusion detection systems) Forecasting in Financial Markets Mining Online – WebKDD Web sites today find themselves competing for customer loyalty. It costs little for customer to switch to competitors Text Mining - intelligent document analysis Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 7 Data Mining Comparison with query and Web site analysis tools Query Tools vs. DM Tools Both allow user to ask questions of DBMS/DW - find out facts Query tool - users make assumption, query based on hypothesis Data mining tool - no assumption when making query (goal) Lecture 10 Example queries: 1. What is the number of white shirt sold in the north vs the south? 2. What are the most significant factors involved in high, medium, and low sales volumes of white shirt? Data mining tool - discover relationships and hidden patterns that are not obvious Trend - integrate data mining in query tools Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 8 Data Mining Comparison with query and Web site analysis tools OLAP Tools vs. DM Tools Lecture 10 OLAP - designed to answer top-down queries OLAP - provides multidimensional data analysis, data can be broken down and summarised OLAP - query-driven, user-driven, verification-driven Data mining - bottom-up, requires no assumption Data mining - focus on finding patterns Data mining - data-driven, discovery-driven, identify facts/conclusions based on patterns discovered For example, OLAP may tell a bookseller about total number of books it sold in a region during a quarter. Statistics can provide another dimension about these sales. Data mining, on the other hand, can tell you the patterns of these sales, i.e., factors influencing the sales. Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 9 DM Technologies (see Unit 20 - WebCT) Database Management and Warehousing Statistics Lecture 10 Parallel Processing Machine Learning Data Mining Visualisation Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk Decision Support 10 Data Mining DM Process - Overview Data Sources Lecture 10 Selected data Pre-processed data Transformed data Extracted data Assimilated knowledge Business objectives data preparation results analysis & knowledge assimilation DM Mining data is only one step in the overall process Business objectives drive the entire process Data preparation requires the most efforts Iterative process with many loop backs over one or more steps Labour intensive exercise, far from autonomous Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 11 Data Mining DM Process – Data Preparation Data Selection Data Pre-processing Data Transformation Lecture 10 Data Selection - identify data sources and extract data for preliminary analysis in preparation for further mining Process of choosing data to analyse decide dependent variable - data (field) to be analysed decide active variable - data actively used in mining decide useful data dimension choose useful (descriptive) fields in the dimension consider adding other useful dimension Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 12 Data Mining DM Process – Data Preparation Data Selection Data Pre-processing Data Transformation Lecture 10 Data Pre-processing - ensure quality of the selected data Data mining is at best as good as the data it is representing Data quality redundant data incorrect or inconsistent data noisy data - outliers - values that are significantly out of line bad outlier & good outliers missing values - value not present or deleted eliminate observations that have missing values - loss info. replace missing values predict value using predictive model Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 13 Data Mining DM Process – Data Preparation Data Selection Data Pre-processing Data Transformation Lecture 10 Data transformation – pre-processed data converted to analytical data model. Data is refined to suite the input format required by DM algorithms Techniques for data conversion simple calculation (SQL) to derive new data fields data reduction: combine several existing variables into one new variable to reduce the total number of variable continuous values are scaled/normalised same order of magnitude discretisation: quantitative variables into categorical variables one-of-N: convert a categorical variable to a numeric representation Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 14 Data Mining DM Process – Data Mining & Results Analysis Lecture 10 DM - apply selected DM algorithm(s) to the pre-processed data Inseparable from results analysis - done by data & business analyst The two are linked in an interactive process - DM definition Results analysis - depend on application developed Segmentation - change base variable may improve result Prediction - accuracy and input sensitivity analysis, overtraining Association - iteration required for discovering actionable rules Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 15 Data Mining DM Process – Knowledge Assimilation Close the loop Objective - take action according to the new, valid and actionable information discovered Challenges - Lecture 10 present discovery in convincing, business-oriented way formulate ways to best exploit discovery Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 16 Data Mining Applications, Models and Algorithms Typical Applications Lecture 10 Models Techniques Market Management Risk Management Target marketing Forecasting Customer relationship Customer retention management Quality control Competitive analysis Market basket analysis Cross selling Market segmentation Predictive Modelling Segmentation Link (Classification) (Clustering) Analysis Associations Decision tree Geometric Memory-based Neural networks discovery (Market Basket Analysis) learning Neural networks Fraud Management Fraud detection Deviation Detection Visualisation Statistics Predictive Modelling –Classification Human learning experience - observations form a model of the essential, underlying characteristics of some phenomenon generalisation ability In DM, predictive model can analyse a DB to determine some essential characteristics about data and make predictions Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 17 Data Mining Applications, Models and Algorithms Predictive Modelling –Classification Supervised learning - correct answer to some already solved cases must be given to the model before it can make prediction about the new observations Lecture 10 Model developed in 2-phase Training - build a model based on large proportion (90%) of available data Testing - try out the model on previously unseen data (10%) to determine its accuracy and performance characteristics 2 types of predictive modelling Classification - classify data into some pre-defined classes Value prediction - predict continuous numeric value for database record Algorithms – decision trees, neural networks, rule induction Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 18 Data Mining Applications, Models and Algorithms Segmentation – Clustering Lecture 10 Segmentation can discover homogeneous sub-population customer profiling/target marketing Segmentation (Clustering) - partition DB into segments (clusters) of similar records, and segments (clusters) are resulting groups of data records Similarity is defined by a measure depends on the distance of records from centre of the cluster - Euclidean distance A(a1,a2, …, an), B(b1, b2, …, bn) Dist(A, B) = ((a1-b1)2 + (a2-b2)2 + … + (an-bn)2)1/2 Clustering is unsupervised learning - the types of clusters or number of clusters are not given - true discovery nature of DM Algorithm – neural networks Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 19 Data Mining Applications, Models and Algorithms Link Analysis / Deviation Detection Lecture 10 Link analysis seeks to establish links between individual records or sets of records in the DB Association discovery - market basket analysis - one transaction Sequential pattern discovery - sequence information over time Deviation detection - further investigate outliers Applications - fraud detection Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 20 Data Mining Applications, Models and Algorithms Lecture 10 Typical Applications Models Techniques Market Management Risk Management Target marketing Forecasting Customer relationship Customer retention management Quality control Competitive analysis Market basket analysis Cross selling Market segmentation Predictive Modelling Segmentation Link (Classification) (Clustering) Analysis Associations Decision tree Geometric Memory-based Neural networks discovery (Market Basket Analysis) learning Neural networks Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk Fraud Management Fraud detection Deviation Detection Visualisation Statistics 21 Data Mining Applications, Models and Algorithms Decision Trees Lecture 10 Decision tree (IF - THEN) - as a commonly used machine learning algorithm are powerful and popular tools for classification and prediction Attempt to split DB among desired categories and identify important cluster features Tree construction choose an attribute (field) for testing - root node of tree number of values of the attribute - branches from the root node – binary - yes/no type of questions – multiple - complex questions with more than two answer Algorithm - ID3 (Interactive Dichotomizer), C4.5, C5.0, CART (chisquared automatic integration detection) rank all features in terms of effectiveness in partitioning the set of classification - information gain make the most effective features as the root node recur on each branch Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 22 Data Mining Applications, Models and Algorithms Decision Trees Lecture 10 Diet Size Colour Habitat Species meat meat meat meat grass grass grass large large small small large small large striped tawny striped brown striped grey tawny jungle jungle house jungle plains plains plains tiger lion tabby weasel zebra rabbit antelope Optimal tree produced by ID3 root node - “Colour”, most information gain 4 branches - “striped”, “tawny”, “brown” & “grey” recur on branch “striped” & “tawny” Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 23 Data Mining Applications, Models and Algorithms Colour striped tawny Lecture 10 Habitat jungle tiger grey brown Diet house plains tabby weasel grass zebra Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk rabbit meat antelope lion 24 Data Mining Applications, Models and Algorithms Neural Networks Lecture 10 An NN is used to simulate the operation of the brain An NN consists of large number of processors (neurons/nodes) and links (connections) - representing knowledge An NN is trained with large amount of data and rules about data relationships - memorise A well trained NN can learn association and similarity – generalise Supervised learning: NN is trained with sets of inputs and desired outputs If the actual output is different from the desired output, the network adjust its internal connection strengths (weights) to reduce the difference This process continues until the network gets the I/O patterns correct or until an acceptable error rate is attained Unsupervised learning - Self-Organising Map (SOM) Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 25 Data Mining Summary Lecture 10 DW & DM: differences The definition Application areas Comparison with query and Web site analysis tools DM Process Data preparation (60% of the whole time) DM (~10% of the time) Applications, Models and Algorithms (decision trees, neural networks, etc.) Next week: Revision Dr. Nawaz Khan, School of Computing Science E-mail: n.x.khan@mdx.ac.uk 26