Download Data Mining

Chapter 1 INTRODUCTION 1 What is Pattern Recognition? Pattern Recognition by Human  perceptual  specialized – decision making Pattern Recognition by Computers  benefit of automated pattern recognition  advantage in complex calculations Pattern Recognition from Data (Data Mining) 2 Pattern Recognition from Data Pattern recognition from data is the process of learning the historical data by finding data dependency and getting the knowledge from data. 3 What is Data? 1 2 3 4 5 6 7 : 99 100 Studies Education Poor SPM Poor SPM Moderate SPM Moderate Diploma Poor SPM Moderate Diploma Good MSC Works Poor Good Poor Poor Poor Poor Good Income (D) None Low Low Low None Low Medium Poor Moderate Good Poor Low Low SPM Diploma 4 What is Knowledge?? studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low) 5 Why is Data Mining prevalent? 1. Lots of data is collected and stored in data warehouses     Business  Wal-Mart logs nearly 20 million transactions per day Astronomy  Telescope collecting large amounts of data. Space  NASA is collecting peta bytes of data from satellites Physics  High energy physics experiments are expected to generate 100 to 1000 tera bytes in the next decade. 6 Why is Data Mining prevalent? 2. Quality and richness of data collected is improving  Retailers   E-commerce   Scanner data is much more accurate than other means Rich data on customer browsing Science  Accurate of sensor is improving 7 Why is Data Mining prevalent? 3. The gap between data and analysts is increasing Existing of Hidden information  High cost of human labor  Much of data is never analyzed at all  8 Origins of Data Mining Drawn ideas from Machine Learning, Pattern Recognition, Statistics, and Database Systems for applications that have Enormous of data  High dimensionality of data  Heterogeneous data  Unstructured data  9 Data Mining: confluence of multiple discipline Database technology statistic HPerformance computing visualization Pattern recognition Machine learning DATA MINING Spatial data analysis Information retrieval Information science Neural network 10 Data Mining – What it isn’t Small Scale  Data mining methods are designed for large data sets Foolproof  Data mining techniques will discover patterns in any data  The patterns discovered may be meaningless  It is up to the user to determine how to interpret the results  “Make it foolproof and they’ll just invent a better fool” Magic  Data mining techniques cannot generate information that is not present in the data  They can only find the patterns that are already there 11 Example: Data Mining is not …. Generating multidimensional cubes of a relational table Searching for a phone number in a phone book Searching for keywords on Google (IR) Generating a histogram of salaries for different age groups Issuing SQL query to a database, and reading the reply 12 Data Mining – What it is Extracting knowledge from large amounts of data Uses techniques from:  Pattern Recognition  Machine Learning  Statistics Plus techniques unique to data mining (Association rules) Data mining methods must be efficient and scalable 13 Example: Data mining is … What goods should be promoted to this customer? What is the probability that a certain customer will respond to a planned promotion? Can one predict the most profitable securities to buy/sell during the next trading session? Will this customer default on a loan or pay back on schedule? What medical diagnose should be assigned to this patient? What kind of cars should be sell this year?? Finding groups of people with similar hobbies Are chances of getting cancer higher if you live near a power line? 14 Data Mining is simply... Finds relationship make prediction 15 Data Mining: Definition The non trivial extraction of implicit, previously unknown, and potentially useful information from data (William J Fawley, Gregory PiatetskyShapiro and Christopher J Matheus) 16 Data Mining : 1-step of KDD Knowledge Evaluation & Presentation KDD = Knowledge Discovery in Databases Patterns Data Mining Selection and Transformation Cleaning and Integration Databases Data Warehouse Flat files 17 Cont’d Data cleaning  To remove noise and inconsistent data Data integration  Multiple data sources may be combined Data selection  Data relevant to the analysis task are retrieved from the database Data transformation  Data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations 18 Cont’d Data mining  An essential process where intelligent methods are applied in order to extract data patterns Pattern evaluation  To identify the truly interesting patterns representing knowledge based on some interestingness measures Knowledge presentation  Visualization and knowledge representation techniques are used to present the mined knowledge to the users 19 Early Steps of Data Mining Data preprocessing  handling incomplete data, noisy data, uncertain data Data discretization/representation  transforms data into suitable values for the mining algorithm to find patterns Data selection  selects the suitable data for mining purposes 20 Data base Systems Kinds of DB Kinds of Knowledge Relational Data warehouse Transactional DB Advanced DB system Flat files WWW Classification Association Clustering Prediction … … 21 Data Mining – Types of Data Mining can be performed on data in a variety of forms Relational Database  Traditional DMBS everyone is familiar with  Data is stored in a series of tables (Collection of tables) Data is extracted via queries, typically with SQL SQL: “Show me a list of items that were sold in the last quarter” “show me the total sales of the last month, grouped by branch” “How many transactions occurred in the month of December?” “which sales person had the highest amount of sales” Relational language: aggregate function such as sum, avg, count, max, min       22 Data Mining – Types of Data  Apply data mining – go further  Searching for trends or data patterns  Analyzed customer data to predict credit risk of new customers based on their income Detect deviation – items whose sales are far from those expected in comparison with the previous year (further investigated: change in packaging, increase in price?)  Transaction Database  Similar to relational database (transactions stored in a table)  Each row (record) is a transaction with id & list of items in transaction  Nested relation  Can be unfolded into a relational database or stored in flat files since nested relational structures did not supported by relational db system  Which items sold well together? 23 Data Mining – Types of Data Data Warehouse  Stores historical data, potentially from multiple sources  Organized around major subjects  Contains summary statistics Object / Object-Relational Databases    Database consisting of objects Object = set of variables + associated methods Eg: Intel uses regularity extraction in automatic circuit layout Images    Can mine features extracted from images, OR Can use mining techniques to extract features Content based image retrieval 24 Data Mining – Types of Data Vector Geometries (spatial db)       Include GIS and CAD data Raster data – n-dimensional bit maps /pixel maps Vector format – point, line, polygon Can find spatial patterns between features Describing the characteristics of houses located near a specified kind of location Describe the climate of mountainous areas located at various altitudes Text    Can be unstructured, semi-structured, or structured Documentation, newspaper articles, web sites etc. Can facilitate search by linking related documents / concepts 25 Data Mining – Types of Data Video / Audio    Speech recognition – recognized spoken command Security applications Integrated with standard data mining methods (storage and searching) Temporal Databases / Time Series      Global change databases (temperature records) Space shuttle telemetry Stock market data (stock exchange) Usually stores relational data that include time-related attributes Find the trend of changes for objects – decision making/strategy planning 26 Data Mining – Types of Data  Stock exchange data can be mined to uncover trends that could help in planning investment strategies (when is the best time to purchase TNB stock?) Legacy Databases     Group of heterogeneous databases (relational, OO db, network db, multimedia db etc.) Connected by intra- or inter-computer networks Information exchange is very difficult – student academic performance among different schools/universities Data mining – transforming the given data into higher, more generalized, conceptual levels 27 The evolution of database technology Data mining can viewed as a result of the natural evolution of data base technology (Fig. 1.1). The figure shows 5 stages of functionalities: - data collection and database creation - database management systems - advanced databases systems - web-based databases systems - data warehousing and data mining 28 29 The evolution of database technology ..cont Databases systems provide data storage and retrieval, and transaction processing. Data warehousing and data mining provide data analysis and understanding. Data ware house is a database architecture that store many different types of databases, a repository of multiple heterogeneous data sources. They are organized under a unified schema at a single site in order to facilitate management decision making. 30 The evolution of database technology ..cont Data warehouse technology includes: - data cleansing - data integration, and - On-Line Analytical Processing (OLAP) OLAP is the analysis technique for performing summarization, consolidation, and aggregation, as well as ability to view information from different angles. Although OLAP tools support data analysis but not indepth-analysis such as data classification, clustering, and the characterization of data changes over time 31 DBMS, OLAP & Data Mining Area Task DBMS OLAP Data Mining Extraction of detailed and summary data Summaries, trends and forecast Knowledge discovery of hidden patterns and insight Type of result Information Analysis Insight and prediction Method Deduction (Ask the question, verify with data) Multidimensional data modeling, Aggregation, statistics Induction (Build the model, apply it to new data, get the result) Example question Who purchased mutual funds in the last 3 years What is the average income of mutual fund buyers by region by year? Who will buy a mutual fund in the next 6 months and why? 32 Example: Weather data Record of the weather conditions during a twoweek period, along with the decisions of a tennis player whether or not to play tennis on each particular day Generated tuples (or examples, instances) consisting of values of 4 independent variables     Outlook Temperature Humidity Windy One dependent variable - play 33 Cont’d Day outlook temperature humidity windy play 1 sunny 85 85 false No 2 sunny 80 90 true No 3 overcast 83 86 False Yes 4 rainy 70 96 False Yes 5 rainy 68 80 False Yes 6 rainy 65 70 True No 7 overcast 64 65 True Yes 8 sunny 72 95 False No 9 sunny 69 70 False Yes 10 rainy 75 80 False Yes 11 sunny 75 70 True Yes 12 overcast 72 90 True Yes 13 overcast 81 75 False Yes 14 rainy 71 91 true no 34 DBMS We may answer questions by querying a DBMS containing the above table What was the temperature in the sunny days?  Which days the humidity was less than 75?  Which days the temperature was greater than 70?  Which days the temperature was greater than 70 and the humidity was less than 75?  35 OLAP (On-line analytical processing) Using OLAP – create Multidimensional Model (Data cube) Eg. Dimensions: time, outlook, play – can create the model below 9/5 sunny rainy overcast Week 1 0/2 2/1 2/0 Week 2 2/1 1/1 2/0 36 Cont’d Observing the data cube – easily identify some important properties of the data  Find regularities or pattern  Eg. The 3rd column: if the outlook is overcast the play attribute is always yes  If outlook = overcast then play = yes 37 Drill-down: time dimension Concept hierarchy 9/5 sunny rainy overcast 1 0/1 0/0 0/0 2 0/1 0/0 0/0 3 0/0 0/0 1/0 4 0/0 1/0 0/0 5 0/0 1/0 0/0 6 0/0 0/1 0/0 7 0/0 0/0 1/0 8 0/1 0/0 0/0 9 1/0 0/0 0/0 10 0/0 1/0 0/0 11 1/0 0/0 0/0 12 0/0 0/0 1/0 13 0/0 0/0 1/0 14 0/0 0/1 0/0 38 Roll-up (reverse of drill-down) 9/5 sunny rainy overcast Week 1 0/2 2/1 2/0 Week 2 2/1 1/1 2/0 39 Data Mining Tasks Prediction methods  Use some variables to predict unknown or future values of the same or other variables.  Inference on the current data in order to make prediction Description methods  Find human interpretable patterns that describe data  Characterize the general properties of data in db Descriptive mining is complementary to predictive mining but it is closer to decision support than decision making 40 Cont’d Association Rule Mining (descriptive) Classification and Prediction (predictive) Clustering (descriptive) Sequential Pattern Discover (descriptive) Regression (predictive) Deviation Detection (predictive) 41 Association Rule Mining Initially developed for market basket analysis Goal is to discover relationships between attributes Data is typically stored in very large databases, sometimes in flat files or images Uses include decision support, classification and clustering Application areas include business, medicine and engineering 42 Association Rule Mining Given a set of transactions, each of which is a set of items, find all rules (XY) that satisfy user specified minimum support and confidence constraints Support = (#T containing X and Y)/(#T) Confidence=(#T containing X and Y)/ (#T containing X) Applications   Cross selling and up selling Supermarket shelf management Transaction T1 T2 T3 T4 T5 Items Bread, Jelly, Jem Bread, Jem Bread, Milk, Jem Coffee, Bread Coffee, Milk Some rules discovered  Bread Jem Sup=60%, conf=75% Jelly Bread  Sup=60%, conf=100% Jelly Jem  Sup=20%, conf=100% Jelly Milk  Sup=0%     43 Association Rule Mining: Definition Given a set of records, each of which contain some number of items from a given collection:  Produce dependency rules which will predict occurrence of an item based on occurrences of other items Example: {Bread} {Jem}  {Jelly} {Jem}  44 Association Rule Mining: Marketing and sales promotion Say the rule discovered is {Bread, …} {Jem} Jem as a consequent: can be used to determine what products will boost its sales. Bread as antecedent: can be used to see which products will be impacted if the store stops selling bread Bread as an antecedent and Jem as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Jem. 45 Association Rule Mining: Supermarket shelf management Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved. Data Used: Point-of sale data collected with barcode scanners to find dependencies among products. Example   If customer buys jelly, then he is very likely to by Jem. So don’t be surprised if you find Jem next to Jelly on an aisle in the super market. Also salsa next to tortilla chips. 46 Association Rule Mining Association rule mining will produce LOTS of rules How can you tell which ones are important?  High Support  High Confidence  Rules involving certain attributes of interest  Rules with a specific structure  Rules with support / confidence higher than expected Completeness – Generating all interesting rules Efficiency – Generating only rules that are interesting 47 Clustering Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following:  Given a set of data points, each having a set of attributes, and a similarity measure, find cluster such that   Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another 48 Cont’d Similarity measures:   Euclidean distance (continuous attr.) Other problem – specific measures Types of Clustering   Group-Based Clustering Hierarchical Clustering 49 Clustering Example Euclidean distance based clustering in 3D space   Intra cluster distances are minimised Inter cluster distances are maximised 50 Clustering: Market Segmentation Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mix Approach:    Collect different attributes of customers based on their geographical and lifestyle related information Find clusters of similar customers Measure the clustering quality by observing the buying patterns of customers in the same cluster vs. those from different clusters. 51 Clustering: Document Clustering Goal: To find groups of documents that are similar to each other based on important terms appearing in them Approach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters. Gain: Information Retrieval can utilize the clusters to relate a new document or search to clustered documents 52 Clustering: Document Clustering Example Clustering points: 3204 articles of LA Times Similarity measure: Number of common words in documents (after some word filtering) Category Financial Foreign National Metro Sports Entertainment Total articles 555 341 273 943 738 354 Correctly placed articles 364 260 36 746 573 278 53 Classification: Definition Given a set of records (called the training set)  Each record contains a set of attributes. One of the attributes is the class Find a model for the class attribute as a function of the values of other attributes Goal: Previous unseen records should be assigned to a class as accurately as possible  Usually, the given data set is divided into training and test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set. 54 Classification: cont’d Classifiers are created using labeled training samples Classifiers are evaluated using independent labeled samples (test set) Training samples created by ground truth / experts Classifier later used to classify unknown samples Measurements must be able to predict the phenomenon! Examples      Direct marketing Fraud detection Customer churn Sky survey cataloging Classifying galaxies 55 cla ss uo us co nt in ca te go ric al ca te go ric al Classification Example Tid Refund Marital Status Taxable Income Cheat 1 2 3 4 5 6 7 8 9 10 Yes No No Yes No No Yes No No No Single Married Single Married Divorced Married Divorced Single Married Single 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K No No No No Yes No No Yes No Yes Training Set Refund Marital Status Taxable Income Cheat Yes No No Yes No No Yes No No No Single Married Single Married Divorced Married Divorced Single Married Single 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K No No No No Yes No No Yes No Yes Test set Learn Classifier Model 56 Classification: Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone product Approach:     Use the data collected for a similar product introduced in the recent past. Use the profiles of consumers along with their (buy, didn’t buy} decision. The latter becomes the class attribute. The profile of the information may consist of demographic, lifestyle and company interaction.  Demographic – Age, Gender, Geography, Salary  Psychographic - Hobbies  Company Interaction – Recentness, Frequency, Monetary Use these information as input attributes to learn a classifier model 57 Classification: Fraud Detection Goal: Predict fraudulent cases in credit card transactions Approach:     Use credit card transactions and the information on its account holders as attributes (important: when and where the card was used) Label past transactions as {fraud, fair} transactions. This forms the class attribute Learn a model for the class of transactions Use this model to detect fraud by observing credit card transactions on an account. 58 Regression Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependency Extensively studied in the fields of Statistics and Neural Networks    Predicting sales number of new product based on advertising expenditure Predicting wind velocities based on temperature, humidity, air pressure, etc Time series prediction of stock market indices 59 Deviation/Anomaly Detection Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity Goal of deviation/anomaly detection is to detect significant deviations from normal behavior 60 Deviation/Anomaly Detection: Definition Given a set of n points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data This can be viewed as two sub problems Define what data can be considered as inconsistent in a given data set  Find an efficient method to mine the outliers  61 Deviation: Credit Card Fraud Detection Goal: to detect fraudulent credit card transactions Approach:    Based on past usage patterns, develop model for authorized credit card transactions Check for deviation from model, before authenticating new credit card transactions Hold payment and verify authenticity of “doubtful” transaction by other means (phone call, etc.) 62 Anomaly detection: Network Intrusion Detection Goal: to detect intrusion of a computer network Approach: Define and develop a model for normal user behavior on the computer network  Continuously monitor behavior of users to check if it deviates from the defined normal behavior  Raise an alarm, if such deviation is found  63 Sequential pattern discovery: definition Given is a set of objects, with each object associated with its own time of events, find rules that predict strong sequential dependencies among different events Sequence discovery aims at extracting sets of events that commonly occur over a period of time (A B) (C)  (D E) 64 Sequential pattern discovery: Telecommunication Alarm Logs Telecommunication alarm logs  (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm)  (Fire_Alarm) 65 Sequential pattern discovery: Point of Sell Up Sell / Cross Sell Point of sale transaction sequences  Computer bookstore (Intro_to_Visual_C) (C++ Primer)  (Perl_For_Dummies, Tcl_Tk)  60% customers who buy Intro toVisual C and C++ Primer also buy Perl for dummies and Tcl Tk within a month   Athletic apparel store  (Shoes) (Racket, Racket ball)  (Sport_Jacket) 66 Example: Data Mining(Weather data) By applying various data mining techniques, we can find    associations and regularities in our data Extract knowledge in the forms of rules, decision trees etc. Predict the value of the dependent variable in new situation Some example    Mining association rules Classification by decision trees and rules Prediction methods 67 Mining association rules First, discretize the numeric attributes (a part of the data preprocessing stage) Group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal) Substitute the values in data with the corresponding names Apply the Apriori algorithm and get the following rules 68 Discretized weather data Day outlook temperature humidity windy play 1 sunny hot high false No 2 sunny hot high true No 3 overcast hot high False Yes 4 rainy mild high False Yes 5 rainy cool normal False Yes 6 rainy cool normal True No 7 overcast cool normal True Yes 8 sunny mild high False No 9 sunny cool normal False Yes 10 rainy mild normal False Yes 11 sunny mild normal True Yes 12 overcast mild high True Yes 13 overcast hot normal False Yes 14 rainy mild high true no 69 Cont’d humidity=normal windy=false  play=yes (4,1) temperature=cool  humidity=normal (4,1) outlook=overcast  play=yes (4,1) temperature=cool play=yes  humidity=normal (3,1) outlook=rainy windy=false  play=yes (3, 1) outlook=rainy play=yes  windy=false (3, 1) outlook=sunny humidity=high  play=no (3, 1) outlook=sunny play=no  humidity=high (3, 1) temperature=cool windy=false  humidity=normal play=yes (2, 1) 10. temperature=cool humidity=normal windy=false  play=yes (2, 1) 1. 2. 3. 4. 5. 6. 7. 8. 9. 70 Cont’d These rules show some attribute values sets (itemsets) that appear frequently in the data Support (the number of occurrences of the itemset in the data) Confidence (accuracy) of the rules Rule 3 – the same as the one that is produced by observing the data cube 71 Classification by Decision Trees and Rules Using ID3 algorithm, the following decision tree is produced Outlook=sunny   Humidity=high:no Humidity=normal:yes Outlook=overcast:yes Outlook=rainy   Windy=true:no Windy=false:yes 72 Cont’d Decision tree consists of:    Decision nodes that test the values of their corresponding attribute Each value of this attribute leads to a subtree and so on, until the leaves of the tree are reached They determine the value of the dependent variable Using a decision tree we can classify new tuples 73 Cont’d A decision tree can be presented as a set of rules  Each rule represents a path through the tree from the root to a leaf Other data mining techniques can produce rules directly: Prism algorithm if outlook=overcast then yes if humidity=normal and windy=false then yes If temperature=mild and humidity=normal the yes If outlook=rainy and windy=false then yes If outlook=sunny and humidity=high then no If outlook=rainy and windy=true then no 74 Prediction methods DM offers techniques to predict the value of the dependent variable directly without first generating a model The most popular approaches is based of statistical methods Uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables 75 Cont’d Eg: applying Bayes to the new tuple: (sunny, mild, normal, false, ?) P(play=yes| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8 P(play=no| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2  The predicted value must be “yes” 76 Data Mining : Problems and Challenges Noisy data Large Database s Dynamic Database s 77 Noisy data many of attribute values will be inexact or incorrect   erroneous instruments measuring some property human errors occurring at data entry two forms of noise in the data   corrupted values - some of the values in the training set are altered from the original form missing values - one or more of the attribute values may be missing both for examples in the training set and for object which are to be classified. 78 Difficult Training Set Non-representative data   Learning are based on a few examples Using large db, the rules probably representative Absence of boundary cases  To find the real differences between two classes Limited information   Two objects to be classified give the same conditional attributes but are classified in the diff class Not have enough information of distinguishing two types of objects 79 Dynamic databases Db change continually Rules that reflect the content of the db at all time (preferred) If same changes are made, the whole learning process may have to be conducted again 80 Large databases The size of db to be ever increasing Machine learning algorithms – handling a small training set (a few hundred examples) Much care on using similar techniques in larger db Large db – provide more knowledge (eg. rules may be enormous) 81 Data Mining – Issues in Data Mining User Interaction / Visualization Incorporation of Background Knowledge Noisy or Incomplete Data Determining Interestingness of Patterns Efficiency and Scalability Parallel and Distributed Mining Incremental Learning / Mining Time-Changing Phenomena Mining from Image / Video / Audio Data Mining Unstructured Data 82

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining