Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Data Mining Dr. Sushil Kulkarni Jai Hind College (sushiltry@yahoo.co.in) Introduction to Data Mining 1 Road Map — Introduction to database — A Problem and A Solution — What Is Data Mining? — Goal of Data Mining — What is (not) Data Mining? — Convergence of 3 key Technologies — Data mining Functions — Kinds of Data Mining Problems Introduction to Data Mining 2 What is Database? A database is any organized collection of data. Introduction to Data Mining 3 Examples Introduction to Data Mining Co-workers 4 Examples Introduction to Data Mining Patient Information 5 Examples Introduction to Data Mining Airline reservation system 6 Data vs. information • What is information? • What is data? – Data is unprocessed – Information is data that have been information. organized and communicated in a coherent and meaningful manner. – Data is converted into information, and information is converted into knowledge. – Knowledge; information evaluated and organized so that it can be used purposefully. Introduction to Data Mining 7 Why do we need a database? • Keep records of our: – Clients – Staff – Volunteers • To keep a record of activities and interventions • Keep sales records • Develop reports • Perform research Introduction to Data Mining 8 Purpose of Database system Is to transform Data Information Introduction to Data Mining Knowledge Action 9 Database • Database: Shared collection of logically related data (and a description of this data), designed to meet the information needs of an organization. • Database management System: A software system that enables users to define, create, and maintain the database and that provides controlled access to this database. Introduction to Data Mining 10 Who and How to do it ? • Database Management System (DBMS) does this job. • Using Software tools: Access, FileMaker, Lotus Notes, Oracle or SQL Server, ……. • It includes tools to add, modify or delete data from the database, ask questions (or queries) about the data stored in the database and produce reports summarizing selected contents. Introduction to Data Mining 11 hmm.. Let’s jump to Data Mining • With this background we will now see what is data Mining Introduction to Data Mining 12 A Problem … • You are a marketing manager of a brokerage company — Problem: Churn is too high > Turnover is 40% (after six month introductory period ends) — Customers receive incentives (average cost: ₹160) when account is opened — Giving new incentives to everyone who might leave is very expensive (as well as wasteful) — Bringing back a customer after they leave is both difficult and costly Introduction to Data Mining 13 A Solution … — One month before the end of the introductory period is over, predict which customers will leave — If you want to keep a customer that is predicted to churn, offer them something based on their predicted value > The ones that are not predicted to churn need no attention — If you don’t want to keep the customer, do nothing — How can you predict future behavior? > Tarot Cards > Magic 8 Ball Introduction to Data Mining 14 KDD Process • Knowledge discovery in databases (KDD) is a multi step process of finding useful information and patterns in data • Data Mining is the use of algorithms to extract information and patterns derived by the KDD process. • Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD. Introduction to Data Mining 15 Steps of KDD Process • Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD. • Knowledge discovery in databases (KDD) is a multi step process of finding useful information and patterns in data Introduction to Data Mining 16 Steps of KDD Process 1. SelectionData Extraction -Obtaining Data from heterogeneous data sources -Databases, Data warehouses, World wide web or other information repositories. 2. PreprocessingData Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected. 3. TransformationData Integration- Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced. Introduction to Data Mining 17 Steps of KDD Process 4. Data mining – Apply algorithms to transformed data an extract patterns. 5. Pattern Interpretation/evaluation Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns. Knowledge presentation- present the mined knowledgevisualization techniques can be used. Introduction to Data Mining 18 What Is Data Mining? Some Definitions • “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (Piatetsky-Shapiro) • "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi) • “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5) • “...finding hidden information in a database.” (Dunham, pg 3) • “...the process of employing one or more computer learning techniques to automatically analyse and extract knowledge from data contained within a database.” (Roiger, pg 4) Introduction to Data Mining 19 Why Data Mining? • That all sounds ... complicated. Why should I learn about Data Mining? • What's wrong with just a relational database? Why would I want to go through these extra [complicated] steps? • Isn't it expensive? It sounds like it takes a lot of skill, programming, computational time and storage space. • Where's the benefit? • Data Mining isn't just a cute academic exercise, it has very profitable real world uses. Practically all large companies and many governments perform data mining as part of their planning and analysis. Introduction to Data Mining 20 Goal of Data Mining — Simplification and automation of the overall statistical process, from data source (s) to model application — Changed over the years > Statistician replace data to a model > Many different data mining algorithms / tools available > Statistical expertise required to build intelligence into the software Introduction to Data Mining 21 Data Mining is … Introduction to Data Mining 22 What is (not) Data Mining? What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about Amazon” Introduction to Data Mining What is Data Mining? – Certain names are more common in certain locations of Mumbai (Kulkarni, Shah, Iyer… ) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) 23 DB VS DM Processing • Query – Well defined – SQL Data – Operational data Output – Precise – Subset of database Introduction to Data Mining • Query – Poorly defined – No precise query language Data – Not operational data Output – Fuzzy – Not a subset of database 24 Convergence of 3 key Technologies Introduction to Data Mining 25 1. Increasing Computing Power — Moore’s law doubles computing power every 18 months — Powerful workstations became common — Cost effective servers (SMPs) provide parallel processing to the mass market — Interesting tradeoff: < Small number of large analyses vs. large number of small analyses Introduction to Data Mining 26 1. The Data Explosion • The rate of data creation is accelerating each year. In 2003, UC Berkeley estimated that the previous year generated 5 exabytes of data, of which 92% was stored on electronically accessible media. Mega < Giga < Tera < Peta < Exa ... All the data in all the books in the US Library of Congress is ~136 Terabytes. So 37,000 New Libraries of Congress in 2002. • VLBI Telescopes produce 16 Gigabytes of data every second. • Google searches 18 billion+ accessible web pages. Introduction to Data Mining 27 1. The Data Explosion Implications • As the amount of data increases, the proportion of information decreases. • As more and more data is generated automatically, we need to find automatic solutions to turn those stored raw results into information. • Companies need to turn stored data into profit ... Otherwise why are they storing it? Introduction to Data Mining 28 2. Improved Data Collection and Management — Data Collection ? Access ? Navigation ? Mining — The more data the better (usually) Introduction to Data Mining 29 3. Statistical & Machine Learning Algorithms — Techniques have often been waiting for computing technology to catch up — Statisticians already doing “manual data mining” — Good machine learning is just the intelligent application of statistical processes — A lot of data mining research focused on tweaking existing techniques to get small percentage gains Introduction to Data Mining 30 3.Data/Information/Knowledge/Wisdom • For example, a data mining application may tell you that there is a correlation between buying music magazines and beer, but it doesn't tell you how to use that knowledge. Should you put the two close together to reinforce the tendency, or should you put them far apart as people will buy them anyway and thus stay in the store longer? • Data mining can help managers plan strategies for a company, it does not give them the strategies. Introduction to Data Mining 31 Data mining Functions • All Data Mining functions can be thought of as attempting to find a model to fit the data. • Each function needs criteria to create one model over another. • Each function needs a technique to compare the data. • Two types of model: – Predictive models predict unknown values based on known data – Descriptive models identify patterns in data Introduction to Data Mining 32 Data mining Functions Introduction to Data Mining 33 Predictive Model — A “black box” that makes predictions about the future based on information from the past and present — Large number of inputs usually available Introduction to Data Mining 34 Kinds of Data Mining problems Database – Find all credit applicants with Aditi as first name – Identify customers who have purchased more than ₹ 10,000 in the last month – Find all customers who have purchased milk Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) Introduction to Data Mining 35 Kinds of Data Mining problems • Classification • Clustering • Association Rule Introduction to Data Mining 36 Classification Classification Model Introduction to Data Mining 37 Definition of Classification Problem Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f: DgC where each t i is assigned to one class. Introduction to Data Mining 38 Example: Credit Card Refund Marital Status Taxable Income Cheat No No Single 75 Cr ? 100 Cr No Yes Married 50 Cr ? Single 70 Cr No No Married 150 Cr ? Yes Married 120 Cr No Yes Divorced 90 Cr ? 5 No Divorced 95 Cr Yes No Single 40 Cr ? 6 No Married No No Married 80 Cr ? Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125 Cr 2 No Married 3 No 4 60 Cr 10 7 Yes Divorced 220 Cr No 8 No Single 85 Cr Yes 9 No Married 75 Cr No 10 No Single 90 Cr Yes Test Set 10 Training Set Introduction to Data Mining Learn Classifier Model 39 Another Example ... • In which group, these object belongs to ? Target Object oopps Group 1: Delia Group 2: Roses (Experiment reported on in Cognitive Science, 2002) Introduction to Data Mining 40 Resemblance • People classify things by finding other items that are similar which have already been classified. • For example: Is a new species a bird? Does it have the same attributes as lots of other birds? If so, then it's probably a bird too. • A combination of rote memorization and the notion of 'resembles'. • Although kiwis can't fly like most other birds, they resemble birds more than they resemble other types of animals. • So the problem is to find which instances most closely resemble the instance to be classified. Introduction to Data Mining 41 Few More Examples • Loan The data generated by companies can “giveengines you airplane can be • Cell phone companies results in used to determine when it can classify customers minutes” by needs to be serviced. into those likely to leave,By classifying you and need discovering the patterns into hence a good credit enticement, and those risk or bad risk, that areaindicative of that areon likely to stay based your problems, companies can regardless. personal service working engines information and a less (increasing largeoften supply of profit) andsimilar discover faults previous, customers. before they materialise (increasing safety). Introduction to Data Mining 42 Clustering • Classification is supervised learning the supervision comes from labeling the instances with the class. • Clustering is unsupervised learning -- there are no predefined class labels, no training set. • So our clustering algorithm needs to assign a cluster to each instance such that all objects with the same cluster are more similar than others. Introduction to Data Mining 43 Clustering • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups • The goal is to find the most 'natural' groupings of the instances. - Within a cluster: Maximize similarity between instances. - Between clusters: Minimize similarity between instances. Intra-cluster distances are minimized Introduction to Data Mining Inter-cluster distances are maximized 44 Clustering • For example, we might have the following data: • Where the axes are two dimensions and shape is a third, nominal attribute. Introduction to Data Mining 45 Clustering • A clustering algorithm might find three clusters: • Even though there are some squares and circles mixed together. Introduction to Data Mining 46 Outliers Outliers Cluster 1 Cluster 2 Introduction to Data Mining 47 What is a natural grouping among these objects? Clustering is subjective Tatkare’s Family School Employees Introduction to Data Mining Females Males 48 What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but… “We know it when we see it” The real meaning of similarity is a philosophical question. We will take a more pragmatic approach. Introduction to Data Mining 49 Clustering Problem • Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. • A Cluster, Kj, contains precisely those tuples mapped to it. • Unlike classification problem, clusters are not known a priori. Introduction to Data Mining 50 Applications • Marketing: Discover consumer groups based on their purchasing habits • City Planning: Identify groups of buildings by type, value, location Introduction to Data Mining 51 Applications • Image Processing: Identify clusters of similar images (eg horses) • Biological: Discover groups of plants/animals with similar properties Introduction to Data Mining 52 Applications • Given: – A source of textual documents – Similarity measure • e.g., how many words are common in these documents Documents source Similarity measure Clustering System • Find: • Several clusters of documents that are relevant to each other Introduction to Data Mining Doc Doc Doc Doc Doc Doc Do Doc Docc Doc 53 Association Rules • A common application is market basket analysis which (1) items are frequently sold together at a supermarket (2) arranging items on shelves which items should be promoted together Introduction to Data Mining 54 Association Rule Discovery Introduction to Data Mining 55 Association Rule Discovery • Given a set of records each of which contain some number of items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Introduction to Data Mining Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 56 Association Rule Discovery Market basket: Rule form: “Body ead [support, confidence]”. buys(X, `beer') buys(X, “snacks') [1%, 60%] (a) If a customer X purchased `beer', 60% of them purchased `snacks' (b) 1% of all transactions contain the items `beer' and `snacks‘ together Introduction to Data Mining 57 A Weka bird is a strong brown bird which is native to New Zealand and grows to be about the same size as a chicken. The Weka was once fairly common on the North and South Islands of New Zealand but over the years has heavily declined on the North Island due to the major damage of their habitats. Introduction to Data Mining 58 • Three graphical user interfaces – “The Explorer” (exploratory data analysis) – “The Experimenter” (experimental environment) – “The KnowledgeFlow” (new process model inspired interface) WEKA is available at http://www.cs.waikato.ac.nz/ml/weka Introduction to Data Mining 59 References • Witten, Ian and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan Kaufmann, 2005 • Dunham, Margaret H, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003 Introduction to Data Mining 60 References: Yahoo Group • ‘dbmsnotes’ http://tech.groups.yahoo.com/group/dbmsnotes/ Introduction to Data Mining 61 Introduction to Data Mining 62