Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Visualization: Data Mining - 1 Matt Cooper Big Data 2 • • Books Part 1 • • • • David Hand Heikki Mannila Padhraic Smyth Mostly about data mining algorithms Part 1: What is the problem? • • • • “Data Preparation for Data Mining” • • • 2 “Principles of Data Mining” • • 1 • Dorian Pyle • Concentrates on data preparation Motivation: what is the goal of data mining? What is data mining? How is it used How does data mining relate to: • • InfoViz Knowledge discovery VDM – Visual Data Mining 3 4 3 4 What is InfoViz Visualization • • • • Q. What is Visualization? A. Using some medium/media to convey a representation of some data so that the user can form a cognitive understanding of the data It is *not* making pictures! 5 Data Often displayed like this Transform=data filtering Mapping? Representation? Transform New data Mapping 6 Representation Display Perception For Scientific Visualization: • • • • • Representation: false ‘picture’ of physical qualities • • • Molecules Fluid flows • • • Body bits Primarily 3D -> volume displays Data has no ‘real’ representation Data isn’t 3D - it’s often quite abstract • Very occasionally higher dimensionality • • • • • Imagine characterizing a person • • Sciviz – 3D or maybe 4D InfoViz – A zillion dimensions What representation? 7 8 7 8 Data Mining Data gathering Wonder what it can tell us Isolate (unexpected) relationships • • (Hopefully) find some which are • • • • No ‘spatial’ relationships at all Data items comprise many different fields Sometimes with time -> ‘animation’ Having an (enormous) amount of data • For InfoViz Interesting Novel Informative Helpful “Secondary data analysis” • We generate enormous amounts of data. Every time we: • • • • • • Bank Shop Vote Drive Fly Phone… This data is collected. 9 9 10 Data gathering (2) e.g. census data • • • All this data is collectable! • Easy to collect and believed to have value We never throw anything away! • • 2011 UK census • • Easy to keep and believed to have value. Technologies to gather new information are growing rapidly. 11 • ~63 Million people ~35 questions each • more than three pages ~2+ Billion data items 12 What is ‘Data Mining’ • • ‘Statistics’ versus ‘data mining’ Statistics • • • • Want to know the answer to a question Database Query & Data mining • • Given a database of shoe-buyers… • Data mining: What common factors (if any) affect the size of shoes people buy? Gather suitable data (ask the question) Analyse the answers Gain (probabilistic?) insight into the answer Database: What size shoes do people in the income bracket 20000Kr-25000Kr buy? 14 • 13 14 Motivation What is data mining? “Everyone spoke of an information overload but what there was in fact was a noninformation overload” • • Richard Saul Wurman, “What-If, Could-be”, Philadelphia, 1976. • Extraction of interesting (non-trivial), previously unknown (and potentially useful) information or patterns from data in ((very) large) databases. • (Wrote the book “Information Anxiety”) Inmon 15 15 16 Alternative names What is not data mining? • Knowledge discovery in databases (KDD) • • • • • Knowledge extraction Data/pattern analysis Data archeology Information harvesting • • • (Deductive) query processing. Expert systems Statistical analysis Business intelligence 17 18 17 18 Data Mining: What Data? • • • • • Relational databases • Each of (large) number(n) of datums is a ‘tuple’ • • Tuple: a (large?) number (p) of items • Transactional databases Advanced DB and information repositories: • • • • • Object-oriented and object-relational databases Time-series data and temporal data Sometimes called a ‘feature vector’ Each item may be: • • • Text databases and multimedia databases Heterogeneous and legacy databases WWW • Security data (images? video?...) • Data warehouses Numeric Textual other tuple (e.g. fingerprints, images, etc.) May be discrete or continuous Result is n points in a p-dimensional space 19 20 19 20 Example data set Problems with data ID AGE SEX Education Income 248 54 M School 100 000 249 ?? F Degree 127 831 250 9 M Incomplete 0 251 85 F PhD 56 348 252 32 ?? Degree 48 326 253 45 M ?? ?? • What are the characteristics of the data? • Holes • • Missing data values Errors and ‘estimates’ • • Income of *exactly* 100000? Sample inconsistencies: • E.g. medical records with different numbers of readings for the same person 21 22 Objectives of DM Data Mining tasks Identifying patterns in data: • • • For representation Because they are ‘interesting’ Unexpected! 1. Exploratory Data Analysis 2. Descriptive Modelling 3. Predictive Modelling ! Classification and Regression 4. Discovering Patterns and Rules 5. Retrieval by content 23 23 24 • • Aside: Models and Patterns Pure data mining A global summary of an entire data set. • • Makes statements about any point in the full measurement space. • Typically very visual approach Model: • • Pattern: • 1. Exploratory Data Analysis Makes statements about relationships between variables only in localized regions of the measurement space. • “Explore the data with no clear idea of what we are looking for” • Very tied to ‘Visual Data Mining’ Problems with: • • Large number of data points Large numbers of dimensions in data 25 26 2. Descriptive Modelling Descriptive modelling(2) • • Attempt to describe all of the data Perhaps use: • Model of overall probability distribution in the p-dimensional space • Partitioning into groups e.g.: • • Cluster analysis for natural grouping Segmentation for user-desired groups 27 28 3. Predictive modelling Predictive modelling (2) • Form a model of the data set which allows prediction of a variable based on the known values of the others • Classification • • • Prediction of a discrete variable Regression analysis • Prediction of a continuous variable (Prediction does not mean future here) 29 29 30 Descriptive and Predictive Modelling • • Q: “Why is PM not the same as DM?” • Strong similarities, some similar methods 4. Discovering Rules and Patterns • Concerned with the identification of local patterns in sub-sets of the space. • Examples: A: The goals are subtly different: • • DM is associated with the grouping in the variable space itself and identifying the groups. PM is associated with predicting one variable. • • Frequently occurring sets of transactions Finding patterns of action indicating fraud 31 32 5. Retrieval by content Score functions • • Using a pattern of interest to locate similar patterns • • Examples: Automatically… • • Finding images with similar content Finding text documents with similar content All of the preceding classes of task share a common feature: • • • The notion of “is like” or “similarity” • Or difference (dissimilarity) Defined through a ‘scoring function’ In numerical or categorical data this is often easy In general it is not… 33 34 33 34 Scoring functions (2) Scoring functions (3) • • • Is an orange like an apple? Yes: • • Both are fruit. Is this picture • Like this one? Both grow on trees. No: • • • One is citrus, one isn’t. One is orange, one is is green/red 35 36 Scoring functions (4) • Specification of the scoring function(s) is crucial to the effectiveness of the system. • One of the biggest contributions the user has to make! Example applications (1) • • Segmentation of sales data is extensively used to classify customers by purchasing patterns and demographic data (age, income etc.) • Use to target marketing Example of descriptive modelling 37 38 Example applications (2) Example applications (3) • • The Advanced Scout system • • Analyses Basketball game logs Identifies features of players behaviour • • • Dr. John Snow’s Cholera diagram • Example of Exploratory Data Analysis Circumstances when they play well/badly Which opposing players are they good or bad against. An example of discovering rules and patterns • • Also Visual Data Mining Done without knowing what caused Cholera! 39 40 Example applications (4) Example Applications (5) • • • SKICAT • • • Classifies stars and galaxies automatically from digital image data Uses a 40-dimensional feature vector • Works as well as human experts Predictive modelling 41 • Image searching on the web • • • Both Altavista and Google had such functions ~2000 Both removed them Google now has one again (2014) Face recognition for security (spotting terrorists) • • Been trialled at several airports in the US Very limited success to date Both examples of retrieval by content. 42 Altavista Image Search (2000) Google image search (2015) 43 44 Google Image Search (2015) Google Image Search (2015) 2nd 45 46 Google Image Search (2015) Google Image Search (2015) 5th 47 48 Google Image Search (2015) Google image search (2015) 15th 49 50 Example applications (6) Fraud Detection and Management • • • Searching text documents for lies on CV’s • Example of a by content method • Detecting inappropriate medical treatment • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australia $1m/yr). Example of Descriptive/Predictive modelling 51 52 Summary (1) Summary (2) Data mining: discovering interesting models and patterns in data • ‘Simplifications’ enabling understanding! • A natural evolution of database technology, in great demand, with wide applications • Mining can be performed in a variety of information repositories 53 • Information expert’s input still vital • • Defining methods Defining scoring functions 54 • End of Part 1 55 55