Download Data Mining Primitives - Texas Tech University

Outline Motivation  Data mining primitives  Data mining query languages  Designing GUI for data mining systems  Architectures  Data Mining Primitives CS 5331 by Rattikorn Hewett Texas Tech University 1 Motivations: Why primitives?   Data mining primitives Data mining systems uncover a large set of patterns – not all are interesting  Data mining should be an interactive process  User directs what to be mined Users need data mining primitives to communicate with the data mining system by incorporating them in a data mining query language Benefits:    2  Data mining tasks can be specified in the form of data mining queries by five data mining primitives:  Task-relevant data  input  The kinds of knowledge to be mined  function & output  Background knowledge  interpretation  Interestingness measures  evaluation  Visualization of the discovered patterns  presentation More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice 3 4 1 Task-relevant data Knowledge to be mined Specify data to be mined  Database, data warehouse, relation, cube  Condition for selection & grouping  Relevant attributes Specify data mining “functions”:  Characterization/discrimination  Association  Classification/prediction  Clustering 5 6 Background Knowledge Interestingness Typically, in the form of concept hierarchies Objective measures:  Simplicity:  Schema hierarchy    Set-grouping hierarchy   Operation-derived hierarchy  E.g., email address: dmbook@cs.ttu.edu login-name < department < university < organization    7 Rule A => B has support, #(A and B)/ sample size noise threshold (description) Novelty  E.g., 87 ≤ temperature < 90  normal_temperature Rule A => B has confidence, P(A|B) = #(A and B)/ #(B) classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility: potential usefulness  Rule-based hierarchy (association) rule length, (decision) tree size Certainty: validity of the rule E.g., {low, high}  all, {30..49}  low, {50..100}  high  simpler rules are easier to understand and likely to be interesting  E.g., street < city < state < country not previously known, surprising (used to remove redundant rules) 8 2 Visualization of Discovered Patterns   DMQL(data mining query language) Specify the form to view the patterns  E.g., rules, tables, chart, decision trees, cubes, reports etc.  Specify operations for data exploration in multiple levels of abstraction E.g., drill-down, roll-up etc.  A DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language  Hope to achieve a similar effect like that SQL has on relational database  Foundation for system development and evolution  Facilitate information exchange, technology transfer, commercialization and wide acceptance DMQL is designed with the primitives described earlier 9 Languages & Standardization Efforts  Association rule language specifications MSQL (Imielinski & Virmani’99) MineRule (Meo Psaila and Ceri’96)  Query flocks based on Datalog syntax (Tsur et al’98)  collection and data mining query composition  Presentation of discovered patterns  Hierarchy specification and manipulation  Manipulation of data mining primitives  Interactive multilevel mining  Other information Based on OLE, OLE DB, OLE DB for OLAP Integrating DBMS, data warehouse and data mining CRISP-DM (CRoss-Industry Standard Process for Data Mining)   What tasks should be considered in the design GUIs based on a data mining query language?  Data OLEDB for DM (Microsoft’2000)   Designing GUI based on DMQL     10 Providing a platform and process structure for effective data mining Emphasizing on deploying data mining technology to solve business problems 11 12 3 Architectures Coupling data mining system with DB/DW system     No coupling - Flat file processing, not recommended Loose coupling - Fetching data from DB/DW Semi-tight coupling - Enhanced DM performance  Provide efficient implementation of a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling - A uniform information processing environment  DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Concept Description CS 5331 by Rattikorn Hewett Texas Tech University 13 14 Review terms Outline  Review terms  Characterization Descriptive vs. predictive data mining  Descriptive: describes the data set in concise, summarative, informative, discriminative forms  Predictive: constructs models representing the data set, and uses them to predict behaviors of unknown data   Summarization  Hierarchical generalization  Attribute relevance analysis  Concept description: involves  Characterization: provides a concise and succinct summarization of the given collection of data  Comparison (discrimination): provides descriptions comparing two or more collections of data Comparison/discrimination  Descriptive statistical measures  15 16 4 Outline Concept Description vs. OLAP  Concept description:  can handle complex data types (e.g., text, image) of the attributes and their aggregations  a more automated process  Review terms  Characterization   Summarization  Hierarchical generalization  Attribute relevance analysis OLAP:  restricted to a small number of dimension and measure data types  user-controlled process Comparison/discrimination  Descriptive statistical measures  17 18 Characterization methods Summarization by OLAP One approach for characterization is to transform data from low conceptual levels to high ones  “data generalization” E.g., daily sales  annual sales Biology  Science  Two Methods:  Summarization – as in Data Cube’s OLAP  Hierarchical generalization – Attribute-oriented induction Data generalization? 19  Data are stored in data cubes Identify summarization computations      e.g., count( ), sum( ), average( ), max( ) Perform computations and store results in data cubes Generalization and specialization can be performed on a data cube by roll-up and drill-down An efficient implementation of data generalization Limitations:  Can handle only simple non-numeric data type of dimensions Can handle only summarization of numeric data  Do not guide users which dimensions to explore or which levels to reach  20 5 Outline Attribute-Oriented Induction Review terms  Characterization      Summarization  Hierarchical generalization  Attribute relevance analysis Comparison/discrimination  Descriptive statistical measures  Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. How is it done?  Collect the task-relevant data (initial relation) using a relational database query  Perform generalization by attribute removal or attribute generalization.  Apply aggregation by merging identical, generalized tuples and accumulating their respective counts  Interactive presentation with users 21 Basic Elements    22 General Steps Data focusing: task-relevant data, including dimensions, and the result is the initial relation. Attribute-removal and Attribute-generalization:  InitialRel: Query processing of task-relevant data, deriving the initial relation.  Attribute A has a large set of distinct values  If there is no generalization operator on A, or  A’s higher level concepts are expressed in terms of other attributes (giving redundancy)  Remove A  If there exists a set of generalization operators on A  Select an operator to generalize A PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?  PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Generalization threshold controls  Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.  Attribute generalization: controls size of attribute values for generalization or removal (~ 2-8, specified/default)  Relation generalization: controls the final relation/rule size (~ 10-30). 23 24 6 Example  Example (cont.) Initial Relation DMQL: Describe general characteristics of graduate students in the Big-University database use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate”  Name Gender Major Birth_date Residence Phone # GPA Jim Woodman M CS Vancouver,BC, Canada 8-12-76 3511 Main St., Richmond 687-4598 3.67 Scott Lachance M CS Montreal, Que, Canada 28-7-75 345 1st Ave., Richmond 253-9106 3.70 Laura Lee … F … Physics … Seattle, WA, USA … 25-8-70 … 420-5232 … 3.83 … Removed Retained Generalized to Sci,Eng,Bus Removed Generalized to Excl, VG,.. Gender Major Age_range Residence GPA M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … Prime Generalized Relation Transform to corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” } … … Birth-Place Generalized to Country Birth_ country 125 Austin Ave., Burnaby … Generalized to City Generalized to Age range … Canada Foreign Total 16 10 26 14 22 36 30 32 62 25 Presentation of results    Summarization Mapping results into cross tabulation Visualization techniques:   Review terms  Characterization  Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation:  26 Outline Generalized relation:   Hierarchical Pie charts, bar charts, curves, cubes, and other visual forms.  Attribute Quantitative characteristic rules:  … Birth_Region Gender M F Total Presentation  Count Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., t = typical generalization relevance analysis Comparison/discrimination  Descriptive statistical measures  grad ( x) Ù male( x) Þ birth _ region( x) ="Canada"[t :53%]Ú birth _ region( x) =" foreign"[t : 47%]. 27 28 7 Analysis of Attribute Relevance Methods To filter out statistically irrelevant attributes or rank attributes for mining Idea: Compute a measure that quantifies the relevance of an attribute with respect to a given class or concept  Irrelevant attributes  inaccurate/unnecessary complex patterns  An attribute is highly relevant for classifying/predicting a class, if it is likely that its values can be used to distinguish the class from others E.g., to describe cheap vs. expensive cars Is “color” a relevant attribute? What about using “color” to compare banana and apple? These measures can be:  Information gain  The Gini index  Uncertainty  Correlation coefficients 29 Example   Example (cont) How much attribute “major” is relevant to classification of graduate/undergraduate students? Relevance measure: Information gain Review formulae:  For an attribute value set S, each labeled with a class in C and pi is a probability that class i is in S, then Ent ( S ) = -å pi log 2 pi  Expected iÎC information needed to classify a sample if it is partitioned into Si’s for data point that has A’s value Si i I ( A) = å Ent ( Si ) iÎdom ( A ) S  Information 30 gain: Gain(A) = Ent(S) – I(A) Gender Major Birth_ country Age_range M F M F M F M F M F M F Science Science Eng Science Science Eng Science Business Business Science Eng Eng Canada Foreign Foreign Foreign Canada Canada Foreign Canada Canada Canada Foreign Canada 20-25 25-30 …. Very-good Excellent GPA …… ….. … ….. ….. …… Count 16 22 18 25 21 18 18 20 22 24 22 24 120 Graduates 130 Undergraduates Dom(Major) = {Science, Eng, Business} Partition the data into Sc, Eng, Bus representing a set of data points whose “Major” is Science, Eng and Business, respectively 31 32 8 Ent ( S ) = -å pi log 2 pi Example (cont) iÎC I ( A) = å iÎdom ( A ) Gender Major Birth_ country Age_range M F M F M F M F M F M F Science Science Eng Science Science Eng Science Business Business Science Eng Eng Canada Foreign Foreign Foreign Canada Canada Foreign Canada Canada Canada Foreign Canada 20-25 25-30 …. Very-good Excellent GPA …… ….. … ….. ….. …… Si S Ent ( Si ) iÎC I ( A) = å iÎdom ( A ) Count 16 22 18 25 21 18 18 20 22 24 22 24 Ent ( S ) = -å pi log 2 pi Example (cont) 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Gender Major Birth_ country Age_range M F M F M F M F M F M F Science Science Eng Science Science Eng Science Business Business Science Eng Eng Canada Foreign Foreign Foreign Canada Canada Foreign Canada Canada Canada Foreign Canada 20-25 25-30 …. Very-good Excellent GPA …… ….. … ….. ….. …… Si S Ent ( Si ) Count 16 22 18 25 21 18 18 20 22 24 22 24 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115 Similarly, find Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GPA) Ent(S) = 120/250 log2 (120/250)  130/250 log2 (130/250) = 0.9988 Ent(Sc) = 84/126 log2 (84/126)  42/126 log2 (42/126) = …. Ent(Eng) = 36/82log2 (36/82)  46/82 log2 (46/82) = …. Ent(Bus) =  0/42 log2 (0/42)  42/42 log2 (42/42) = …. I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873 Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115 • We can rank “importance” or degree of “relevance” by Gain values • We can use a threshold to prune out attributes that are less “relevant” Class Information captured from S Expected class information induced by attribute “Major” 33 34 Outline Class comparison Review terms  Characterization    Goal: mine properties (or rules) to compare a target class with a contrasting class The two classes must be comparable E.g., address and gender are not comparable store_address and home_address are comparable CS students and Eng students are comparable  Summarization   Hierarchical generalization  Attribute relevance analysis  Comparable classes should be generalized to the same conceptual level Approaches  Use attribute-oriented induction or data cube to generalize data for two contrasting classes and then compare the results --- !!!!  Pattern Recognition approach –Approximate discriminating rules from a data set, repeatedly fine-tune until errors are small enough Comparison/discrimination  Descriptive statistical measures  35 36 9 Outline Descriptive statistical measures Review terms  Characterization Data Characteristics that can be computed   Central Tendency    Summarization   Hierarchical generalization  Attribute relevance analysis mean median Dispersion When is “mean” not an appropriate measure? For a very large data set, how do we compute median ?  five number summary: Min, Quartile1, Median, Quartile3, Max variance, standard deviation Spread about the mean. What does var = 0 mean?  Outliers Detected by rules of thumb: values falling at  Comparison/discrimination  Descriptive statistical measures  least 1.5 of (Q3-Q1) above Q3 or below Q1 Useful displays  37 Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve 38 References          E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997. Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm, Aug. 2000. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, “DMQL: A Data Mining Query Language for Relational Databases”, DMKD'96, Montreal, Canada, June 1996. T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov. 1994. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998. 39 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Primitives - Texas Tech University