Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: A Database Perspective Present By YC Liu Reference • Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", Chapter 6. • M.S. Chen, J. Han, and P.S. Yu., “Data Mining: An Overview from a Database Perspective” , IEEE Transactions on Knowledge and Data Engineering, 8(6): 866-883, 1996. • J. Liu, Y. Pan, K. Wang, and J. Han, "Mining Frequent Item Sets by Opportunistic Projection," In Proc. of 2002 Int. Conf. on Knowledge Discovery in Databases (KDD'02), Edmonton, Canada, July 2002. outline • Introduction • Mining Association Rules • Multilevel Data Generalization, Summarization, and Characterization • Data Classification • Clustering Analysis • • • • • (Pattern-Based Similarity Search) (Mining Path Traversal Patterns) (Recommendation) (Web Mining) (Text Mining) Introduction(1/5) • Knowledge Discovery in Databases • A process of nontrivial extraction of implicit, previously unknown and potentially useful information. Introduction(2/5) • 主要功用 Data – – – – Knowledge 從資料庫中挖掘知識 了解使用者行為 幫助企業作決策 增進商機 • Data Mining 為何興起? – – – – 商品條碼之廣泛使用 企業界之電腦化 數以百萬計之資料庫正在使用 多年來累積了大量企業交易資料 Introduction(3/5) Data Mining: A KDD Process Pattern Evaluation – Data mining: the core of knowledge Data Mining discovery process. Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection Introduction(4/5) Challenges of Data Mining(1/2) • Handling of Different Types of Data • Efficiency and Scalability of Data Mining Algorithms • Usefulness, Certainty, and Expressiveness of Data Mining Results • Expression of Various Kinds of Data Mining Requests and Result Introduction(5/5) Challenges of Data Mining(2/2) • Interactive Mining Knowledge at Multiple Abstraction Levels • Mining Information from Different Sources of Data • Protection of Privacy and Data Security An Overview of Data Mining Techniques • Classifying Data Mining Techniques – What kinds of databases to work on • Relational database, transaction database, spatial database, temporal database..... – What kinds of knowledge to be mined • Association rules, classification, clustering... – What kind of techniques to be utilized • Generalization-based mining, patternbased mining, mining based on statistics or mathematical. Mining Different Kinds of Knowledge from Databases – Association Rules – Data generalization, summarization, and characterization – Data classification – Data clustering – Pattern-based similarity search – Path traversal patterns – Recommendation – Web Mining – Text Mining Mining Association Rules • An association rule is an implication of the form X=>Y, where X I, Y I and XY=. • The rule X=>Y has support s in the transaction set D if s% of transactions in D contain XY. • The rule X=>Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. What Is Association Mining? • Association rule mining: – • Applications: – • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. For cross-marketing and attached mailing applications. Other applications include catalog design, add-on sales, store layout and customer segmentation based on buying patterns. Examples. Rule form: “Body Head [support, confidence]”. – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%] – Association Rule: Basic Concepts • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items – E.g., 98% of people who purchase tires and auto accessories also get automotive services done • Applications – * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics * (What other products should the store stocks up?) Rule Measures: Support and Confidence Customer buys both • Find all the rules X & Y Z with minimum confidence and support – support, s, probability that a transaction contains {X∪Y∪Z} Customer buys beer – confidence, c, conditional probability that a transaction Transaction ID Items Bought having {X∪Y} also contains Z 2000 A,B,C Let minimum support 50%, 1000 A,C and minimum confidence 4000 A,D 50%, we have 5000 B,E,F – A C (50%, 66.6%) Customer buys diaper Association Rule Mining: A Road Map • Boolean vs. quantitative associations (Based on the types of values handled) – – • Single dimension vs. multiple dimensional associations – • age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single level vs. multiple-level analysis – • buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] What brands of beers are associated with what brands of diapers? Various extensions – Correlation, causality analysis Mining Association Rules—An Example Transaction ID 2000 1000 4000 5000 Items Bought A,B,C A,C A,D B,E,F For rule A C: Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Mining Association Rules • Steps for mining association rules – Discover all large itemsets – Use the large itemsets to generate the association rules for the database • To Identify The Large Itemset – Algorithm Apriori Mining generalized and multilevel association rules • Interesting associations among data items often occur at a relatively high concept level Interestingness of Discovered Association Rules • Example 1: (Aggarwal & Yu, PODS98) – Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal – play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. – play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and basketball not basketball sum(row) confidence cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 Interestingness of Discovered Association Rules • An association rule “A=>B” is interesting if its confidence exceeds a certain measure, or P( A B) P( B) d P( A) where d is a suitable constant. Improving the Efficiency of Mining Association Rules • Database Scan Reduction – FP-tree...... • Sampling • Incremental Updating of Discovered Association Rules • Parallel Data Mining Classification • A process of learning a function that maps a data item into one of several predefined classes. • Every classification based on inductivelearning algorithms is given as input a set of samples that consist of vectors of attribute values and a corresponding class. • predicts categorical class labels • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Classification Process (1): Model Construction Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Tenured? Data Classification • Decision-tree-based Classification Method – Decision Tree Learning System, ID3 – Evaluation Functions • Information Gain i ( pi ln( pi )) • Gini Index n gini(T ) 1 p 2j j 1 Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Performance Improvement • Database Indices • Attribute-oriented Induction • Two-phase Multiattribute Extraction – Inference Power – Feature Extraction Phase – Feature Combination Phase Clustering Analysis • Clustering: The process of grouping physical or abstract objects into classes of similar objects. • Clustering Analysis: to construct meaningful partitioning of a large set of objects based on a “divide and conquer” methodology. • Method: – Statistic Analysis (Bayesian Classification Method) – Probability Analysis Clustering Based on Randomized Search • PAM (Partitioning Around Medoids) • CLARA (CLustering LARge Application) • CLARANS (Clustering Large Applications Based Upon RANdomized Search) PAM (Partitioning Around Medoids) (1987) • PAM (Kaufman and Rousseeuw, 1987), built in Splus • Use real object to represent the cluster – Select k representative objects arbitrarily – For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih – For each pair of i and h, • If TCih < 0, i is replaced by h • Then assign each non-selected object to the most similar representative object – repeat steps 2-3 until there is no change PAM Clustering: Total swapping cost TCih=jCjih 10 10 9 9 8 7 7 6 j t 8 t 6 j 5 5 4 4 i 3 h h i 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 Cjih = d(j, h) - d(j, i) 0 10 9 9 3 4 5 6 7 8 9 10 8 h 7 2 Cjih = 0 10 8 1 7 j 6 6 5 i 5 i 4 4 t 3 h 3 2 2 1 1 0 j t 0 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, t) - d(j, i) 10 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, h) - d(j, t) 10 CLARA (Clustering Large Applications) (1990) • CLARA (Kaufmann and Rousseeuw in 1990) – Built in statistical analysis packages, such as S+ • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output • Strength: deals with larger data sets than PAM • Weakness: – Efficiency depends on the sample size – A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased Focusing Methods • Focusing Methods – CLARANS assumes that all the objects to be clustered are all stored in main memory – The most computationally expensive step of CLARANS is calculating the total distances between the two clusters – Reducing the number of objects considered • Only the most central object of a leaf node of the R*-tree are used to compute the medoids of the clusters – Restricting the access • Focus on Relevant Clusters • Focus on a Cluster BIRCH(Balanced Iterative Reducing and Clustering) • An incremental one with the possibility of adjustment of memory requirements to the size of memory that is available • Clustering Features – Summarize information about the subclusters of points instead of storing all points • CF Trees – Branching factor B and threshold T • By changing the threshold value we can change the size of the tree – Use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi2 CF = (5, (16,30),(54,190)) 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) CF Tree Root B=7 CF1 CF2 CF3 CF6 L=6 child1 child2 child3 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next Data Generalization, Summarization, and Characterization • • Data Generalization: A process which abstracts a large set of relevant data in a database from a low concept level to relatively high ones Approaches 1. Data Cube Approach 2. Attribute-oriented Induction Approach Data Cube Approach • Multidimensional database, OLAP, .... • The general idea of the approach is to materialize certain expensive computation that are frequently inquired – Such as count, sum, average, max, min,... – Fast response time and flexible views of data from different angles at different abstraction levels Attribute-oriented Induction Approach • Essential Background Knowledge: Concept Hierarchy • Steps: # – – – – – – Retrieval initial relation Attribute Removal Concept-tree climbing Vote propagation Threshold control Rule transformation Concept Hierarchy and Concept-Tree • 概念階層在歸納之前必須先定義清楚,最一般化 的概念是以”ANY”或是”ALL”來表示,最特定的 概念是對應到資料庫中該屬性的某一特定值。如 屬性Birth place的概念階層可表示為 example • 假設我們要找出研究生(graduated student)的特性法則: example • 屬性的概念階層表格(Concept Hierarchy Table) example • 將資料庫中屬性Status是Graduate的過濾出來。 同時表格每一筆資料都加上一欄”Vote”用來紀錄 在歸納過程中,符合該值組的原始資料筆數。 Example-attribute removal • 將所有屬性中,沒有存在較高概念階層的屬性刪除。 Example-Concept-tree Climbing and Vote Propagation • 假設某一個屬性在概念階層中存在著更高層級的 概念,則該屬性就以其更高層級的值取代。如此 例中的history, physics, math...會由science • 取代... 屬性值向上爬升後,若產生相同的tuples,則 將相同的tuples合併為一筆,並將vote值累加 到歸納後的tuple中。 Example-Concept-tree Climbing and Vote Propagation Example-Threshold Control and Rule Transformation • 門檻控制(Threshold Control ) • 歸納完成