Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker: Liu Yu-Jiun Date: 2006/11/8 Introduction  The goal of data mining is to discover useful knowledge.  Present the clusters as the sets of points.  Interpret the clusters as the humancomprehensible patterns.  In the past, only concern the length of patterns, and descript the cluster C directly. 2 SOR description  Sum of Rectangles ( SOR ) is the canonical format for cluster descriptions.  SOR  for C : either SOR or SOR  Black: cluster C (R1 and R2) Red: other cluster (R1’) Green: Bc SOR description: R1 + R2 SOR  description: Bc – R1’ SOR  SOR   kSOR  3 ESOR (C )  R1  R2 Notations 4 Example R2 R3 R4 E SOR (C )  R1  R2  R3  R4  R5 E SOR (C )  Bc  ( R1' R2' R3' ) R5 R2’ R1 R3’ 5 Problems  Maximum Description Accuracy (MDA)  Minimum Description Length (MDL)  A novel description: kSOR  description 6 Accuracy Formula  recall  E  C / C  precision  E  C / E  f  2  recall  precision recall  precision Two additional measures: 1. Recall at fixed precision. (fix precision = 1) 2. Precision at fixed recall. (fix recall = 1) 7 Three Heuristic Algorithms  Learn2Cover  MDL  approximating max length.  Length of rectangle.  DesTree  MDA  approximating the Pareto front.  FindClans  transforms the output from DesTree into the shorter final description. 8 Learn2Cover o x is the next point from Bc in the sorted order. 9 Cost of Learn2Cover l j (R ) : the length of rectangle R along dimension Dj. R’ : the expanded R in covering ox 10 DesTree  DesTree takes the output from Learn2Cover, R or R -, as input.  Build the tree from bottom to up.  Merge the child nodes into parent nodes until a single node is left.  Each node represents a rectangle.  The higher in the tree we cut, the shorter the length and the lower the accuracy. 11 merge 12 FindClans  FindClans takes as input a cut from  DesTree, outputs a kSOR description. 13 Algorithm -- FindClans 14 Experimental  Compare with CART and BP.  Real datasets from the UCI repository, where data records with the same class label were treated as a cluster. 15 Comparisons with CART  Concern both of MDA and MDL. 16 DesTree vs. CART accuracy length 17 Comparisons with BP     BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so faster than BP. 18 Conclusions  kSOR  provides enhanced expressive power.  MDA allows trading accuracy for interpretability.  A paradigm for query-based “secondgeneration” database mining systems. 19