* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download presentation file
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					1 39 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications Keynote at iiWAS 2002 2 39 Similarity Search 3 39 Feature Based Similarity 4 39 Simple Similarity Queries  Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q. 5 39 Multidimensional Index Structure (R-tree) Directory Data Page: Page: rectangle 1, address1 point 1: x11, x12, x13, ... rectangle 2,xaddress 2 point : x , 2 21 22, x23, ... rectangle 3,xaddress 3 point : x , 3 31 32, x33, ... rectangle4, address4 6 39 Range Query with Depth-First Traversal 7 39 Nearest Neighbor: Priority Algorithm [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995] 4 page accesses 8 39 Problems of High-Dim. Index Structures  „Curse of dimensionality“: • Search performance of index deteriorates in high dim. • Outperformed by sequential scan  Solution • Optimize various parameters of index structures  Needed: Cost model for queries How many pages are expected to be accessed for • Range queries (with given e) • Nearest neighbor queries (with given k) 9 39 Cost Estimation (Uniformity/Independence)  Minkowski sum: Estimation of the access probability of a page [Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000] Nearest neighbor: Estimate distance by point density 10 39 Cost Estimation  Boundary and saturation effects in high dim. space (considered by our model extension)  Correlation between attributes (considered by the concept of fractal dimension)  Cluster structure has also impact on performance • Currently neglected by our model • Histograms and similar data descriptions difficult in high-dimensional space (number of histo-bins exponential in dimensionality) • Other descriptions of cluster structure (dendrograms)  Subject to future work 11 39 Optimization of Index Structures  To avoid the possibility to outperform index based query processing by the sequential scan:  Optimize various parameters such as • • • • Logical block size of the index pages Indexed dimension I/O schedule optimization (fast index scan) Data quantization 12 39 Page Size Optimization 13 39 Page Size Optimization [Böhm, Kriegel: Dynamically Optimizing High Dimensional Index Structures , EDBT 2000] Optimized Dimension Assignment Hi-dim. Index R-tree Problem in hi-dim: Too few splits in each dimension Inverted List B-tree Problem in hi-dim: Too many results in each dimension Matching [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000] 14 39 Optimized Dimension Assignment Hi-dim. Index R-tree Inverted List B-tree Compromise: A moderate number of R-trees each indexing a few dimensions Matching OPTIMIZE! [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000] 15 39 16 39 Schedule Optimization (Fast Index Scan) Range Query: Required Pages are known from the directory 17 39 Schedule Optimization (NN Queries) [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000]  Current expenses are traded for possible later savings  Start at 100% page and extend forward and backward  Optimize the cumulated cost balance (CCB): 18 39 Quantization  Approximate the points by quantization grid based on quantiles  Benefit:fewer bits for representation  Cost: Grid cell partially intersected  access the original point data  How to choose grid resolution ??? [Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998] 19 39 Independent Quantization (IQ tree) Combines index, scan, and quantization [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000] Grid resolution optimized by cost model 20 39 Open Research Problems in Optimization  Multi-Parameter Optimization: • How can parameters be optimized simultaneously? • Are there conflicts between optimization goals? Example: Uniform data: Quantization Correlated data: Tree Striping 21 39 Open Research Problems in Optimization  Consider Insert/Delete/Update:  If the data set faces heavy update, the constructed index should look differently compared with more static data sets • Update-bound: Construct index rather simple • Query-bound: Spend more effort to organize data  Can be considered as an optimization problem 22 39 Data Mining 23 39 KDD Algorithms Based on Similarity Queries LOF Simultan. Nearest Dist. Neighbor OPTICS Based Classific. Outliers DBSCAN .... .... .... Spatial Trend Detect. Spatial Assoc. Rules 24 39 Similarity Join  Catalogue Matching R S 25 39 Clustering  Clustering (e.g. DBSCAN) [Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996] 26 39 Cache Behavior 27 39 Clustering and Similarity Join  DBSCAN uses similarity join as basic operations [Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000] 28 39 k-Nearest Neighbor Classification  Example: k=3 Objects with known class New objects • New objects Known objects 29 39 Distance Range Join (e-Join) • • Most widespread and best evaluated join Often also called the similarity join 30 39 k-Closest Pair Query In SQL notation: SELECT * FROM R, S ORDER BY ||R.obj - S.obj|| STOP AFTER k 31 39 k-Nearest Neighbor Join In SQL notation: SELECT * FROM R, S (limited to k = 1) GROUP BY R.obj ORDER BY ||R.obj - S.obj|| STOP AFTER K (*  k *) 32 39 R-tree Spatial Join (RSJ) [Brinkhoff, Kriegel, Seeger: Efficient Processing of Spatial Joins using R-trees, SIGMOD 1993] procedure r_tree_sim_join (R, S, e) if IsDirpg (R)  IsDirpg (S) then foreach r  R.children do foreach s  S.children do if mindist (r,s)  e then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,e) ; else (* assume R,S both DataPg *) foreach p  R.points do foreach q  S.points do if |p - q| e then report (p,q); R e S 33 39 Modeling and Optimization [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, ICDE 2001]  Mating probability of index pages:  Probability that distance between two pages e  Two-fold application of Minkowski sum 34 39 Modeling and Optimization  I/O cost: • High const. cost per page • Large capacity optimum  CPU cost: • Low const. cost per page • Low capacity optimum  CPU-performance like CPU optimized index  I/O- performance like I/O optimized index 35 39 Open Problems for Research (Sim. Join)  Modeling and Optimization: • • • • Dimension Quantization Page scheduling Caching strategies  Nearest Neighbor Join • Applications • Algorithms  General • Integration into object-relational DBMS 36 39 New Challenges New Challenges Incertain Features:  Application: • Biometric Identification  Particularities: Relative probability 37 39 • Features individually associated with incertainty (e.g. as Gaussian distributions)  Queries: • Probability of match • Find objects with highes probability of match • Find objects with probability of match >= e Feature a1 38 39 New Challenges Support of e-commerce in all phases • Marketing • Sales and booking • Add-on products  Advanced Similarity • • • • Adaptable Multimodal models Relevance-feedback Convex hull  customer segmentation  advanced similarity search  Sales transaction analysis 39 39 New Challenges Stock quota: Technical chart analysis  Known: Database techniques for similarity search in time sequences (DFT, etc.) 40 39 New Challenges  Professional analyst tools use: • Trading signals generated by indicators (etc. MACD) • Formations indicating trends in charts • Relationships to the market and to derivatives 41 39 Conclusion  Database primitives: abstraction from application: Similarity Search  Range Queries Nearest Neighbor Queries Clustering Classification  Similarity Join Outlier Detection  Advantages • General solution, reuse • Separately optimizable
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            