* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A novel clustering algorithm based on weighted support and its
Inverse problem wikipedia , lookup
Geographic information system wikipedia , lookup
Neuroinformatics wikipedia , lookup
Data analysis wikipedia , lookup
Corecursion wikipedia , lookup
Data assimilation wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Theoretical computer science wikipedia , lookup
CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Yiling Yang Xudong Guan Jinyuan You Outline         Motivation Objective Introduction Clustering With sLOPE Implementation Experiments Conclusions Personal opinion Motivation  This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Objective  To present a fast and efficient clustering algorithm CLOPE for transactional data. Introduction  Clustering is an important data mining technique.  Objective : to group data into sets    Intra-cluster similarity is maximized Inter-cluster similarity is minimized The basic idea behind CLOPE   Uses global criterion function that tries to increase the intra-cluster overlapping of transaction items by increasing the height-to-width ratio of the cluster histogram. A parameter to control the tightness of the cluster. Introduction Clustering With SLOPE  The Criterion function can be defined locally or globally.  Locally   The criterion function is built on the pair-wise similarity between transactions. Globally  Clustering quality is measured in the cluster level,utilizing information like the sets of large and small items in the clustering. Clustering With SLOPE  We define the size S(C) and width W(C) of a cluster C below: S (C )   iD ( C ) Occ (i, C )   | ti | ti C W (C ) | D(C ) |  The height of a cluster is defined as H (C )  S (C ) / W (C )  We use gradient G(C)  H (C) / W (C)  S (C) / W (C)2 instead of H(C) as the quality measure for cluster C. Clustering With SLOPE Clustering With SLOPE k k profit (C )   G (C )  | C i i 1 i |  k | C i 1 i S (Ci )  | Ci |  2 i 1 W (Ci ) k | C | i 1 i | k profit r (C )  S (Ci )  | Ci |  r i 1 W (Ci ) k | C i 1 i | Here, r is a positive real number called repulsion, used to control the level of intra-cluster similarity. Clustering With SLOPE  Problem definition   Given D and r, find a clustering C that maximize profit r (C) . The sketch of the CLOPE algorithm. Implementation  RAM data structure    Remark   We keep only the current transaction and A small amount of information for each cluster.The information, called cluster features. The total memory required for item occurrences is approximately M*K*4 bytes using array of 4-byte integers. The computation of profit  Computing the delta value of adding t to C. Implementation Implementation  The time and space complexity      Suppose the average length of a transaction is A. The total number of transactions is N. The maximum number of clusters is K. The time complexity for one iteration is O( N  K  A ). The space requirement for CLOPE is approximately the memory size of the cluster features. Experiments  We analyze the effectiveness and execution speed of CLOPE with two real-life datasets.   For effectiveness, we compare the clustering quality of CLOPE on a labeled dataset with those of LargeItem and ROCK. For execution speed, we compare CLOPE with LargeItem on a large web log dataset. Mushroom  Mushroom dataset (real-life)   The mushroom dataset from the UCI machine learning repository has been used by both ROCK and LargeItem for effectiveness tests. It contains 8124 records with two classes,4208 edible mushrooms and 3916 poisonous mushrooms. Mushroom Cost , w (C )  w  Intra  Inter Mushroom Mushroom   The result of CLOPE on mushroom is better than that of LargeItem and close to that of ROCK. Sensitivity to data order   The results are different but very close to the original ones. It shows that CLOPE is not very sensitive to the order of input data. Berkeley web logs   Web log data is another typical category of transactional databases. The web log files from http://www.cs.berkeley.edu/log/ as the dataset for our second experiment and test the scalability as well as performance of CLOPE.  There are about 7 million entries in the raw log file and 2 million of them are kept after non-html entries removed. Conclusions     The CLOPE algorithm is proposed based on the intuitive idea of increasing the height-to-width ratio of the cluster histogram. The idea is generalized with a repulsion parameter that controls tightness of transactions in a cluster. Experiments show that CLOPE is quite effective in finding interesting clusterings. Moreover,CLOPE is not very sensitive to data order and requires little domain knowledge in controlling the number of clusters. Personal Opinion  The idea behind CLOPE is very simple, can help beginner easy to learning.
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            