Download Major Project Report Submitted in Partial fulfillment of the

ALGORITHM FOR CLUSTERING OF LARGE DATASETS Major Project Report Submitted in Partial fulfillment of the requirements for the award of the degree BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE AND ENGINEERING BY Lokesh Kumar Meena (08CO38) Mhetre Shardul (08CO44) Udit Agrawal (08CO80) Under The Guidance of Dr.Mahendra Pratap Singh Assistant Professor DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL SRINIVASNAGAR- 575025 KARNATAKA, INDIA APRIL, 2012 DECLARATION We hereby declare that the Project Work Report entitled “ALGORITHM FOR CLUSTERING OF LARGE DATASETS” which is being submitted to National Institute of Technology Karnataka, Surathkal for the award of the degree of Bachelor of Technology in Computer Science and Engineering is a bonafide report of the work carried out by us. The material contained in this report has not been submitted to any university or Institution for the award of any degree. NAME REGISTRATION NO. 1. Lokesh Kumar Meena 08CO38 2. Mhetre Shardul 08CO44 3. Udit Agrawal 08CO80 SIGNATURE Department of Computer Science and Engineering Place: NITK, Surathkal Date: CERTIFICATE This is to certify that the B.Tech Project Work Report entitled “ALGORITHM FOR CLUSTERING OF LARGE DATASETS” submitted by Sl.No. Registration No Name 1. 08CO38 Lokesh Kumar Meena 2. 08CO44 Mhetre Shardul 3. 08CO80 Udit Agrawal as the record of the work carried out by them, is accepted as the B.Tech Project Work Report Submission in partial fulfillment of the requirements for the award of degree of Bachelor of Technology in Computer Science & Engineering. Project Guide: Mr.Mahendra Pratap Singh Assistant Professor Department of Computer Science and Engineering NITK Surathkal Chairman- DUGC ACKNOWLEDGEMENT We take this opportunity to express our deepest gratitude and appreciation to all those who have helped us directly or indirectly towards the successful completion of this project. First and foremost, we would like to express our sincere appreciation and gratitude to our esteemed guides Mr. M.P.Singh, Assistant Professor and Mr. Vinay Kumar, Associate Professor, Department of Computer Science & Engineering, NITK Surathkal for their insightful advice, encouragement, guidance, critics, and valuable suggestions throughout the course of our project work. Without their continued support and interest, this thesis would not have been the same as presented here. We express our deep gratitude to Dr. Santhi Thilagam P., Associate Professor and Head, Department of Computer Science & Engineering, National Institute of Technology Karnataka, Surathkal for her constant co-operation, support and for providing necessary facilities throughout the B.Tech program. We would like to take this opportunity to express our thanks towards the teaching and nonteaching staff in the Department of Computer Science & Engineering, NITK for their invaluable help and support in these four years of our study. We are also grateful to all our classmates for their help, encouragement and invaluable suggestions. Our special thanks to our parents, supporting families and friends who continuously supported and encouraged us in every possible way for the successful completion of this thesis. Last but not least, we thank God Almighty for his blessings without which the completion of this project work would not have been possible. Lokesh Kumar Meena, 08CO38 Mhetre Shardul, 08CO44 Udit Agrawal, 08CO80 ABSTRACT Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. Using these limitations as our motivation, we propose a scheme called prefix trees for compact storage of patterns in data mining, which forms clustering of the patterns, and is generated from the dataset in a single scan. This clustering takes less amount of space and hence forms a compact storage of patterns. We propose a clustering algorithm based on this storage and prove experimentally that this type of storage reduces the space and time. KEYWORDS: Algorithm, Clustering, Prefix Tree, Data Mining. TABLE OF CONTENT Declaration ........................................................................................................................ i Certificate .......................................................................................................................... ii Acknowledgement ............................................................................................................. iii Abstract ............................................................................................................................. iv Table of contents ................................................................................................................v List of Figures ....................................................................................................................viii List of Tables ..................................................................................................................... ix 1. INTRODUCTION ......................................................................................................... 1 1.1 Application of Clustering ........................................................................................... 2 1.1.1 Educational research analysis .............................................................................. 2 1.1.2 Information retrieval............................................................................................ 2 1.1.3 Biology................................................................................................................ 2 1.2 Constraint of Clustering ............................................................................................. 2 1.2.1 Scalability............................................................................................................ 2 1.2.2 Ability to deal with different types of attributes ................................................... 2 1.2.3 Discovery of clusters with arbitrary shape............................................................ 2 1.2.4 Minimal requirements for domain knowledge of determine input parameters ....... 2 1.2.5 Ability to deal with noisy data ............................................................................. 3 1.2.6 High dimensionality ............................................................................................ 3 1.3 Organisation of Report ........................................................................................... 3 2. LITERATURE SURVEY .............................................................................................. 4 2.1 Typical clustering models .......................................................................................... 4 2.1.1 Centroid models .................................................................................................. 4 2.1.2 Density models: ................................................................................................... 4 2.1.3 Hierarchical Clustering ........................................................................................ 4 2.1.4 Distribution models: ............................................................................................ 5 2.1.5 Graph-Based Clustering: ...................................................................................... 5 2.2 Related Work ............................................................................................................. 5 2.2.1 K-Mean Clustering [8] ......................................................................................... 5 2.2.2 K-Medoids Clustering [9] .................................................................................... 6 2.2.3 CLARANS [4]..................................................................................................... 6 v 2.2.4 DBSCAN [1] ....................................................................................................... 6 2.6 CURE [2] ............................................................................................................... 7 2.7 Chameleon [5] ........................................................................................................ 7 2.8 BIRCH [7].............................................................................................................. 8 3. PROBLEM STATEMENT ........................................................................................... 9 3.1 Problem Statement ..................................................................................................... 9 4. PROPOSED SYSTEM DESIGN .................................................................................. 10 4.1 Terminology and techniques: ..................................................................................... 10 4.1.1 The Data Matrix .................................................................................................. 10 4.1.2 The Dissimilarity Matrix...................................................................................... 10 4.2 Abstract view of our system ....................................................................................... 10 4.3 Data standardization module ...................................................................................... 11 4.3.1 Data discretization using Equal Width Interval Discretization .............................. 11 4.3.2 Algorithm ............................................................................................................ 11 4.4 Data compression module .......................................................................................... 12 4.4.1 Compact Representation using Prefix tree ............................................................ 12 4.4.2 Algorithm ............................................................................................................ 12 4.5 Data Clustering Module ............................................................................................. 13 4.5.1 Clustering using Relevant Feature algorithm ........................................................ 13 4.5.2 Algorithm ............................................................................................................ 13 4.6 RFBC: Relevant Feature Based Clustering Algorithm ................................................ 13 4.7 Design Constraints ..................................................................................................... 14 5. PROJECT IMPLEMENTATION ................................................................................ 15 5.1 Data Discretization – Module 1 .................................................................................. 15 5.1.1 Class: PreProc.java .............................................................................................. 15 5.2 Prefix Tree – Module 2 .............................................................................................. 15 5.2.1 Class: Node.java .................................................................................................. 15 5.2.2 Class: Prefix.java ................................................................................................. 16 5.3 Scanning Prefix tree and Clustering– Module 3 .......................................................... 16 5.3.1 Class: Cluster.java ............................................................................................... 16 5.3.2 Class: ObjectCluster.java ..................................................................................... 17 5.4 Implementation Setup ............................................................................................. 17 6. RESULTS AND ANALYSIS ......................................................................................... 18 vvvvv vi 6.1 Results ....................................................................................................................... 18 6.1.1 Sample Files ........................................................................................................ 18 6.1.2 Data Dicretization Module ................................................................................... 20 6.1.3 Data Compression Module................................................................................... 21 6.1.4 Data Clustering Module ....................................................................................... 22 6. 2 Analysis of Algorithm ............................................................................................... 24 6.3 Comparison of Results ............................................................................................... 25 7. CONCLUSION AND FUTURE WORK ...................................................................... 27 7.1 Conclusion ................................................................................................................. 27 7.2 Future Work............................................................................................................... 27 REFERENCES .................................................................................................................. 28 vii LIST OF FIGURES Fig No. Description Page No. Fig. 1.1 Clusters – Example 1 Fig. 2.1 Density Based Clustering - Model 4 Fig. 4.1 Flow Diagram of Cluster Formation 10 Fig. 4.2 Abstract View of System 11 Fig. 4.3 Prefix Tree – Example 12 Fig. 6.1 Small Sample Data 18 Fig. 6.2 Medium Sample Data 19 Fig. 6.3 Large Sample Data 19 Fig. 6.4 Processed Small Data File 20 Fig. 6.5 Processed Medium Data File 20 Fig. 6.6 Processed Large Data File 21 Fig. 6.7 Prefix Tree - Small Data 21 Fig. 6.8 Prefix Tree - Medium Data 22 Fig. 6.9 Prefix Tree - Large Data 22 Fig. 6.10 Final Output - Small Data 23 Fig. 6.11 Final Output - Medium Data 23 Fig. 6.12 Final Output - Large Data 24 Fig. 6.13 Comparison of RFBC and BIRCH at Threshold=10 25 Fig. 6.14 Comparison of RFBC and BIRCH at Threshold=20 25 viii LIST OF TABLES Table No. Description Page Number Table 5.1 Implementation Environment Specification 17 Table 6.1 Comparison table between RFBC and BIRCH when Threshold = 10 25 Table 6.2 Comparison table between RFBC and BIRCH when Threshold = 20 25 ix 1 CHAPTER 1 INTRODUCTION Cluster analysis is a data mining technique used to place data elements into related groups without advance knowledge of the group definitions. It groups objects based on the information found in the data describing the objects or their relationships. The goal is that the objects in a group will be similar (or related) to one other and different from (or unrelated to) the objects in other groups. The greater the similarity or homogeneity within a group and the greater the dissimilarity between groups the “better” or more distinct the clustering. Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful clusters are the goal, then the resulting clusters should capture the “natural” structure of the data. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, and to provide a grouping of spatial locations prone to earthquakes. However, in other cases, cluster analysis is only a useful starting point for other purposes, e.g., data compression or efficiently finding the nearest neighbours of points. Figure 1.1 shows an example of clustering which grouped objects into 3 clusters. Whether for understanding or utility, cluster analysis has long been used in a wide variety of fields: psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning, and data mining. Figure 1.1 An example of Clusters Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular 2 statistical distributions. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended use of the results. 1.1 Application of Clustering There are wide applications of clustering in real life problems. Following the few fields in which clustering is used very often: 1.1.1 Educational research analysis Data for clustering can be students, parents, sex or test score. Clustering is an important method for understanding and utility of grouping or streaming in educational research. Cluster analysis in educational research can be used for data exploration, cluster confirmation and hypothesis testing. Data exploration is used when there is little information about which schools or students will be grouped together. It aims at discovering any meaningful clusters of units based on measures on a set of response variables. 1.1.2 Information retrieval The result of a search engine returns thousands of pages. Clustering can be used to group these search results into small clusters, each of which captures a particular aspect of a query. 1.1.3 Biology Clustering can be used for creating the taxonomy of all living things, kingdoms, phylum, family, genus etc. 1.2 Constraint of Clustering Clustering is a challenging and interested field; its potential applications pose their own special requirements. The following are typical requirements of clustering: 1.2.1 Scalability Many clustering algorithms work well on small data sets containing fewer than 200 data objects. However, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. 1.2.2 Ability to deal with different types of attributes Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. 1.2.3 Discovery of clusters with arbitrary shape Many clustering algorithms determined clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures „end to find spherical clusters with similar size and density‟. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. 3 1.2.4 Minimal requirements for domain knowledge of determine input parameters Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often hard to determine, especially for data sets containing highdimensional objects. This not only burdens users, but also makes the quality of clustering difficult to control. 1.2.5 Ability to deal with noisy data Most real-world databases contain outliners or missing, unknown, erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. 1.2.6 High dimensionality A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. It is challenging to cluster data objects in high-dimensional space, especially considering that such data can be very sparse and highly skewed. 1.3 Organisation of Report Chapter 2 documents various clustering models developed over the time, and some selected clustering algorithms from these models. Chapter 3 states the problem statement of project; and Chapter 4 explains proposed system design, including all design modules and their constraints. Chapter 5 include implementation details of all modules. Chapter 6 presents the results, analysis and comparison of the results with standard algorithms. Chapter 7 concludes the report with future work. 4 CHAPTER 2 LITERATURE SURVEY We studied various clustering algorithms i.e. K-Means, K-Medoid, DBSCAN, CURE, BIRCH, CLARAN, CHAMELEON, and their variants which use expectation minimization techniques. 2.1 Typical clustering models Many different clustering techniques that have been proposed over the years these techniques can be described using the following criteria or models. 2.1.1 Centroid models Partition clustering techniques create a one-level partitioning of the data points. K-means, Kmedoid are the two examples of this model. Both these techniques are based on the idea that a centre point can represent a cluster. For K-means we use the notion of a centroid, which is the mean or median point of a group of points. For K-medoid we use the notion of a medoid, which is the most representative (central) point of a group of points. By its definition a medoid is required to be an actual data point. CLARANS is a more efficient version of the basic K-medoid and is used in spatial data mining problems. 2.1.2 Density models: The most popular density based clustering methods are DBSCAN and OPTICS defines clusters as connected dense regions in the data space. In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. In contrast to many newer methods, it features a well-defined cluster model called "density-reachability". Figure 2.1 shows examples of density based clustering. Figure 2.1 Density based clustering model 2.1.3 Hierarchical Clustering In hierarchical clustering the goal is to produce a hierarchical series of nested clusters, ranging from clusters of individual points at the bottom to an all-inclusive cluster at the top. 5 A diagram called a dendogram graphically represents this hierarchy and is an inverted tree that describes the order in which points are merged (bottom-up view) or clusters are split (top-down view). Hierarchical clustering builds models based on distance connectivity. Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from. 2.1.4 Distribution models: The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A nice property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution. Clusters are modelled using statistic distributions, such as multivariate normal distributions used by the Expectationmaximization algorithm. 2.1.5 Graph-Based Clustering: The hierarchical clustering algorithms can be viewed as operating on a proximity graph. However, they are most commonly viewed in terms of merging or splitting clusters, and often there is no mention of graph related concepts. There are some clustering techniques, however, that are explicitly cast in terms of a graph or a hyper graph. Many of these algorithms are based on the idea of looking at the nearest neighbours of a point. 2.2 Related Work There are many clustering algorithms based on the models listed above, we studied some selected algorithms from these clustering models to understand concepts behind each algorithm. By this study we tried to find out the limitations of these algorithms. 2.2.1 K-Mean Clustering [8] The K-means clustering technique is very simple. K-means uses the notion of a centroid, which is the mean or median point of a group of points. a centroid almost never corresponds to an actual data point. Basic Algorithm for finding K clusters. Given k, the k-means algorithm is implemented in 4 steps: 1. Partition objects into k nonempty subsets. 2. Compute seed points as the centroids of the clusters of the current partition. The centroid is the centre (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment. 6 2.2.2 K-Medoids Clustering [9] The objective of K-medoid clustering is to find a non-overlapping set of clusters such that each cluster has a most representative point, i.e., a point that is most centrally located with respect to some measure, e.g., distance. These representative points are called medoids. Basic K-medoid Algorithm for finding K clusters: 1. Select K initial points. These points are the candidate medoids and are intended to be the most central points of their clusters. 2. Consider the effect of replacing one of the selected objects (medioids) with one of the non-selected objects. The distance of each non-selected point from the closest candidate medoid is calculated, and this distance is summed over all points. This distance represents the “cost” of the current configuration. All possible swaps of a non-selected point for a selected one are considered, and the cost of each configuration is calculated. 3. Select the configuration with the lowest cost. If this is a new configuration, then repeat step 2. 4. Otherwise, associate each non-selected point with its closest selected point (medoid) and stop. 2.2.3 CLARANS [4] CLARANS itself grew out of two clustering algorithms, PAM and CLARA, it was developed specifically for use in spatial data mining. PAM (Partitioning Around Medoids) is a “K-medoid” based clustering algorithm that attempts to cluster a set of m points into K clusters. CLARA (Clustering LARge Applications) is an adaptation of PAM, for handling larger data sets. It works by repeatedly sampling a set of data points, calculating the medoids of the sample, and evaluating the cost of the configuration that consists of these “sample-derived” medoids and the entire data set. The set of medoids that minimizes the cost is selected. The algorithm can be stated as follows: 1. Randomly pick K candidate medoids. 2. Randomly consider a swap of one of the selected points for a non-selected point. 3. If the new configuration is better, i.e., has lower cost, then repeat step 2 with the new configuration. 4. Otherwise, repeat step 2 with the current configuration unless a parameterized limit has been exceeded. (For example: limit is set to max (250, K *(m - K)). 5. Compare the current solution with any previous solutions and keep track of the best. 6. Return to step 1 unless a parameterized limit has been exceeded. 2.2.4 DBSCAN [1] DBSCAN is a density based clustering algorithm that works with a number of different distance metrics. When DBSCAN has processed a set of data points, a point will either be in 7 a cluster or will be classified as noise. DBSCAN is based on the concepts of a point being “density reachable” and “density connected”. Conceptually, data points fall into three classes: 1. Core points. These are points that are at the interior of a cluster. A point is an interior point if there are enough points in its neighbourhood, i.e., if the number of points within a given neighbourhood around the point exceeds a certain threshold, as determined by the distance function and a supplied distance parameter. If two core points belong to each other‟s neighbourhoods, then the core points belong to the same cluster. 2. Border points. A border point is a point that is not a core point, i.e., there are not enough points in its neighbourhood, but it falls within the neighbourhood of a core point. 3. Noise points. A noise point is any point that is not a core point or a border point. Thus, for DBSCAN, a cluster is the set of all core points whose neighbourhoods transitively connect them together, along with some border points. 2.6 CURE [2] CURE (Clustering Using Representatives) is a clustering algorithm that uses a variety of different techniques to create an approach which can handle large data sets, outliers, and clusters with non-spherical shapes and non-uniform sizes. We summarize our description of CURE by explicitly listing the different steps: 1. Draw a random sample from the data set. 2. Partition the sample into p equal sized partitions. 3. Cluster the points in each cluster using the hierarchical clustering algorithm to obtain m/pq clusters in each partition and a total of m/q clusters. Some outlier elimination occurs during this process. 4. Eliminate outliers. This is the second phase of outlier elimination. 5. Assign all data to the nearest cluster to obtain a complete clustering. 2.7 Chameleon [5] Chameleon is a clustering algorithm that combines an initial partitioning of the data using an efficient graph partitioning algorithm with a novel hierarchical clustering scheme that dynamically models clusters. The key idea is that two clusters will be merged only if the resulting cluster is similar to the original clusters, i.e. self-similarity is preserved. In short, the steps of the Chameleon algorithms are: 1. Build a k-nearest neighbour graph. 2. Partition the graph into partitions using a multilevel graph-partitioning algorithm. 3. Perform a hierarchical clustering starting with the partitions. This hierarchical clustering will merge the clusters which best preserve the cluster self-similarity with respect to relative interconnectivity and relative closeness. 8 2.8 BIRCH [7] BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies) is based on the notion of a clustering feature (CF) and a CF tree. The idea is that a cluster of data points (vectors) can be represented by a triple of numbers (N, LS, SS), where N is the number of points in the cluster, LS is the linear sum of the points, and SS is the sum of squares of the points. A CF tree is built as the data is scanned. As each data point is encountered, the CF tree is traversed, starting from the root and choosing the closest node at each level. When the closest “leaf” cluster for the current data point is finally identified, a test is performed to see if adding the data item to the candidate cluster will result in a new cluster with a diameter greater than the given threshold, T. If not, then the data point is “added” to the candidate cluster by updating the CF information. The cluster information for all nodes from the leaf to the root is also updated. BIRCH consists of a number of phases beyond the initial creation of the CF tree. The phases of BIRCH are as follows: 1. Load the data into memory by creating a CF tree that “summarizes” the data. 2. Build a smaller CF tree if it is necessary for phase 3. T is increased, and then the leaf node entries (clusters) are reinserted. Since T has increased, some clusters will be merged. 3. Perform global clustering. Different forms of global clustering (clustering which uses the pair wise distances between all the clusters) can be used. 4. Redistribute the data points using the centroids of clusters discovered in step 3 and thus, discover a new set of clusters. By repeating this phase, multiple times, the process converges to a local minimum. Because of page size constraints and the T parameter, points that should be in one cluster are sometimes split, and points that should be in different clusters are sometimes combined. Also, if the data set contains duplicate points, these points can sometimes be clustered differently, depending on the order in which they are encountered. 9 CHAPTER 3 PROBLEM STATEMENT In this chapter we will state our problem statement and the major objectives of the project. 3.1 Problem Statement The problem statement of the project is to “Develop an algorithm for clustering of large datasets”. The objectives of the project are:    Algorithm developed must be scalable. It must be able to handle objects with high dimensionality. Better time and space complexity than the current similar algorithm like BIRCH etc. 10 CHAPTER 4 PROPOSED SYSTEM DESIGN Clustering is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. The process of formation of clusters from the raw data has been showed by the following flow diagram (Fig. 4.1): Figure 4.1 The flow diagram for formation of cluster. In the figure 4.1, raw data represents the data which have been obtained from any source or database. Generally this data consists of millions of entries each representing a different object. The clustering algorithm does the task converting the raw data to clusters of data. Currently, there are numerous number of clustering algorithm present based on the input, size of data etc. Finally, the clusters which are generated as the result of the second step contains object of similar type. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. 4.1 Terminology and techniques: The following are the terminology used in our system design. 4.1.1 The Data Matrix Objects (samples, measurements, patterns, events) are usually represented as points (vectors) in a multi-dimensional space, where each dimension represents a distinct attribute (variable, measurement) describing the object. It is represented in from of a two dimensional array m*n. The each row of the data matrix represents an object and the columns represent the attributes of the object. 4.1.2 The Dissimilarity Matrix The dissimilarity matrix (also called distance matrix) describes pair wise distinction between M objects. It is a square symmetrical MxM matrix with the (ij)th element equal to the value of a chosen measure of distinction between the (i)th and the (j)th object. The diagonal elements are either not considered or are usually equal to zero - i.e. the distinction between an object and itself is postulated as zero. 11 4.2 Abstract view of our system The abstract view shows the flow chart of all the modules present in the system and displays how they are linked together. The following diagram (Figure 4.2) shows the abstract view of our system: Figure 4.2 Abstract view of system 4.3 Data standardization module This module deals with the standardization of data removing all the redundancies in the data type and then assigning the values to the attributes. 4.3.1 Data discretization using Equal Width Interval Discretization The algorithm needs to first sort the attribute according to its values, and then find the minimum value, Xmin, and the maximum value, Xmax of that attribute. Interval width, w, is then computed by Where f is the user-defined parameter as the total number of intervals needed (and is equal to the branching factor). The values of attributes are then discretized according to the formulae: Here Vfinal is the final value of the attribute and Vinitial is the initial value. 4.3.2 Algorithm Input: A raw Data Matrix of size MxN and branching factor f. Output: A processed Data Matrix of size MxN having discrete values for attributes. Method: 1) For each column in the Data Matrix: 2) Find the max and min value of that column; 12 3) 4) 5) Calculate the width w of the interval using Formula 4.1; For each element in the column: Reassign the value of element using Formula 4.2; 4.4 Data compression module This module deals with conversion of data into reduced data set. This is to be done using the prefix tree. At end of the module implementation a prefix tree should be formed preserving the properties of original set. 4.4.1 Compact Representation using Prefix tree A prefix tree is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Figure 4.3 An Example of Prefix tree for keys "A", "to", "tea", "ted", "ten", "i", "in" and "inn". 4.4.2 Algorithm Input: A pre-processed Data Matrix of size MxN obtained from the module 1. Output: A prefix tree representing the processed Data Matrix. Method: 1) For each row in the Data Matrix: 2) Insert row in prefix tree; 3) Function prefixTreeInsert (Data Matrix row) 4) If (root == NULL): 5) Create root node; 6) For each attribute i in row: 7) If row[i] is not present in prefix tree 8) create a child node; 13 Insert the child node in the tree; 9) return; 4.5 Data Clustering Module This is the last and final module of our algorithm and it deals with the clustering of objects. We used our Relevant Feature algorithm for finding the dissimilarity between the objects. At the end of this module implementation clusters having similar objects will be generated. 4.5.1 Clustering using Relevant Feature algorithm In this type of approach the relative distance between the nodes of the prefix tree will be found. This will serve as the basic criteria to find the dissimilarity between the different features of objects. Features having distance less than a threshold will be put together in a feature cluster. Once we get the features clusters, again the prefix tree is scanned and object having these features will be grouped in a same cluster. 4.5.2 Algorithm Input: A prefix obtained from the module 2, the feature threshold tf and the object cluster threshold toc. Output: A set of K object clusters which maximizes the similarity between the individual elements in each cluster. Method: 1) For each branch in the prefix tree: Calculate the difference between each individual nodes values of that branch. If difference < tf and none of them is present in a Feature Cluster FCi then, Create a new Feature Cluster FC and put both the node values in FC; Else Put both the values in FCi; 2) For each two branches i & j in prefix tree: Dissimilarity = 0; Repeat: If nodes at same level have values are not present in FCi then, Dissimilarity = Dissimilarity+ (vi-vj)2; Until leaf node is reached; Dissimilarity = sqrt (Dissimilarity); If Dissimilarity < toc and none of them is present is a Object Cluster OCi then, Create a new Object Cluster OC and put both the node values in OC; Else Put both the values in OCi; 4.6 RFBC: Relevant Feature Based Clustering Algorithm All the three modules discussed above are combined to give the final algorithm. Input: A database containing n objects, branching factor f, the feature threshold tf and the object cluster threshold toc. 14 Output: A set of K clusters which maximizes the similarity between the individual element in each cluster. Method: The RFBC Algorithm is implemented as follows: 1) Pre-process the data by using the data discretization technique; 2) Represent the pre-processed data obtained in the first step in form of a prefix tree; 3) Apply Relevant Feature Algorithm; 4) Display the Clusters Ci‟s; 4.7 Design Constraints 1. 2. 3. 4. All the modules should be executed in minimum space and time complexity. The data reduction/compression should preserve the properties of data. The data reduction should be independent of the input data type. Algorithm should not end in creating fuzzy clusters. 15 CHAPTER 5 PROJECT IMPLEMENTATION In the project the three modules which were discussed in the proposed project design are implemented. The focus of this implementation is on minimum space and time complexity. The design and abstract view of the project will serve as the guideline during implementation. 5.1 Data Discretization – Module 1 This module converts the raw data in processed data. In order to implement the data discretization module we created the following classes:  PreProc.java 5.1.1 Class: PreProc.java The following sections discusses about the constructor, data structure and functions used in the class. 5.1.1.1 Constructor PreProc <Variable Name> = new PreProc ("FILE <Variable Name>"); It accepts a parameter of type File, which is the raw data file. 5.1.1.2 Data Structure   Float table [ ] [ ]: Stores the Raw data Matrix and the final processed data matrix. Int rows, Int Count: They represent the number of rows and Column in the Raw Data. The number of rows represents the number of elements and number of column represent the number of attribute. 5.1.1.3 Functions  Void GetSize: Calculates the size/width of the interval using the formula 4.1  Void Process: Calculates and assigns the final values to the attribute using the formulae 4.2  Void Save: Saves the processed Data Matrix (table [ ] [ ]) in a file proc.txt. 5.2 Prefix Tree – Module 2 In this module we represent the data which we obtained after the data-discretization in the module1 in the form of prefix tree. Following are the classes we used for the implementation of this module:   Node.java PrefixTree.java 5.2.1 Class: Node.java The following sections discusses about the constructor, data structure and functions used in the class. 16 5.2.1.1 Constructor Node <Variable Name> = new Node("Float <Variable Name>"); It accepts a parameter of type Float, which is the value of attribute. 5.2.1.2 Data Structure  Collection<Node> child: Stores all the child node of the parent.  Float content: The value of the node. 5.2.1.3 Functions  Float getValue(node): Return the value of the node.  Collection<Node> getChild(node): Return all the child nodes of the current node. 5.2.2 Class: Prefix.java The following sections discusses about the constructor, data structure and functions used in the class. 5.2.2.1 Constructor PrefixTree <Variable Name> = new Node("Float [ ] <Variable Name>"); parameter of type Float [ ] , which is a row of Pre-processed Data Matrix. It accepts a 5.2.2.2 Data Structure Node root: The root of the prefix tree. 5.2.2.3 Functions  Create prefixTree(): Creates the root node of the prefix tree.  Void insert(float row , int number of element): Insert the row of Data Matrix in the Prefix Tree. 5.3 Scanning Prefix tree and Clustering– Module 3 In this module we generate final clusters of similar objects. Following are the classes used for its implementation:   ObjectCluster.Java Cluster.Java 5.3.1 Class: Cluster.java The following sections discusses about the constructor, data structure and functions used in the class. 5.3.1.1 Constructor Cluster<Variable Name> = new Cluster(" "); 5.3.1.2 Data Structure  Int featureThreshold: If the distance (difference between the node values) between two features is less then this threshold then they will be put in a same cluster.  LinkedList featureClusters: Stores the feature Cluster 17 5.3.1.3 Functions  LinkedList createFeatureCluster(File): Create cluster of feature and return a linked list of feature cluster.  Boolean inRange( int , int ):Return true if the distance between the values is less then threshold else return false. 5.3.2 Class: ObjectCluster.java The following sections discusses about the constructor, data structure and functions used in the class. 5.3.2.1 Constructor ObjectCluster<Variable Name> = new ObjectCluster(" ") 5.3.2.2 Data Structure HashMap<Integer, ArrayList> clusters: Final Object Clusters. 5.3.2.3 Functions:  Void createObjectCluster(LinkedList): Accepts as input the linked list of the feature clusters and generates object clusters.  Void dfsTree(Node ): Performs dfs of the tree. Compare two object based on their features. Assign them to the object clusters. 5.4 Implementation Setup We simulated and ran our algorithm on a system having following specifications: Environmental Variable Value Processor Pentium(R)Dual-Core CPU 2.00Ghz Operating System 32 Bit Windows 7 RAM 3.00 GB Language Java 6 IDE NetBeans 6.9.1 JRE 1.6 Table 5.1 Implementation Environment Specification 18 CHAPTER 6 RESULTS AND ANALYSIS 6.1 Results In order to check the results of our algorithm and check the validity of the output we tested our algorithm on various datasets and have obtained successful results. 6.1.1 Sample Files We have used 3 different datafiles (http://www.uni-koeln.de/themen/statistik/data/cluster/: Data Courtesy) for obtaining the results. These files represent the data of Nutrients in Meat, Fish and Fowl. The data sets contain the nutrients in different kind of meat, fish or fowl. The first column shows the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy (Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their respective orders. We have classified the out three dataset files into:    Small input ( 27 elements) Medium Input ( 1500 elements) Large Input ( 15000 elements) Figure 6.1 shows the sample dataset file consisting of 27 elements. The first column shows the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy (Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their respective orders. 19 Figure 6.1 Small Sample Data Figure 6.2 shows the sample dataset file consisting 1500 elements. The first column shows the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy (Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their respective orders. Figure 6.2 Medium Sample Data Figure 6.3 shows the sample dataset file consisting 15000 elements. The first column shows the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy (Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their respective orders. Figure 6.3 Large Sample Data 20 6.1.2 Data Dicretization Module The output of this module different dataset file is as follows: The figure 6.4 shows that after pre-processing of small data file the highlighted attribute (fats) have discretized values. Figure 6.4 Resulting Processed small datafile The figure 6.5 shows that after pre-processing of medium data file the highlighted attribute (fats) have discretized values. Figure 6.5 Resulting Processed medium datafile 21 The figure 6.6 shows that after pre-processing of large data file the highlighted attribute (fats) have discretized values. Figure 6.6 Resulting Processed large datafile 6.1.3 Data Compression Module This module shows the prefix tree which is formed from the 3 dataset files. The Figure 6.7 shows the prefix tree formed after the data compression module is applied to the processed small dataset. Each ith row of the figure shows the ith level of the prefix tree.The height of the tree is equal to the number of attributes. In this case the height of tree is 5. Figure 6.7 Prefix Tree formed The Figure 6.8 shows the prefix tree formed after the data compression module is applied to the processed medium dataset. Each ith row of the figure shows the ith level of the prefix tree. The height of the tree is equal to the number of attributes. In this case the height of tree is 5. 22 Figure 6.8 Prefix Tree formed The Figure 6.9 shows the prefix tree formed after the data compression module is applied to the processed large dataset. Each ith row of the figure shows the ith level of the prefix tree.The height of the tree is equal to the number of attributes. In this case the height of tree is 5. Figure 6.9 Prefix Tree formed 6.1.4 Data Clustering Module This module shows the final output of the algorithm for 3 datasets files. The following figure 6.10 shows the final output of the algorithm for small dataset. Each line in the figure represents a cluster. There are total 8 clusters formed. Each cluster is a collection of serial number of all those elements which are similar. Numerical values in the output indicate the serial number of object in the raw datafile. 23 Figure 6.10 Final output of the Algorithm for small dataset The following figure 6.11 shows the final output of the algorithm for medium dataset. Each line in the figure represents a cluster. There are total 8 clusters formed. Each cluster is a collection of serial number of all those elements which are similar. Numerical values in the output indicate the serial number of object in the raw datafile. Figure 6.11 Final output of the Algorithm for medium dataset The following figure 6.12 shows the final output of the algorithm for large dataset. Each line in the figure represents a cluster. There are total 8 clusters formed. Each cluster is a collection 24 of serial number of all those elements which are similar. Numerical values in the output indicate the serial number of object in the raw datafile. Figure 6.12 Final output of the Algorithm for large dataset 6. 2 Analysis of Algorithm A formal analysis of algorithm is done to theoretically obtain its time complexity. Phase 1: Data Preprocessing Module: (STEP 1)  During this phase the whole data is standardized and normalized  Each element of the raw data is processed only once.  Hence, the time complexity of this phase is: T (n) = O (m x n); Here, m = no. of elements in data set; n = no . of attributes/ features of each object; Phase 2: Conversion of processed data into prefix tree: (STEP 2)  Time complexity of inserting an element of n takes O (n) time;  Hence the total time complexity of converting m elements each having n attributes/feature is O(m x n);  As we are creating a prefix tree the total cost of space utilized is O (k x n), where k is the branching factor given by the user; Phase 3: Calculating the dissimilarity between objects and assigning them to clusters: (STEP 3 onwards)  Time complexity of the STEP -3 is O (k x n2), where K in the branching factor, n is the number of attributes/features and m is the number of elements. 25  Time complexity of the STEP-4 is O (k2 x n), where K in the branching factor, n is the number of attributes/features and m is the number of elements. Hence the total time complexity of the algorithm will be T1 + T2 + T3, which is equal to: T(m) = O( kn2 + k2n + mn) where m >>> k and m>>>n Hence, Total time complexity is: T (M) = O (M*N), which is linear in terms of M, ( Here M is the number of elements). 6.3 Comparison of Results We have coded the BIRCH algorithm which also uses the concept of tree for finding the clusters. Moreover, it is a good example of algorithm which generates clusters for large datasets. We have compared the results for 3 types of sample input (as mentioned in section 6.1.1 ) and found out the performance of algorithm both the on these. We kept the same threshold for both the algorithms and obtained the clusters of dataset. The tables 6.1 below show the comparison of our algorithm as compared with the BIRCH for the 3 different datasets ( small, medium and large )and threshold = 10. Algorithm RFBC BIRCH Time Taken for small input file 0.69 sec 0.49 sec Time Taken for medium input file Time Taken for large input file 1.05 sec 1.15 sec 4.55 sec 19.78 sec Table 6.1 Comparison table between RFBC and BIRCH. Threshold = 10. The tables 6.2 below show the comparison of our algorithm as compared with the BIRCH for the 3 different datasets ( small, medium and large) and threshold. Algorithm RFBC BIRCH Time Taken for small input file 0.71 sec 0.47 sec Time Taken for medium input file Time Taken for large input file 0.99 sec 0.94 sec 3.41 sec 12.64 sec Table 6.2 Comparison table between RFBC and BIRCH. Threshold = 20 The figure 6.13 below shows graphically the comparison of our algorithm with the BIRCH algorithm and plots the table 6.1. The X-axis in the figure denotes the size of Data set and Yaxis shows the time taken by the algorithm in seconds. The threshold here is 10. 26 Figure 6. 13 Show the comparison of our algorithm with the BIRCH. X-axis represent size of Dataset and Y-axis represent time taken in seconds. The figure 6.14 below shows graphically the comparison of our algorithm with the BIRCH algorithm and plots the table 6.2. The X-axis in the figure denotes the size of Data set and Yaxis shows the time taken by the algorithm in seconds. The threshold here is 20. Figure 6. 14 Show the comparison of our algorithm with the BIRCH. X-axis represent size of Dataset and Y-axis represent time taken in seconds. 27 CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion We successfully demonstrated the correctness of our algorithm and showed that it has better performance than BIRCH algorithm. Our algorithm performs exceptionally well on large datasets. It was observed in our experiments that the large (15000 objects) dataset took nearly 5X time lesser than the BIRCH. 7.2 Future Work Although we achieved a better time complexity there is further scope for improvement. Following are the improvements which can be made on the current algorithm:   The algorithm works fine for the objects having numerical data values. But the dissimilarity function has to be changed whenever a boolean/interval scaled variable appears. Hence dissimilarity function can be made more generic. Currently, pre-processing of data is done in a serial manner. Future work may include parallelization of pre-processing step using GPU computing or threads. 28 References [1] Ester, M., Kriegel, H.-p., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. (pp. 226-231). [2] Guha, S., Rastogi, R., & Shim, K. (1998 ). CURE: an efficient clustering algorithm for large databases. SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data. New York: ACM. [3] Guha, S., Rastogi, R., & Shim, K. (July 2000). Rock: A robust clustering algorithm for categorical attributes. Information Systems, 345-366. [4] Raymond T. Ng and Jiawei Han. 2002. CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Trans. on Knowl. and Data Eng. 14, 5 (September 2002), 10031016. DOI=10.1109/TKDE.2002.1033770. [5] George Karypis, Eui-Hong Han, and Vipin Kumar, (1999) CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer, Vol. 32, No. 8, August, 1999. pp. 68-75. [6] Radhika M. Pai and V. S Ananthanarayana (2008). Prefix-Suffix Trees: A Novel Scheme for Compact representation of Large Datasets. Springer. [7] Zhang, T., Ramakrishnan, R., & Livny, M. (1997). BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1 (2), 141-182. [8] Mac Queen, J., 1967. Some methods for classification and analysis of multivariate observations (pp. 281–297). In: Le Cam, L.M., Neyman, J. (Eds.), Proc. 5th Berkley Symp. On Mathematical Statistics and Probability, vol. I. University of California Press. xvii pp. 666. [9] Kaufman, L., & P., R. (1987). Clustering by Means of Medoids. [10] Han, J., Kamber, M., & Pei, J. (June, 2011). Data Mining: Concepts and Techniques. Elsevier. [11] Using Tries. http://www.topcoder.com/AlgorithmTutorial//tries.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Major Project Report Submitted in Partial fulfillment of the