Download Major Project Report Submitted in Partial fulfillment of the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ALGORITHM FOR CLUSTERING OF LARGE DATASETS
Major Project Report
Submitted in Partial fulfillment of the requirements for the award of the degree
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
BY
Lokesh Kumar Meena (08CO38)
Mhetre Shardul (08CO44)
Udit Agrawal (08CO80)
Under The Guidance of
Dr.Mahendra Pratap Singh
Assistant Professor
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL
SRINIVASNAGAR- 575025 KARNATAKA, INDIA
APRIL, 2012
DECLARATION
We hereby declare that the Project Work Report entitled “ALGORITHM FOR
CLUSTERING OF LARGE DATASETS” which is being submitted to National Institute of
Technology Karnataka, Surathkal for the award of the degree of Bachelor of Technology in
Computer Science and Engineering is a bonafide report of the work carried out by us. The
material contained in this report has not been submitted to any university or Institution for the
award of any degree.
NAME
REGISTRATION NO.
1. Lokesh Kumar Meena
08CO38
2. Mhetre Shardul
08CO44
3. Udit Agrawal
08CO80
SIGNATURE
Department of Computer Science and Engineering
Place: NITK, Surathkal
Date:
CERTIFICATE
This is to certify that the B.Tech Project Work Report entitled “ALGORITHM FOR
CLUSTERING OF LARGE DATASETS” submitted by
Sl.No.
Registration No
Name
1.
08CO38
Lokesh Kumar Meena
2.
08CO44
Mhetre Shardul
3.
08CO80
Udit Agrawal
as the record of the work carried out by them, is accepted as the B.Tech Project Work Report
Submission in partial fulfillment of the requirements for the award of degree of Bachelor of
Technology in Computer Science & Engineering.
Project Guide:
Mr.Mahendra Pratap Singh
Assistant Professor
Department of Computer Science and Engineering
NITK Surathkal
Chairman- DUGC
ACKNOWLEDGEMENT
We take this opportunity to express our deepest gratitude and appreciation to all those
who have helped us directly or indirectly towards the successful completion of this project.
First and foremost, we would like to express our sincere appreciation and gratitude to our
esteemed guides Mr. M.P.Singh, Assistant Professor and Mr. Vinay Kumar, Associate Professor,
Department of Computer Science & Engineering, NITK Surathkal for their insightful advice,
encouragement, guidance, critics, and valuable suggestions throughout the course of our project
work. Without their continued support and interest, this thesis would not have been the same as
presented here.
We express our deep gratitude to Dr. Santhi Thilagam P., Associate Professor and Head,
Department of Computer Science & Engineering, National Institute of Technology Karnataka,
Surathkal for her constant co-operation, support and for providing necessary facilities throughout
the B.Tech program.
We would like to take this opportunity to express our thanks towards the teaching and nonteaching staff in the Department of Computer Science & Engineering, NITK for their invaluable
help and support in these four years of our study. We are also grateful to all our classmates for
their help, encouragement and invaluable suggestions.
Our special thanks to our parents, supporting families and friends who continuously supported
and encouraged us in every possible way for the successful completion of this thesis.
Last but not least, we thank God Almighty for his blessings without which the completion of this
project work would not have been possible.
Lokesh Kumar Meena, 08CO38
Mhetre Shardul, 08CO44
Udit Agrawal, 08CO80
ABSTRACT
Data clustering is an important technique for exploratory data analysis, and has been studied
for several years. It has been shown to be useful in many practical domains such as data
classification and image processing. However existing data clustering methods do not
adequately address the problem of processing large datasets with a limited amount of
resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up
well in terms of memory requirement, running time, and result quality.
Using these limitations as our motivation, we propose a scheme called prefix trees for
compact storage of patterns in data mining, which forms clustering of the patterns, and is
generated from the dataset in a single scan. This clustering takes less amount of space and
hence forms a compact storage of patterns. We propose a clustering algorithm based on this
storage and prove experimentally that this type of storage reduces the space and time.
KEYWORDS: Algorithm, Clustering, Prefix Tree, Data Mining.
TABLE OF CONTENT
Declaration ........................................................................................................................ i
Certificate .......................................................................................................................... ii
Acknowledgement ............................................................................................................. iii
Abstract ............................................................................................................................. iv
Table of contents ................................................................................................................v
List of Figures ....................................................................................................................viii
List of Tables ..................................................................................................................... ix
1. INTRODUCTION ......................................................................................................... 1
1.1 Application of Clustering ........................................................................................... 2
1.1.1 Educational research analysis .............................................................................. 2
1.1.2 Information retrieval............................................................................................ 2
1.1.3 Biology................................................................................................................ 2
1.2 Constraint of Clustering ............................................................................................. 2
1.2.1 Scalability............................................................................................................ 2
1.2.2 Ability to deal with different types of attributes ................................................... 2
1.2.3 Discovery of clusters with arbitrary shape............................................................ 2
1.2.4 Minimal requirements for domain knowledge of determine input parameters ....... 2
1.2.5 Ability to deal with noisy data ............................................................................. 3
1.2.6 High dimensionality ............................................................................................ 3
1.3 Organisation of Report ........................................................................................... 3
2. LITERATURE SURVEY .............................................................................................. 4
2.1 Typical clustering models .......................................................................................... 4
2.1.1 Centroid models .................................................................................................. 4
2.1.2 Density models: ................................................................................................... 4
2.1.3 Hierarchical Clustering ........................................................................................ 4
2.1.4 Distribution models: ............................................................................................ 5
2.1.5 Graph-Based Clustering: ...................................................................................... 5
2.2 Related Work ............................................................................................................. 5
2.2.1 K-Mean Clustering [8] ......................................................................................... 5
2.2.2 K-Medoids Clustering [9] .................................................................................... 6
2.2.3 CLARANS [4]..................................................................................................... 6
v
2.2.4 DBSCAN [1] ....................................................................................................... 6
2.6 CURE [2] ............................................................................................................... 7
2.7 Chameleon [5] ........................................................................................................ 7
2.8 BIRCH [7].............................................................................................................. 8
3. PROBLEM STATEMENT ........................................................................................... 9
3.1 Problem Statement ..................................................................................................... 9
4. PROPOSED SYSTEM DESIGN .................................................................................. 10
4.1 Terminology and techniques: ..................................................................................... 10
4.1.1 The Data Matrix .................................................................................................. 10
4.1.2 The Dissimilarity Matrix...................................................................................... 10
4.2 Abstract view of our system ....................................................................................... 10
4.3 Data standardization module ...................................................................................... 11
4.3.1 Data discretization using Equal Width Interval Discretization .............................. 11
4.3.2 Algorithm ............................................................................................................ 11
4.4 Data compression module .......................................................................................... 12
4.4.1 Compact Representation using Prefix tree ............................................................ 12
4.4.2 Algorithm ............................................................................................................ 12
4.5 Data Clustering Module ............................................................................................. 13
4.5.1 Clustering using Relevant Feature algorithm ........................................................ 13
4.5.2 Algorithm ............................................................................................................ 13
4.6 RFBC: Relevant Feature Based Clustering Algorithm ................................................ 13
4.7 Design Constraints ..................................................................................................... 14
5. PROJECT IMPLEMENTATION ................................................................................ 15
5.1 Data Discretization – Module 1 .................................................................................. 15
5.1.1 Class: PreProc.java .............................................................................................. 15
5.2 Prefix Tree – Module 2 .............................................................................................. 15
5.2.1 Class: Node.java .................................................................................................. 15
5.2.2 Class: Prefix.java ................................................................................................. 16
5.3 Scanning Prefix tree and Clustering– Module 3 .......................................................... 16
5.3.1 Class: Cluster.java ............................................................................................... 16
5.3.2 Class: ObjectCluster.java ..................................................................................... 17
5.4 Implementation Setup ............................................................................................. 17
6. RESULTS AND ANALYSIS ......................................................................................... 18
vvvvv vi
6.1 Results ....................................................................................................................... 18
6.1.1 Sample Files ........................................................................................................ 18
6.1.2 Data Dicretization Module ................................................................................... 20
6.1.3 Data Compression Module................................................................................... 21
6.1.4 Data Clustering Module ....................................................................................... 22
6. 2 Analysis of Algorithm ............................................................................................... 24
6.3 Comparison of Results ............................................................................................... 25
7. CONCLUSION AND FUTURE WORK ...................................................................... 27
7.1 Conclusion ................................................................................................................. 27
7.2 Future Work............................................................................................................... 27
REFERENCES .................................................................................................................. 28
vii
LIST OF FIGURES
Fig No.
Description
Page No.
Fig. 1.1
Clusters – Example
1
Fig. 2.1
Density Based Clustering - Model
4
Fig. 4.1
Flow Diagram of Cluster Formation
10
Fig. 4.2
Abstract View of System
11
Fig. 4.3
Prefix Tree – Example
12
Fig. 6.1
Small Sample Data
18
Fig. 6.2
Medium Sample Data
19
Fig. 6.3
Large Sample Data
19
Fig. 6.4
Processed Small Data File
20
Fig. 6.5
Processed Medium Data File
20
Fig. 6.6
Processed Large Data File
21
Fig. 6.7
Prefix Tree - Small Data
21
Fig. 6.8
Prefix Tree - Medium Data
22
Fig. 6.9
Prefix Tree - Large Data
22
Fig. 6.10
Final Output - Small Data
23
Fig. 6.11
Final Output - Medium Data
23
Fig. 6.12
Final Output - Large Data
24
Fig. 6.13
Comparison of RFBC and BIRCH at Threshold=10
25
Fig. 6.14
Comparison of RFBC and BIRCH at Threshold=20
25
viii
LIST OF TABLES
Table No.
Description
Page Number
Table 5.1
Implementation Environment Specification
17
Table 6.1
Comparison table between RFBC and BIRCH
when Threshold = 10
25
Table 6.2
Comparison table between RFBC and BIRCH
when Threshold = 20
25
ix
1
CHAPTER 1
INTRODUCTION
Cluster analysis is a data mining technique used to place data elements into related groups
without advance knowledge of the group definitions.
It groups objects based on the information found in the data describing the objects or their
relationships. The goal is that the objects in a group will be similar (or related) to one other
and different from (or unrelated to) the objects in other groups. The greater the similarity or
homogeneity within a group and the greater the dissimilarity between groups the “better” or
more distinct the clustering.
Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful
clusters are the goal, then the resulting clusters should capture the “natural” structure of the
data. For example, cluster analysis has been used to group related documents for browsing, to
find genes and proteins that have similar functionality, and to provide a grouping of spatial
locations prone to earthquakes. However, in other cases, cluster analysis is only a useful
starting point for other purposes, e.g., data compression or efficiently finding the nearest
neighbours of points. Figure 1.1 shows an example of clustering which grouped objects into 3
clusters. Whether for understanding or utility, cluster analysis has long been used in a wide
variety of fields: psychology and other social sciences, biology, statistics, pattern recognition,
information retrieval, machine learning, and data mining.
Figure 1.1 An example of Clusters
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a
cluster and how to efficiently find them. Popular notions of clusters include groups with low
distances among the cluster members, dense areas of the data space, intervals or particular
2
statistical distributions. The appropriate clustering algorithm and parameter settings depend
on the individual data set and intended use of the results.
1.1 Application of Clustering
There are wide applications of clustering in real life problems. Following the few fields in
which clustering is used very often:
1.1.1 Educational research analysis
Data for clustering can be students, parents, sex or test score. Clustering is an important
method for understanding and utility of grouping or streaming in educational research.
Cluster analysis in educational research can be used for data exploration, cluster confirmation
and hypothesis testing. Data exploration is used when there is little information about which
schools or students will be grouped together. It aims at discovering any meaningful clusters
of units based on measures on a set of response variables.
1.1.2 Information retrieval
The result of a search engine returns thousands of pages. Clustering can be used to group
these search results into small clusters, each of which captures a particular aspect of a query.
1.1.3 Biology
Clustering can be used for creating the taxonomy of all living things, kingdoms, phylum,
family, genus etc.
1.2 Constraint of Clustering
Clustering is a challenging and interested field; its potential applications pose their own
special requirements. The following are typical requirements of clustering:
1.2.1 Scalability
Many clustering algorithms work well on small data sets containing fewer than 200 data
objects. However, a large database may contain millions of objects. Clustering on a sample of
a given large data set may lead to biased results. Highly scalable clustering algorithms are
needed.
1.2.2 Ability to deal with different types of attributes
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical (nominal),
and ordinal data, or mixtures of these data types.
1.2.3 Discovery of clusters with arbitrary shape
Many clustering algorithms determined clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures „end to find spherical clusters with
similar size and density‟. However, a cluster could be of any shape. It is important to develop
algorithms that can detect clusters of arbitrary shape.
3
1.2.4 Minimal requirements for domain knowledge of determine input parameters
Many clustering algorithms require users to input certain parameters in cluster analysis (such
as the number of desired clusters). The clustering results can be quite sensitive to input
parameters. Parameters are often hard to determine, especially for data sets containing highdimensional objects. This not only burdens users, but also makes the quality of clustering
difficult to control.
1.2.5 Ability to deal with noisy data
Most real-world databases contain outliners or missing, unknown, erroneous data. Some
clustering algorithms are sensitive to such data and may lead to clusters of poor quality.
1.2.6 High dimensionality
A database or a data warehouse can contain several dimensions or attributes. Many clustering
algorithms are good at handling low-dimensional data, involving only two to three
dimensions. Human eyes are good at judging the quality of clustering for up to three
dimensions. It is challenging to cluster data objects in high-dimensional space, especially
considering that such data can be very sparse and highly skewed.
1.3 Organisation of Report
Chapter 2 documents various clustering models developed over the time, and some selected
clustering algorithms from these models. Chapter 3 states the problem statement of project;
and Chapter 4 explains proposed system design, including all design modules and their
constraints. Chapter 5 include implementation details of all modules. Chapter 6 presents the
results, analysis and comparison of the results with standard algorithms. Chapter 7 concludes
the report with future work.
4
CHAPTER 2
LITERATURE SURVEY
We studied various clustering algorithms i.e. K-Means, K-Medoid, DBSCAN, CURE,
BIRCH, CLARAN, CHAMELEON, and their variants which use expectation minimization
techniques.
2.1 Typical clustering models
Many different clustering techniques that have been proposed over the years these techniques
can be described using the following criteria or models.
2.1.1 Centroid models
Partition clustering techniques create a one-level partitioning of the data points. K-means, Kmedoid are the two examples of this model. Both these techniques are based on the idea that
a centre point can represent a cluster. For K-means we use the notion of a centroid, which is
the mean or median point of a group of points. For K-medoid we use the notion of a medoid,
which is the most representative (central) point of a group of points. By its definition a
medoid is required to be an actual data point. CLARANS is a more efficient version of the
basic K-medoid and is used in spatial data mining problems.
2.1.2 Density models:
The most popular density based clustering methods are DBSCAN and OPTICS defines
clusters as connected dense regions in the data space. In density-based clustering, clusters are
defined as areas of higher density than the remainder of the data set. In contrast to many
newer methods, it features a well-defined cluster model called "density-reachability". Figure
2.1 shows examples of density based clustering.
Figure 2.1 Density based clustering model
2.1.3 Hierarchical Clustering
In hierarchical clustering the goal is to produce a hierarchical series of nested clusters,
ranging from clusters of individual points at the bottom to an all-inclusive cluster at the top.
5
A diagram called a dendogram graphically represents this hierarchy and is an inverted tree
that describes the order in which points are merged (bottom-up view) or clusters are split
(top-down view).
Hierarchical clustering builds models based on distance connectivity. Connectivity based
clustering, also known as hierarchical clustering, is based on the core idea of objects being
more related to nearby objects than to objects farther away. As such, these algorithms connect
"objects" to form "clusters" based on their distance. A cluster can be described largely by the
maximum distance needed to connect parts of the cluster. At different distances, different
clusters will form, which can be represented using a dendrogram, which explains where the
common name "hierarchical clustering" comes from.
2.1.4 Distribution models:
The clustering model most closely related to statistics is based on distribution models.
Clusters can then easily be defined as objects belonging most likely to the same distribution.
A nice property of this approach is that this closely resembles the way artificial data sets are
generated: by sampling random objects from a distribution. Clusters are modelled using
statistic distributions, such as multivariate normal distributions used by the Expectationmaximization algorithm.
2.1.5 Graph-Based Clustering:
The hierarchical clustering algorithms can be viewed as operating on a proximity graph.
However, they are most commonly viewed in terms of merging or splitting clusters, and often
there is no mention of graph related concepts. There are some clustering techniques, however,
that are explicitly cast in terms of a graph or a hyper graph. Many of these algorithms are
based on the idea of looking at the nearest neighbours of a point.
2.2 Related Work
There are many clustering algorithms based on the models listed above, we studied some
selected algorithms from these clustering models to understand concepts behind each
algorithm. By this study we tried to find out the limitations of these algorithms.
2.2.1 K-Mean Clustering [8]
The K-means clustering technique is very simple. K-means uses the notion of a centroid,
which is the mean or median point of a group of points. a centroid almost never corresponds
to an actual data point.
Basic Algorithm for finding K clusters. Given k, the k-means algorithm is implemented in 4
steps:
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters of the current partition. The
centroid is the centre (mean point) of the cluster.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment.
6
2.2.2 K-Medoids Clustering [9]
The objective of K-medoid clustering is to find a non-overlapping set of clusters such that
each cluster has a most representative point, i.e., a point that is most centrally located with
respect to some measure, e.g., distance. These representative points are called medoids.
Basic K-medoid Algorithm for finding K clusters:
1. Select K initial points. These points are the candidate medoids and are intended to be
the most central points of their clusters.
2. Consider the effect of replacing one of the selected objects (medioids) with one of the
non-selected objects. The distance of each non-selected point from the closest
candidate medoid is calculated, and this distance is summed over all points. This
distance represents the “cost” of the current configuration. All possible swaps of a
non-selected point for a selected one are considered, and the cost of each
configuration is calculated.
3. Select the configuration with the lowest cost. If this is a new configuration, then
repeat step 2.
4. Otherwise, associate each non-selected point with its closest selected point (medoid)
and stop.
2.2.3 CLARANS [4]
CLARANS itself grew out of two clustering algorithms, PAM and CLARA, it was developed
specifically for use in spatial data mining.
PAM (Partitioning Around Medoids) is a “K-medoid” based clustering algorithm that
attempts to cluster a set of m points into K clusters.
CLARA (Clustering LARge Applications) is an adaptation of PAM, for handling larger data
sets. It works by repeatedly sampling a set of data points, calculating the medoids of the
sample, and evaluating the cost of the configuration that consists of these “sample-derived”
medoids and the entire data set. The set of medoids that minimizes the cost is selected.
The algorithm can be stated as follows:
1. Randomly pick K candidate medoids.
2. Randomly consider a swap of one of the selected points for a non-selected point.
3. If the new configuration is better, i.e., has lower cost, then repeat step 2 with the new
configuration.
4. Otherwise, repeat step 2 with the current configuration unless a parameterized limit
has been exceeded. (For example: limit is set to max (250, K *(m - K)).
5. Compare the current solution with any previous solutions and keep track of the best.
6. Return to step 1 unless a parameterized limit has been exceeded.
2.2.4 DBSCAN [1]
DBSCAN is a density based clustering algorithm that works with a number of different
distance metrics. When DBSCAN has processed a set of data points, a point will either be in
7
a cluster or will be classified as noise. DBSCAN is based on the concepts of a point being
“density reachable” and “density connected”.
Conceptually, data points fall into three classes:
1. Core points. These are points that are at the interior of a cluster. A point is an interior
point if there are enough points in its neighbourhood, i.e., if the number of points
within a given neighbourhood around the point exceeds a certain threshold, as
determined by the distance function and a supplied distance parameter. If two core
points belong to each other‟s neighbourhoods, then the core points belong to the same
cluster.
2. Border points. A border point is a point that is not a core point, i.e., there are not
enough points in its neighbourhood, but it falls within the neighbourhood of a core
point.
3. Noise points. A noise point is any point that is not a core point or a border point.
Thus, for DBSCAN, a cluster is the set of all core points whose neighbourhoods transitively
connect them together, along with some border points.
2.6 CURE [2]
CURE (Clustering Using Representatives) is a clustering algorithm that uses a variety of
different techniques to create an approach which can handle large data sets, outliers, and
clusters with non-spherical shapes and non-uniform sizes.
We summarize our description of CURE by explicitly listing the different steps:
1. Draw a random sample from the data set.
2. Partition the sample into p equal sized partitions.
3. Cluster the points in each cluster using the hierarchical clustering algorithm to obtain
m/pq clusters in each partition and a total of m/q clusters. Some outlier elimination
occurs during this process.
4. Eliminate outliers. This is the second phase of outlier elimination.
5. Assign all data to the nearest cluster to obtain a complete clustering.
2.7 Chameleon [5]
Chameleon is a clustering algorithm that combines an initial partitioning of the data using an
efficient graph partitioning algorithm with a novel hierarchical clustering scheme that
dynamically models clusters. The key idea is that two clusters will be merged only if the
resulting cluster is similar to the original clusters, i.e. self-similarity is preserved.
In short, the steps of the Chameleon algorithms are:
1. Build a k-nearest neighbour graph.
2. Partition the graph into partitions using a multilevel graph-partitioning algorithm.
3. Perform a hierarchical clustering starting with the partitions. This hierarchical
clustering will merge the clusters which best preserve the cluster self-similarity with
respect to relative interconnectivity and relative closeness.
8
2.8 BIRCH [7]
BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies) is based on the
notion of a clustering feature (CF) and a CF tree. The idea is that a cluster of data points
(vectors) can be represented by a triple of numbers (N, LS, SS), where N is the number of
points in the cluster, LS is the linear sum of the points, and SS is the sum of squares of the
points.
A CF tree is built as the data is scanned. As each data point is encountered, the CF tree is
traversed, starting from the root and choosing the closest node at each level. When the closest
“leaf” cluster for the current data point is finally identified, a test is performed to see if
adding the data item to the candidate cluster will result in a new cluster with a diameter
greater than the given threshold, T. If not, then the data point is “added” to the candidate
cluster by updating the CF information. The cluster information for all nodes from the leaf to
the root is also updated.
BIRCH consists of a number of phases beyond the initial creation of the CF tree. The phases
of BIRCH are as follows:
1. Load the data into memory by creating a CF tree that “summarizes” the data.
2. Build a smaller CF tree if it is necessary for phase 3. T is increased, and then the leaf
node entries (clusters) are reinserted. Since T has increased, some clusters will be
merged.
3. Perform global clustering. Different forms of global clustering (clustering which uses
the pair wise distances between all the clusters) can be used.
4. Redistribute the data points using the centroids of clusters discovered in step 3 and
thus, discover a new set of clusters. By repeating this phase, multiple times, the
process converges to a local minimum.
Because of page size constraints and the T parameter, points that should be in one cluster are
sometimes split, and points that should be in different clusters are sometimes combined. Also,
if the data set contains duplicate points, these points can sometimes be clustered differently,
depending on the order in which they are encountered.
9
CHAPTER 3
PROBLEM STATEMENT
In this chapter we will state our problem statement and the major objectives of the project.
3.1 Problem Statement
The problem statement of the project is to “Develop an algorithm for clustering of large
datasets”.
The objectives of the project are:



Algorithm developed must be scalable.
It must be able to handle objects with high dimensionality.
Better time and space complexity than the current similar algorithm like BIRCH etc.
10
CHAPTER 4
PROPOSED SYSTEM DESIGN
Clustering is not one specific algorithm, but the general task to be solved. It can be achieved
by various algorithms that differ significantly in their notion of what constitutes a cluster and
how to efficiently find them.
The process of formation of clusters from the raw data has been showed by the following
flow diagram (Fig. 4.1):
Figure 4.1 The flow diagram for formation of cluster.
In the figure 4.1, raw data represents the data which have been obtained from any source or
database. Generally this data consists of millions of entries each representing a different
object. The clustering algorithm does the task converting the raw data to clusters of data.
Currently, there are numerous number of clustering algorithm present based on the input, size
of data etc. Finally, the clusters which are generated as the result of the second step contains
object of similar type. Popular notions of clusters include groups with low distances among
the cluster members, dense areas of the data space, intervals or particular statistical
distributions.
4.1 Terminology and techniques:
The following are the terminology used in our system design.
4.1.1 The Data Matrix
Objects (samples, measurements, patterns, events) are usually represented as points (vectors)
in a multi-dimensional space, where each dimension represents a distinct attribute (variable,
measurement) describing the object. It is represented in from of a two dimensional array m*n.
The each row of the data matrix represents an object and the columns represent the attributes
of the object.
4.1.2 The Dissimilarity Matrix
The dissimilarity matrix (also called distance matrix) describes pair wise distinction between
M objects. It is a square symmetrical MxM matrix with the (ij)th element equal to the value
of a chosen measure of distinction between the (i)th and the (j)th object. The diagonal
elements are either not considered or are usually equal to zero - i.e. the distinction between an
object and itself is postulated as zero.
11
4.2 Abstract view of our system
The abstract view shows the flow chart of all the modules present in the system and displays
how they are linked together. The following diagram (Figure 4.2) shows the abstract view of
our system:
Figure 4.2 Abstract view of system
4.3 Data standardization module
This module deals with the standardization of data removing all the redundancies in the data
type and then assigning the values to the attributes.
4.3.1 Data discretization using Equal Width Interval Discretization
The algorithm needs to first sort the attribute according to its values, and then find the
minimum value, Xmin, and the maximum value, Xmax of that attribute. Interval width, w, is
then computed by
Where f is the user-defined parameter as the total number of intervals needed (and is equal to
the branching factor). The values of attributes are then discretized according to the formulae:
Here Vfinal is the final value of the attribute and Vinitial is the initial value.
4.3.2 Algorithm
Input: A raw Data Matrix of size MxN and branching factor f.
Output: A processed Data Matrix of size MxN having discrete values for attributes.
Method:
1)
For each column in the Data Matrix:
2)
Find the max and min value of that column;
12
3)
4)
5)
Calculate the width w of the interval using Formula 4.1;
For each element in the column:
Reassign the value of element using Formula 4.2;
4.4 Data compression module
This module deals with conversion of data into reduced data set. This is to be done using the
prefix tree. At end of the module implementation a prefix tree should be formed preserving
the properties of original set.
4.4.1 Compact Representation using Prefix tree
A prefix tree is an ordered tree data structure that is used to store an associative array where
the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key
associated with that node; instead, its position in the tree defines the key it is associated with.
All the descendants of a node have a common prefix of the string associated with that node,
and the root is associated with the empty string.
Figure 4.3 An Example of Prefix tree for keys "A", "to", "tea", "ted", "ten", "i", "in" and "inn".
4.4.2 Algorithm
Input: A pre-processed Data Matrix of size MxN obtained from the module 1.
Output: A prefix tree representing the processed Data Matrix.
Method:
1) For each row in the Data Matrix:
2)
Insert row in prefix tree;
3) Function prefixTreeInsert (Data Matrix row)
4)
If (root == NULL):
5)
Create root node;
6)
For each attribute i in row:
7)
If row[i] is not present in prefix tree
8)
create a child node;
13
Insert the child node in the tree;
9) return;
4.5 Data Clustering Module
This is the last and final module of our algorithm and it deals with the clustering of objects.
We used our Relevant Feature algorithm for finding the dissimilarity between the objects. At
the end of this module implementation clusters having similar objects will be generated.
4.5.1 Clustering using Relevant Feature algorithm
In this type of approach the relative distance between the nodes of the prefix tree will be
found. This will serve as the basic criteria to find the dissimilarity between the different
features of objects. Features having distance less than a threshold will be put together in a
feature cluster. Once we get the features clusters, again the prefix tree is scanned and object
having these features will be grouped in a same cluster.
4.5.2 Algorithm
Input: A prefix obtained from the module 2, the feature threshold tf and the object cluster
threshold toc.
Output: A set of K object clusters which maximizes the similarity between the individual
elements in each cluster.
Method:
1) For each branch in the prefix tree:
Calculate the difference between each individual nodes values of that branch.
If difference < tf and none of them is present in a Feature Cluster FCi then,
Create a new Feature Cluster FC and put both the node values in FC;
Else
Put both the values in FCi;
2) For each two branches i & j in prefix tree:
Dissimilarity = 0;
Repeat:
If nodes at same level have values are not present in FCi then,
Dissimilarity = Dissimilarity+ (vi-vj)2;
Until leaf node is reached;
Dissimilarity = sqrt (Dissimilarity);
If Dissimilarity < toc and none of them is present is a Object Cluster OCi then,
Create a new Object Cluster OC and put both the node values in OC;
Else
Put both the values in OCi;
4.6 RFBC: Relevant Feature Based Clustering Algorithm
All the three modules discussed above are combined to give the final algorithm.
Input: A database containing n objects, branching factor f, the feature threshold tf and the
object cluster threshold toc.
14
Output: A set of K clusters which maximizes the similarity between the individual element in
each cluster.
Method: The RFBC Algorithm is implemented as follows:
1) Pre-process the data by using the data discretization technique;
2) Represent the pre-processed data obtained in the first step in form of a prefix tree;
3) Apply Relevant Feature Algorithm;
4) Display the Clusters Ci‟s;
4.7 Design Constraints
1.
2.
3.
4.
All the modules should be executed in minimum space and time complexity.
The data reduction/compression should preserve the properties of data.
The data reduction should be independent of the input data type.
Algorithm should not end in creating fuzzy clusters.
15
CHAPTER 5
PROJECT IMPLEMENTATION
In the project the three modules which were discussed in the proposed project design are
implemented. The focus of this implementation is on minimum space and time complexity.
The design and abstract view of the project will serve as the guideline during implementation.
5.1 Data Discretization – Module 1
This module converts the raw data in processed data. In order to implement the data
discretization module we created the following classes:

PreProc.java
5.1.1 Class: PreProc.java
The following sections discusses about the constructor, data structure and functions used in
the class.
5.1.1.1 Constructor
PreProc <Variable Name> = new PreProc ("FILE <Variable Name>"); It accepts a parameter
of type File, which is the raw data file.
5.1.1.2 Data Structure


Float table [ ] [ ]: Stores the Raw data Matrix and the final processed data matrix.
Int rows, Int Count: They represent the number of rows and Column in the Raw Data.
The number of rows represents the number of elements and number of column
represent the number of attribute.
5.1.1.3 Functions
 Void GetSize: Calculates the size/width of the interval using the formula 4.1
 Void Process: Calculates and assigns the final values to the attribute using the
formulae 4.2
 Void Save: Saves the processed Data Matrix (table [ ] [ ]) in a file proc.txt.
5.2 Prefix Tree – Module 2
In this module we represent the data which we obtained after the data-discretization in the
module1 in the form of prefix tree. Following are the classes we used for the implementation
of this module:


Node.java
PrefixTree.java
5.2.1 Class: Node.java
The following sections discusses about the constructor, data structure and functions used in
the class.
16
5.2.1.1 Constructor
Node <Variable Name> = new Node("Float <Variable Name>"); It accepts a parameter of
type Float, which is the value of attribute.
5.2.1.2 Data Structure
 Collection<Node> child: Stores all the child node of the parent.
 Float content: The value of the node.
5.2.1.3 Functions
 Float getValue(node): Return the value of the node.
 Collection<Node> getChild(node): Return all the child nodes of the current node.
5.2.2 Class: Prefix.java
The following sections discusses about the constructor, data structure and functions used in
the class.
5.2.2.1 Constructor
PrefixTree <Variable Name> = new Node("Float [ ] <Variable Name>");
parameter of type Float [ ] , which is a row of Pre-processed Data Matrix.
It accepts a
5.2.2.2 Data Structure
Node root: The root of the prefix tree.
5.2.2.3 Functions
 Create prefixTree(): Creates the root node of the prefix tree.
 Void insert(float row , int number of element): Insert the row of Data Matrix in the
Prefix Tree.
5.3 Scanning Prefix tree and Clustering– Module 3
In this module we generate final clusters of similar objects. Following are the classes used for
its implementation:


ObjectCluster.Java
Cluster.Java
5.3.1 Class: Cluster.java
The following sections discusses about the constructor, data structure and functions used in
the class.
5.3.1.1 Constructor
Cluster<Variable Name> = new Cluster(" ");
5.3.1.2 Data Structure
 Int featureThreshold: If the distance (difference between the node values) between
two features is less then this threshold then they will be put in a same cluster.
 LinkedList featureClusters: Stores the feature Cluster
17
5.3.1.3 Functions
 LinkedList createFeatureCluster(File): Create cluster of feature and return a linked list
of feature cluster.
 Boolean inRange( int , int ):Return true if the distance between the values is less then
threshold else return false.
5.3.2 Class: ObjectCluster.java
The following sections discusses about the constructor, data structure and functions used in
the class.
5.3.2.1 Constructor
ObjectCluster<Variable Name> = new ObjectCluster(" ")
5.3.2.2 Data Structure
HashMap<Integer, ArrayList> clusters: Final Object Clusters.
5.3.2.3 Functions:
 Void createObjectCluster(LinkedList): Accepts as input the linked list of the feature
clusters and generates object clusters.
 Void dfsTree(Node ): Performs dfs of the tree. Compare two object based on their
features. Assign them to the object clusters.
5.4 Implementation Setup
We simulated and ran our algorithm on a system having following specifications:
Environmental Variable
Value
Processor
Pentium(R)Dual-Core CPU 2.00Ghz
Operating System
32 Bit Windows 7
RAM
3.00 GB
Language
Java 6
IDE
NetBeans 6.9.1
JRE
1.6
Table 5.1 Implementation Environment Specification
18
CHAPTER 6
RESULTS AND ANALYSIS
6.1 Results
In order to check the results of our algorithm and check the validity of the output we tested
our algorithm on various datasets and have obtained successful results.
6.1.1 Sample Files
We have used 3 different datafiles (http://www.uni-koeln.de/themen/statistik/data/cluster/:
Data Courtesy) for obtaining the results. These files represent the data of Nutrients in Meat,
Fish and Fowl. The data sets contain the nutrients in different kind of meat, fish or fowl. The
first column shows the Serial number of Meat, Fish or Fowl. The remaining column shows
the Food Energy (Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron
(Milli Grams) in their respective orders.
We have classified the out three dataset files into:



Small input ( 27 elements)
Medium Input ( 1500 elements)
Large Input ( 15000 elements)
Figure 6.1 shows the sample dataset file consisting of 27 elements. The first column shows
the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy
(Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their
respective orders.
19
Figure 6.1 Small Sample Data
Figure 6.2 shows the sample dataset file consisting 1500 elements. The first column shows
the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy
(Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their
respective orders.
Figure 6.2 Medium Sample Data
Figure 6.3 shows the sample dataset file consisting 15000 elements. The first column shows
the Serial number of Meat, Fish or Fowl. The remaining column shows the Food Energy
(Calories), Protein (Grams), Fat (Grams), Calcium (Milli Grams), Iron (Milli Grams) in their
respective orders.
Figure 6.3 Large Sample Data
20
6.1.2 Data Dicretization Module
The output of this module different dataset file is as follows:
The figure 6.4 shows that after pre-processing of small data file the highlighted attribute
(fats) have discretized values.
Figure 6.4 Resulting Processed small datafile
The figure 6.5 shows that after pre-processing of medium data file the highlighted attribute
(fats) have discretized values.
Figure 6.5 Resulting Processed medium datafile
21
The figure 6.6 shows that after pre-processing of large data file the highlighted attribute (fats)
have discretized values.
Figure 6.6 Resulting Processed large datafile
6.1.3 Data Compression Module
This module shows the prefix tree which is formed from the 3 dataset files.
The Figure 6.7 shows the prefix tree formed after the data compression module is applied to
the processed small dataset. Each ith row of the figure shows the ith level of the prefix tree.The
height of the tree is equal to the number of attributes. In this case the height of tree is 5.
Figure 6.7 Prefix Tree formed
The Figure 6.8 shows the prefix tree formed after the data compression module is applied to
the processed medium dataset. Each ith row of the figure shows the ith level of the prefix tree.
The height of the tree is equal to the number of attributes. In this case the height of tree is 5.
22
Figure 6.8 Prefix Tree formed
The Figure 6.9 shows the prefix tree formed after the data compression module is applied to
the processed large dataset. Each ith row of the figure shows the ith level of the prefix tree.The
height of the tree is equal to the number of attributes. In this case the height of tree is 5.
Figure 6.9 Prefix Tree formed
6.1.4 Data Clustering Module
This module shows the final output of the algorithm for 3 datasets files.
The following figure 6.10 shows the final output of the algorithm for small dataset. Each line
in the figure represents a cluster. There are total 8 clusters formed. Each cluster is a collection
of serial number of all those elements which are similar. Numerical values in the output
indicate the serial number of object in the raw datafile.
23
Figure 6.10 Final output of the Algorithm for small dataset
The following figure 6.11 shows the final output of the algorithm for medium dataset. Each
line in the figure represents a cluster. There are total 8 clusters formed. Each cluster is a
collection of serial number of all those elements which are similar. Numerical values in the
output indicate the serial number of object in the raw datafile.
Figure 6.11 Final output of the Algorithm for medium dataset
The following figure 6.12 shows the final output of the algorithm for large dataset. Each line
in the figure represents a cluster. There are total 8 clusters formed. Each cluster is a collection
24
of serial number of all those elements which are similar. Numerical values in the output
indicate the serial number of object in the raw datafile.
Figure 6.12 Final output of the Algorithm for large dataset
6. 2 Analysis of Algorithm
A formal analysis of algorithm is done to theoretically obtain its time complexity.
Phase 1: Data Preprocessing Module: (STEP 1)
 During this phase the whole data is standardized and normalized
 Each element of the raw data is processed only once.
 Hence, the time complexity of this phase is:
T (n) = O (m x n);
Here, m = no. of elements in data set;
n = no . of attributes/ features of each object;
Phase 2: Conversion of processed data into prefix tree: (STEP 2)
 Time complexity of inserting an element of n takes O (n) time;
 Hence the total time complexity of converting m elements each having n
attributes/feature is O(m x n);
 As we are creating a prefix tree the total cost of space utilized is O (k x n), where k is
the branching factor given by the user;
Phase 3: Calculating the dissimilarity between objects and assigning them to clusters: (STEP
3 onwards)
 Time complexity of the STEP -3 is O (k x n2), where K in the branching factor, n is
the number of attributes/features and m is the number of elements.
25

Time complexity of the STEP-4 is O (k2 x n), where K in the branching factor, n is
the number of attributes/features and m is the number of elements.
Hence the total time complexity of the algorithm will be T1 + T2 + T3, which is equal to:
T(m) = O( kn2 + k2n + mn) where m >>> k and m>>>n
Hence, Total time complexity is: T (M) = O (M*N), which is linear in terms of M, ( Here M
is the number of elements).
6.3 Comparison of Results
We have coded the BIRCH algorithm which also uses the concept of tree for finding the
clusters. Moreover, it is a good example of algorithm which generates clusters for large
datasets. We have compared the results for 3 types of sample input (as mentioned in section
6.1.1 ) and found out the performance of algorithm both the on these. We kept the same
threshold for both the algorithms and obtained the clusters of dataset.
The tables 6.1 below show the comparison of our algorithm as compared with the BIRCH for
the 3 different datasets ( small, medium and large )and threshold = 10.
Algorithm
RFBC
BIRCH
Time Taken for small input file
0.69 sec
0.49 sec
Time Taken for medium input
file
Time Taken for large input file
1.05 sec
1.15 sec
4.55 sec
19.78 sec
Table 6.1 Comparison table between RFBC and BIRCH. Threshold = 10.
The tables 6.2 below show the comparison of our algorithm as compared with the BIRCH for
the 3 different datasets ( small, medium and large) and threshold.
Algorithm
RFBC
BIRCH
Time Taken for small input file
0.71 sec
0.47 sec
Time Taken for medium input
file
Time Taken for large input file
0.99 sec
0.94 sec
3.41 sec
12.64 sec
Table 6.2 Comparison table between RFBC and BIRCH. Threshold = 20
The figure 6.13 below shows graphically the comparison of our algorithm with the BIRCH
algorithm and plots the table 6.1. The X-axis in the figure denotes the size of Data set and Yaxis shows the time taken by the algorithm in seconds. The threshold here is 10.
26
Figure 6. 13 Show the comparison of our algorithm with the BIRCH. X-axis represent size of Dataset
and Y-axis represent time taken in seconds.
The figure 6.14 below shows graphically the comparison of our algorithm with the BIRCH
algorithm and plots the table 6.2. The X-axis in the figure denotes the size of Data set and Yaxis shows the time taken by the algorithm in seconds. The threshold here is 20.
Figure 6. 14 Show the comparison of our algorithm with the BIRCH. X-axis represent size of Dataset
and Y-axis represent time taken in seconds.
27
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 Conclusion
We successfully demonstrated the correctness of our algorithm and showed that it has better
performance than BIRCH algorithm. Our algorithm performs exceptionally well on large
datasets. It was observed in our experiments that the large (15000 objects) dataset took nearly
5X time lesser than the BIRCH.
7.2 Future Work
Although we achieved a better time complexity there is further scope for improvement.
Following are the improvements which can be made on the current algorithm:


The algorithm works fine for the objects having numerical data values. But the
dissimilarity function has to be changed whenever a boolean/interval scaled variable
appears. Hence dissimilarity function can be made more generic.
Currently, pre-processing of data is done in a serial manner. Future work may include
parallelization of pre-processing step using GPU computing or threads.
28
References
[1] Ester, M., Kriegel, H.-p., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise. (pp. 226-231).
[2] Guha, S., Rastogi, R., & Shim, K. (1998 ). CURE: an efficient clustering algorithm for
large databases. SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international
conference on Management of data. New York: ACM.
[3] Guha, S., Rastogi, R., & Shim, K. (July 2000). Rock: A robust clustering algorithm for
categorical attributes. Information Systems, 345-366.
[4] Raymond T. Ng and Jiawei Han. 2002. CLARANS: A Method for Clustering Objects for
Spatial Data Mining. IEEE Trans. on Knowl. and Data Eng. 14, 5 (September 2002), 10031016. DOI=10.1109/TKDE.2002.1033770.
[5] George Karypis, Eui-Hong Han, and Vipin Kumar, (1999) CHAMELEON: A
Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer, Vol. 32, No.
8, August, 1999. pp. 68-75.
[6] Radhika M. Pai and V. S Ananthanarayana (2008). Prefix-Suffix Trees: A Novel Scheme
for Compact representation of Large Datasets. Springer.
[7] Zhang, T., Ramakrishnan, R., & Livny, M. (1997). BIRCH: A New Data Clustering
Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1 (2), 141-182.
[8] Mac Queen, J., 1967. Some methods for classification and analysis of multivariate
observations (pp. 281–297). In: Le Cam, L.M., Neyman, J. (Eds.), Proc. 5th Berkley Symp.
On Mathematical Statistics and Probability, vol. I. University of California Press. xvii pp.
666.
[9] Kaufman, L., & P., R. (1987). Clustering by Means of Medoids.
[10] Han, J., Kamber, M., & Pei, J. (June, 2011). Data Mining: Concepts and Techniques.
Elsevier.
[11] Using Tries. http://www.topcoder.com/AlgorithmTutorial//tries.html