Download Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MIS 451
Building Business Intelligence Systems
Clustering (2)
Problem

Target Marketing
Diaper, Baby food,
Toys
Swiss cheese and Belgian
chocolate
French
Wine
2
Clustering


Clustering is a data mining method for grouping
objects such that objects within the same cluster are
similar and objects in different clusters are dissimilar.
Why clustering


SQL based OLAP is not suitable for clustering objects whose
attributes have a large number of possible values
SQL based OLAP is not suitable for clustering objects with a
large number of attributes
3
Clustering

Steps in clustering objects

Compute similarity between objects

Clustering based on similarity between objects
4
Similarity



An object (e.g., a customer) has a list of variables
(e.g., attributes of a customer such as age, spending,
gender etc.)
When measuring similarity between objects we
measure similarity between variables of objects.
Instead of measuring similarity between variables,
we use distance to measure dissimilarity between
variables.
5
Dissimilarity

Continuous variable

Manhattan distance

Euclidean distance
6
Dissimilarity

For two objects X and Y with continuous
variables 1,2,…n, Manhattan distance is
defined as:
d ( X , Y )  x1  y1  x2  y2    xn  yn
where x1 ... xn are values of variables of object X
and y1 ... yn are values of variables of object Y
7
Dissimilarity

Example of Manhattan distance
NAME
AGE
SPENDING($)
Sue
21
2300
Carl
27
2600
TOM
45
5400
JACK
52
6000
8
Dissimilarity

For two objects X and Y with continuous
variables 1,2,…n, Euclidean distance is
defined as:
d ( X , Y )  ( x1  y1 )2  ( x2  y2 )2  ...  ( xn  yn )2
where x1 ... xn are values of variables of object X
and y1 ... yn are values of variables of object Y
9
Dissimilarity

Example of Euclidean distance
NAME
AGE
SPENDING($)
Sue
21
23200
Carl
27
23330
TOM
45
23260
JACK
52
23400
10
Dissimilarity

Standardize values of an variable



Calculate mean value
Calculate mean absolute deviation
Standardize values of an variable using the
formula:
new value = (old value – mean value)/mean standard deviation
11
Dissimilarity

Binary variable
distance = number of matched variables/total number of variables
NAME
Married(Y/N)
Gender
Internet connection at home
Sue
Y
M
Y
Carl
Y
F
Y
TOM
N
M
N
JACK
N
F
N
12
Clustering based on dissimilarity

After calculating dissimilarity between
objects, a dissimilarity matrix can be created
with objects as indexes and dissimilarities
between objects as elements.
13
Clustering based on dissimilarity
Sue
Tom
Carl
Jack
Mary
Sue
0
6
8
2
7
Tom
6
0
1
5
3
Carl
8
1
0
10
9
Jack
2
5
10
0
4
Mary
7
3
9
4
0
14
Clustering based on dissimilarity
Step 1:Initially, place each object in an unique cluster
Step 2: Calculate dissimilarity between clusters
Dissimilarity between clusters is the minimum
dissimilarity between two objects of the clusters, one
from each cluster
Step 3: Merge two clusters with the least dissimilarity
Step 4: Continue step 1-3 until all objects are in one
cluster
15
Related documents