Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cluster Analysis
Mark Stamp
Cluster Analysis
1
Cluster Analysis
 Grouping
objects in meaningful way
o Clustered data fits together in some way
o Can help to make sense of (big) data
o Useful analysis technique in many fields
 Many
different clustering strategies
 Overview, then details on 2 methods
o K-means  simple and can be effective
o EM clustering  not as simple
Cluster Analysis
2
Intrinsic vs Extrinsic
 Intrinsic
clustering relies on
unsupervised learning
o No predetermined labels on objects
o Apply analysis directly to data
 Extrinsic
relies on category labels
o Requires pre-processing of data
o Can be viewed as a form of supervised
learning
Cluster Analysis
3
Agglomerative vs Divisive
 Agglomerative
o Each object starts in its own cluster
o Clustering merges existing clusters
o A “bottom up” approach
 Divisive
o All objects start in one cluster
o Clustering process splits existing clusters
o A “top down” approach
Cluster Analysis
4
Hierarchical vs Partitional
 Hierarchical
clustering
o “Child” and “parent” clusters
o Can be viewed as dendrograms
 Partitional
clustering
o Partition objects into disjoint clusters
o No hierarchical relationship
 We
consider K-means and EM in detail
o These are both partitional
Cluster Analysis
5
Hierarchical Clustering
 Example
1.
2.
of a hierarchical approach...
start: Every point is its own cluster
while number of clusters exceeds 1
o Find 2 nearest clusters and merge
end while
 OK, but no real theoretical basis
3.
o And some find that “disconcerting”
o Even K-means has some theory behind it
Cluster Analysis
6
Dendrogram
 Example
 Obtained
by
hierarchical
clustering
o Maybe…
Cluster Analysis
7
Distance
 Distance
between data points?
 Suppose
x = (x1,x2,…,xn) and y = (y1,y2,…,yn)
where each xi and yi are real numbers
 Euclidean
distance is
d(x,y) = sqrt((x1-y1)2+(x2-y2)2+…+(xn-yn)2)
 Manhattan
(taxicab) distance is
d(x,y) = |x1-y1| + |x2-y2| + … + |xn-yn|
Cluster Analysis
8
Distance
 Euclidean
distance  red line
 Manhattan distance  blue or yellow
o Or any similar right-angle only path
b
a
Cluster Analysis
9
Distance
 Lots
and lots more distance measures
 Other examples include
o Mahalanobis distance  takes mean and
covariance into account
o Simple substitution distance  measure
of “decryption” distance
o Chi-squared distance  statistical
o Or just about anything you can think of…
Cluster Analysis
10
One Clustering Approach
 Given
data points x1,x2,x3,…,xm
 Want to partition into K clusters
o I.e., each point in exactly one cluster
A
centroid specified for each cluster
o Let c1,c2,…,cK denote current centroids
 Each
xi associated with one centroid
o Let centroid(xi) be centroid for xi
o If cj = centroid(xi), then xi is in cluster j
Cluster Analysis
11
Clustering
 Two
crucial questions
1. How to determine centroids, cj?
2. How to determine clusters, that is,
how to assign xi to centroids?
 But first, what makes a cluster “good”?
o For now, focus on one individual cluster
o Relationship between clusters later…
 What
Cluster Analysis
do you think?
12
Distortion
 Intuitively,
“compact” clusters good
o Depends on data and K, which are given
o And depends on centroids and assignment
of xi to clusters, which we can control
 How
to measure “goodness”?
 Define distortion = Σ d(xi,centroid(xi))
o Where d(x,y) is a distance measure
 Given
Cluster Analysis
K, let’s try to minimize distortion
13
Distortion
 Consider
this 2-d data
o Choose K = 3 clusters
 Same
data for both
o Which has smaller
distortion?
 How
to minimize
distortion?
o Good question…
Cluster Analysis
14
Distortion
 Note,
distortion depends on K
o So, should probably write distortionK
 Problem
we want to solve…
o Given: K and x1,x2,x3,…,xm
o Minimize: distortionK
 Best
choice of K is a different issue
o Briefly considered later
 For
now, assume K is given and fixed
Cluster Analysis
15
How to Minimize Distortion?
Given m data points and K
 Minimize distortion via exhaustive search?
o Try all “m choose K” different cases?
o Too much work for realistic size data set
An approximate solution will have to do
o Exact solution is NP-complete problem
Important Observation: For min distortion…
o Each xi grouped with nearest centroid
o Centroid must be center of its group
Cluster Analysis
16
K-Means
 Previous
slide implies that we can
improve suboptimal cluster by either…
1. Re-assign each xi to nearest centroid
2. Re-compute centroids so they’re centers
 No
improvement from applying either 1
or 2 more than once in succession
 But alternating might be useful
o In fact, that is the K-means algorithm
Cluster Analysis
17
K-Means Algorithm
 Given
1.
2.
3.
4.
5.
dataset…
Select a value for K (how?)
Select initial centroids (how?)
Group data by nearest centroid
Recompute centroids (cluster centers)
If “significant” change, then goto 3;
else stop
Cluster Analysis
18
K-Means Animation
 Very
good animation here
http://shabal.in/visuals/kmeans/2.html
 Nice
animations of movement of
centroids in different cases here
http://www.ccs.neu.edu/home/kenb/db/examples/059.html
(near bottom of web page)
 Other?
Cluster Analysis
19
K-Means
 Are
we assured of optimal solution?
o Definitely not
 Why
not?
o For one, initial centroid locations critical
o There is a (sensitive) dependence on initial
conditions
o This is a common issue in iterative
processes (HMM training, is an example)
Cluster Analysis
20
K-Means Initialization
 Recall,
K is the number of clusters
 How to choose K?
 No obvious “best” way to do so
 But K-means is fast
o So trial and error may be OK
o That is, experiment with different K
o Similar to choosing N in HMM
 Is
there a better way to choose K?
Cluster Analysis
21
Optimal K?
 Even
for trial and error, we need a
way to measure “goodness” of results
 Choosing optimal K is tricky
 Most intuitive measures will tend to
improve for larger K
 But K “too big” may overfit data
 So, when is K “big enough”?
o But not too big…
Cluster Analysis
22
Schwarz Criterion
f(K)
Choose K that minimizes
K
f(K) = distortionK + λdK log m
o Where d is the dimension, m is the number of
data points, and λ is ???
Recall that distortion depends on K
o Tends to decrease as K increases
o Essentially, adding a penalty as K increases
Related to Bayes Information Criterion (BIC)
o And some other similar things
Consider choice of K in more detail later…
Cluster Analysis
23
K-Means Initialization
 How
to choose initial centroids?
 Again, no best way to do this
o Counterexamples to any “best” approach
 Often
just choose at random
 Or uniform/maximum spacing
o Or some variation on this idea
 Other?
Cluster Analysis
24
K-Means Initialization
 In
practice, we might do following
 Try several different choices of K
o For each K, test several initial centroids
 Select
the result that is best
o How to measure “best”?
o We look at that next
 May
not be very scientific
o But often it’s good enough
Cluster Analysis
25
K-Means Variations
 One
variation is K-mediods
o Centroids point must be actual data point
 Fuzzy
K-means
o In K-means, any data point is in one
cluster and not in any other
o In fuzzy case, data point can be partly in
several different clusters
o “Degree of membership” vs distance
 Many
Cluster Analysis
other variations…
26
Measuring Cluster Quality
 How
can we judge clustering results?
o In general, that is, not just for K-means
 Compare
to typical training/scoring…
o Suppose we test new scoring method
o E.g., score malware and benign files
o Compute ROC curves, AUC, etc.
o Many tools to measure success/accuracy
 Clustering
Cluster Analysis
is different (Why? How?)
27
Clustering Quality
 Clustering
is a fishing expedition
o Not sure what we are looking for
o Hoping to find structure, “data discovery”
o If we know answer, no point to clustering
 Might
find something that’s not there
o Even random data can be clustered
 Some
things to consider on next slides
o Relative to the data to be clustered
Cluster Analysis
28
Cluster-ability?
Clustering tendency
o How suitable is dataset for clustering?
o Which dataset below is cluster-friendly?
o We can always apply clustering…
o …but expect better results in some cases
Cluster Analysis
29
Validation
External validation
o Compare clusters based on data labels
o Similar to usual training/scoring scenario
o Good idea if know something about data
Internal validation
o Determine quality based only on clusters
o E.g., spacing between and within clusters
o Harder to do, but always applicable
Cluster Analysis
30
It’s All Relative
Comparing clustering results
o That is, compare one clustering result
with others for same dataset
o Can be very useful in practice
o Often, lots of trial and error
o Could enable us to “hill climb” to better
clustering results…
o …but still need a way to quantify things
Cluster Analysis
31
How Many Clusters?
Optimal number of clusters?
o Already mentioned this wrt K-means
o But what about the general case?
o I.e., not dependent on cluster technique
o Can the data tell us how many clusters?
o Or the topology of the clusters?
 Next,
Cluster Analysis
we consider relevant measures
32
Internal Validation
 Direct
measurement of clusters
o Might call it “topological” validation
 We’ll
consider the following
o Cluster correlation
o Similarity matrix
o Sum of squares error
o Cohesion and separation
o Silhouette coefficient
Cluster Analysis
33
Correlation Coefficient
 For
X=(x1,x2,…,xn) and Y=(y1,y2,…,yn)
 Correlation coefficient rXY is
 Can
show -1 ≤ rXY ≤ 1
o If rXY > 0 then positive cor (and vice versa)
o Magnitude is strength of correlation
Cluster Analysis
34
Examples of rXY in 2-d
Cluster Analysis
35
Cluster Correlation
 Given
data x1,x2,…,xm, and clusters,
define 2 matrices
 Distance matrix D = {dij}
o Where dij is distance between xi and xj
 Adjacency
matrix A = {aij}
o Where aij is 1 if xi and xj in same cluster
o And aij is 0 otherwise
 Now
what?
Cluster Analysis
36
Cluster Correlation
 Compute
correlation between D and A
 High
inverse correlation implies nearby
things clustered together
o Why inverse?
Cluster Analysis
37
Correlation
 Correlation
Cluster Analysis
examples
38
Similarity Matrix
 Form
“similarity matrix”
o Could be based on just about anything
o Typically, distance matrix D = {dij}, where
dij = d(xi,xj)
 Group
rows and columns by cluster
 Heat map for resulting matrix
o Provides visual representation of
similarity within clusters (so look at it…)
Cluster Analysis
39
Similarity Matrix
 Same
examples as above
 Corresponding
Cluster Analysis
heat maps on next slide
40
Heat
Maps
Cluster Analysis
41
Residual Sum of Squares
 Residual
Sum of Squares (RSS)
o Aka Sum of Squared Errors (SSE)
o RSS is squared sum of “error” terms
o Definition of error depends on problem
 What
is “error” when clustering?
o Distance from centroid?
o Then it’s the same as distortion
o But, could use other measures instead
Cluster Analysis
42
Cohesion and Separation
 Cluster
cohesion
o How “tightly packed” is a cluster
o The more cohesive a cluster, the better
 Cluster
separation
o Distance between clusters
o The more separation, the better
 Can
we measure these things?
o Yes, easily
Cluster Analysis
43
Notation
 Same
notation as K-means
o Let ci, for i=1,2,…,K, be centroids
o Let x1,x2,…,xm be data points
o Let centroid(xi) be centroid of xi
o Clusters determined by centroids
 Following
results apply generally
o Not just for K-means
Cluster Analysis
44
Cohesion
 Lots
of measures of cohesion
o Previously defined distortion is useful
o Recall, distortion = Σ d(xi,centroid(xi))
 Or,
could use distance between all pairs
Cluster Analysis
45
Separation
Again, many ways to measure this
o Here, using distances to other centroids
Or distances between all points in clusters
 Or distance from centroids to a “midpoint”
 Or distance between centroids, or…
Cluster Analysis
46
Silhouette Coefficient
 Essentially,
combines cohesion and
separation into a single number
 Let Ci be the cluster that xi belongs to
o Let a be average of d(xi,y) for all y in Ci
o For Cj ≠ Ci, let bj be avg d(xi,y) for y in Cj
o Let b be minimum of bj
 Then
let S(xi) = (b – a) / max(a,b)
o What the … ?
Cluster Analysis
47
Silhouette Coefficient
 The
idea...
a=avg
xi
Ci
Ck
avg
b=min
avg
 Usually,
Cluster Analysis
Cj
S(xi) = 1 - a/b
48
Silhouette Coefficient
For given point xi …
o Let a be avg distance to points in its cluster
o Let b be dist to nearest other cluster (in a sense)
Usually, a < b and hence S(xi) = 1 – a/b
 If a is a lot less than b, then S(xi) ≈ 1
o Points inside cluster much closer together than
nearest other cluster (this is good)
If a is almost same as b, then S(xi) ≈ 0
o Some other cluster is almost as close as things
inside cluster (this is bad)
Cluster Analysis
49
Silhouette Coefficient
 Silhouette
coefficient is defined for
each point
 Avg silhouette coefficient for a cluster
o Measure of how “good” a cluster is
 Avg
silhouette coefficient for all points
o Measure of overall clustering “goodness”
 Numerically,
what is a good result?
o Rule of thumb on next slide
Cluster Analysis
50
Silhouette Coefficient
 Average
coefficient (to 2 decimal places)
o 0.71 to 1.00  strong structure found
o 0.51 to 0.70  reasonable structure found
o 0.26 to 0.50  weak or artificial structure
o 0.25 or less  no significant structure
 Bottom
line on silhouette coefficient
o Combine cohesion, separation in one number
o A useful measures of cluster quality
Cluster Analysis
51
External Validation
 “External”
implies that we measure
quality based on data in clusters
o Not relying on cluster topology (“shape”)
 Suppose
clustering data is of several
different types
o Say, different malware families
 We
can compute statistics on clusters
o We only consider 2 stats here
Cluster Analysis
52
Entropy and Purity
 Entropy
o Standard measure of uncertainty or
randomness
o High entropy implies clusters less uniform
 Purity
o Another measure of uniformity
o Ideally, cluster should be more “pure”,
that is, more uniform
Cluster Analysis
53
Entropy
 Suppose
total of m data elements
o As usual, x1,x2,…,xm
 Denote
cluster j as Cj
o Let mj be number of elements in Cj
o Let mij be count of type i in cluster Cj
 Compute
probabilities based on
relative frequencies
o That is, pij = mij / mj
Cluster Analysis
54
Entropy
 Then
entropy of cluster Cj is
Ej = − Σ pij log pij, where sum is over i
 Compute
entropy Ej for each cluster Cj
 Overall (weighted) entropy is then
E = Σ mj/m Ej, where sum is from 1 to K and
K is number of clusters
 Smaller
E is better
o Implies clusters less uncertain/random
Cluster Analysis
55
Purity
 Ideally,
each cluster is all one type
 Using same notation as in entropy…
o Purity of Cj defined as Uj = max pij
o Where max is over i (different types)
 If
Uj is 1, then Cj all one type of data
o If Uj is near 0, no dominant type
 Overall
(weighted) purity is
U = Σ mj/m Uj, where sum is from 1 to K
Cluster Analysis
56
Entropy and Purity
 Examples
Cluster Analysis
57
EM Clustering
 Data
might be from different
probability distributions
o Then “distance” might be poor measure
o Maybe better to use mean and variance
 Cluster
on probability distributions?
o But distributions are unknown…
 Expectation
Maximization (EM)
o Technique to determine unknown
parameters of probability distributions
Cluster Analysis
58
EM Clustering Animation
 Good
animation on Wikipedia page
http://en.wikipedia.org/wiki/Expectation–maximization_algorithm
 Another
animation here
http://www.cs.cmu.edu/~alad/em/
 Probably
Cluster Analysis
others too…
59
EM Clustering Example
 Old
Faithful in
Yellowstone NP
 Measure “wait”
and duration
 Two clusters
o Centers are means
o Shading based on
standard deviation
Cluster Analysis
60
Maximum Likelihood Estimator
 Maximum
Likelihood Estimator (MLE)
 Suppose you flip a coin and obtain
X = HHHHTHHHTT
 What is most likely value of p = P(H)?
 Coin flips follow binomial distribution:
 Where
Cluster Analysis
p is prob of “success” (heads)
61
MLE
 Suppose
X = HHHHTHHHTT
 Maximum likelihood function for X is
 And
log likelihood function is
 Optimize
 In
log likelihood function
this case, MLE is θ = P(H) = 0.7
Cluster Analysis
62
Coin Experiment
 Given
2 biased coins, A and B
o Randomly select coin, then…
o Flip selected coin 10 times, and…
o Repeat 5 times, for 50 total coin flips
 Can
we determine P(H) for each coin?
 Easy, if you know which coin selected
o For each coin, just divide number of
heads by number of flips of that coin
Cluster Analysis
63
Coin Example
 For
example, suppose
Coin
Coin
Coin
Coin
Coin
B:
A:
A:
B:
A:
 Then
HTTTHHTHTH
HHHHTHHHHH
HTHHHHHTHH
HTHTTTHHTT
THHHTHHHTH
5H
9H
8H
4H
7H
and
and
and
and
and
5T
1T
2T
6T
3T
maximum likelihood estimate is
PA(H)=24/30=0.80 and PB(H)=9/20=0.45
Cluster Analysis
64
Coin Example
 Suppose
we have same data, but we
don’t know which coin was selected
Coin
Coin
Coin
Coin
Coin
 Can
??: 5
??: 9
??: 8
??: 4
??: 7
H and
H and
H and
H and
H and
5T
1T
2T
6T
3T
we estimate PA(H) and PB(H)?
Cluster Analysis
65
Coin Example
 We
do not know which coin was flipped
 So, there is “hidden” information
o This sounds familiar…
 Train
HMM on sequence of H and T ??
o Using 2 hidden states
o Use resulting model to find most likely
hidden state sequence (HMM “problem 2”)
o Use sequence to estimate PA(H) and PB(H)
Cluster Analysis
66
Coin Example
 HMM
is very “heavy artillery”
o And HMM needs lots of data to converge
(or lots of different initializations)
o EM gives us info we need, less work/data
 EM
algorithm: Initial guess for params
o Then alternate between these 2 steps:
o Expectation: Recompute “expected values”
o Maximization: Recompute params via MLEs
Cluster Analysis
67
EM for Coin Example
Start with a guess (initialization)
o Say, PA(H) = 0.6 and PB(H) = 0.5
Compute expectations (E-step)
 First, from current PA(H) and PB(H) we find
5 H, 5 T
9 H, 1 T
8 H, 2 T
4 H, 6 T
7 H, 3 T
P(A) = .45, P(B) = .55
P(A) = .80, P(B) = .20
P(A) = .73, P(B) = .27
P(A) = .35, P(B) = .65
P(A) = .65, P(B) = .35
Why? See next slide…
Cluster Analysis
68
EM for Coin Example
 Suppose
PA(H) = 0.6 and PB(H) = 0.5
o And in 10 flips of 1 coin, we find 8 H and 2
T
 Assuming
coin A was flipped, we have
a = 0.68 × 0.42 = 0.0026874
 Assuming
coin B was flipped, we have
b = 0.58 × 0.52 = 0.0009766
 Then
by Bayes’ Formula
P(A) = a/(a + b) = 0.73 and P(B) = b/(a + b) = 0.27
Cluster Analysis
69
E-step for Coin Example
Assuming PA(H) = 0.6 and PB(H) = 0.5, we have
5
9
8
4
7
H, 5 T
H, 1 T
H, 2 T
H, 6 T
H, 3 T
P(A) = .45, P(B) = .55
P(A) = .80, P(B) = .20
P(A) = .73, P(B) = .27
P(A) = .35, P(B) = .65
P(A) = .65, P(B) = .35
Next, compute expected (weighted) H and T
 For example, in 1st line
o For A we have 5 x .45 = 2.25 H and 2.25 T
o For B we have 5 x .55 = 2.75 H and 2.75 T
Cluster Analysis
70
E-step for Coin Example
Assuming PA(H) = 0.6 and PB(H) = 0.5, we have
5 H, 5 T
9 H, 1 T
8 H, 2 T
4 H, 6 T
7 H, 3 T
P(A) = .45, P(B) = .55
P(A) = .80, P(B) = .20
P(A) = .73, P(B) = .27
P(A) = .35, P(B) = .65
P(A) = .65, P(B) = .35
Compute expected (weighted) H and T
 For example, in 2nd line
o For A, we have 9 x .8 = 7.2 H and 1 x .8 = .8 T
o For B, we have 9 x .2 = 1.8 H and 1 x .2 = .2 T
Cluster Analysis
71
E-step for Coin Example
Rounded to nearest 0.1:
5 H, 5 T
9 H, 1 T
8 H, 2 T
4 H, 6 T
7 H, 3 T
P(A) = .45, P(B) = .55
P(A) = .80, P(B) = .20
P(A) = .73, P(B) = .27
P(A) = .35, P(B) = .65
P(A) = .65, P(B) = .35
Coin A
Coin B
2.2H 2.2T
7.2H 0.8T
5.9H 1.5T
1.4H 2.1T
4.5H 1.9T
2.8H 2.8T
1.8H 0.2T
2.1H 0.5T
2.6H 3.9T
2.5H 1.1T
expected
21.2H 8.5T 11.8H 8.5T
 This completes the E-step
 We computed these expected values based
on current PA(H) and PB(H)
Cluster Analysis
72
M-step for Coin Example
 M-step
 Re-estimate
PA(H) and PB(H) using
results from E-step:
PA(H) = 21.2/(21.2+8.5) ≈ 0.71
PB(H) = 11.8/(11.8+8.5) ≈ 0.58
 Next?
E-step with these PA(H), PA(H)
o Then M-step, then E-step, then…
o …until convergence (or we get tired)
Cluster Analysis
73
EM for Clustering
 How
is EM relevant to clustering?
 Can use EM to obtain parameters of K
“hidden” distributions
o That is, means and variances, μi and σi2
 Then,
use μi as centers of clusters
o And σi (standard deviations) as “radii”
o Often use Gaussian (normal) distributions
 Is
this better than K-means?
Cluster Analysis
74
EM vs K-Means
Whether it is better or not, EM is obviously
different than K-means…
o …or is it?
Actually, K-means is special case of EM
o Using “distance” instead of “probabilities”
In K-means, we re-assign points to centroids
o Like “E” in EM, which “re-shapes” clusters
In K-means, we recompute centroids
o Like “M” of EM, where we recompute parameters
Cluster Analysis
75
EM Algorithm
 Now,
we give EM algorithm in general
o Eventually, consider a realistic example
 For
simplicity, we assume data is a
mixture from 2 distributions
o Easy to generalize to 3 or more
 Choose
probability distribution type
o Like choosing distance function in K-means
 Then
Cluster Analysis
iterate EM to determine params
76
EM Algorithm
 Assume
2 distributions (2 clusters)
o Let θ1 be parameter(s) of 1st distribution
o Let θ2 be parameter(s) of 2nd distribution
 For
binomial, θ1 = PA(H), θ2 = PB(H)
 For Gaussian distribution, θi = (μi,σi2)
 Also, mixture parameters, τ = (τ1,τ2)
o Fraction of samples from distribution i is τi
o Since 2 distributions, we have τ1 + τ2 = 1
Cluster Analysis
77
EM Algorithm
 Let
f(xi,θj) be the probability function
for distribution j
o For now, assuming 2 distributions
o And xi are “experiments” (data points)
 Make
initial guess for parameters
o That is, θ = (θ1, θ2) and τ
 Let
pji be probability of xi assuming
distribution j
Cluster Analysis
78
EM Algorithm
 Initial
guess for parameters θ and τ
o If you know something, use it
o If not, “random” may be OK
o In any case, choose reasonable values
 Next,
apply E-step, then M-step…
o …then E then M then E then M ….
 So,
what are E and M steps?
o Want to state these for the general case
Cluster Analysis
79
E-Step
 Using
Bayes’ Formula, we compute
o Where j = 1,2 and i = 1,2,…,n
o Assuming n data points and 2 distributions
 Then
pji is prob. of xi is in cluster j
o Or the “part” of xi that’s in cluster j
 Note
Cluster Analysis
p1i + p2i = 1 for i=1,2,…,n
80
M-Step
 Use
probabilities from E-step to reestimate parameters θ and τ
 Best estimate for τj given by
 This
simplifies to
Cluster Analysis
81
M Step
 Parms
θ and τ are funcs of μi and σi2
o Depends on specific distributions used
 Based
on pji, we have…
 Means:
 Variances:
Cluster Analysis
82
EM Example: Binomial
 Mean
of binomial is μ = Np
o Where p = P(H) and N trials per experiment
 Suppose
x1: 8 H and
x2: 5 H and
x3: 9 H and
x4: 4 H and
x5: 7 H and
N = 10, and 5 experiments:
2T
5T
1T
6T
3T
 Assuming
Cluster Analysis
2 coins, determine PA(H), PB(H)
83
Binomial Example
 Let
X=(x1,x2,x3,x4,x5)=(8,5,9,4,7)
o Have N = 10, and means are μ = Np
o We want to determine p for both coins
 Initial
guesses for parameters:
o Probabilities PA(H)=0.6 and PB(H)=0.5
o That is, θ = (θ1,θ2) = (0.6,0.5)
o Mixture params τ1 = 0.7 and τ2 = 0.3
o That is, τ = (τ1,τ2) = (0.7,0.3)
Cluster Analysis
84
Binomial Example: E Step
 Compute
pji using current guesses for
parameters θ and τ
Cluster Analysis
85
Binomial Example: M Step
 Recompute
 So,
τ using the new pji
τ = (τ1,τ2) = (0.7593,0.2407)
Cluster Analysis
86
Binomial Example: M Step
 Recompute
θ = (θ1,θ2)
 First, compute means μ = (μ1,μ2)
 Obtain
μ = (μ1,μ2) = (6.9180,5.5969)
 So, θ = (θ1,θ2) = (0.6918,0.5597)
Cluster Analysis
87
Binomial Example: Convergence
 Next
E-step:
o Compute new pji using the τ and θ
 Next
M-step:
o Compute τ and θ using pji from E step
 And
so on…
 In this example, EM converges to
τ = (τ1,τ2) = (0.5228, 0.4772)
θ = (θ1,θ2) = (0.7934, 0.5139)
Cluster Analysis
88
Gaussian Mixture Example
 We’ll
consider 2-d data
o That is, each data point is of form (x,y)
 Suppose
we want 2 clusters
o That is, 2 distributions to determine
 We’ll
assume Gaussian distributions
o Recall, Gaussian is normal distribution
o Since 2-d data, bivariate Gaussian
 Gaussian
Cluster Analysis
mixture problem
89
Data
 “Old
Faithful” geyser, Yellowstone NP
Cluster Analysis
90
Bivariate Gaussian
 Gaussian
(normal) dist. of 1 variable
 Bivariate
Gaussian distribution
o Where
o And ρ = cov(x,y)/σxσy
Cluster Analysis
91
Bivariate Gaussian:
Matrix Form
 Bivariate
Gaussian can be written as
 Where
 And
S is the covariance matrix
and hence
 Also
and
Cluster Analysis
92
Why Use Matrix Form?
 Generalizes
to multivariate Gaussian
o Formulas for det(S) and S-1 change
o Can cluster data that’s more than 2-d
 Re-estimation
formulas in E and M
steps have the same form as before
o Simply replace scalars with vectors
 In
matrix form, params of (bivariate)
Guassian: θ=(μ,S) where μ=(μx,μy)
Cluster Analysis
93
Old Faithful Data
 Data
is 2-d, so bivariate Gaussians
 Parameters of a bivariate Gaussian:
θ=(μ,S) where μ=(μx,μy)
 We
want 2 clusters, so must determine
θ1=(μ1,S1) and θ2=(μ2,S2)
Where Si are 2x2 and μi are 2x1
 Make
initial guesses, then iterate EM
o What to use as initial guesses?
Cluster Analysis
94
Old Faithful Example
 Initial
guesses?
 Recall that S is 2x2 and symmetric,
o Which implies that ρ = s12/sqrt(s11s22)
o And must always have -1 ≤ ρ ≤ 1
 This
imposes restriction on S
 We’ll use mean and variance of x and y
components when initializing
Cluster Analysis
95
Old Faithful:
Initialization
 Want
2 clusters
 Initialize 2 bivariate Gaussians
 For
both Si, can verify that ρ = 0.5
 Also, initialize τ = (τ1,τ2) = (0.6, 0.4)
Cluster Analysis
96
Old Faithful: E-Step
 First
Cluster Analysis
E-step yields
97
Old Faithful: M-Step
 First
M-step yields
τ = (τ1,τ2) = (0.5704, 0.4296)
 And
 Easy
to verify that -1 ≤ ρ ≤ 1
o For both distributions
Cluster Analysis
98
Old Faithful: Convergence
 After
100 iterations of EM
τ = (τ1,τ2) = (0.5001, 0.4999)
 With
 And
 Again,
Cluster Analysis
we have -1 ≤ ρ ≤ 1
99
Old Faithful Clusters
 Centroids
are
means: μ1, μ2
 Shading is
standard devs
o Darker area is
within one σ
o Lighter is two
standard dev’s
Cluster Analysis
100
EM Clustering
 Clusters
based on probabilities
o Each data point has a probability related
to each cluster
o Point is assigned to the cluster that gives
highest probability
 In
K-means, uses distance, not prob.
 But can view prob. as a “distance”
 So, K-means and EM not so
different…
Cluster Analysis
101
Conclusion
Clustering is fun, entertaining, very useful
o Can explore mysterious data, and more…
And K-means is really simple
o EM is powerful and not that difficult either
Measuring success is not so easy
o “Good” clusters? And useful information?
o Or just random noise? Anything can be clustered
Clustering is often just a starting point
o Helps us decide if any “there” is there
Cluster Analysis
102
References: K-Means
A.W. Moore, K-means and hierarchical
clustering
 P.-N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining, Addison-Wesley,
2006, Chapter 8, Cluster analysis: Basic
concepts and algorithms
 R. Jin, Cluster validation
 M.J. Norusis, IBM SPSS Statistics 19
Statistical Procedures Companion, Chapter 17,
Cluster analysis
Cluster Analysis
103
References: EM Clustering
C.B. Do and S. Batzoglou, What is the
expectation maximization algorithm?, Nature
Biotechnology, 26(8):897-899, 2008
 J.A. Bilmes, A gentle tutorial of the EM
algorithm and its application to parameter
estimation for Gaussian mixture and hidden
Markov models, ICSI Report TR-97-021,
1998
Cluster Analysis
104