Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Graph Data
Marina Meila
University of Washington
Department of Statistics
www.stat.washington.edu
Graph Data—An example
edge weight Sij =
= number reports couthored by i, j
Dept of statistics
technical reports
Examples of graph data
Social networks
• friendships, work relationships
• AIDS epidemiology
• transactions between economic agents
• internet communities (e.g usenet, chat rooms)
Document databases, the web
• Citations, hyperlinks (not symmetric)
Computer networks
Image segmentation
• Data points are pixels
• features are distance, contour, color, texture
• Natural images, medical images, satellite images, etc
Protein-protein interactions, similarities
Linguistics
Vector data can be transformed into pairwise data
by nearest neighbor graphs
by “kernelizartion” (as in SVM’s)
Graph data and the similarity matrix
for most of this talk
Graph data can be
Symmetric similarities between nodes Sij=Sji¸ 0
• e.g. number of papers co-authored
Asymmetric affinities Aij ¸ 0
• e.g. number of links from site i to site j
Node attributes
• e.g. age, university
[Symmetric dis-similarities]
Overview
Graph data
The problem
what does it mean to do classification or clustering on a
graph?
three approaches to grouping
Clustering
Semisupervised learning
Kernels on graphs
Other and future directions
The main difference
In standard tasks, data are independent vectors
x = (age, number publications, ...)
Training set { x1, x2, ... xn} = a set of persons sampled
independently from the population
In graph mining tasks, the data are the (weighted) links
between graph nodes
S(x,x’) =number papers co-authored by x,x’
“Training set” = the whole co-authorship network
The problem
Standard
Standarddata
tasksmining tasks
data
data=independent
=independentvectors
vectors
(x
(x11,,...,x
...,xnn))in2 R
Rdd
[labels
[labels(y
(y11,...,y
,...,ynn))2
in {-1,1}]
{-1,1}]
Classification
Classification
supervised
supervisedlearning
learning
Semisupervised learning
Clustering
Clustering
unsupervised
unsupervisedlearning
learning
Graph mining tasks
data = graph on n nodes
node similarities Sij
[labels (y1,...,yn) in {-1,1}]
Classification
supervised learning
Semisupervised learning
Clustering and embedding
unsupervised learning
Clustering
3 clusters
2 clusters
Embedding
Semisupervised learning
(transductive classification)
Three paradigms for grouping nodes in graphs
Both clustering and classification can be seen as grouping
Graph cuts
remove some edges disconnected graph
the groups are the connected components
By “similar behavior”
nodes i, j in the same group iff i,j “have the same pattern
of connections” w.r.t other nodes
By Embedding
map nodes {1,2,...,n} --->{x1,x2, ..,xn} 2 Rd
then use standard classification and clustering methods
1. Graph cuts
Definitions
node degree (or volume) D_i
volume of cluster C
cut between clusters C,C’
MinCut vs Multway Normalized Cut (MNCut)
MinCut minimize Cut( C, C’) over all partitions C,C’
polynomial
BUT: resulting partition can be imbalanced
MNCut
For K=2
Motivation for MNCut
MNCut is smallest for the “best” clustering in many situations
Sij / 1/dist(I,j)
2. “Patterns of behavior” : The random walks view
Sij
i
Sil
l
j
Pij
i
Sik
Pil
k
volume (degree) of node i
transition probability
l
matrix notation
D = diag( D1, D2, … Dn )
) P = D-1S
j
Pik
k
Idea:
nodes i, j are grouped together, iff they transition in the
same way to other clusters
2
i
2
1
j
2
i
Pi,red
Pi,yellow
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
2/3
1/3
2/3
1/3
2/3
1/3
2/3
1/3
3. Embedding
mapping from nodes to R
mapping from nodes to Rd = [f(1) f(2) ... f(d)]
where fi(k) represents the k-th coordinate of node i
wanted
nodes that are similar mapped near each other
ideally: all nodes in a group map to the same point
vector with n elements
Another look at Pi,C
i
Pi,red
Pi,yellow
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
2/3
1/3
2/3
1/3
2/3
1/3
2/3
1/3
a piecewise
constant function
fred
2/3
1/5
1/3
4/5
not all graphs produce perfect embeddings by this method
need to know the groups to obtain the embedding
Three approaches to grouping, summarized
1. Minimize MNCut
2. Random walks
(how?)
Group by similarity of “aggregated” transitions Pi,C
3. Embedding
(how?)
(how?)
Will show that
1. 1-2-3 are equivalent
2. a spectral algorithm to solve the problem
Overview
Graph data
The problem
Clustering
Random walks: a spectral clustering algorithm
spectral clustering as optimization
a stability result
Semisupervised learning
Kernels on graphs
Other and future directions
Theorem 1. Lumpability
Let S = n x n similarity matrix
C = {C1, C2, ... CK} a clustering
Then
the transition probabilities Pi,C are piecewise constant
iff the transition matrix P = D-1S has K piecewise constant
eigenvectors
Why is this important?
suggests algorithm to find the grouping C
• spectral algorithm
grouping by the similarity of connections is a form of
embedding
A spectral clustering algorithm
Algorithm SC
(Meila & Shi, 01)
(there are many
other variants)
INPUT: number of clusters K
symmetric similarity matrix S
1. Compute transition matrix P
2. Compute K largest eigenvalues of P and their eigenvectors
l1¸ l2 … ¸ lK , v1, v2, …, vK
3. Spectral mapping: map nodes to RK by
4.
node i
xi = ( v1i v2i … vKi )
Cluster data in RK by e.g min diameter, k-means
OUTPUT : clustering C
(Dasgupta & Schulman 02)
Spectral clustering in a nutshell
weighted
graph
similarity
matrix S
transition
matrix P
first K
eigenvectors of P
K clusters
n vertices to
cluster;
observations
are pairwise
similarities
normalize
rows
n x n, symmetric
Sij¸ 0
spectral
mapping
clustering
in RK
Theorem 2. Multicut
Let S = n x n similarity matrix
L = I - D-1/2SD-1/2 and P = D-1S
C = {C1, C2, ... CK} a clustering, Y it’s indicator matrix
Then
a
with
equality iff (v1 v2 ..vK) eigenvectors of P are piecewise
constant
Theorem 2. Multicut
Let S = n x n similarity matrix
L = I - D-1/2SD-1/2 and P = D-1S
C = {C1, C2, ... CK} a clustering
Then
a
equality iff (v1 v2 ..vK) eigenvectors of P are piecewise
constant
Why is this important?
MNCut has quadratic expression (used later)
non-trivial lower bound for MNCut (used later)
for (nearly) perfect P the Spectral Clustering Algorithm
minimizes MNCut
Hence the SC algorithm can ce viewed in three different ways
Theorem 3. Stability
The eigengap of P
measures the stability of the K-th principal subspace w.r.t
perturbations of P
Definition
Theorem
Let
be two clusterings with
Then,
Significance
If a stability theorem holds
any two “good” clusterings are close
in particular, no “good” clustering can be too far from the
optimal C*
Gap Corollary If
then
Is the bound ever informative?
An experiment: S perfect + additive noise
Overview
Graph data
The problem
Clustering
Semisupervised learning
Kernels on graphs
Other and future directions
Semisupervised grouping
Data
(i1,y1),(i2,y2)...(il,yl) = l labeled
nodes
il+1,...il+u = u unlabeled nodes
l+u = n
Assumed that groups (classes)
agree with graph stucture
ignoring unlabeled data
using unlabeled data
MNCut as smoothness
Let f 2 Rn = the labeling
fi = class( node i )
In Rd
smoothness functional
On graph
grad f fi – fj
P = discrete measure
The Laplace operator(s) on a graph
Unnormalized Laplacian
L=D–S
intuitive
Normalized Laplacian
L = I – D-1/2SD-1/2
scale invariant
compact operator
• better convergence properties
Graph regularized Least Squares
Belkin & Nyogi ‘05
For simplicity assume K = 2
Criterion: Minimize smoothness + labeling error
= regularization parameter (to be chosen)
Solution
Quadratic criterion
linear gradient
solution f* obtained by solving linear system
label node i by y(i) = sign( fi )
Approach extends to K>2 classes
Overview
Graph data
The problem
Clustering
Semisupervised learning
Kernels on graphs
graph regularized SVM
heat kernels
Other and future directions
Kernel machines
Kernel machines/ Supprt vector machines solve the problem
min
in an elegant way
• when cost and ||f|| can be expressed in terms of a
scalar product between data points
the scalar product <x,x’> = K(x,x’)
• defines the kernel K
Our problem: define a kernel between nodes of a graph
has to reflect the graph topology
Kernels on graphs
1. “Manifold regularization”
kernel K is given
• e.g data are vectors in RN
graph + S given
• e.g nearest neighbors graph
task = classification
adds regularization (=smoothness penalty) based on
unlabeled data
2. “Heat kernel”
graph + S given
task
• find a kernel on the finite set of graph nodes
• [will be use it to label the nodes as in regular SVM]
Graph regularized SVM
Graph given
e.g nearest neighbor graph
Kernel K given
Problem formulation
Representer theorem
if || ||I smooth enough w.r.t || ||K
Belkin & Nyogi ‘05
The Heat kernel
Kondor & Lafferty 03
The [heat] diffusion equation
f(x,t) = “temperature”
= Laplace operator
solution
• with Kt = the heat kernel
On graph
heat kernel (discrete time)
continuos time
• t = = smoothing parameter for the kernel
Generalized Heat Kernel
(Smola & Kondor 03)
Theorem The only linear, permutation invariant mappings
S T(S)2 Rn £ n are of the form
S + D + V with V = Di
Idea:
1. choose a regularization norm ||f||2 = <f, Qf>
2. with Q = q(L)
3. define
where
Theorem
<f,Qf’> defines a reproducing kernel Hilbert space
(RKHS)
the kernel is
Overview
Graph data
The problem
Clustering
Semisupervised learning
Kernels on graphs
Other and future directions
Other aspects and future directions
Computation
Selecting number of clusters K
Obtaining / Learning the similarities Sij
Other tasks
ranking, influence, communication
Incorporating
constraints (prior knowledge)
statistical models
vector data
Directed graphs/ asymmetric S matrix
Computation
Algorithms are polynomial but intensive
all eigenvectors n3
K eigenvectors nK x iterations
SVM solver
• quadratic optimization problem
Numerical stability
Good: many graphs are sparse
saves memory and computation
Perfect (P,C) pair
C1
A
PBA
B
R12
PAC
PCB
C
C2
R21
The “chain” over clusters is generally not Markov
I.e, knowing past states gives information about the
future
Definition (P, C) is a perfect pair iff aggregated chain is
Markov
The spectral mapping
If (P, C) perfect
v1, v2,… vK first K eigenvectors of P
v1
v2
v3
The spectral mapping: Data as elements of v2, v3
These
eigenvectors are
called piecewise
constant (PC)
v3
v2
The “classification error” distance
computed by the maximal bipartite matching algorithm
between clusters
k
classification
confusion
matrix
error
k’
Dkk’