Download on a graph - Department of Electrical Engineering and Computing

Mining Graph Data Marina Meila University of Washington Department of Statistics www.stat.washington.edu Graph Data—An example  edge weight Sij = = number reports couthored by i, j Dept of statistics technical reports Examples of graph data        Social networks • friendships, work relationships • AIDS epidemiology • transactions between economic agents • internet communities (e.g usenet, chat rooms) Document databases, the web • Citations, hyperlinks (not symmetric) Computer networks Image segmentation • Data points are pixels • features are distance, contour, color, texture • Natural images, medical images, satellite images, etc Protein-protein interactions, similarities Linguistics Vector data can be transformed into pairwise data  by nearest neighbor graphs  by “kernelizartion” (as in SVM’s) Graph data and the similarity matrix  for most of this talk Graph data can be  Symmetric similarities between nodes Sij=Sji¸ 0 • e.g. number of papers co-authored  Asymmetric affinities Aij ¸ 0 • e.g. number of links from site i to site j  Node attributes • e.g. age, university  [Symmetric dis-similarities] Overview  Graph data   The problem  what does it mean to do classification or clustering on a graph?  three approaches to grouping Clustering  Semisupervised learning  Kernels on graphs  Other and future directions The main difference  In standard tasks, data are independent vectors  x = (age, number publications, ...)  Training set { x1, x2, ... xn} = a set of persons sampled independently from the population  In graph mining tasks, the data are the (weighted) links between graph nodes  S(x,x’) =number papers co-authored by x,x’  “Training set” = the whole co-authorship network The problem Standard Standarddata tasksmining tasks  data data=independent =independentvectors vectors (x (x11,,...,x ...,xnn))in2 R Rdd  [labels [labels(y (y11,...,y ,...,ynn))2 in {-1,1}] {-1,1}]  Classification Classification  supervised supervisedlearning learning  Semisupervised learning  Clustering Clustering  unsupervised unsupervisedlearning learning Graph mining tasks  data = graph on n nodes  node similarities Sij  [labels (y1,...,yn) in {-1,1}]  Classification  supervised learning  Semisupervised learning  Clustering and embedding  unsupervised learning  Clustering  3 clusters  2 clusters  Embedding  Semisupervised learning (transductive classification) Three paradigms for grouping nodes in graphs  Both clustering and classification can be seen as grouping  Graph cuts  remove some edges  disconnected graph  the groups are the connected components  By “similar behavior”  nodes i, j in the same group iff i,j “have the same pattern of connections” w.r.t other nodes  By Embedding  map nodes {1,2,...,n} --->{x1,x2, ..,xn} 2 Rd  then use standard classification and clustering methods 1. Graph cuts  Definitions  node degree (or volume) D_i  volume of cluster C  cut between clusters C,C’ MinCut vs Multway Normalized Cut (MNCut)  MinCut minimize Cut( C, C’) over all partitions C,C’  polynomial  BUT: resulting partition can be imbalanced  MNCut  For K=2 Motivation for MNCut  MNCut is smallest for the “best” clustering in many situations Sij / 1/dist(I,j) 2. “Patterns of behavior” : The random walks view Sij i Sil l j Pij i Sik Pil k  volume (degree) of node i  transition probability l  matrix notation D = diag( D1, D2, … Dn ) ) P = D-1S j Pik k  Idea:  nodes i, j are grouped together, iff they transition in the same way to other clusters 2 i 2 1 j 2 i Pi,red Pi,yellow 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 2/3 1/3 2/3 1/3 2/3 1/3 2/3 1/3 3. Embedding  mapping from nodes to R  mapping from nodes to Rd = [f(1) f(2) ... f(d)]  where fi(k) represents the k-th coordinate of node i  wanted  nodes that are similar mapped near each other  ideally: all nodes in a group map to the same point vector with n elements Another look at Pi,C i Pi,red Pi,yellow 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 2/3 1/3 2/3 1/3 2/3 1/3 2/3 1/3 a piecewise constant function fred 2/3 1/5 1/3 4/5   not all graphs produce perfect embeddings by this method need to know the groups to obtain the embedding Three approaches to grouping, summarized 1. Minimize MNCut 2. Random walks  (how?) Group by similarity of “aggregated” transitions Pi,C 3. Embedding  (how?) (how?) Will show that 1. 1-2-3 are equivalent 2. a spectral algorithm to solve the problem Overview  Graph data  The problem  Clustering  Random walks: a spectral clustering algorithm  spectral clustering as optimization  a stability result  Semisupervised learning  Kernels on graphs  Other and future directions Theorem 1. Lumpability    Let S = n x n similarity matrix C = {C1, C2, ... CK} a clustering Then  the transition probabilities Pi,C are piecewise constant  iff the transition matrix P = D-1S has K piecewise constant eigenvectors Why is this important?  suggests algorithm to find the grouping C • spectral algorithm  grouping by the similarity of connections is a form of embedding A spectral clustering algorithm Algorithm SC (Meila & Shi, 01) (there are many other variants) INPUT: number of clusters K symmetric similarity matrix S 1. Compute transition matrix P 2. Compute K largest eigenvalues of P and their eigenvectors l1¸ l2 … ¸ lK , v1, v2, …, vK 3. Spectral mapping: map nodes to RK by 4. node i  xi = ( v1i v2i … vKi ) Cluster data in RK by e.g min diameter, k-means OUTPUT : clustering C (Dasgupta & Schulman 02) Spectral clustering in a nutshell weighted graph similarity matrix S transition matrix P first K eigenvectors of P K clusters n vertices to cluster; observations are pairwise similarities normalize rows n x n, symmetric Sij¸ 0 spectral mapping clustering in RK Theorem 2. Multicut   Let S = n x n similarity matrix L = I - D-1/2SD-1/2 and P = D-1S C = {C1, C2, ... CK} a clustering, Y it’s indicator matrix Then  a  with  equality iff (v1 v2 ..vK) eigenvectors of P are piecewise constant Theorem 2. Multicut   Let S = n x n similarity matrix L = I - D-1/2SD-1/2 and P = D-1S C = {C1, C2, ... CK} a clustering Then  a  equality iff (v1 v2 ..vK) eigenvectors of P are piecewise constant   Why is this important?  MNCut has quadratic expression (used later)  non-trivial lower bound for MNCut (used later)  for (nearly) perfect P the Spectral Clustering Algorithm minimizes MNCut Hence the SC algorithm can ce viewed in three different ways Theorem 3. Stability  The eigengap of P   measures the stability of the K-th principal subspace w.r.t perturbations of P Definition  Theorem Let be two clusterings with Then, Significance  If a stability theorem holds  any two “good” clusterings are close  in particular, no “good” clustering can be too far from the optimal C*  Gap Corollary If then Is the bound ever informative?  An experiment: S perfect + additive noise Overview  Graph data  The problem  Clustering  Semisupervised learning  Kernels on graphs  Other and future directions Semisupervised grouping  Data  (i1,y1),(i2,y2)...(il,yl) = l labeled nodes  il+1,...il+u = u unlabeled nodes  l+u = n  Assumed that groups (classes) agree with graph stucture ignoring unlabeled data using unlabeled data MNCut as smoothness    Let f 2 Rn = the labeling  fi = class( node i ) In Rd  smoothness functional On graph  grad f  fi – fj  P = discrete measure The Laplace operator(s) on a graph  Unnormalized Laplacian L=D–S   intuitive Normalized Laplacian L = I – D-1/2SD-1/2  scale invariant  compact operator • better convergence properties Graph regularized Least Squares   Belkin & Nyogi ‘05 For simplicity assume K = 2 Criterion: Minimize smoothness + labeling error   = regularization parameter (to be chosen)  Solution  Quadratic criterion linear gradient  solution f* obtained by solving linear system  label node i by y(i) = sign( fi )  Approach extends to K>2 classes Overview  Graph data  The problem  Clustering  Semisupervised learning  Kernels on graphs  graph regularized SVM  heat kernels  Other and future directions Kernel machines  Kernel machines/ Supprt vector machines solve the problem  min  in an elegant way • when cost and ||f|| can be expressed in terms of a scalar product between data points  the scalar product <x,x’> = K(x,x’) • defines the kernel K  Our problem: define a kernel between nodes of a graph  has to reflect the graph topology Kernels on graphs 1. “Manifold regularization” kernel K is given • e.g data are vectors in RN  graph + S given • e.g nearest neighbors graph  task = classification  adds regularization (=smoothness penalty) based on unlabeled data 2. “Heat kernel”  graph + S given  task • find a kernel on the finite set of graph nodes • [will be use it to label the nodes as in regular SVM]  Graph regularized SVM   Graph given  e.g nearest neighbor graph Kernel K given  Problem formulation  Representer theorem  if || ||I smooth enough w.r.t || ||K Belkin & Nyogi ‘05 The Heat kernel  Kondor & Lafferty 03 The [heat] diffusion equation  f(x,t) = “temperature”   = Laplace operator  solution • with Kt = the heat kernel  On graph  heat kernel (discrete time)  continuos time •  t =  = smoothing parameter for the kernel Generalized Heat Kernel    (Smola & Kondor 03) Theorem The only linear, permutation invariant mappings S  T(S)2 Rn £ n are of the form  S +  D +  V with V =  Di Idea: 1. choose a regularization norm ||f||2 = <f, Qf> 2. with Q = q(L) 3. define  where Theorem  <f,Qf’> defines a reproducing kernel Hilbert space (RKHS)  the kernel is Overview  Graph data  The problem  Clustering  Semisupervised learning  Kernels on graphs  Other and future directions Other aspects and future directions  Computation  Selecting number of clusters K  Obtaining / Learning the similarities Sij  Other tasks  ranking, influence, communication  Incorporating  constraints (prior knowledge)  statistical models  vector data  Directed graphs/ asymmetric S matrix Computation  Algorithms are polynomial but intensive  all eigenvectors n3  K eigenvectors nK x iterations  SVM solver • quadratic optimization problem   Numerical stability Good: many graphs are sparse  saves memory and computation Perfect (P,C) pair C1 A PBA B   R12 PAC PCB C C2 R21 The “chain” over clusters is generally not Markov  I.e, knowing past states gives information about the future Definition (P, C) is a perfect pair iff aggregated chain is Markov The spectral mapping If (P, C) perfect v1, v2,… vK first K eigenvectors of P v1 v2 v3 The spectral mapping: Data as elements of v2, v3 These eigenvectors are called piecewise constant (PC) v3 v2  The “classification error” distance  computed by the maximal bipartite matching algorithm between clusters k classification confusion matrix error k’ Dkk’

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download on a graph - Department of Electrical Engineering and Computing