Download Trees - Carnegie Mellon School of Computer Science

Efficient Learning in High Dimensions with Trees and Mixtures Marina Meila Carnegie Mellon University Multidimensional data · Multidimensional (noisy) data Learning · Learning tasks - intelligent data analysis · · · · categorization (clustering) classification novelty detection probabilistic reasoning · Data is changing, growing · Tasks change need to make learning automatic, efficient Combining probability and algorithms · Automatic · Efficient probability and statistics algorithms · This talk the tree statistical model Talk overview Introduction: statistical models Perspective: generative models and decision tasks The tree model Mixtures of trees Learning Experiments Accelerated learning Bayesian learning A multivariate domain · Data Patient1 Patient2 Smoker Bronchitis Lung cancer Cough X ray Smoker Bronchitis Lung cancer Cough X ray ............ Statistical model · Queries · Diagnose new patient Smoker Bronchitis Lung cancer? Cough X ray Cough X ray · Is smoking related to lung cancer? Smoker Bronchitis Lung cancer? · Understand the “laws” of the domain Probabilistic approach · Smoker, Bronchitis .. (discrete) random variables · Statistical model (joint distribution) P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain · Queries: · inference e.g. P( Lung cancer = true | Smoker = true, Cough = false ) · structure of the model • discovering relationships • categorization Probability table representation v1v2 00 01 11 00 v3 0 .01 .14 .22 .01 1 .23 .03 .33 .03 · Query: P(v1=0 | v2=1) = P(v1=0, v2=1) P(v2=1) = .14 + .03 .14 + .3 + .22 + .33 = .23 · Curse of dimensionality if v1, v2, … vn binary variables · · · · How to represent? How to query? How to learn from data? Structure? PV1,V2…Vn table with 2n entries! Graphical models · Structure · vertices = variables · edges = “direct dependencies” · Parametrization · by local probability tables Galaxy type spectrum dust Obs spectrum photometric measurement · · · · compact parametric representation efficient computation learning parameters by simple formula learning structure is NP-hard distance size Z (redshift) observed size The tree statistical model · Structure · tree (graph with no cycles) Parameters · probability tables associated to edges 1 1 T3 3 2 T(x) = 4 5 T34 3 equivalent 2 P T (x x ) uv E uv u v deg v-1 P Tv(xv) T(x) = v V • T(x) factors over tree edges 4 5 T4|3 P Tv|u(xv|xu) uv E Examples · Splice junction domain junction type -7 -6 -5 -4 -3 -2 -1 +2 +1 +3 +4 +5 · Premature babies’ Bronho-Pulmonary Disease (BPD) PulmHemorrh Coag HyperNa Acidosis Gestation Thrombocyt Weight Hypertension Temperature BPD Neutropenia Suspect Lipid +6 +7 +8 Trees - basic operations T(x) = P Tuv(xuxv) uv E P Tvdeg (xvv) -1 v V |V| =n Querying the model Estimating the model · · · · · computing likelihood T(x) ~ n conditioning TV-A|A (junction tree algorithm) ~ n marginalization Tuv for arbitrary u,v ~ n sampling ~ n fitting to a given distribution ~ n2 • learning from data ~ n2Ndata · is a simple model The mixture of trees m Q(x) = S lkTk(x) k=1 h = “hidden” variable P( h=k ) = lk k = 1, 2 . . . m · NOT a graphical model · computational efficiency preserved (Meila 97) Learning - problem formulation · Maximum Likelihood learning · given a data set D = { x1, . . . xN } · find the model that best predicts the data Topt = argmax T(D) · Fitting a tree to a distribution · given a data set D = { x1, . . . xN } and distribution P that weights each data point, · find Topt = argmin KL( P || T ) · KL is Kullbach-Leibler divergence · includes Maximum likelihood learning as a special case Fitting a tree to a distribution Topt = argmin KL( P || T ) · optimization over structure + parameters · sufficient statistics · probability tables Puv = Nuv/N u,v · mutual informations Iuv Iuv = S V Puv Puv log PuPv (Chow & Liu 68) Fitting a tree to a distribution - solution · Structure Eopt = argmax Suv IEuv I12 I23 · found by Maximum Weight Spanning Tree algorithm with edge weights Iuv I34 I56 I45 · Parameters · copy marginals of P Tuv = Puv for uv I63 E I61 Learning mixtures by the EM algorithm Meila & Jordan ‘97 E step which xi come from T k? distribution P k(x) M step fit T k to set of points min KL( Pk||Tk ) · Initialize randomly · converges to local maximum of the likelihood Remarks · Learning a tree · solution is globally optimal over structures and parameters · tractable: running time ~ n2N · Learning a mixture by the EM algorithm · both E and M steps are exact, tractable · running time • E step ~ mnN • M step ~ mn2N · assumes m known · converges to local optimum Finding structure - the bars problem Data n=25 learned structure Structure recovery: 19 out of 20 trials Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous) 0.95 +/- 0.01 (unambiguous) Data likelihood [bits/data point] true model 8.58 learned model 9.82 +/-0.95 Experiments - density estimation · Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000 n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees ) DNA splice junction classification · n = 61 variables · class = Intron/Exon, Exon/Intron, Neither Supervised (DELVE) Tree TANB NB Discovering structure Tree adjacency matrix class IE junction Intron Exon 15 16 . . . 25 26 27 28 29 30 31 Tree - CT CT CT - CT A G G True CT CT CT CT - CT A G G (Watson “The molecular biology of the gene” 87) EI junction Exon 28 29 30 31 32 Tree CA A G G T True CA A G G T Intron 33 34 35 36 AG A G AG A G T Irrelevant variables 61 original variables + 60 “noise” variables Original Augmented with irrelevant variables Accelerated tree learning · Running time for the tree learning algorithm ~ n2N · Quadratic running time may be too slow: Example: document classification · document = data point --> N = 103-4 · word = variable --> n = 103-4 · sparse data --> #words in document s and s << n,N · Can sparsity be exploited to create faster algorithms? Meila ‘99 Sparsity · assume special value “0” that occurs frequently sparsity = s  # non-zero variables in each data point  s s << n, N · Idea: “do not represent / count zeros” Sparse data 010000100001000 Linked list 000100000100000 length 010000000000001 s Presort mutual informations Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then Nv > Nv’ ==> Iuv > Iuv’ · Consequences · sort Nv => all edges uv , Nuv = 0 implicitly sorted by Iuv · these edges need not be represented explicitly · construct black box that outputs next “largest” edge The black box data structure Nv v1 v2 list of u , Nuv > 0, sorted by Iuv v F-heap of size ~n list of u, Nuv =0, sorted by Nv (virtual) vn next edge uv n log n + s2N + nK log n (standard alg. running time n2N ) Total running time Experiments - sparse binary data Standard accelerated · N = 10,000 · s = 5, 10, 15, 100 Remarks · · · · Realistic assumption Exact algorithm, provably efficient time bounds Degrades slowly to the standard algorithm if data not sparse General · non-integer counts · multi-valued discrete variables Bayesian learning of trees Meila & Jaakkola ‘00 · Problem · given prior distribution over trees P0(T) data D = { x1, . . . xN } · find posterior distribution P(T|D) · Advantages · incorporates prior knowledge · regularization · Solution · Bayes’ formula P(T|D) = 1 P0(T) P T(xi) i=1,N Z · practically hard • distribution over structure E and parameters hard to represent • computing Z is intractable in general • exception: conjugate priors qE Decomposable priors T = P f( u, v, qu|v) uv E · want priors that factor over tree edges · prior for structure E P0(E) a P buv uv E · prior for tree parameters P0(qE) = P D( qu|v ; N’uv ) uv E · (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v · posterior is also Dirichlet with hyper-parameters Nuv(xuxv) + N’uv(xuxv), u,v V V Decomposable posterior · Posterior distribution P(T|D) a P Wuv uv E · factored over edges · same form as prior Wuv = buv D( qu|v; N’uv + Nuv ) · Remains to compute the normalization constant Discrete: graph theory continuous: Meila & Jaakkola 99 The Matrix tree theorem · Matrix tree theorem If P0(E) = 1 Z P buv, buv uv E v -buv M( b ) = Then -buv Sb v’ Z = det M( vv' b) u 0 Remarks on the decomposable prior · Is a conjugate prior for the tree distribution · Is tractable · · · · defined by ~ n2 parameters computed exactly in ~ n3 operations posterior obtained in ~ n2N + n3 operations derivatives w.r.t parameters, averaging, . . . ~ n3 · Mixtures of trees with decomposable priors · MAP estimation with EM algorithm tractable · Other applications · ensembles of trees · maximum entropy distributions on trees So far . . · Trees and mixtures of trees are structured statistical models · Algorithmic techniques enable efficient learning • mixture of trees • accelerated algorithm • matrix tree theorem & Bayesian learning · Examples of usage · Structure learning · Compression · Classification Generative models and discrimination · Trees are generative models · descriptive · can perform many tasks suboptimally · Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99) · · · · optimize for specific tasks use generative models combine simple models into ensembles complexity control - by information theoretic principle · Discrimination tasks · detecting novelty · diagnosis · classification Bridging the gap Tasks Descriptive learning Discriminative learning Future . . . · Tasks have structure · · · · multi-way classification multiple indexing of documents gene expression data hierarchical, sequential decisions Learn structured decision tasks · sharing information btw tasks (transfer) · modeling dependencies btw decisions

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Trees - Carnegie Mellon School of Computer Science