Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficient Learning in High Dimensions
with
Trees and Mixtures
Marina Meila
Carnegie Mellon University
Multidimensional data
· Multidimensional (noisy) data
Learning
· Learning tasks - intelligent data analysis
·
·
·
·
categorization (clustering)
classification
novelty detection
probabilistic reasoning
· Data is changing, growing
· Tasks change
need to make learning automatic, efficient
Combining probability and algorithms
· Automatic
· Efficient
probability and statistics
algorithms
· This talk
the tree statistical model
Talk overview
Introduction:
statistical models
Perspective:
generative models
and decision tasks
The tree model
Mixtures
of trees
Learning
Experiments
Accelerated
learning
Bayesian
learning
A multivariate domain
· Data
Patient1
Patient2
Smoker
Bronchitis
Lung cancer
Cough
X ray
Smoker
Bronchitis
Lung cancer
Cough
X ray
............
Statistical model
· Queries
· Diagnose new patient
Smoker
Bronchitis
Lung cancer?
Cough
X ray
Cough
X ray
· Is smoking related to lung cancer?
Smoker
Bronchitis
Lung cancer?
· Understand the “laws” of the domain
Probabilistic approach
· Smoker, Bronchitis .. (discrete) random variables
· Statistical model (joint distribution)
P( Smoker, Bronchitis, Lung cancer, Cough, X ray )
summarizes knowledge about domain
· Queries:
· inference
e.g. P( Lung cancer = true | Smoker = true, Cough = false )
· structure of the model
• discovering relationships
• categorization
Probability table representation
v1v2 00 01 11 00
v3
0
.01 .14 .22 .01
1
.23 .03 .33 .03
· Query:
P(v1=0 | v2=1) =
P(v1=0, v2=1)
P(v2=1)
=
.14 + .03
.14 + .3 + .22 + .33
= .23
· Curse of dimensionality
if v1, v2, … vn binary variables
·
·
·
·
How to represent?
How to query?
How to learn from data?
Structure?
PV1,V2…Vn table with 2n entries!
Graphical models
· Structure
· vertices = variables
· edges = “direct dependencies”
· Parametrization
· by local probability tables
Galaxy
type
spectrum
dust
Obs
spectrum
photometric
measurement
·
·
·
·
compact parametric representation
efficient computation
learning parameters by simple formula
learning structure is NP-hard
distance
size
Z (redshift)
observed
size
The tree statistical model
· Structure
·
tree (graph with no cycles)
Parameters
· probability tables associated to edges
1
1
T3
3
2
T(x) =
4
5
T34
3
equivalent
2
P
T (x x )
uv E uv u v
deg v-1
P Tv(xv)
T(x) =
v V
• T(x) factors over tree edges
4
5
T4|3
P Tv|u(xv|xu)
uv E
Examples
· Splice junction domain
junction type
-7
-6
-5
-4
-3
-2
-1
+2
+1
+3
+4
+5
· Premature babies’ Bronho-Pulmonary Disease (BPD)
PulmHemorrh
Coag
HyperNa
Acidosis
Gestation
Thrombocyt
Weight
Hypertension
Temperature
BPD
Neutropenia
Suspect
Lipid
+6
+7
+8
Trees - basic operations
T(x) =
P Tuv(xuxv)
uv E
P Tvdeg
(xvv) -1
v V
|V| =n
Querying
the model
Estimating
the model
·
·
·
·
·
computing likelihood T(x) ~ n
conditioning TV-A|A (junction tree algorithm) ~ n
marginalization Tuv for arbitrary u,v ~ n
sampling ~ n
fitting to a given distribution ~ n2
• learning from data ~ n2Ndata
· is a simple model
The mixture of trees
m
Q(x) = S lkTk(x)
k=1
h = “hidden” variable
P( h=k ) = lk
k = 1, 2 . . . m
· NOT a graphical model
· computational efficiency preserved
(Meila 97)
Learning - problem formulation
· Maximum Likelihood learning
· given a data set D = { x1, . . . xN }
· find the model that best predicts the data
Topt = argmax T(D)
· Fitting a tree to a distribution
· given a data set D = { x1, . . . xN }
and distribution P that weights each data point,
· find
Topt = argmin KL( P || T )
· KL is Kullbach-Leibler divergence
· includes Maximum likelihood learning as a special case
Fitting a tree to a distribution
Topt = argmin KL( P || T )
· optimization over structure + parameters
· sufficient statistics
· probability tables Puv = Nuv/N u,v
· mutual informations Iuv
Iuv =
S
V
Puv
Puv log
PuPv
(Chow & Liu 68)
Fitting a tree to a distribution - solution
· Structure
Eopt = argmax
Suv IEuv
I12
I23
· found by Maximum Weight
Spanning Tree algorithm with
edge weights Iuv
I34
I56
I45
· Parameters
· copy marginals of P
Tuv = Puv for uv
I63
E
I61
Learning mixtures by the EM algorithm
Meila & Jordan ‘97
E step
which xi come from T k?
distribution P k(x)
M step
fit T k to set of points
min KL( Pk||Tk )
· Initialize randomly
· converges to local maximum of the likelihood
Remarks
· Learning a tree
· solution is globally optimal over structures and parameters
· tractable: running time ~ n2N
· Learning a mixture by the EM algorithm
· both E and M steps are exact, tractable
· running time
• E step ~ mnN
• M step ~ mn2N
· assumes m known
· converges to local optimum
Finding structure - the bars problem
Data n=25
learned structure
Structure recovery: 19 out of 20 trials
Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous)
0.95 +/- 0.01 (unambiguous)
Data likelihood [bits/data point] true model 8.58
learned model 9.82 +/-0.95
Experiments - density estimation
· Digits and digit pairs
Ntrain = 6000 Nvalid = 2000 Ntest = 5000
n = 64 variables ( m = 16 trees )
n = 128 variables ( m = 32 trees )
DNA splice junction classification
· n = 61 variables
· class = Intron/Exon, Exon/Intron, Neither
Supervised
(DELVE)
Tree
TANB
NB
Discovering structure
Tree
adjacency
matrix
class
IE junction
Intron
Exon
15 16 . . . 25 26 27 28 29 30 31
Tree - CT CT CT - CT A G G
True CT CT CT CT - CT A G G
(Watson “The molecular biology of the gene” 87)
EI junction
Exon
28 29 30 31 32
Tree CA A G
G T
True CA A G
G T
Intron
33 34 35 36
AG A G AG A G T
Irrelevant variables
61 original variables + 60 “noise” variables
Original
Augmented with irrelevant variables
Accelerated tree learning
· Running time for the tree learning algorithm ~ n2N
· Quadratic running time may be too slow:
Example: document classification
· document = data point
--> N = 103-4
· word = variable
--> n = 103-4
· sparse data --> #words in document
s and s << n,N
· Can sparsity be exploited to create faster algorithms?
Meila ‘99
Sparsity
· assume special value “0” that occurs frequently
sparsity = s
# non-zero variables in each data point s
s << n, N
· Idea:
“do not represent / count zeros”
Sparse
data
010000100001000
Linked list
000100000100000
length
010000000000001
s
Presort mutual informations
Theorem (Meila,99) If v, v’ are variables that do not cooccur with u
in V (i.e. Nuv = Nuv’ = 0 ) then
Nv > Nv’ ==> Iuv > Iuv’
· Consequences
· sort Nv => all edges uv , Nuv = 0 implicitly sorted by Iuv
· these edges need not be represented explicitly
· construct black box that outputs next “largest” edge
The black box data structure
Nv
v1
v2
list of u , Nuv > 0, sorted by Iuv
v
F-heap
of size
~n
list of u, Nuv =0, sorted by Nv (virtual)
vn
next edge uv
n log n + s2N + nK log n
(standard alg. running time n2N )
Total running time
Experiments - sparse binary data
Standard
accelerated
· N = 10,000
· s = 5, 10, 15, 100
Remarks
·
·
·
·
Realistic assumption
Exact algorithm, provably efficient time bounds
Degrades slowly to the standard algorithm if data not sparse
General
· non-integer counts
· multi-valued discrete variables
Bayesian learning of trees
Meila & Jaakkola ‘00
· Problem
· given prior distribution over trees P0(T)
data D = { x1, . . . xN }
· find posterior distribution P(T|D)
· Advantages
· incorporates prior knowledge
· regularization
· Solution
· Bayes’ formula
P(T|D) =
1
P0(T) P T(xi)
i=1,N
Z
· practically hard
• distribution over structure E and parameters
hard to represent
• computing Z is intractable in general
• exception: conjugate priors
qE
Decomposable priors
T = P f( u, v, qu|v)
uv E
· want priors that factor over tree edges
· prior for structure E
P0(E)
a P buv
uv E
· prior for tree parameters
P0(qE)
= P D( qu|v ; N’uv )
uv E
· (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v
· posterior is also Dirichlet with hyper-parameters
Nuv(xuxv) + N’uv(xuxv), u,v
V
V
Decomposable posterior
· Posterior distribution
P(T|D)
a P Wuv
uv
E
· factored over edges
· same form as prior
Wuv =
buv D( qu|v; N’uv + Nuv )
· Remains to compute the normalization constant
Discrete: graph theory
continuous: Meila & Jaakkola 99
The Matrix tree theorem
· Matrix tree theorem
If
P0(E) = 1
Z
P buv, buv
uv E
v
-buv
M( b ) =
Then
-buv
Sb
v’
Z = det M(
vv'
b)
u
0
Remarks on the decomposable prior
· Is a conjugate prior for the tree distribution
· Is tractable
·
·
·
·
defined by ~ n2 parameters
computed exactly in ~ n3 operations
posterior obtained in ~ n2N + n3 operations
derivatives w.r.t parameters, averaging, . . . ~ n3
· Mixtures of trees with decomposable priors
· MAP estimation with EM algorithm tractable
· Other applications
· ensembles of trees
· maximum entropy distributions on trees
So far . .
· Trees and mixtures of trees are structured statistical models
· Algorithmic techniques enable efficient learning
• mixture of trees
• accelerated algorithm
• matrix tree theorem & Bayesian learning
· Examples of usage
· Structure learning
· Compression
· Classification
Generative models and discrimination
· Trees are generative models
· descriptive
· can perform many tasks suboptimally
· Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99)
·
·
·
·
optimize for specific tasks
use generative models
combine simple models into ensembles
complexity control - by information theoretic principle
· Discrimination tasks
· detecting novelty
· diagnosis
· classification
Bridging the gap
Tasks
Descriptive
learning
Discriminative
learning
Future . . .
· Tasks have structure
·
·
·
·
multi-way classification
multiple indexing of documents
gene expression data
hierarchical, sequential decisions
Learn structured decision tasks
· sharing information btw tasks (transfer)
· modeling dependencies btw decisions