Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A brief maximum entropy tutorial
Overview
• Statistical modeling addresses the problem of
modeling the behavior of a random process
• In constructing this model, we typically have at
our disposal a sample of output from the process.
From the sample, which constitutes an incomplete
state of knowledge about the process, the
modeling problem is to parlay this knowledge into
a succinct, accurate representation of the process
• We can then use this representation to make
predictions of the future behavior of the process
Motivating example
• Suppose we wish to model an expert translator’s
decisions concerning the proper French rendering
of the English word in.
• A model p of the expert’s decisions assigns to each
French word or phrase f an estimate, p(f), of the
probability that the expert would choose f as a
translation of in.
• Develop p – collect a large sample of instances of
the expert’s decisions
Motivating example
• Our goal is to
– Extract a set of facts about the decision-making
process from the sample (the first task of
modeling)
– Construct a model of this process (the second
task)
Motivating example
• One obvious clue we might glean from the sample
is the list of allowed translations
– in {dans, en, à, au cours de, pendant}
• With this information in hand, we can impose our
first constraint on our model p:
p(dans) p(en) p(a ) p(au cours de) p( pendant ) 1
This equation represents our first statistic of the
process; we can now proceed to search for a
suitable model which obeys this equation
– There are infinite number of models p for which this
identify holds
Motivating example
• One model which satisfies the above equation is
p(dans)=1; in other words, the model always
predicts dans.
• Another model which obeys this constraint
predicts pendant with a probability of ½, and à
with a probability of ½.
• But both of these models offend our sensibilities:
knowing only that the expert always chose from
among these five French phrases, how can we
justify either of these probability distributions?
Motivating example
• Knowing only that the expert chose exclusively from
among these five French phrases, the most intuitively
appealing model is
p (dans ) 1 / 5
p (en) 1 / 5
p (a ) 1 / 5
p (au cours de) 1 / 5
p ( pendant ) 1 / 5
This model, which allocates the total probability evenly among
the five possible phrases, is the most uniform model subject to
our knowledge
It is not, however, the most uniform overall; that model would
grant an equal probability to every possible French phrase.
Motivating example
• We might hope to glean more clues about the expert’s
decisions from our sample.
• Suppose we notice that the expert chose either dans or en
30% of the time
p(dans) p(en) 3 / 10
p(dans) p(en) p(a ) p(au cours de) p( pendant ) 1
Once again there are many probability distributions
consistent with these two constraints.
p (dans ) 3 / 20
• In the absence of any other knowledge,
a reasonable choice for p is again the
p(en) 3 / 20
most uniform – that is, the distribution
p(a ) 7 / 30
which allocates its probability as evenly
as possible, subject to the constrains: p (au cours de) 7 / 30
p ( pendant ) 7 / 30
Motivating example
• Say we inspect the data once more, and this time notice
another interesting fact: in half the cases, the expert chose
either dans or à. We can incorporate this information into
our model as a third constraint:
p(dans) p(en) 3 / 10
p(dans) p(en) p(a ) p(au cours de) p( pendant ) 1
p(dans) p(a ) 1 / 2
• We can once again look for the most uniform p satisfying
these constraints, but now the choice is not as obvious.
Motivating example
• As we have added complexity, we have
encountered two problems:
– First, what exactly is meant by “uniform,” and how can
one measure the uniformity of a model?
– Second, having determined a suitable answer to these
questions, how does one find the most uniform model
subject to a set of constraints like those we have
described?
Motivating example
• The maximum entropy method answers both these
questions.
• Intuitively, the principle is simple:
– model all that is known and assume nothing about that
which is unknown
– In other words, given a collection of facts, choose a
model which is consistent with all the facts, but
otherwise as uniform as possible.
• This is precisely the approach we took in selecting
our model p at each step in the above example
Maxent Modeling
• Consider a random process which produces an
output value y, a member of a finite set У.
– y may be any word in the set {dans, en, à, au cours de,
pendant}
• In generating y, the process may be influenced by
some contextual information x, a mamber of a
finite set X.
– x could include the words in the English sentence
surrounding in
• To construct a stochastic model that accurately
represents the behavior of the random process
– Given a context x, the process will output y.
Training data
• Collect a large number of samples (x1, y1), (x2, y2),…,
(xN, yN)
– Each sample would consist of a phrase x containing the
words surrounding in, together with the translation y of
in which the process produced
1
~
p x, y number of times that x, y occurs in the sample
N
• Typically, a particular pair (x, y) will either not occur
at all in the sample, or will occur at most a few times.
– smoothing
Features and constraints
• The goal is to construct a statistical model of the
process which generated the training sample ~
px, y
• The building blocks of this model will be a set of
statistics of the training sample
– The frequency that in translated to either dans or en
was 3/10
– The frequency that in translated to either dans or au
cours de was ½
– And so on
Statistics of the
~
p x, y
training sample
Features and constraints
• Conditioning information x
– E.g., in the training sample, if April is the word
following in, then the translation of in is en with
frequency 9/10
• Indicator function
1 if y en and April follows in
f x, y
0 otherwise
• Expected value of f
~
~
p f
p x, y f x, y
x, y
(1)
Features and constraints
• We can express any statistic of the sample as the
expected value of an appropriate binary-valued
indicator function f
– We call such function a feature function or feature for
short
Features and constraints
• When we discover a statistic that we feel is useful,
we can acknowledge its importance by requiring
that our model accord with it
• We do this by constraining the expected value that
the model assigns to the corresponding feature
function f
• The expected value of f with respect to the model
p(y | x) is
p f ~
p x p y | x f x, y
(2)
x, y
where ~
px is the empirical distributi on of x in the training sample
Features and constraints
• We constrain this expected value to be the same as
the expected value of f in the training sample. That
is, we require
p f ~
p f
(3)
– We call the requirement (3) a constraint equation or
simply a constraint
• Combining (1), (2) and (3) yields
~
~
p x p y | x f x, y
p x, y f x, y
x, y
x, y
Features and constraints
• To sum up so far, we now have
– A means of representing statistical phenomena inherent
in a sample of data (namely, ~
p f )
– A means of requiring that our model of the process
exhibit these phenomena (namely, p f ~
p f )
• Feature:
– Is a binary-value function of (x, y)
• Constraint
– Is an equation between the expected value of the feature
function in the model and its expected value in the
training data
The maxent principle
• Suppose that we are given n feature functions fi,
which determine statistics we feel are important in
modeling the process. We would like our model to
accord with these statistics
• That is, we would like p to lie in the subset C of P
defined by
C p P | p f i ~
p f i for i 1,2,..., n
(4)
P
(a)
Figure 1:
P
P
P
C1
C1 C2
C1 C
2
(b)
(c)
(d)
•
If we impose no constraints, then all probability models are
allowable
•
Imposing one linear constraint C1 restricts us to those pP which
lie on the region defined by C1
•
A second linear constraint could determine p exactly, if the two
constraints are satisfiable, where the intersection of C1 and C2 is
non-empty. p C1 C2
•
Alternatively, a second linear constraint could be inconsistent with
the first (i,e, C1 C2 = ); no pP can satisfy them both
The maxent principle
• In the present setting, however, the linear
constraints are extracted from the training sample
and cannot, by construction, be inconsistent
• Furthermore, the linear constraints in our
applications will not even come close to
determining pP uniquely as they do in (c);
instead, the set C = C1 C2 … Cn of allowable
models will be infinite
The maxent principle
• Among the models pC, the maximum entropy
philosophy dictates that we select the distribution
which is most uniform
• A mathematical measure of the uniformity of a
conditional distribution p(y|x) is provided by the
conditional entropy
H p ~
p x p y | x log p y | x
x, y
(5)
The maxent principle
• The principle of maximum entropy
– To select a model from a set C of allowed probability
distributions, choose the model p★ C with maximum
entropy H(p):
p* arg max H p
pC
(6)
Exponential form
• The maximum entropy principle presents us with a
problem in constrained optimization: find the
p★C which maximizes H(p)
• Find
p * arg max H p
pC
~
arg max p x p y | x log p y | x
pC
x, y
(7)
Exponential form
• We refer to this as the primal problem; it is a
succinct way of saying that we seek to maximize
H(p) subject to the following constraints:
– 1. p y | x 0 for all x, y.
– 2.
p y | x 1
y
for all x.
• This and the previous condition guarantee that p is a
conditional probability distribution
– 3.
~
~
p
x
p
y
|
x
f
x
,
y
x, y
x , y p x, y f x, y
for i 1,2,..., n.
• In other words, p C, and so satisfies the active constraints C
Exponential form
• To solve this optimization problem, introduce the
Lagrangian
p, , ~p x p y | x log p y | x
x, y
~
~
i p x, y f i x, y p x p y | x f i x, y
i
x, y
p y | x 1
y
(8)
Exponential form
~
p x 1 log p y | x i ~
p x f i x, y
p y | x
i
(9)
~
p x 1 log p y | x i ~
p x f i x, y 0
i
~
p x 1 log p y | x i ~
p x f i x, y
i
log p y | x i ~
p x f i x, y ~ 1
p x
i
~
p y | x exp i p x f i x, y exp ~ 1
i
p x
(10)
Exponential form
• We have thus found the parametric form of p★, and so we
now take up the task of solving for the optimal values ★,
★.
• Recognizing that the second factor in this equation is the
factor corresponding to the second of the constraints listed
above, we can rewrit (10) as
p y | x Z x exp i f i x, y
i
(11)
where Z(x), the normalizing factor, is given by
Z x exp i f i x, y
y
i
(12)
Proof (12) :
second constraint : x, y p y | x 1
y p y | x y exp i f i x, y exp ~ 1 1
i
p x
*
exp
exp
1 y exp i f i x, y 1
~
p x
i
1
1
1
~
p x
Z x
y exp i i f i x, y
Z x y exp i f i x, y
i
Exponential form
• We have found ★ but not yet ★. Towards this end we
introduce some further notation. Define the dual function
() as
p , ,
(13)
and the dual optimization problem as
Find arg max
(14)
• Since p★ and ★ are fixed, the righthand side of (14) has
only the free variables ={1, 2,…, n}.
Exponential form
• Final result
– The maximum entropy model subject to the constraints
C has the parametric form p★ of (11), where Λ★ can be
determined by maximizing the dual function ()
Maximum likelihood
The log - likelihood L~p p of the empirical distributi on ~
p
as predicted by a model p is defined by
~
p x, y
L~p p log p y | x
~
p x, y log p y | x
x, y
(15)
x, y
It is easy to check that the dual function of the previous
section is, in fact, just the log - likelihood for the exponentia l
model p; that is
L~p p
(16)
where p has the parametric form of (11). With this interpreta tion,
the result of the previous section can be rephrased as :
The model p * C with maximum entropy is the model in the
parametric family p y | x that maximizes the likelihood of the
train ing sample ~
p.
Maximum likelihood
Since (16) and From (8) :
p, , ~p x p y | x log p y | x
x, y
~
~
i p x, y f i x, y p x p y | x f i x, y
i
x, y
p y | x 1
y
p, , ~
p x p y | x log p y | x
x, y
p * , , * ~
p x ~
p y | x log p y | x
x, y
p * , , * ~
p x ~
p y | x log p y | x
x, y
~
1
~
p x p y | x log
exp i f i x. y
x, y
i
Z x
~
~
p x p y | x log Z x i f i x. y
x, y
i
~
p x ~
p y | x log Z x ~
p x ~
p y | x i f i x. y
x, y
x. y
i
~
~
p x log Z x p x, y i f i x. y
x
x. y
i
~
~
p x log Z x i p x, y f i x. y
x
i
x, y
~
p x log Z x i ~
p f i
x
i
Outline (Maxent Modeling summary)
• We began by seeking the conditional distribution p(y|x)
which had maximal entropy H(p) subject to a set of linear
constraints (7)
• Following the traditional procedure in constrained
optimization, we introduced the Lagrangian ( p,,),
where , are a set of Lagrange multipliers for the
constraints we imposed on p(y|x)
• To find the solution to the optimization problem, we
appealed to the Kuhn-Tucker theorem, which states that we
can (1) first solve ( p,,) for p to get a parametric form
for p★ in terms of , ; (2) then plug p★ back in to
( p,,), this time solving for ★, ★.
Outline (Maxent Modeling summary)
• The parametric form for p★ turns out to have the
exponential form (11)
• The ★ gives rise to the normalizing factor Z(x), given in
(12)
• The ★ will be solved for numerically using the dual
function (14). Furthermore, it so happens that this function,
(), is the log-likelihood for the exponential model p
(11). So what started as the maximization of entropy
subject to a set of linear constraints turns out to be
equivalent to the unconstrained maximization of likelihood
of a certain parametric family of distributions.
Outline (Maxent Modeling summary)
• Table 1 summarize the primal-dual framework
Primal
Dual
problem
argmaxpCH(p)
argmax()
description
maximum entropy
maximum likelihood
type of search constrained optimization unconstrained optimization
search domain
pC
real-value vectors {1 2,…}
solution
p★
★
Kuhn-Tucker theorem: p★ = p★
Computing the parameters
Algorithm 1 Improved Iterative Scaling
Input :
Feature functions f1,f 2 , f n ; empirica l distribu tion ~
p ( x, y )
Output : Optimal pa rameter values *i ; optimal model p *
1. Start with i 0 for all i {1,2, , n}
2. Do for each i {1,2, , n} :
a. Let i be the solution to
#
~
p
(
x
)
p
(
y
|
x
)
f
(
x
,
y
)
exp
f
( x, y ) ~
p( fi )
i
i
(18)
x, y
where f # x, y i 1 f i x, y
n
(19)
b. Update the value of i according to : i i i
3. Go to step 2 if not all the i have converged
f x, y
i i
i