Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 550 Notes 5
Reading: Sections 1.4, 1.5
I. Prediction (Chapter 1.4)
A common decision problem is that we want to predict a
variable Y based on a covariate vector Z .
Examples: (1) Predict whether a patient, hospitalized due to
a heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and clinical
measurements for that patient; (2) Predict the price of a
stock 6 months from now, on the basis of company
performance measures and economic data; (3) Predict the
numbers in a handwritten ZIP code, from a digitized image.
We typically have a “training” sample (Y1 , Z1 ), ,(Yn , Z n )
available from the joint distribution of (Y , Z ) and we want
to predict Ynew for a new observation from the distribution
of (Y , Z ) for which we know Z Znew .
In Section 1.4, we consider how to make predictions when
we know the joint distribution of (Y , Z ) ; in practice, we
often have only an estimate of the joint distribution based
on the training sample.
1
Let g ( Z ) be a rule for predicting Y based on Z . A
criterion that is often used for judging different prediction
rules is mean squared prediction error:
2 (Y , g ( Z )) E[{g ( Z ) Y }2 | Z ] -this is the average squared prediction error when g ( Z ) is
used to predict Y for a particular Z . We want
2 (Y , g ( Z )) to be as small as possible.
Theorem 1.4.1: Let ( Z ) E (Y | Z ) . ( Z ) is the best
mean squared prediction error prediction rule.
Proof: For any prediction rule g ( Z ) ,
2 (Y , g ( Z ))
E[{g ( Z ) Y }2 | Z ] E[{( g ( Z ) E (Y | Z )) ( E (Y | Z ) Y )}2 | Z ]
E[( g ( Z ) E (Y | Z )) 2 | Z ] 2 E[( g ( Z ) E (Y | Z ))( E (Y | Z ) Y ) | Z ]
E[( E (Y | Z ) Y ) 2 | Z ]
E[( g ( Z ) E (Y | Z )) 2 | Z ] E[( E (Y | Z ) Y ) 2 | Z ]
E[( E (Y | Z ) Y ) 2 | Z ] 2 (Y , E (Y | Z ))
II. Sufficient Statistics
Our usual setting: We observe data X from a distribution
P , where we do not know the true P but only know that
P P = { P , } (the statistical model).
2
The observed sample of data X may be very complicated
(e.g., in the handwritten zip code example from Notes 1,
the data is 500 216 72 matrices). An experimenter may
wish to summarize the information in a sample by
determining a few key features of the sample values, e.g.,
the sample mean, the sample variance or the largest
observation. These are all examples of statistics.
Recall: A statistic Y T ( X ) is a random variable or
random vector that is a function of the data.
A statistic is sufficient if it carries all the information in the
data about the parameter vector . T ( X ) can be a scalar or
a vector. If T ( X ) is of lower “dimension” than X , then we
have a good summary of the data that does not discard any
important information.
For example, consider a sequence of independent Bernoulli
trials with unknown probability of success . We may
have the intuitive feeling that the total number of successes
contains all the information about that there is in the
sample, that the order in which the successes occurred, for
example, does not give any additional information. The
following definition formalizes this idea:
3
Definition: A statistic Y T ( X ) is sufficient for if the
conditional distribution of X given Y y does not depend
on for any value of y .1
Implication: If a statistic Y T ( X ) is sufficient, then if we
already know the value of the statistic, knowing the full
data X does not provide any additional information about
.
Example 1: Let X 1 ,
, X n be a sequence of independent
Bernoulli random variables with P( X i 1) . We will
verify that Y i 1 X i is sufficient for .
Consider
n
P ( X1 x1 , , X n xn | i 1 X i y) .
n
We have
1
This definition is not quite precise. Difficulties arise when
P (Y y) 0 , so that the conditioning
event has probability zero. The definition of conditional probability can then be changed at one or more
values of y (in fact at any set of y values which has probability zero) without affecting the distribution of
X , which is the result of combining the distribution of Y with the conditional distribution of X given Y .
In general, there can be more than one version of the conditional probability distribution P ( X | Y ) which
together with the distribution of Y leads back to the distribution of X . We define a statistic as sufficient
if there exists at least one version of the conditional probability distributions P ( X | Y ) which are the
same for all . See Lehmann and Casella, Theory of Point Estimation, 2 nd Edition, Chapter 1.6, pg.
34-35 for further discussion. For our purposes, we define Y to be a sufficient statistic if i) for discrete
distributions of the data, for each y that has positive probability for at least one , the conditional
| Y ) does not depend on for all for which P (Y y) 0 ; and ii) for continuous
distributions of the data, for each y that has positive density for at least one , the conditional
probability density f ( X | Y ) does not depend on for all for which f (Y y) 0 .
probability P ( X
4
P ( X 1 x1 ,
, X n xn | i 1 X i y )
n
P ( X 1 x1 , , X n xn , Y y )
P (Y y )
y (1 )n y
1
n y
n
n y
(1
)
y
y
The conditional distribution thus does not involve at all
n
Y
and thus
i1 X i is sufficient for .
Example 2:
Let X 1 , , X n be iid Uniform( 0, ). Consider the statistic
Y max1in X i .
We showed in Notes 4 that
ny n 1
0 y
fY ( y ) n
0
elsewhere
Y must be less than . For Y , we have
P ( X 1 x1 ,
, X n xn | Y y )
P ( X 1 x1 , , X n xn , Y y )
P (Y y )
1
n
IY I min X i 0 I max X i Y
ny n 1
n
IY
1
I min X i 0 I max X i Y
ny n 1
which does not depend on (for all Y for which
p (Y ) 0 ).
5
It is often hard to verify or disprove sufficiency of a
statistic directly because we need to find the distribution of
the sufficient statistic. The following theorem is often
helpful.
Factorization Theorem: A statistic T ( X ) is sufficient for
if and only if there exist functions g (t , ) and h ( x ) such
that
p( x | ) g (T ( x ), ) h ( x )
for all x and all .
(where p ( x | ) denotes the probability mass function for
discrete data given the parameter and the probability
density function for continuous data).
Proof: We prove the theorem for discrete data; the proof for
continuous distributions is similar. First, suppose that the
probability mass function factors as given in the theorem.
We have
P (T ( x ) t ) p( x ' | )
x ':T ( x ') t
so that
6
P ( X x | T ( x ) t )
P ( X x , T ( x ) t )
P (T ( x ) t )
P ( X x )
P (T ( x ) t )
g (T ( x ), )h( x )
g (T ( x' ), )h( x' )
x ':T ( x ') t
h( x )
h( x' )
x ':T ( x ') t
Thus, T ( X ) is sufficient for because the conditional
distribution P ( X x | T ( x) t ) does not depend on .
Conversely, suppose T ( X ) is sufficient for . Then the
conditional distribution of X | T ( X ) does not depend on .
Let P( X x | T ( X ) t ) k ( x, t ) . Then
p( x | ) k ( x, t ) P (T ( x) t ) .
Thus, we can take h( x) k ( x, t ), g (t , ) P (T ( x) t )
Example 1 Continued: X 1 , , X n a sequence of
independent Bernoulli random variables with
P( X i 1) . To show that Y i 1 X i is sufficient for
, we factor the probability mass function as follows:
n
7
P( X 1 x1 ,
n
, X n xn | ) xi (1 )1 xi
i 1
x
n
x
i1 i (1 ) i1 i
n
1
n
i1 xi
n
The pmf is of the form g (i 1 xi , )h( x1 ,
h( x1 , , xn ) 1.
n
(1 ) n
, xn ) where
Example 2 continued: Let X 1 , , X n be iid Uniform( 0, ).
To show that Y max1i n X i is sufficient, we factor the pdf
as follows:
n
1
1
f ( x1 , , xn | ) I 0 xi n I max1in X i I min1in X i 0
i 1
The pdf is of the form g ( I max1in X i , )h( x1 , , xn ) where
1
g ( x1 , , xn , ) n I max1in X i , h( x1 , , xn ) I min1in X i 0
Example 3: Let X 1 ,
factors as
, X n be iid Normal ( , 2 ). The pdf
8
1
1
exp 2 ( xi ) 2
2
i 1 2
1
n
1
2
n
exp
(
x
)
2 2 i 1 i
(2 ) n / 2
n
, xn ; , )
2
f ( x1 ,
1
n
n
1
2
2
exp
(
x
2
x
n
)
i
i
2 2
i 1
i 1
n (2 ) n / 2
The pdf is thus of the form
g (i 1 xi , i 1 xi2 , , 2 )h( x1 ,
n
n
, xn ) where h( x1 ,
, xn ) 1.
2
Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient
n
n
2
statistic for ( , ) , i.e., the distribution of X 1 ,
, X n is
2
2
independent of ( , ) given (i 1 xi , i 1 xi ) .
n
n
A theorem for proving that a statistic is not sufficient:
Theorem 1: Let T ( X ) be a statistic. If there exists some
1 , 2 and x , y such that
(i) T ( x ) T ( y ) ;
(ii) p( x | 1 ) p( y | 2 ) p( x | 2 ) p( y | 1 ) ,
then T ( X ) is not a sufficient statistic.
Proof: Assume that (i) and (ii) hold. Suppose that T ( X ) is
a sufficient statistic. Then by the factorization theorem,
p( x | ) g (T ( x ), )h( x ) . Thus,
p( x | 1 ) p( y | 2 ) g (T ( x), 1 )h( x) g (T ( y), 2 ) h( y)
g (T ( x ), )h( x ) g (T ( x ), )h( y ) ,
1
where the last equality follows from (i).
9
2
Also,
p( x | 2 ) p( y | 1 ) g (T ( x ), 2 )h( x) g (T ( y), 1 ) h( y)
g (T ( x ), 2 )h( x ) g (T ( x ),1 )h( y )
where the last equality follows from (i). Thus,
p( x | 1 ) p( y | 2 ) p( x | 2 ) p( y | 1 ) . This contradicts (ii).
Hence the supposition that T ( X ) is a sufficient statistic is
impossible and T ( X ) must not be a sufficient statistic
when (i) and (ii) hold.
■
Example 4: Consider a series of three independent
Bernoulli trials X 1 , X 2 , X 3 with probability of success p.
Let T X1 2 X 2 3 X 3 . Show that T is not sufficient.
Let x = ( X1 0, X 2 0, X 3 1) and
y ( X1 1, X 2 1, X 3 0) . We have T ( x ) T ( y ) 3 .
But
f ( x | p 1/ 3) f ( y | p 2 / 3) ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3)) 16 / 729
f ( x | p 2 / 3) f ( y | p 1/ 3) ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3)) 4 / 729
Thus, by Theorem 1, T is not sufficient.
III. Implications of Sufficiency
We have said that reducing the data to a sufficient statistic
does not sacrifice any information about .
We now justify this statement in two ways:
10
(1) We show that for any decision procedure, we can
find a randomized decision procedure that is based
only on the sufficient statistic and that has the same
risk function.
(2) We show that any point estimator that is not a
function of the sufficient statistic can be improved
upon for a strictly convex loss function.
(1) Let ( X ) be a decision procedure and T ( X ) be a
sufficient statistic. Consider the following randomized
decision procedure [call it '(T ( X )) ]:
Based on T ( X ) , randomly draw X ' from the distribution
X | T ( X ) (which does not depend on and is hence
known) and take action ( X' ) .
X has the same distribution as X' so that ( X ) has the
same distribution as '(T ( X )) ( X' ) . Since ( X ) and
( X' ) have the same distribution, they have the same risk
function.
2
Example 5: X ~ N (0, ) . T ( X ) | X | is sufficient
2
because X | T ( X ) t is equally likely to be t for all .
Given T t , construct X ' to be t with probability 0.5
2
each. Then X ' ~ N (0, ) .
(2) The Rao-Blackwell Theorem.
11
Convex functions: A real valued function defined on an
open interval I (a, b) is convex if for any a x y b
and 0 1 ,
[ x (1 ) y] ( x) (1 ) ( y) .
is strictly convex if the inequality is strict.
If '' exists, then is convex if and only if '' 0 on
I ( a, b) .
A convex function lies above all its tangent lines.
Convexity of loss functions:
For point estimation:
squared error loss is strictly convex.
absolute error loss is convex but not strictly convex
Huber’s loss functions,
2
if |q( ) - a | k
(q( ) - a)
l ( a
2
2k | q( ) - a | -k if |q( ) - a |> k
for some constant k is convex but not strictly convex.
zero-one loss function
if |q( )- a | k
0
l ( a
if |q( )- a |> k
1
is nonconvex.
Jensen’s Inequality: (Appendix B.9)
Let X be a random variable. (i) If is convex in an open
interval I and P( X I ) 1 and E ( X ) , then
( E[ X ]) E[ ( X )] .
12
(ii) If is strictly convex, then ( E[ X ]) E[ ( X )] unless
X equals a constant with probability one.
Proof of (i): Let L ( x ) be a tangent line to ( x) at the point
( E[ X ]) . Write L( x) a bx . By the convexity of ,
( x) a bx . Since expectations preserve inequalities,
E[ ( X )] E[a bX ]
a bE[ X ]
L( E[ X ])
( E[ X ])
as was to be shown.
Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic.
Let be a point estimate of q( ) and assume that the loss
function l ( , d ) is strictly convex in d. Also assume that
R ( , ) . Let (t ) E[ ( X ) | T ( X ) t ] . Then
R( , ) R( , ) unless ( X ) (T ( X )) with probability
one.
Proof: Fix . Apply Jensen’s inequality with
(d ( x )) l ( , d ( x )) and let X have the conditional
distribution of X | T ( X ) t for a particular choice of t .
By Jensen’s inequality,
l ( , (t )) E[l[ , ( X )] | t ]
(0.1)
with the inequality being strict unless ( X ) (t ) with
probability one. Taking the expectation on both sides of
13
this inequality yields R( , ) R( , ) unless
( X ) (T ( X )) with probability one.
Comments:
(1) Sufficiency ensures (t ) E[ ( X ) | T ( X ) t ] is an
estimator (i.e., it depends only on t and not on ).
(2) If loss is convex rather than strictly convex, we get in
(1.2).
(3) Theorem is not true without convexity of loss functions.
Example 4 continued. Consider a series of three
independent Bernoulli trials X 1 , X 2 , X 3 with probability of
success p. We have shown that T ( X ) X1 X 2 X 3 is a
sufficient statistic and that T '( X ) X1 2 X 2 3 X 3 is an
insufficient statistic. The unbiased estimator
X 2 X 2 3X 3
(X ) 1
is a function of the insufficient
6
statistic T '( X ) X1 2 X 2 3 X 3 and can thus be improved
for a strictly convex loss function by using the RaoBlackwell theorem:
X1 2 X 2 3 X 3
| X1 X 2 X 3 t
6
(t ) E ( ( X ) | T ( X ) t ) E
Note that
14
Pp ( X 1 x | X 1 X 2 X 3 t )
Pp ( X 1 x, X 2 X 3 t x)
Pp ( X 1 X 2 X 3 t )
t
2 tx
2
if x 1
2 t x
p (1 p)
p (1 p)
3
t x
t x
3 t
3
3 t
t
p (1 p)
1 if x 0
t
t
3
x
1 x
Thus,
X1 2 X 2 3 X 3
t 1 2t 1 3t 1 t
| X1 X 2 X 3 t
.
6
36 3 6 3 6 3
(t ) E
For squared error loss we have,
R( p, ) Bias p ( ) Varp ( ) 0
2
p(1 p)
3
and
R( p, ) Bias p ( ) Varp ( ) 0
2
so that R( p, ) R( p, ) .
14
p(1 p) ,
36
Consequence of Rao-Blackwell theorem: For convex loss
functions, we can dispense with randomized estimators.
A randomized estimator randomly chooses the estimate
Y( x ) , where the distribution of Y( x ) is known. A
randomized estimator can be obtained as an estimator
*
estimator ( X ,U ) where X and U are independent and U
is uniformly distributed on (0,1). This is achieved by
observing X = x and then using U to construct the
distribution of Y( x ) . For the data ( X , U ) , X is sufficient.
Thus, by the Rao-Blackwell Theorem, the nonrandomized
15
*
*
estimator E[ ( X ,U ) | X ] dominates ( X ,U ) for strictly
convex loss functions.
16