Download CENTRAL LIMIT THEOREM Contents 1. Introduction 1 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Transcript
CENTRAL LIMIT THEOREM
FREDERICK VU
Abstract. This expository paper provides a short introduction to probability theory before proving a central theorem in probability theory, the central
limit theorem. The theorem concerns the eventual convergence to a normal
distribution of an average of a sampling of independently distributed random
variables with identical variance and mean. The paper shall use Lévy’s continuity theorem to go about proving the central limit theorem.
Contents
1. Introduction
2. Convergence
3. Variance Matrices
4. Multivariate Normal Distribution
5. Characteristic Functions and the Lévy Continuity Theorem
Acknowledgments
References
1
3
7
8
8
13
13
1. Introduction
Before we state the central limit theorem, we must first define several terms.
An understanding of the terms relies on basic functional analysis fitted with new
probability terminology.
Definition 1.1. A probability space is a triple (Ω, F, P ) where Ω is a non-empy
set, F is a σ-algebra (collection of subsets closed under countable unions/intersections
of countable many subsets) of measurable subsets of Ω, and P is a finite measure
on the measurable space (Ω, F) with P (Ω) = 1. P is referred to as a probability.
Definition 1.2. A random variable X is a measurable function from a probability space (Ω, F, P ) to a measurable space (S, S) where S is a σ-algebra of
measurable subsets of S. Normally (S, S) is the real numbers with the Borel σalgebra. We will maintain this notation, but conform to the norm throughout the
paper. A random vector is a column vector whose components are real-valued
random variables defined on the same probability space. In many places in this
paper, a statement concerning random variables will presume the existence of some
general probability space.
Definition 1.3. The expected value of a real-valued random variable X is defined
as the Lebesgue integral of X with respect to the measure P
Z
E(X) ≡
X dP.
Ω
1
2
FREDERICK VU
For a random vector X, the expected value E(X) is the vector whose components
are E(Xi )
Definition 1.4. Because independence is such a central notion in probability it
is best to define it early. First, define the distribution of a random variable as
Q ≡ P ◦ X −1 defined on (S, S) by
Q(B) := P (X −1 (B)) ≡ P (X ∈ B) ≡ P (ω ∈ Ω : X(ω) ∈ B), B ∈ S.
This possibly confusing notation can be understood as the pushforward measure of
P to (S, S).
Definition 1.5. A set of random variables X1 , ..., Xn with Xi a map from (Ω, F, P )
to (Si , Si ) is called independent if the distribution Q of X := (X1 , ..., Xn ) on the
product space (S = S1 × · · · × Sn , S = S1 × · · · × Sn ) is the product measure
Q = Q1 × · · · × Qn where Qi is the distribution of Xi , or more compactly:
Q(B1 × · · · × Bn ) =
n
Y
Qi (Bi )
i=1
Two random vectors are said to be independent if their components are pairwise
independent as above.
Since the (multivariate) central limit theorem won’t be stated until much further
along due to the required definitions of normal distributions and many lemmas along
the way, we pause here to give an informal statement of the central theorem before
continuing on with a few basic lemmas from probability theory. The central limit
theorem basically says that given a fixed distribution, if one were to repeatedly,
but independently, sample from such distribution, the average value will roughly
approach the expected value of the corresponding random variable, even giving a
bell-shaped curve if one were to plot a histogram.
The following are simple inequalities used often in the paper.
Lemma 1.6. (Markov’s Inequality) If X is a nonnegative randomvariable and
a > 0, then
E(X)
.
P (X ≥ a) ≤
a
Proof. Denote for U ⊆ Ω, the indicator function of U , IU . Then by linearity of the
integral and the definition of the probability distribution
E(X) ≥ E(aIX≥a ) = aE(IX≥1 ) = aP (X ≥ a)
Corollary 1.7. (Chebyshev’s Inequality) For any random variable X and a > 0
P (|X − E(X)|) ≥ a) ≤
E((X − E(X))2 )
.
a2
Proof. Consider the random variable (X − E(X))2 and apply Markov’s inequality.
There are many ways to understand probability measures, and it is from these
different points of view and their interrelations that one can derive the multitude
of theorems following.
CENTRAL LIMIT THEOREM
3
Definitions 1.8. The cumulative distribution function (cdf) of a random
vector X = (Xi , ..., Xn ) is the function FX : Rn → R
FX (x) = P (X1 ≤ x1 , ..., Xn ≤ xn )
For a continuous random vector X, define the probability density function as
∂n
fX (x) =
FX (x1 , ..., xn )
∂x1 · · · ∂xn
This provides us with another way to write the distribution of a random vector X.
For A ⊆ Rn ,
Z
P (X ∈ A) =
fX (x) dx.
A
Remark 1.9. For a continuous random variable X, there is also another way to
express the expected value of powers of X.
Z
(1.10)
E(X n ) =
xn fX (x) dx
R
This is just a specific case of
Z
(1.11)
g(x)fX (x) dx
E(g(X)) =
R
where g is a measurable function.
2. Convergence
Definition 2.1. A sequence of cumulative distribution functions {Fn } is said to
converge in distribution, or converge weakly, to the cumulative distribution
function F , denoted Fn ⇒ F , if
(2.2)
lim Fn (x) = F (x)
n
for every continuity point x of F . If Qn and Q are their respective distribution
functions, then we may equivalently define Qn ⇒ Q if for every A = (−∞, x) for
which Q(x) = 0,
lim Qn (A) = Q(A).
n
Similarly, if Xn and X are the respective random variables corresponding to Fn
and F , we write Xn ⇒ X defined equivalently.
Since distributions are just measures on some measurable (S, S), which again
is generally the reals, we have a similar understanding of convergence of measures
rather than just distributions. The following theorem allows the representation of
weakly convergent measures as the distribution of random variables defined on a
common probability space.
Theorem 2.3. Suppose that µn and µ are probability measures on (R, R) and
µn ⇒ µ. Then there exist random variables Xn and X on some (Ω, F, P ) such that
Xn , X have respective distributions µn , µ, and Xn (ω) → X(ω) for each ω ∈ Ω.
Proof. Take (Ω, F, P ) to be the set (0, 1) with Borel subsets of (0, 1) and the
Lebesgue measure. Denote the cumulative distributions associated with µn , µ by
Fn , F , and put
Xn (ω) = inf[x : ω ≤ Fn (x)]
and
X(ω) = inf[x : ω ≤ F (x)].
4
FREDERICK VU
The set [x : ω ≤ F (x)] is closed on the left since F (x) is right-continuous as are all
cumulative distributions, and therefore it is the set [X(ω), ∞). Hence, ω ≤ F (x) if
and only if X(ω) ≤ x, and P [ω : X(ω) ≤ x] = P [ω : ω ≤ F (x)] = F (x). Thus, X
has cumulative distribution F ; similarly, Xn has distribution F .
To prove pointwise convergence, suppose for a given > 0, we choose x so that
X(ω) − < x < X(ω) and µ(x) = 0. Then F (x) < ω, and Fn (x) → F (x) implies
that for large enough n, Fn (x) < ω, and therefore X(ω) − < x < Xn (ω). Thus
lim inf Xn (ω) ≤ X(ω).
n
0
Now for ω > ω, we may similarly choose a y such that ω ≤ Fn (y) and hence
Xn (ω) ≤ y < X(ω 0 ) + . Thus
lim sup Xn (ω) ≤ X(ω 0 ).
n
Therefore, if X is continuous at ω, Xn (ω) → X(ω).
Since X is increasing on (0, 1), it has at most countably many discontinuities.
For any point of discontinuity ω, define Xn (ω) = X(ω) = 0. Since the set of
discontinuities has Lebesgue measure 0, the distributions remain unchanged.
At the heart of many theorems in probability is the properties of convergence
of distribution functions. We now come to many fundamental convergence theorems in probability, though in essence they are rehashings of conventional proofs
from functional analysis. The first theorem essentially says that measurable maps
preserve limits.
Theorem 2.4. Let h : R → R be measurable and let the set Dh of its continuities
be measurable. If µn ⇒ µ as before and µ(Dh ) = 0, then µn ◦ h−1 ⇒ µ ◦ h−1 .
Proof. Using the random variables Xn , X defined in the previous proof, we see
h(Xn (ω)) → h(X(ω)) almost everywhere. Therefore h(Xn ) ⇒ h(X), where such
notation means composition h ◦ X. For A ⊆ R, since
P [h(X) ∈ A] = P [X ∈ h−1 (A)] = µ(h−1 (A)),
h ◦ X has distribution µh−1 ; similarly, h ◦ Xn has distribution µn h−1 , again abusing
notation of composition. Thus h(Xn ) ⇒ h(X) is equivalent to µn h−1 ⇒ µh−1 . Corollary 2.5. If Xn ⇒ X and P [X ∈ Dh ] = 0, then h(Xn ) ⇒ h(X).
RLemma 2.6.
R µn ⇒ µ if and only if for every bounded, continuous function f ,
f dµn → f dµ.
Proof. For the forward proof, in the same process as seen in the proof of theorem
2.3, we have f (Xn ) → f (X) almost everywhere. By change of variables and the
dominated convergence theorem,
Z
Z
f dµn = E(f (Xn )) → E(f (X)) = f dµ.
Conversely, consider the cumulative distribution functions Fn , F associated with
µn , µ and suppose x < y. Define the function f by f (t) = 1 for t ≤ x, f (t) = 0
CENTRAL LIMIT THEOREM
5
for
R t ≥ y, and f (t) = (y − t)/(y − x) for x ≤ t ≤ y. Since Fn (x) ≤
f dµ ≤ F (y), if we let y & x it follows from assumption that
R
f dµn and
lim sup Fn (x) ≤ F (x).
n
If we consider u < x and define a function g similar to f (1 up to u, 0 after x, and
linear inbetween), we have
F (x−) ≤ lim inf Fn (x)
n
which implies convergence at continuity points.
Theorem 2.7. (Helly’s Selection Theorem) For every sequence of cdf ’s Fn , there
exists a subsequence Fnk and a nondecreasing, right-continuous fuction F such that
limk Fnk (x) = F (x) at continuity points x of F .
Proof. Enumerate the rationals by t1 , t2 , . . . . Since these are cumulative distribution functions, the sequence Fn (t1 ) contains a convergent subsequence; denote one
by Fn(1) . Similarly, we may find a subsequence of this subsequence, denoted Fn(2) ,
k
k
(2)
(1)
(k)
such that n1 > n1 and Fn(2) converges. Let nk = n1 , the first element of the
k
k-th sub-subsequence, so that Fnk is convergent at every rational. Denote G(tm ) as
the limit of the function at the rational tm , then define F (x) = inf{G(tm ) : x < tm },
which is clearly nondecreasing.
For any given x and > 0, there exists an r > x so that G(r) < F (x) + . If
x < y < r, then F (y) ≤ G(r) < F (x) + . Hence F is right continuous.
If F is continuous at x, choose y < x so that F (x) − < F (y), and choose
rational s, r so that y < r < x < s and G(s) < F (x) + . From F (x) − < G(r) ≤
G(s) < F (x) + and monotonicity of Fn , it follows that as k → ∞, Fnk has lim sup
and lim inf within of F (x).
Note that the function F described above does not have to be a cdf; consider
Fn to be the unit jump at n, then F ≡ 0. To make the theorem useful, we need a
condition that ensures F is a cdf.
Definition 2.8. A sequence of distributions µn on (R, R) is tight if for each
positive these exists a finite interval (a, b] such that µn ((a, b]) > 1 − for all n.
Theorem 2.9. A sequence of distributions µn is tight if and only if for every
subsequence µnk there is a further subsequence µnkj and a probability measure µ
such that µnkj ⇒ µ.
Proof. For the forward direction, use Helly’s theorem to the subsequence Fnk of
corresponding cdf’s such that limj Fnkj (x) = F (x) at continuity points of F . From
the random variables constructed in the proof of theorem 2.3, a measure µ on (R, R)
may be defined so that µ(a, b] = F (a) − F (b). As a consequence of tightness, given
> 0, we may choose a, b so that µn (a, b] > 1 − for all n. We may also decrease a
and increase b so that they are points of continuity for F . Then µ(a, b] ≥ 1 − so
that µ is a probability measure and µnkj ⇒ µ.
Conversely, assume µn is not tight; there exist > 0 such that for any finite
interval (a.b], µn (a, b] ≤ 1 − for some n. Choose the subsequence nk so that
µnk (−k, k) ≤ 1 − . Now suppose there is a subsequence µnkj of µnk that converges
6
FREDERICK VU
weakly to some probability measure µ. Choose (a, b] so that µ(a) = µ(b) = 0 and
µ(a, b] > 1 − . Then for large enough j, (a, b] ⊂ (−kj , kj ], and so
1 − ≥ µnkj (−kj , kj ] ≥ µnkj (a, b] → µ(a, b].
Thus, µ(a, b] ≤ 1 − , a contradiction.
Corollary 2.10. If µn is a tight sequence of probability measures, and if each
convergent subsequence converges weakly to the probability measure µ, then µn ⇒ µ.
Proof. By the theorem, every subsequence µnk contains a subsequence µnkj that
converges weakly to µ. Suppose that µn ⇒ µ is false. Then there exists x such that
µ(x) = 0 but µn (−∞, x] → µ(−∞, x] is false. Then there is an > 0 such that
|µnk (−∞, x] − µ(−∞, x]| ≥ for a sequence nk , for which no subsequence may converge weakly to µ, a contradiction.
Like many fundamental theorems in analysis, the following section concerns interactions between limits and integration. This is very important when we are
dealing with sequences of random variables and their expected values.
Definition 2.11. The random variables Xn are uniformly integrable if
Z
lim sup
|Xn | dP = 0
a→∞ n
|Xn |≥a
which implies that
sup E(|Xn |) < ∞/.
n
Theorem 2.12. If Yn ⇒ X and Yn are uniformly integrable, then Y is integrable
and
E(Yn ) → E(Y )
Proof. The integrability of Y follows from Fatou’s lemma. From the distributions
associated with Yn and Y , construct as in the proof of theorem 2.3 the random
variables Xn , X. Since they have the same distribution and Xn → X almost
everywhere,
E(Xn ) = E(Yn ) → E(Y ) = E(X)
by Vitali’s convergence theorem.
Corollary 2.13. For a positive integer r, if Xn ⇒ X and supn E(|Xn |r+ ) < ∞,
for some > 0, then E(|X|r ) < ∞ and E(Xnr ) → E(X r )
Proof. The Xn are uniformly integrable because
Z
1
|Xn | dP ≤ E(|Xn |1+ ).
a
|Xn |≥a
Then by corollary 2.5, Xn ⇒ X implies Xnr ⇒ X r .
CENTRAL LIMIT THEOREM
7
3. Variance Matrices
Definitions 3.1. The covariance of two random variables X, Y is
Cov(X, Y ) ≡ E[(X − µX )(Y − µY )]
where µX = E(X).
The covariance matrix of two random vectors X = (X1 , ..., Xn ), Y = (Y1 , ..., Ym )
is the n × m matrix defined by
[Cov(X, Y)]ij = Cov(Xi , Yj )
The variance matrix of a random vector X is the square matrix MX defined by
[MX ]ij = [var(X)]ij = Cov(Xi , Xj )
Let’s examine some properties of the expected value (mean) and variance matrix
of random vectors a little more closely.
Theorem 3.2. Let Y = a + BX where a is any fixed vector, B is any fixed matrix,
and X is a random vector, then
(3.3)
E(Y) = a + BE(X)
(3.4)
var(Y) = Bvar(X)B0
Proof. To prove (3.3), it is enough to note the linearity of the expectation operator.
To prove (3.4), we note that the variance matrix may be written
var(Y) = E[(Y − µY )(Y − µY )0 ]
Thus, evaluating the variance of Y, we get
var(a + BX) = E[(a + BX − µY )(a + BX − µY )0 ]
= E[(BX − BµX )(BX − BµX )0 ]
= E[B(X − µX )(X − µX )0 B0 ]
= BE[(X − µX )(X − µX )0 ]B0
= Bvar(X)B0 .
Definition 3.5. An n × n matrix A is positive semi-definite if for every vector
c ∈ Rn
c0 Ac ≥ 0
and positive definite if for every vector c ∈ Rn
c0 Ac > 0.
Theorem 3.6. The variance matrix of a random vector X is symmetric and positive semi-definite.
Proof. Define a scalar random variable by Y = a+c0 X, where a is a constant scalar
and c is a constant vector, then by theorem 3.2,
var(Y ) = cMX c0
but since the variance of of a random variable is non-negative by definition, we see
MX is positive semi-definite.
8
FREDERICK VU
4. Multivariate Normal Distribution
The standard multivariate normal distribution is the distribution of a random
vector Z = (Z1 , . . . , Zn ) whose components are independent and have identical
distribution, and for notation, we write Z ∼ Nn (0, I). The distribution is defined
by its probability density function
fZ (x) =
n
Y
0
2
1
1
√ e−xi /2 =
e−x x/2
n/2
(2π)
2π
n=1
Remark 4.1. The value of a Gaussian integral is
r
Z
2
π
I(a) =
e−ax dx =
.
a
R
√
Z
2
1 π
dI(a)
=
−x2 e−ax dx = − 3/2
da
2a
R
Thus by definitions 3.1, the variance matrix of Z is the n dimensional identity
matrix, and the mean of Z is 0.
Corollary 4.2. For a random vector X = a + BZ,
E(X) = a + BE(Z) = a
var(X) = Bvar(Z)B0 = BB0 .
We say that X has multivariate normal distribution with mean a and variance
BB0 = M. For notation, we write X ∼ Nn (a, M), dropping the n if the dimension
is implied from context.
It turns out that a symmetric, positive semi-definite matrix is equivalent to the
variance matrix of a normal random vector. Some more properties about matrices
will have to be introduced before this can be proven.
Definition 4.3. For a given symmetric, positive semi-definite matrix A, we know
by the spectral theorem that it may be written as A = ODO0 , where O is orthogonal and D is the diagonal matrix of eigenvalues of A. Define the symmetric
square root of A by
A1/2 = OD1/2 O0
Theorem 4.4. For a given symmetric, positive semi-definite matrix M and vector
µ, there is a normal random vector X such that X ∼ N (µ, M).
Proof. Define X = µ + M1/2 Z, where Z is multivariate standard normal. Using
corollary 4.2 and the properties of the symmetric square root of M, this finishes
the proof.
5. Characteristic Functions and the Lévy Continuity Theorem
The power of the following function stems from its relation with the probability
density function of a random variable, i.e. they are essentially the Fourier transform
of one another. It turns out that this is just the tool needed to go about proving
stronger convergence theorems about distributions.
CENTRAL LIMIT THEOREM
9
Definition 5.1. The characteristic function of a random vector X is the function
φ : Rn → C
φX (t ) = E(eit
0
X
)
where the expectation is taken with respect to the distribution of X. The characteristic function is sometimes written φ(t ) without the index X.
Some basic properties of φ(t ) follow from the absolute continuity of the integral
and the isomorphic properties of the exponential operator from the additive group
to the multiplicative group of the real numbers:
1.φ(0) = 1
3.φa+bX = eit a φX (at )
2.φ(t ) ≤ 1
4.φPN
(t ) =
n=1 Xn
N
Y
φXn (t )
n=1
Characteristic functions provide us with a new way of determining convergence
of distributions.
Theorem 5.2. Distribution functions have unique characteristic functions.
Proof. We shall prove this by giving an inversion formula:
Z M −ita
e
− e−itb
1
φ(t) dt.
(5.3)
µ(a, b] = lim
M →∞ 2π −M
it
We expand the characteristic function and apply Fubini’s theorem to get
#
Z "Z M it(x−a)
e
− eit(x−b)
1
dt dµ
IM =
2π R −M
it
Since sin x is odd and cos x is even, this simplifies to
#
Z "Z M
Z M
sin t(x − a)
sin t(x − b)
1
−
dt dµ.
IM =
π R 0
t
t
0
lim IM =
M →∞


0
1
2

1
for x < a or x > b
for x = a, b
for a < x < b
Since µ(a) = µ(b) = 0, equation 5.3 holds.
Theorem 5.4. The characteristic function of a random vector X ∼ N (µ, M) is
φX (t ) = eit
0
µ−t 0 Mt/2
Proof. First, we shall prove the theorem for the univariate case then use property
4 above to generalize.
φ(t) = E(eitx ) = E(cos(tx) + isin(tx)) = E(cos(tx)) + 0
Z
2
1
√ cos(tx)e−x dx.
=
2π
10
FREDERICK VU
The third equality comes from sin(tx) being an odd function. Now we differentiate
with respect to t to get
Z
2
1
xsin(tx)e−x dx
φ0 (t) = − √
2π
Z
2
1
=√
sin(tx) d(e−x )
2π
Z
2
1
−x2
sin(tx)e−x |∞
−
tcos(tx)e
dx
=√
−∞
2π
= −tφ(t)
With the initial condition φ(0) = 1, we have a differential equation with a unique
2
solution φ(t) = e−x . Now if X ∼ N (0, I), we have
φX (t ) = e−t’t /2
For an arbitrary X ∼ N (µ, M), by theorem 3.3, we may write X = µ + M1/2 Z,
where Z is multivariate standard normal. Now the characteristic function of X is
E(eit
0
X
) = E(eit
0
(M1/2 Z+µ)
= eit
0
= eit
0
µ i(M
= eit
0
µ−t 0 Mt /2
µ
E(ei(M
1/2
1/2
0
e
)
0
t) Z
t ) (M
)
1/2
t )/2
The following lemma provides bounds for the error in the Taylor approximation
of the exponential function. While this may seem quite a bit away from the goal of
the paper, it is one of many lemmas needed to prove Levy’s theorem, from while
the central limit theorem follows more easily.
Lemma 5.5. Suppose X is a random variable such that E(X m ) < ∞. Then
m
k
X
(it)
(tX)m+1 (tX)m
itX
k E(X ) ≤ min
,2
E(e ) −
k!
(m + 1)!
m!
k=0
Rx
k
Pm
m+1 iy
Proof. Let fm (x) = eix − k=0 (ix)
d dy.
k! and note that fm (x) = i 0 (x − y)
Iterating this reduction m − 1 times yields iterated integrals with integrand of
m
modulus |eiym−1 − 1| ≤ 2. Thus, the value of fm is at most 2 |x|
m! . For the other
bound, consider the following identity using integration by parts
Z x
Z x
i
xm+1
m iy
+
(x − y)m+1 eiy dy
(x − y) e dy =
(m + 1)! m + 1 0
0
This defines a recursive function that leads by induction to the formula
Z
m
X
im+1 x
(ix)k
eix =
+
(x − y)m eiy dy
k!
m! 0
k=0
m
m
Since |x − y| < |y| for both nonnegative and negative values of x, we obtain
|x|m+!
. Replace x with |tX| and take
a bound on the modulus of the integral of (m+1)!
expected values.
CENTRAL LIMIT THEOREM
11
Before we prove the Lévy continuity theorem, we need one more algebraic lemma
concerning Gaussian integrals.
Lemma 5.6. For a > 0,
Z
e−ax
2
Proof. Rewrite −ax2 + bx = −a(x −
Z
e
−ax2 +bx
dx = e
+bx
r
dx =
b 2
2a )
b2
4a
+
Z
e
b2
4a .
π b2
e 4a
a
Then we have
b 2
−a(x− 2a
)
dx = e
b2
4a
r
π
a
Theorem 5.7. (L evy’s Continuity Theorem)
Xn ⇒ X
φXn (t ) → φX (t ), ∀t ∈ Rm
if and only if
0
Proof. For the forward direction, since eit X is bounded and continuous in X, the
result follows from lemma 2.6.
Conversely, assume φXn (t ) → φX (t ), ∀t ∈ Rm . We first show E(g(Xn )) →
E(g(X)) for continuous g with compact support, and we shall later show that this
implies convergence for bounded, continuous functions on all of Rm . Now since g
is uniformly continuous, for any > 0, we can find a δ > 0 such that kx − yk < δ
implies kg(x) − g(y)k < . Let Z be a N (0, σ 2 I) random vector that is independent
to X and {Xn }, then
|E(g(Xn )) − E(g(X))| = |E(g(Xn )) − E(g(X)) + E(g(Xn + Z))
− E(g(Xn + Z)) + E(g(X + Z)) − E(g(X + Z))|
≤ |E(g(Xn )) − E(g(Xn + Z))| + |E(g(Xn + Z)) − |E(g(X + Z))|
+ |E(g(X + Z)) − E(g(X))|.
The first term above is bounded by 2 because for sufficiently small σ
|E(g(Xn )) − E(g(Xn + Z))| ≤|E(g(Xn ) − E(g(Xn + IkZk≤δ Z))|
+ |E(g(Xn ) − E(g(Zn + IkZk>δ Z))|
≤E() + 2 sup |g(w )| P (kZk > δ)
w
≤2
where the last line follows from Chebyshev’s inequality. Similarly, the third term
in the above expression is bounded by 2. We wish to show that the second term
goes to 0, i.e. E(g(Xn + Z)) → Eg(X + Z). We have
12
FREDERICK VU
ZZ
0
2
1
E(g(Xn + Z)) = √
g(x+z )e−z z /(2σ ) dz dFXn
n
( 2πσ)
ZZ
0
2
1
= √
g(u )e−(u-x) (u-x) /(2σ ) du dFXn
n
( 2πσ)
ZZ
n
Y
2
2
1
= √
g(u )
e−(uj −xj ) /(2σ ) du dFXn
n
( 2πσ)
j=1
ZZ
Z
n
Y
2 2
1
σ
√
= √
g(u )
eitj (uj −xj )−σ tj /2 dtj du dFXn
( 2πσ)n
2π
j=1
ZZZ
0
2
1
g(u )eit (u-x )−σ t’t /2 dt du dFXn
=
n
(2π)
ZZ
0
2
1
=
g(u )eit (u )−σ t’t /2 φXn (−t ) dt du
n
(2π)
The first equality comes from a multivariate form of equation (1.11), the fourth
equality comes from lemma 4.4, and the last inequality comes from the definition of
the characteristic function. Since g is continuous with compact support, we can add
a constant and scale g so that it is a distribution. Furthermore, we may consider the
above expression as an expectation with respect to two random vectors, one having
normal density and the other s(g(u ) + r) for some constants which would make
the expression a permissable density. The expectation then is of the argument
0
eit u φXn (−t ) which is bounded by 1, and thus by the dominated convergence
theorem (lemma 2.6),
ZZ
0
2
1
g(u )eit (u-x )−σ t’t /2 φXn (−t ) dt du
n
(2π)
ZZ
0
2
1
g(u )eit (u-x )−σ t’t /2 φX (−t ) dt du .
→
n
(2π)
Repeating the above derivation with X in place of Xn , we have
E(g(Xn + Z)) → E(g(X + Z)).
Now it only remains to extend this to bounded, continuous functions defined on all
of Rm . Take g : Rm → R such that |g(x )| ≤ A for some A ∈ R. For any > 0, we
shall show that |E(g(Xn )) − E(g(X))| ≤ .
We may find c ∈ R such that P (kXk ≥ c) < 2A
, and a continuous function
0
0
0 ≤ g (x ) ≤ 1 such that g (x ) = 0 if kx k ≥ c + 1 and g 0 (x ) = 1 if kx k ≤ c. It
follows that E(g 0 (X)) ≥ 1 − 2A
and
|E(g(Xn )) − E(h(X))| =|E(g(Xn )) − E(g(X)) + E(g(Xn )g 0 (Xn ))
− E(g(Xn )g 0 (Xn )) + E(g 0 (X)g 0 (X)) − E(g 0 (X)g 0 (X))|
≤|E(g(Xn )) − E(g(Xn )g 0 (Xn ))| + |E(g(Xn )g 0 (Xn )) − E(g 0 (X)g 0 (X))|
+ E(g 0 (X)g 0 (X)) − E(g(X))|
→|E(g(Xn )) − E(g(Xn )g 0 (Xn ))| + 0 + |E(g 0 (X)g 0 (X)) − E(g(X))|
→ + = .
2 2
CENTRAL LIMIT THEOREM
13
The first convergence follows from the first half of the proof and the fact that g · g 0
is continuous with compact support. The second convergence follows from
|E(g(Xn )) − E(g(Xn )g 0 (Xn ))| ≤ E(|g(Xn )| · |1 − g 0 (Xn )|)
≤ AE(|g(Xn )| · |1 − g 0 (Xn )|)
= AE(g(Xn )| · |1 − g 0 (Xn ))
= A(1 − E(g 0 (Xn )))
→ A(1 − E(g 0 (X)))
≤A
=
2A
2
and a bound for |E(g(X)) − E(g 0 (X)g 0 (X))| is found in the same fashion.
The following theorem along with the law of large numbers is the basis for much
of the beauty (subjectively) in statistics.
Theorem 5.8. (The Classical Central Limit Thorem) Let {Xn } be a sequence of
independent and identically distributed m-dimensional random vectors with mean
µ and finite covariance matrix M. Then denoting ΦM as a Nm (0, M) distribution,
X1 + · · · + Xn − nµ
√
⇒ ΦM
n
Proof. We shall prove the theorem for µ = 0 and M = I since the general result
follows by a linear transformation. Consider
first the case m = 1, and let the
√
random variable Yn = (X1 + · · · + Xn )/ n. By Taylor’s theorem, we have
n
√
1
t2
n
φYn (t) = φX (t/ n) = 1 − + o( )
2
n
t2
where limn→∞ n · o( n1 ) = 0. This converges to e− 2 , and lemma 5.5 proves the
theorem for m = 1. For m > 1, Xn ∼ N (0, I), define for a fixed t ∈ Rm the
random variable sequence Yn = t · Xn . Then {Yn } has mean 0 and
√ variance t ·t .
From the preceding, the random variable Zn := (Y1 + · · · + Yn )/ n converges in
distribution to the normal distribution with mean 0 and variance t ·t . By Lévy’s,
2
φZn (ξ) → e−t·t ξ ,
ξ∈R
Evaluating the expression at ξ = 1 and applying Lévy’s continuity theorem once
again, the proof is complete.
Acknowledgments. It is a pleasure to thank my mentor, Mohammad Rezaei, for
leading me to this time consuming, though beautiful topic.
References
[1] Rabi Bhattacharya, Edward Waymire. A Basic Course in Probability Theory. Springer. 2007.
Print.
[2] Guy Lebanon. The Analysis of Data http://theanalysisofdata.com/