Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 2: Basics from Probability Theory
and Statistics
2.1 Probability Theory
Events, Probabilities, Random Variables, Distributions, Moments
Generating Functions, Deviation Bounds, Limit Theorems
Basics from Information Theory
2.2 Statistical Inference: Sampling and Estimation
Moment Estimation, Confidence Intervals
Parameter Estimation, Maximum Likelihood, EM Iteration
2.3 Statistical Inference: Hypothesis Testing and Regression
Statistical Tests, p-Values, Chi-Square Test
Linear and Logistic Regression
mostly following L. Wasserman Chapters 1-5, with additions from other textbooks on stochastics
IRDM WS 2005
2-1
2.1 Basic Probability Theory
A probability space is a triple (, E, P) with
• a set  of elementary events (sample space),
• a family E of subsets of  with E which is closed under
, , and  with a countable number of operands
(with finite  usually E=2), and
• a probability measure P: E  [0,1] with P[]=1 and
P[i Ai] = i P[Ai] for countably many, pairwise disjoint Ai
Properties of P:
P[A] + P[A] = 1
P[A  B] = P[A] + P[B] – P[A  B]
P[] = 0 (null/impossible event)
P[ ] = 1 (true/certain event)
IRDM WS 2005
2-2
Independence and Conditional Probabilities
Two events A, B of a prob. space are independent
if P[A  B] = P[A] P[B].
A finite set of events A={A1, ..., An} is independent
if for every subset S A the equation P[
Ai ]   P[Ai ]
Ai S
Ai S
holds.
The conditional probability P[A | B] of A under the
P[ A  B]
condition (hypothesis) B is defined as:
P[ A | B] 
P[ B]
Event A is conditionally independent of B given C
if P[A | BC] = P[A | C].
IRDM WS 2005
2-3
Total Probability and Bayes’ Theorem
Total probability theorem:
For a partitioning of  into events B1, ..., Bn:
n
P[ A]   P[ A| Bi ] P[ Bi ]
i 1
Bayes‘ theorem:
P[ B | A] P[ A]
P[ A | B] 
P[ B]
P[A|B] is called posterior probability
P[A] is called prior probability
IRDM WS 2005
2-4
Random Variables
A random variable (RV) X on the prob. space (, E, P) is a function
X:   M with M  R s.t. {e | X(e) x} E for all x M
(X is measurable).
FX: M  [0,1] with FX(x) = P[X  x] is the
(cumulative) distribution function (cdf) of X.
With countable set M the function fX: M  [0,1] with fX(x) = P[X = x]
is called the (probability) density function (pdf) of X;
in general fX(x) is F‘X(x).
For a random variable X with distribution function F, the inverse function
F-1(q) := inf{x | F(x) > q} for q  [0,1] is called quantile function of X.
(0.5 quantile (50th percentile) is called median)
Random variables with countable M are called discrete,
otherwise they are called continuous.
For discrete random variables the density function is also
referred to as the probability mass function.
IRDM WS 2005
2-5
Important Discrete Distributions
• Bernoulli distribution with parameter p: P [ X  x ]  p x ( 1  p )1 x
for x  { 0,1}
• Uniform distribution over {1, 2, ..., m}:
P [ X  k ]  f X (k ) 
1
m
for1  k  m
• Binomial distribution (coin toss n times repeated; X: #heads):
 n k
P [ X  k ]  f X (k )    p (1  p) nk
k
• Poisson distribution (with rate ):
P[ X  k ]  f X
k


(k )  e
k!
• Geometric distribution (#coin tosses until first head):
P [ X  k ]  f X (k )  (1  p) k p
• 2-Poisson mixture (with a1+a2=1):
P [ X  k ]  f X ( k )  a1 e
IRDM WS 2005
1
1k
k!
 a2 e
2
2 k
k!
2-6
Important Continuous Distributions
• Uniform distribution in the interval [a,b]
1
f X ( x )
ba
for a  x  b ( 0 otherwise )
• Exponential distribution (z.B. time until next event of a
Poisson process) with rate  = limt0 (# events in t) / t :
f X ( x )   e x
for x  0 ( 0 otherwise )
1x
2 x
f
(
x
)

p

e

(
1

p
)

e
• Hyperexponential distribution: X
1
2
a 1
ab
f
(
x
)

for x  b , 0 otherwise
 
• Pareto distribution: X
b x
c
f
(
x
)

Example of a „heavy-tailed“ distribution with X
x 1
1
• logistic distribution: FX ( x ) 
1  e x
IRDM WS 2005
2-7
Normal Distribution (Gaussian Distribution)
• Normal distribution N(,2) (Gauss distribution;
approximates sums of independent,
identically distributed random variables): f X ( x) 

1
2
2
e
( x   )2
2 2
• Distribution function of N(0,1):
z
( z )  

1
2
x2

e 2
dx
Theorem:
Let X be normal distributed with
expectation  and variance 2.
X 
Then Y :

is normal distributed with expectation 0 and variance 1.
IRDM WS 2005
2-8
Multidimensional (Multivariate) Distributions
Let X1, ..., Xm be random variables over the same prob. space
with domains dom(X1), ..., dom(Xm).
The joint distribution of X1, ..., Xm has a density function
f X1,...,X m ( x1 ,...,xm )
with

x1dom( X1 )

or
...
xmdom( X m )

...
dom( X 1 )

f X1 ,...,X m ( x1 ,..., xm )  1
f X 1,..., Xm ( x1 ,...,xm ) dxm ...dx1  1
dom( X m )
The marginal distribution of Xi in the joint distribution
of X1, ..., Xm has the density function
 ...   ...  f X1 ,...,X m ( x1 ,..., xm ) or
x1
xi 1 xi 1
 ... 
X1
 ...  f X1 ,...,X m ( x1 ,..., xm ) dxm ... dxi 1 dxi 1 ... dx1
X i 1 X i 1
IRDM WS 2005
xm
Xm
2-9
Important Multivariate Distributions
multinomial distribution (n trials with m-sided dice):
 n  k1
km
 p1 ... pm
P [ X 1  k1  ...  X m  k m ]  f X1 ,...,X m ( k1 ,..., k m )  
 k1 ... k m 
 n 
n!
 :
with 
 k1 ... k m  k1! ... k m !
multidimensional normal distribution:

f X1 ,...,X m ( x ) 
1
( 2 )m 
 
1  
 ( x   )T  1 ( x   )
e 2
with covariance matrix  with ij := Cov(Xi,Xj)
IRDM WS 2005
2-10
Moments
For a discrete random variable X with density fX
E [ X ]   k f X (k )
is the expectation value (mean) of X
E [ X i ]   k i f X (k )
is the i-th moment of X
kM
kM
V [ X ]  E [( X  E[ X ])2 ]  E[ X 2 ]  E[ X ]2
is the variance of X
For a continuous random variable X with density fX

E [ X ]   x f X ( x) dx is the expectation value of


i
E [ X ]   xi f X ( x) dx is the i-th moment of X

2
2
2
V [ X ]  E [( X  E[ X ]) ]  E[ X ]  E[ X ]
X
is the variance of X
Theorem: Expectation values are additive: E [ X  Y ]  E[ X ]  E[ Y ]
(distributions are not)
IRDM WS 2005
2-11
Properties of Expectation and Variance
E[aX+b] = aE[X]+b for constants a, b
E[X1+X2+...+Xn] = E[X1] + E[X2] + ... + E[Xn]
(i.e. expectation values are generally additive, but distributions are not!)
E[X1+X2+...+XN] = E[N] E[X]
if X1, X2, ..., XN are independent and identically distributed (iid RVs)
with mean E[X] and N is a stopping-time RV
Var[aX+b] = a2 Var[X] for constants a, b
Var[X1+X2+...+Xn] = Var[X1] + Var[X2] + ... + Var[Xn]
if X1, X2, ..., Xn are independent RVs
Var[X1+X2+...+XN] = E[N] Var[X] + E[X]2 Var[N]
if X1, X2, ..., XN are iid RVs with mean E[X] and variance Var[X]
and N is a stopping-time RV
IRDM WS 2005
2-12
Correlation of Random Variables
Covariance of random variables Xi and Xj::
Cov( Xi, Xj) : E[ ( Xi  E[ Xi]) ( Xj  E[ Xj]) ]
Var( Xi )  Cov( Xi , Xi )  E [ X 2 ]  E [ X ] 2
Correlation coefficient of Xi and Xj
Cov ( Xi, Xj)
 ( Xi, Xj) :
Var ( Xi) Var ( Xj)
Conditional expectation of X given Y=y:

 x f X|Y (x | y)
E[X | Y  y]  

  x f X|Y (x | y) dx
IRDM WS 2005
discrete case
continuous case
2-13
Transformations of Random Variables
Consider expressions r(X,Y) over RVs such as X+Y, max(X,Y), etc.
1. For each z find Az = {(x,y) | r(x,y)z}
2. Find cdf FZ(z) = P[r(x,y)  z] =
  Az fX,Y (x, y)dx dy
3. Find pdf fZ(z) = F‘Z(z)
Important case: sum of independent RVs (non-negative)
Z = X+Y
FZ(z) = P[r(x,y)  z] =   x  yz f X (x)f Y (y) dx dy
yx
z x z
 y0 x 0 f X (x)f Y (y) dx dy
z
f (x)FY (z  x)
x 0 X

or in discrete case:
dx
FZ (z)    x  yz f X (x)f Y (y)
Convolution
x y
IRDM WS 2005
2-14
Generating Functions and Transforms
X, Y, ...: continuous random variables
A, B, ...: discrete random variables with
with non-negative real values

MX(s)   e
sx
f X ( x )dx  E [ e
sX
non-negative integer values

GA ( z )   z i f A( i )  E[ z A ] :
]:
i 0
0
moment-generating function of X

f *X ( s )   e  sx f X ( x )dx  E [ e  sX ]
generating function of A
(z transform)
0
Laplace-Stieltjes transform (LST) of X
f A* ( s )  M A( s )  GA( e s )
Examples:
exponential:
f X ( x )  e
f *X ( s ) 
IRDM WS 2005
 x

 s
Erlang-k:
fX ( x ) 
 k(  kx )k 1
( k  1 )!
Poisson:
e
 kx
 k  k
f *X ( s )  

 k  s 
f A ( k )  e 
k
k!
GA( z )  e ( z 1 )
2-15
Properties of Transforms
s 2 E[ X 2 ] s 3 E[ X 3 ]
M X ( s )  1  sE[ X ] 

 ...
2!
3!
d nM X ( s )
 E[ X ] 
(0 )
n
ds
n
1 d nG A ( z )
f A( n ) 
(0 )
n
n!
dz
dGA ( z )
E[ A] 
(1)
dz
f X ( x )  ag( x )  bh( x )  f * ( s )  ag* ( s )  bh* ( s )
f X ( x )  g'( x )  f * ( s )  sg* ( s )  g(0  )
x
f X ( x )   g( t )dt  f * ( s ) 
0
g*( s )
s
Convolution of independent random variables:
z
FX Y ( z )   f X ( x) FY ( z  x) dx
0
f * X Y (s)  f * X (s) f *Y (s)
M X Y ( s )  M X ( s )MY ( s )
IRDM WS 2005
k
FA B ( k )   f A( i ) FY ( k  i )
i o
GA B ( z )  GA ( z )GB ( z )
2-16
Inequalities and Tail Bounds
Markov inequality: P[X  t]  E[X] / t
for t > 0 and non-neg. RV X
Chebyshev inequality: P[ |XE[X]|  t]  Var[X] / t2
for t > 0 and non-neg. RV X

Chernoff-Hoeffding bound: P [ X  t ]  inf e t M X (  ) |   0
1

2nt 2
Corollary: : P   X i  p  t   2e
n

for Bernoulli(p) iid. RVs
X1, ..., Xn and any t > 0
 t2 / 2
2e
Mill‘s inequality: P  Z  t  

t
for N(0,1) distr. RV Z
and t > 0
Cauchy-Schwarz inequality: E[XY]  E[X ]E[Y ]
2
2
Jensen‘s inequality: E[g(X)]  g(E[X]) for convex function g
E[g(X)]  g(E[X]) for concave function g
(g is convex if for all c[0,1] and x1, x2: g(cx1 + (1-c)x2)  cg(x1) + (1-c)g(x2))
IRDM WS 2005
2-17

Convergence of Random Variables
Let X1, X2, ...be a sequence of RVs with cdf‘s F1, F2, ...,
and let X be another RV with cdf F.
• Xn converges to X in probability, Xn P X, if for every  > 0
P[|XnX| > ]  0 as n  
• Xn converges to X in distribution, Xn D X, if
lim n  Fn(x) = F(x) at all x for which F is continuous
• Xn converges to X in quadratic mean, Xn qm X, if
E[(XnX)2]  0 as n  
• Xn converges to X almost surely, Xn as X, if P[Xn X] = 1
weak law of large numbers (for X n  i 1..n Xi / n )
if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n P E[X]
that is: lim n  P[| X n  E[X] |  ]  0
strong law of large numbers:
if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n as E[X]
that is: P[lim n  | X n  E[X] |  ]  0
IRDM WS 2005
2-18
Poisson Approximates Binomial
Theorem:
Let X be a random variable with binomial distribution with
parameters n and p := /n with large n and small constant  << 1.
Then limn f X ( k )  e
IRDM WS 2005
k


k!
2-19
Central Limit Theorem
Theorem:
Let X1, ..., Xn be independent, identically distributed random variables
with expectation  and variance 2.
The distribution function Fn of the random variable Zn := X1 + ... + Xn
converges to a normal distribution N(n, n2)
with expectation n and variance n2:
limn P [ a 
Z n  n
n
 b ]  ( b )  ( a )
Corollary:
1 n
X :  X i
n i 1
converges to a normal distribution N(, 2/n)
with expectation  and variance 2/n .
IRDM WS 2005
2-20
Elementary Information Theory
Let f(x) be the probability (or relative frequency) of the x-th symbol
1
in some text d. The entropy of the text
H ( d )   f ( x ) log2
(or the underlying prob. distribution f) is:
f(x)
x
H(d) is a lower bound for the bits per symbol
needed with optimal coding (compression).
For two prob. distributions f(x) and g(x) the
relative entropy (Kullback-Leibler divergence) of f to g is
f(x)
D( f g ) :  f ( x ) log
g( x )
x
Relative entropy is a measure for the (dis-)similarity of
two probability or frequency distributions.
It corresponds to the average number of additional bits
needed for coding information (events) with distribution f
when using an optimal code for distribution g.
The cross entropy of f(x) to g(x) is:
H ( f , g ) : H ( f )  D( f g )    f ( x ) log g( x )
x
IRDM WS 2005
2-21
Compression
• Text is sequence of symbols (with specific frequencies)
• Symbols can be
• letters or other characters from some alphabet 
• strings of fixed length (e.g. trigrams)
• or words, bits, syllables, phrases, etc.
Limits of compression:
Let pi be the probability (or relative frequency)
of the i-th symbol in text d
1
Then the entropy of the text: H (d )   pi log 2 p
i
i
is a lower bound for the average number of bits per symbol
in any compression (e.g. Huffman codes)
Note:
compression schemes such as Ziv-Lempel (used in zip)
are better because they consider context beyond single symbols;
with appropriately generalized notions of entropy
the lower-bound theorem does still hold
IRDM WS 2005
2-22
Related documents