Download ppt

Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions, Moments Generating Functions, Deviation Bounds, Limit Theorems Basics from Information Theory 2.2 Statistical Inference: Sampling and Estimation Moment Estimation, Confidence Intervals Parameter Estimation, Maximum Likelihood, EM Iteration 2.3 Statistical Inference: Hypothesis Testing and Regression Statistical Tests, p-Values, Chi-Square Test Linear and Logistic Regression mostly following L. Wasserman Chapters 1-5, with additions from other textbooks on stochastics IRDM WS 2005 2-1 2.1 Basic Probability Theory A probability space is a triple (, E, P) with • a set  of elementary events (sample space), • a family E of subsets of  with E which is closed under , , and  with a countable number of operands (with finite  usually E=2), and • a probability measure P: E  [0,1] with P[]=1 and P[i Ai] = i P[Ai] for countably many, pairwise disjoint Ai Properties of P: P[A] + P[A] = 1 P[A  B] = P[A] + P[B] – P[A  B] P[] = 0 (null/impossible event) P[ ] = 1 (true/certain event) IRDM WS 2005 2-2 Independence and Conditional Probabilities Two events A, B of a prob. space are independent if P[A  B] = P[A] P[B]. A finite set of events A={A1, ..., An} is independent if for every subset S A the equation P[ Ai ]   P[Ai ] Ai S Ai S holds. The conditional probability P[A | B] of A under the P[ A  B] condition (hypothesis) B is defined as: P[ A | B]  P[ B] Event A is conditionally independent of B given C if P[A | BC] = P[A | C]. IRDM WS 2005 2-3 Total Probability and Bayes’ Theorem Total probability theorem: For a partitioning of  into events B1, ..., Bn: n P[ A]   P[ A| Bi ] P[ Bi ] i 1 Bayes‘ theorem: P[ B | A] P[ A] P[ A | B]  P[ B] P[A|B] is called posterior probability P[A] is called prior probability IRDM WS 2005 2-4 Random Variables A random variable (RV) X on the prob. space (, E, P) is a function X:   M with M  R s.t. {e | X(e) x} E for all x M (X is measurable). FX: M  [0,1] with FX(x) = P[X  x] is the (cumulative) distribution function (cdf) of X. With countable set M the function fX: M  [0,1] with fX(x) = P[X = x] is called the (probability) density function (pdf) of X; in general fX(x) is F‘X(x). For a random variable X with distribution function F, the inverse function F-1(q) := inf{x | F(x) > q} for q  [0,1] is called quantile function of X. (0.5 quantile (50th percentile) is called median) Random variables with countable M are called discrete, otherwise they are called continuous. For discrete random variables the density function is also referred to as the probability mass function. IRDM WS 2005 2-5 Important Discrete Distributions • Bernoulli distribution with parameter p: P [ X  x ]  p x ( 1  p )1 x for x  { 0,1} • Uniform distribution over {1, 2, ..., m}: P [ X  k ]  f X (k )  1 m for1  k  m • Binomial distribution (coin toss n times repeated; X: #heads):  n k P [ X  k ]  f X (k )    p (1  p) nk k • Poisson distribution (with rate ): P[ X  k ]  f X k   (k )  e k! • Geometric distribution (#coin tosses until first head): P [ X  k ]  f X (k )  (1  p) k p • 2-Poisson mixture (with a1+a2=1): P [ X  k ]  f X ( k )  a1 e IRDM WS 2005 1 1k k!  a2 e 2 2 k k! 2-6 Important Continuous Distributions • Uniform distribution in the interval [a,b] 1 f X ( x ) ba for a  x  b ( 0 otherwise ) • Exponential distribution (z.B. time until next event of a Poisson process) with rate  = limt0 (# events in t) / t : f X ( x )   e x for x  0 ( 0 otherwise ) 1x 2 x f ( x )  p  e  ( 1  p )  e • Hyperexponential distribution: X 1 2 a 1 ab f ( x )  for x  b , 0 otherwise   • Pareto distribution: X b x c f ( x )  Example of a „heavy-tailed“ distribution with X x 1 1 • logistic distribution: FX ( x )  1  e x IRDM WS 2005 2-7 Normal Distribution (Gaussian Distribution) • Normal distribution N(,2) (Gauss distribution; approximates sums of independent, identically distributed random variables): f X ( x)   1 2 2 e ( x   )2 2 2 • Distribution function of N(0,1): z ( z )    1 2 x2  e 2 dx Theorem: Let X be normal distributed with expectation  and variance 2. X  Then Y :  is normal distributed with expectation 0 and variance 1. IRDM WS 2005 2-8 Multidimensional (Multivariate) Distributions Let X1, ..., Xm be random variables over the same prob. space with domains dom(X1), ..., dom(Xm). The joint distribution of X1, ..., Xm has a density function f X1,...,X m ( x1 ,...,xm ) with  x1dom( X1 )  or ... xmdom( X m )  ... dom( X 1 )  f X1 ,...,X m ( x1 ,..., xm )  1 f X 1,..., Xm ( x1 ,...,xm ) dxm ...dx1  1 dom( X m ) The marginal distribution of Xi in the joint distribution of X1, ..., Xm has the density function  ...   ...  f X1 ,...,X m ( x1 ,..., xm ) or x1 xi 1 xi 1  ...  X1  ...  f X1 ,...,X m ( x1 ,..., xm ) dxm ... dxi 1 dxi 1 ... dx1 X i 1 X i 1 IRDM WS 2005 xm Xm 2-9 Important Multivariate Distributions multinomial distribution (n trials with m-sided dice):  n  k1 km  p1 ... pm P [ X 1  k1  ...  X m  k m ]  f X1 ,...,X m ( k1 ,..., k m )    k1 ... k m   n  n!  : with   k1 ... k m  k1! ... k m ! multidimensional normal distribution:  f X1 ,...,X m ( x )  1 ( 2 )m    1    ( x   )T  1 ( x   ) e 2 with covariance matrix  with ij := Cov(Xi,Xj) IRDM WS 2005 2-10 Moments For a discrete random variable X with density fX E [ X ]   k f X (k ) is the expectation value (mean) of X E [ X i ]   k i f X (k ) is the i-th moment of X kM kM V [ X ]  E [( X  E[ X ])2 ]  E[ X 2 ]  E[ X ]2 is the variance of X For a continuous random variable X with density fX  E [ X ]   x f X ( x) dx is the expectation value of   i E [ X ]   xi f X ( x) dx is the i-th moment of X  2 2 2 V [ X ]  E [( X  E[ X ]) ]  E[ X ]  E[ X ] X is the variance of X Theorem: Expectation values are additive: E [ X  Y ]  E[ X ]  E[ Y ] (distributions are not) IRDM WS 2005 2-11 Properties of Expectation and Variance E[aX+b] = aE[X]+b for constants a, b E[X1+X2+...+Xn] = E[X1] + E[X2] + ... + E[Xn] (i.e. expectation values are generally additive, but distributions are not!) E[X1+X2+...+XN] = E[N] E[X] if X1, X2, ..., XN are independent and identically distributed (iid RVs) with mean E[X] and N is a stopping-time RV Var[aX+b] = a2 Var[X] for constants a, b Var[X1+X2+...+Xn] = Var[X1] + Var[X2] + ... + Var[Xn] if X1, X2, ..., Xn are independent RVs Var[X1+X2+...+XN] = E[N] Var[X] + E[X]2 Var[N] if X1, X2, ..., XN are iid RVs with mean E[X] and variance Var[X] and N is a stopping-time RV IRDM WS 2005 2-12 Correlation of Random Variables Covariance of random variables Xi and Xj:: Cov( Xi, Xj) : E[ ( Xi  E[ Xi]) ( Xj  E[ Xj]) ] Var( Xi )  Cov( Xi , Xi )  E [ X 2 ]  E [ X ] 2 Correlation coefficient of Xi and Xj Cov ( Xi, Xj)  ( Xi, Xj) : Var ( Xi) Var ( Xj) Conditional expectation of X given Y=y:   x f X|Y (x | y) E[X | Y  y]      x f X|Y (x | y) dx IRDM WS 2005 discrete case continuous case 2-13 Transformations of Random Variables Consider expressions r(X,Y) over RVs such as X+Y, max(X,Y), etc. 1. For each z find Az = {(x,y) | r(x,y)z} 2. Find cdf FZ(z) = P[r(x,y)  z] =   Az fX,Y (x, y)dx dy 3. Find pdf fZ(z) = F‘Z(z) Important case: sum of independent RVs (non-negative) Z = X+Y FZ(z) = P[r(x,y)  z] =   x  yz f X (x)f Y (y) dx dy yx z x z  y0 x 0 f X (x)f Y (y) dx dy z f (x)FY (z  x) x 0 X  or in discrete case: dx FZ (z)    x  yz f X (x)f Y (y) Convolution x y IRDM WS 2005 2-14 Generating Functions and Transforms X, Y, ...: continuous random variables A, B, ...: discrete random variables with with non-negative real values  MX(s)   e sx f X ( x )dx  E [ e sX non-negative integer values  GA ( z )   z i f A( i )  E[ z A ] : ]: i 0 0 moment-generating function of X  f *X ( s )   e  sx f X ( x )dx  E [ e  sX ] generating function of A (z transform) 0 Laplace-Stieltjes transform (LST) of X f A* ( s )  M A( s )  GA( e s ) Examples: exponential: f X ( x )  e f *X ( s )  IRDM WS 2005  x   s Erlang-k: fX ( x )   k(  kx )k 1 ( k  1 )! Poisson: e  kx  k  k f *X ( s )     k  s  f A ( k )  e  k k! GA( z )  e ( z 1 ) 2-15 Properties of Transforms s 2 E[ X 2 ] s 3 E[ X 3 ] M X ( s )  1  sE[ X ]    ... 2! 3! d nM X ( s )  E[ X ]  (0 ) n ds n 1 d nG A ( z ) f A( n )  (0 ) n n! dz dGA ( z ) E[ A]  (1) dz f X ( x )  ag( x )  bh( x )  f * ( s )  ag* ( s )  bh* ( s ) f X ( x )  g'( x )  f * ( s )  sg* ( s )  g(0  ) x f X ( x )   g( t )dt  f * ( s )  0 g*( s ) s Convolution of independent random variables: z FX Y ( z )   f X ( x) FY ( z  x) dx 0 f * X Y (s)  f * X (s) f *Y (s) M X Y ( s )  M X ( s )MY ( s ) IRDM WS 2005 k FA B ( k )   f A( i ) FY ( k  i ) i o GA B ( z )  GA ( z )GB ( z ) 2-16 Inequalities and Tail Bounds Markov inequality: P[X  t]  E[X] / t for t > 0 and non-neg. RV X Chebyshev inequality: P[ |XE[X]|  t]  Var[X] / t2 for t > 0 and non-neg. RV X  Chernoff-Hoeffding bound: P [ X  t ]  inf e t M X (  ) |   0 1  2nt 2 Corollary: : P   X i  p  t   2e n  for Bernoulli(p) iid. RVs X1, ..., Xn and any t > 0  t2 / 2 2e Mill‘s inequality: P  Z  t    t for N(0,1) distr. RV Z and t > 0 Cauchy-Schwarz inequality: E[XY]  E[X ]E[Y ] 2 2 Jensen‘s inequality: E[g(X)]  g(E[X]) for convex function g E[g(X)]  g(E[X]) for concave function g (g is convex if for all c[0,1] and x1, x2: g(cx1 + (1-c)x2)  cg(x1) + (1-c)g(x2)) IRDM WS 2005 2-17  Convergence of Random Variables Let X1, X2, ...be a sequence of RVs with cdf‘s F1, F2, ..., and let X be another RV with cdf F. • Xn converges to X in probability, Xn P X, if for every  > 0 P[|XnX| > ]  0 as n   • Xn converges to X in distribution, Xn D X, if lim n  Fn(x) = F(x) at all x for which F is continuous • Xn converges to X in quadratic mean, Xn qm X, if E[(XnX)2]  0 as n   • Xn converges to X almost surely, Xn as X, if P[Xn X] = 1 weak law of large numbers (for X n  i 1..n Xi / n ) if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n P E[X] that is: lim n  P[| X n  E[X] |  ]  0 strong law of large numbers: if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n as E[X] that is: P[lim n  | X n  E[X] |  ]  0 IRDM WS 2005 2-18 Poisson Approximates Binomial Theorem: Let X be a random variable with binomial distribution with parameters n and p := /n with large n and small constant  << 1. Then limn f X ( k )  e IRDM WS 2005 k   k! 2-19 Central Limit Theorem Theorem: Let X1, ..., Xn be independent, identically distributed random variables with expectation  and variance 2. The distribution function Fn of the random variable Zn := X1 + ... + Xn converges to a normal distribution N(n, n2) with expectation n and variance n2: limn P [ a  Z n  n n  b ]  ( b )  ( a ) Corollary: 1 n X :  X i n i 1 converges to a normal distribution N(, 2/n) with expectation  and variance 2/n . IRDM WS 2005 2-20 Elementary Information Theory Let f(x) be the probability (or relative frequency) of the x-th symbol 1 in some text d. The entropy of the text H ( d )   f ( x ) log2 (or the underlying prob. distribution f) is: f(x) x H(d) is a lower bound for the bits per symbol needed with optimal coding (compression). For two prob. distributions f(x) and g(x) the relative entropy (Kullback-Leibler divergence) of f to g is f(x) D( f g ) :  f ( x ) log g( x ) x Relative entropy is a measure for the (dis-)similarity of two probability or frequency distributions. It corresponds to the average number of additional bits needed for coding information (events) with distribution f when using an optimal code for distribution g. The cross entropy of f(x) to g(x) is: H ( f , g ) : H ( f )  D( f g )    f ( x ) log g( x ) x IRDM WS 2005 2-21 Compression • Text is sequence of symbols (with specific frequencies) • Symbols can be • letters or other characters from some alphabet  • strings of fixed length (e.g. trigrams) • or words, bits, syllables, phrases, etc. Limits of compression: Let pi be the probability (or relative frequency) of the i-th symbol in text d 1 Then the entropy of the text: H (d )   pi log 2 p i i is a lower bound for the average number of bits per symbol in any compression (e.g. Huffman codes) Note: compression schemes such as Ziv-Lempel (used in zip) are better because they consider context beyond single symbols; with appropriately generalized notions of entropy the lower-bound theorem does still hold IRDM WS 2005 2-22

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt