Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 1: Basic Statistical Tools Discrete and Continuous Random Variables A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution A discrete RV x --- takes on values X1, X2, … Xk S Probability = Xi)value in some i = Pr(x A continuous can possible Pi > 0, RV xdistribution: Pi =take 1 on Pany interval (or set of intervals) The probability distribution is defined by the probability density function, p(x) < > < and p(x) P (x 1 ••x0• x 2) = Z x1 2 p(x)dx p(x) dx = 1 °x 11 Joint and Conditional Probabilities The probability for a pair (x,y) of random variables is specified by the joint probability density function, p(x,y) Z y2 Z x 2 The marginal density of x, p(x) < y2 ; x 1 • < x •< x2 ) = P ( y1 <• y • p(x; y) dx dy Z1 y1 p(y|x), the conditional density y given xx 1 p(x) = p((x; y) dy Z Relationships among p(x), y 2 p(y|x) ° 1 p(x,y), < y2 j x ) = < y• P ( y1 • p( y j x ) dy p(x; y) = p(said y j xto) p(x); hence p( y j x ) == p(x)p(y) xp(x; and y) y are be independent if p(x,y) y1 p(x) Note that p(y|x) = p(y) if x and y are independent Bayes’ Theorem Suppose an unobservable RV takes on values b1 .. bn Suppose that we observe the outcome A of an RV correlated with b. What can we say about b given A? Bayes’ theorem: P r(bj ) P r(A j bj ) P r(bj ) P r(A j bj ) P r(bj j A) = = n X P r(A) P r(bi ) P r(A j bi ) i= 1 A typical application in genetics is that A is some phenotype and b indexes some underlying (but unknown) genotype Genotype QQ Qq qq Freq(genotype) 0.5 0.3 0.2 Pr(height >70 | genotype) 0.3 0.6 0.9 Pr(height > 70) = 0.3*0.5 +0.6*0.3 + 0.9*0.2 = 0.51 Pr(QQ | height > 70) = Pr(QQ) * Pr (height > 70 | QQ) Pr(height > 70) = 0.5*0.3 / 0.51 = 0.294 Expectations of Random Variables The expected value, E [f(x)], of some function x of the random variable x is just the average value of that function E[x] = the (arithmetic) mean, m, of a random variable x XZ + 1 x discrete (x)]== f (x)p(x)dx EE[f[f(x)] P r(x = X i )fZ(X i )1x continuous + -° 1 i E (x) = š = x p(x) dx E[ (x - m)2 ] = s 2, the variance ° 1of x Z +1 £ § More generally, the rth2moment about the mean is given 2 2 E (x ° š ) = æ = (x ° š ) p(x) dx ] [ r by E[ (x - m) ] ° 1 Useful properties expectations r ==2: =3: 4: variance skew (scaled)ofkurtosis E [g(x) + f (y)] = E [g(x)] + E [f (y)] E (c x) = c E (x) The Normal (or Gaussian) Distribution A normal RV with mean m and variance s2 has density: µ p(x) = p 1 (x ° š ) exp ° 2 2æ2 2º æ ( 2 Ž ) (m) =ispeak of TheMean variance a measure of distribution spread about the mean. The smaller s2, the narrower the distribution about the mean The truncated normal Only consider values of T or above in a normal Mean Density of function truncated = distribution p(x | x > T) Z 1 z p(z)p(z) æ ¢pT p(z) E [z j z > T ] = dz = š + R = 1 º ºT T P r(z > T )t p(z) dz T Here pT is the height of the normal Variance at the truncation point, T Let pT = Pr(z > T) " • • # 2 µ Ž 2 exp °p (T °2 š ) pTpT= ¢(z (2º °) ° š1 =)=æ T 2 1+ ° 2 æ2æ ºT ºT Covariances • Cov(x,y) = E [(x-mx)(y-my)] • = E [x*y] - E[x]*E[y] Cov(x,y) Cov(x,y) >=<0, 0, negative (linear) (linear) association between between Cov(x,y) Cov(x,y) 0,positive no = 0linear DOES association NOTassociation imply between no association x & y x x&&y y cov(X,Y) cov(X,Y) > 0=< 00= 0 cov(X,Y) cov(X,Y) Y Y Y Y X X X X Correlation Cov = 10 tells us nothing about the strength of an association What is needed is an absolute measure of association This is provided by the correlation, r(x,y) r (x; y) = p Cov(x; y) V a r (x) Va r (y) r = 1 implies a perfect (positive) linear association r = - 1 implies a perfect (negative) linear association Useful Properties of Variances and Covariances • Symmetry, Cov(x,y) = Cov(y,x) • The covariance of a variable with itself is the variance, Cov(x,x) = Var(x) • If a is a constant, then – Cov(ax,y) = a Cov(x,y) • Var(a x) = a2 Var(x). – Var(ax) = Cov(ax,ax) = a2 Cov(x,x) = a2Var(x) • Cov(x+y,z) = Cov(x,z) + Cov(y,z) More generally 0 1 Xn Xm Xn Xm C ov @ xi ; yj A = Cov(x i ; yj ) i= 1 j= 1 i= 1 j = 1 Var (x + y) = V ar (x) + Var (y) + 2Cov(x; y) Hence, the variance of a sum equals the sum of the Variances ONLY when the elements are uncorrelated Regressions Consider the best (linear) predictor of y given we know x yb = y + by j x ( x x) The slope of this linear regression is a function of Cov, by j x Cov(x; y) = Va r (x) The fraction of the variation in y accounted for by knowing x, i.e,Var(yhat - y), is r2 r2 = 0.6 0.9 Relationship between the correlation and the regression Slope: s C ov(x; y) r (x; y) = p = by jx Va r (x)Va r (y) Va r (x) V a r (y) If Var(x) = Var(y), then by|x = b x|y = r(x,y) In the case, the fraction of variation accounted for by the regression is b2 Properties of Least-squares Regressions The slope and intercept obtained by least-squares: minimize the sum of squared residuals: • The regression line passes through the means X X X 2 2 of both x and e i =y ( yi ° y^i ) = ( yi ° a ° bx i )2 • The average value of the residual is zero • The LS solution maximizes the amount of variation in y that can be explained by a linear regression on x • Fraction of variance in y accounted by the regression is r2 • The residual errors around the least-squares regression are uncorrelated with the predictor variable x • Homoscedastic vs. heteroscedastic residual variances Maximum Likelihood p(x1,…, xn | q ) = density of the observed data (x1,…, xn) given the (unknown) distribution parameter(s) q Fisher (yup, the same one) suggested the method of maximum likelihood --- given the data (x1,…, xn) find the value(s) of q that maximize p(x1,…, xn | q ) We usually express p(x1,…, xn | q) as a likelihood function l ( q | x1,…, xn ) to remind us that it is dependent on the observed data The Maximum Likelihood Estimator (MLE) of q are the value(s) that maximize the likelihood function l given the observed data x1,…, xn . MLE of q l (q | x) This curvature The is formalize of by thelooking likelihood at the surface log-likelihood in the neighborhood surface, of=the us as precisionfunction, of the estimator L ln [lMLE (q | informs x) ]. Since lnto is the a monotonic the A narrow high precision. A board peak value of q peak that =maximizes l also maximizes L = lower precision Var(MLE) = -1 @2 L(µj z) @µ2 Negative curvature larger the curvature, the smaller at a maximum The 2nd derivative = curvature the variance Likelihood Ratio tests Hypothesis testing in the ML frameworks occurs through likelihood-ratio (LR) tests Ž h i b `( £q r j z) b r j z) ° L( £b LR = ° 2 ln = ° 2 L( £q j z) q b j z) `( £q µ Maximum value of the likelihood function under the Ratio For large of the sample valuesizes of the (generally) maximumLR value approaches of the a null hypothesis (typically r parameters assigned fixed likelihood function Chi-square distribution under with thernull df (r hypothesis = numbervs. of values) the alternative parameters assigned fixed values under null) Bayesian Statistics Instead of simply estimatingisaBayesian point estimate (e.g.., the An extension of likelihood statistics MLE), the goal is the estimate the entire distribution for the unknown parameter q given the data x p(q | x) = C* l(x | q) p(q) posterior distribution of Likelihood function prior distribution for q the posterior for q Why Bayesian? The appropriate constant so that q given x integrates to one. • Exact for any sample size • Marginal posteriors • Efficient use of any prior information • MCMC (such as Gibbs sampling) methods