Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Genomics Lecture 4: Statistical inference Zhiwu Zhang Washington State University Administration Homework1, due Feb 1, Wednesday, 3:10PM Outline X2 test on contingency table Empirical null distribution X2 test on variance t test Hypothesis test two types of error Power Observed and expected frequency Transgenetic Non transgenetic SUM Herbicide 35 5 40 No herbicide 35 25 60 SUM 70 30 100 Transgenetic Non transgenetic SUM Herbicide 28 12 40 No herbicide 42 18 60 SUM 70 30 100 Approximate Distributions Poisson distribution: Mean=Var=Expected (Observed-Expected)/Sqrt(Expected) ~ N(0,1) SUM(Observed-Expected)2/ Expected ~ X2(df) df=number of independent cells Observed and expected frequency Transgenetic Non transgenetic SUM Herbicide 35 5 40 No herbicide 35 25 60 SUM 70 30 100 Transgenetic Non transgenetic SUM Herbicide 28 12 40 No herbicide 42 18 60 SUM 70 30 100 49/28+49/12+49/42+49/18=9.72 Distribution of x2(1) 0 0 2000 4000 6000 99% percentile 6.97 0.6 0.4 8000 10000 0 2 4 6 8 10 12 1.0 N = 10000 ecdf(x) Bandwidth = 0.1299 0.6 0.4 Observed 9.72 P<1% 0.2 3000 Fn(x) 5000 0.8 7000 Index of x Histogram 0.0 0 1000 Frequency x=rchisq(k,1) d=density(x) plot(x) plot(d) hist(x) plot(ecdf(x)) quantile(x,.99) 0.0 2 0.2 4 6 x par(mfrow=c(2,2),mar = c(3,4,1,1)) Density 8 0.8 10 12 1.0 density.default(x = x) 0 2 4 6 8 10 12 0 5 10 Tests on samples A sample has mean of 103.6 and variance of 27.82 The sample has 10 observations Q1: What is the probability that the sample was from a normal distribution with variance of 25? Q2: What is the probability that the sample was from a normal distribution with mean of 100? Q1: distribution with variance of 25 Empirical solution: Sample ten observations from a normal distribution with variance of 25. Calculate observed variance. Repeat the sampling and get null distribution of the sample variances Find percentile of observed variance on the null distribution Q1: distribution with variance of 25 2000 4000 6000 8000 0 20 40 60 80 N = 10000 ecdf(x) Bandwidth = 1.642 0.6 Observed 27.82 P>25% 0.0 0.2 0.4 Fn(x) 1000 0.8 1500 1.0 Index of x Histogram 0 > length(x[x>27.82])/10000 [1] 0.3516 75% percentile 31.6 0.01 0.00 20 0 0 500 Frequency par(mfrow=c(2,2),mar = c(3,4,1,1)) d=density(x) plot(x) plot(d) hist(x) plot(ecdf(x)) quantile(x,.75) 0.02 Density 60 40 x x=replicate(10000, {s=rnorm(10,0,5) var=var(s) }) 0.03 80 density.default(x = x) 0 20 40 60 80 0 20 40 60 80 100 Q1: distribution with variance of 25 Theoretical solution: v=(10-1)*27.82/25=10.026 > 1-pchisq(10.026,9) [1] 0.3483845 vs. 0.3516 from empirical Q2: distribution with mean of 100 Empirical solution Sample ten observations from N(100, 25) Calculate mean Repeat the process 10,000 times Null distribution of of the 10,000 means Determine the percentile of testing mean (103.6) on the null distribution 99% percentile 102.6 95% percentile 102.6 0.20 4000 6000 8000 10000 95 1.0 Index of x Histogram 0.8 0.6 Fn(x) 1500 1000 100 105 N = 10000 ecdf(x) Bandwidth = 0.2281 Observed 103.6 1%<P<5% 0 0.0 0.2 500 Frequency > length(x[x>103.6])/10000 [1] 0.0132 0.15 0.05 0.00 2000 0.4 0 2000 par(mfrow=c(2,2),mar = c(3,4,1,1)) d=density(x) plot(x) plot(d) hist(x) plot(ecdf(x)) quantile(x,.95) quantile(x,.99) density.default(x = x) 0.10 Density 98 100 96 94 x=replicate(10000, {s=rnorm(10,100,5) m=mean(s) }) x 104 0.25 Q2: distribution with mean of 100 95 100 105 95 100 105 t test Let Z ~ N (0,1) V ~ c k2 Z,V independent Z Define: T = V /k Application: X1,..., X n ~ iid N ( m, s 2 ) æ X -mö X -m Z= = nç ÷ ~ N(0,1) s/ n è s ø V= T= (n -1)S 2 s 2 2 ~ c n-1 æ X -mö nç ÷ è s ø (n -1)S 2 s2 (n -1) Z,V Independent æX -mö æ X -m ö = nç ÷ ÷=ç è S ø èS n ø t test æ X -m ö T =ç ÷ èS n ø T=(103.6-100)/(5/sqrt(10)) P=1-pt(T,9) c(T,P) 2.27683992 0.02440704 Under 5% of threshold, reject the hypothesis that the sample was from a distribution with mean of 100 Hypothesis test Null hypothesis (H0): Initial assumption Alternative hypothesis (Ha): Opposite to the assumption Find the probability of H0 If the probability is too low (e.g. 5%), reject Ho and accept Ha Otherwise, accept Ho Two types of errors and power Type I error: Reject true H0, False positive, the probability is the threshold used, e.g. α=5% Type II error: Accept false H0, false negative, β Power: Probability to reject false H0, (1-β) Summary Test H0 is True Ho is False Positive (reject H0) False positive Type I: α Power=1-β Negative (Accept H0) Specificity=1-α False negative Type II: β Sum 100% 100% Highlight X2 test on contingency table Empirical null distribution X2 test on variance t test Hypothesis test two types of error Power