Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Achim Tresch UoC / MPIPZ Cologne Statistics treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de 1 II. Testing II. Testing Induction from the sample to the population Significance Testing: Difference in the sample Estimation, Regression: Measure in the sample Difference in the population? Probability of a false call? Measure in the population? Variance? Confidence intervals? 2 II. Testing What allows us to conclude from the sample to the population? The sample has to be representative (figures about drug abuse of students cannot be generalized to the whole population of Germany) How is representativity achieved? Large sample numbers Random recruitment of samples from the population E.g.: Dial a random phone number. Choose a random name from the register of birth (Advantages/Disadv.?) Randomization: Random allocation of the samples to the different experimental groups 3 A non-sheep detector Training: Measure the length of all sheep that cross your way 4 II Testing A non-sheep detector Training: 70 80 90 Measure the length of all sheep that cross your way. Determine the distribution of the quantity of interest. 100 110 120 130 140 5 Groesse [cm] II Testing A non-sheep detector Testing: For any unknown animal, test the hypothesis that it is a sheep. Measure ist length and compare it to the learned length distribution of the sheep. If its length is „out of bounds“, the animal will be called a non-sheep (rejection of the hypothesis). Otherwise, we cannot say much (non-rejection). Not a sheep 70 80 90 100 110 120 130 140 6 II Testing A non-sheep detector Advantage of the method: One does not need to know much about sheep. Disadvantage: It produces errors… Positive calls Negatives calls False Negatives 70 80 90 False Positives True Negatives 100 110 Groesse [cm] 120 130 140 True Positives Decision boundary 7 Statistical Hypothesis Testing II Testing State a null hypothesis H0 („nothing happens, there is no difference…“) Choose an appropriate test statistic (the dataderived quantity that finally leads to the decision) This implicitly determines the null distribution (the distribution of the test statistic under the null hypothesis). 8 Statistical Hypothesis Testing II Testing Stats an alternative hypothesis (e.g. „the test statistic is higher than expected under the null hypothesis“) Determine a decision boundary. This is equivalent to the chioce of a significance level α, i.e. the fraction of false positive calls you are willing to accept. d Rejection region Acceptance region α 9 Statistical Hypothesis Testing II Testing Calculate the actual value of the test statistic in the sample, and make your decision according to the prespecified(!) decision boundary. Keep H0 (no rejection) d Reject H0 (assume the alternative hypothesis) α 10 Good test statistics, bad test statistics d Distribution of the test statistic under the null hypothesis II Testing Good statistic Distribution of the test statistic under the alternative hypothesis 0 Accept null hypothesis Reject null hypothesis Null hypothesis is true right decision Typ I error (False Positive) Alternative is true Typ II error (False Negative) right decision 11 Good test statistics, bad test statistics d Distribution of the test statistic under the null hypothesis II Testing Bad statistic Distribution of the test statistic under the alternative hypothesis 0 Accept null hypothesis Reject null hypothesis Null hypothesis is true right decision Typ I error (False Positive) Alternative is true Typ II error (False Negative) right decision II Testing The Offenbach Oracle Throw the 20-sided dice Toni, 29, Offenbach, mechanician and moral philosopher Score = 20: reject the null hypothesis Score ≠ 20: keep the null hypothesis This is (independent of the null hypothesis) a valid statistical test at a 5% type I error level! 13 But: 0.25 0.15 0.30 0.20 0.25 The Offenbach Oracle The distribution of the test statistic under null- and alternative hypothesis is identical This test cannot discriminate between the two alternatives! 0.15 0.05 0.20 0.10 c(0, 0) Distribution under H0 0.10 0.00 c(0, 0) II Testing 5 10 15 20 0.05 Index 0.00 Distribution under H1 5 10 15 20 95% of the Positives (as well as the Negatives) will be missed. Index 14 II Testing The p-value Given a test statistic and ist actual value t in a sample, a p-Wert can be calculated: Each test value t maps to a p-value, the latter is the probability of observing a value of the test statistic which is at least as extreme as the actual value t [under the assumption of the null hypothesis]. p = 0.08 -5 0 5 t=4.2 10 15 15 II Testing The p-value Given a test statistic and ist actual value t in a sample, a p-Wert can be calculated: Each test value t maps to a p-value, the latter is the probability of observing a value of the test statistic which is at least as extreme as the actual value t [under the assumption of the null hypothesis]. p = 0.42 -5 0 t=0.75 16 10 15 II Testing Test decisions according to the p-value Decision boundary d significance level α Observed test statistic t p-value t more extreme than d p≥α Keep H0 (no rejection) p is smaller than α p = 0.83 p < α Reject H0 (assume the alternative hypothesis) α = 0.05 p = 0.02 t d t 17 One- and two-sided hypotheses II Testing One-sided alternative H0: The value of a quantity of interest in group A is not higher than in group B. H1: The value of a quantity of interest in group A is higher than in group B. -10 -5 0 Acceptance region 5 Blutdrucksenkung [mmHg] ][ 10 15 18 Rejection region II Testing One- and two-sided hypotheses Two-sided alternative H0: The quantity of interest has the same value in group A and group B H1: The quantity of interest is different in group A and group B Generally, two-sided alternatives are more conservative: Deviations in both directions are detected. -10 ][-5 Rejection region 0 5 Acceptance region Blutdrucksenkung [mmHg] ][10 15 19 Rejection region Example “Testing”: Colon Carcinoma II Testing Variable: Vaccine Scale: binary What about this fact? 32*94 ≈ 30 (62-32)*77 ≈ 23 Endpoint: 4-year survival Scale: binary 20 Example “Testing”: Colon Carcinoma II Testing 4-year survival Vaccine Ja Nein yes (n=32) 30 (94%) 2 (6%) no (n=30) 23 (77%) 7 (23%) Interesting questions: Das the vaccine yield any effect? Is this effect „significant“ ? 21 Example “Testing”: Colon Carcinoma II Testing Null hypothesis H0: Vaccination has not (either positive or negative) impact on the patients. The survival rates in the vaccine and non-vaccine group in the whole population are the same. Alternative hypothesis H1: For the whole population, the survival rates in the vaccine and non vaccine group are different. Choose the significance level α (usually: α = 1%; 0.1%; 5%) Interpretation of the significane level α : If there is no difference between the groups, one obtains a false positive result with a probability of α. 22 Example “Testing”: Colon Carcinoma II Testing Choice of test statistic: „Fisher‘s Exact Test“ Sir Ronald Aylmer Fisher, 1890-1962 Theoretical Biology, Evolution Theory, Statistics 23 Example “Testing”: Colon Carcinoma II Testing Value of the test statistic t after the experiment has been carried out. This value can be converted into a p-value: p = 0.0766 7.7% Since we have chosen a significane level α = 5%, and p > α, we cannot reject the null hypothesis, thus we keep it. Formulation of the result: At a 5% significance level (and using Fisher‘s Exact Test), no significant effect of vaccination on survival could be detected. Consequence: We are not (yet) sufficiently convinced of the utility of this therapy. But this does not mean that there is no difference at all! 24 Non-significance ≠ equivalence Statistics can never prove a hypothesis, it can only provide evidence Egon Pearson (1895-1980) II Testing Jerzy Neyman (1894-1981) “No test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis.“ Neyman J, Pearson E (1933) Phil Trans R Soc A 25 I. Description Confidence intervals 95%-Confidence interval: An estimated interval which contains the „true value“ of a quantity with a probability of 95%. ( Interval estimate ) ____________________________________ X 20.5 24,3 29,5 Point estimate (e.g. % votes for the SPD in the EU elections) ( 1 – α ) – Conficence interval: An estimated interval which contains the „true value“ of a quantity with a probability of (1 – α). 1 – α = confidence level , α = error probability Use confidence intervals with caution! 26 Specific statistical tests Comparison of two group means Gene A … Gene B group 1 group 2 Which gene is expressed at a higher level? gene expression measurements 27 Two group comparison Data: Expression of gene g in different samples 2 mean( Punkte) Test statistic, e.g. Difference of group means d 1 Hypothesis: The expression of gene g in group 1 is lower than in group 2. d 1 2 mean ( Punkte) group 1 group 2 Decision for “lower expression“, if d d0 28 Two group comparison Bad idea: Difference of group means d 1 2 Problem: d is not scaling invariant 2 1 d 1 d 2 group 1 group 2 Solution: Divide d by an estimate of the standard deviation s(d) in the two groups d t s(d ) This is the t-statistic giving rise to the (unpaired) t-test. 29 Wilcoxon (rank sum) test (equiv. to Mann-Whitney-test) Question: Given independent samples in group 1 and group 2, Are the values in group 1 smaller than in group 2 ? measurements group 1 18 3 6 9 5 group 2 15 10 8 7 12 Raw scale Rank scale 3 5 6 7 8 9 10 12 15 1 2 3 4 5 6 7 8 9 10 Rank sum group 1: 1+2+3+6+10 = 22 Rank sum group 2: 4+5+7+8+9 = 33 18 Wilcoxon (rank sum) test (equiv. to Mann-Whitney-test) Choose the rank sum of group 1 as test statistic W The p-value corresponding to W can be computed exactly for small sample numbers. For large numbers, there exist good approximations. P(W≤22, given the groups do not differ in their location) = 0.15 15 20 22 25 30 35 40 Wilcoxon W Rank sum distribution for group 1, |group 1| = 5, |group 2| = 5 Summary: Two-group comparison of a continuous variable Question: Do the two measurements in the two groups differ in their location? Gaussian data? yes no Paired Samples? Paired samples? yes Unpaired two sample t-test no Paired two sample t-Test yes Wilcoxon signed rank test no Wilcoxon rank sum test 32 Comparison of two binary variables Unpaired data: Fisher‘s exact test Question: Do the distributions in group 1 and group 2 differ? Example: Clinical trial, unpaired design (each test person receives only one treatment) Effect Medication effect no effect Verum 65 7 Placebo 44 13 Odds und Odds Ratio heads tails Fair coin 54 46 Bent coin 82 18 Odds (= Chances): Odds (fair coin) = 54 : 46 = 1.17 Odds (bent coin) = 82 : 18 = 4.56 Odds Ratio 54 / 46 1.17 OR 0.26 82 / 18 4.56 Comparison of two categorial variables Unpaired data: 5yr survival Chi-square test (χ2-test) Tumor size No Yes 1 10 8 2 20 23 3 19 10 4 32 18 Null hypothesis: 5yr survival is independent of tumor size. In this example, p < 0.001. Vergleich zweier kategorialer Merkmale Unpaired data Chi-square test (χ2-test) Requirements Sample number sufficiently large (n ≥ 60) Expected number of is not too small ( ≥ 5) for all possible observations Note that for binary data and large n, chisquare test and Fisher test are equivalent. Summary: Comparison of two categorial variables Question: Do there exist differences in the distribution of one variable if grouped by the second variable? Binary data? yes no Paired data? Paired data? yes McNemar test no Fisher‘s exact test yes (Bowker Symmetrytest) no Chisquare (χ2) -Test 37 Summary: Description und Testing Design Deskription numerisch Deskription graphisch Test two sample Medians, quartiles 2 Boxplots Wilcoxon rank sum test, t-test* paired Medians, quartiles óf differences Boxplot of differences Wilcoxon signed rank test, paired t-test* Continuous binary two sample Cross table, odds ratio Barplot Fisher‘s exact test binary paired Cross table Barplot McNemartest categorial two sample Cross table 3D Barplot χ2-test Merkmal Continuous * If differences follow a normal distribution 38 Remarks on Testing Data description is the mandatory first step of every statistical analysis / test. Test results should report the outcome (singificant/not significant) together with the p-value that has been obtained. Never report a p-value of exactly 0! (why?) Statistical significance ≠ relevance For large sample numbers, even tiny differences may produce significant findings. For small sample numbers, an observed relevant difference can be statistically insignificant. 40 Multiple Testing Examples of multiple tests: Testing of several endpoints (systolic and diastolic blood pressure, pulse, …) Comparison of several groups (e.g., 4 groups require 6 pairwise two-group comparisons) Let us set a significance level of 5%, and suppose the null hypothesis holds in all cases. → If we perform 6 tests, the probability of reporting at least one false positive finding can increase to 30%! 41 Multiple Testing, Bonferroni Correction Remedy: Bonferroni correction For m tests and a target significance level, perform each individual test at a significance level of α/m (local significance level). The probability of producing a false positive finding in at least one of the m tests is then at most α (multiple / global significance level)