Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applied Data Analysis Spring 2017 Karen, age 7 Karen Albert kalbert2@ur.rochester.edu Thursdays, 4-5 PM (Hark 302) Lecture outline 1. One and two-tailed tests 2. Types of errors The alternative hypothesis When the null hypothesis is H0 : µ = 10 we have three choices for the alternative hypothesis: H1 : µ > 10 H1 : µ < 10 H1 : µ 6= 10 One v. two-tailed H1 : µ > 10 H1 : µ < 10 A “greater than” or “less than” test is a one-tailed test. H1 : µ 6= 10 A “not equal to” test is a two-tailed test. One v. two-tailed H1 : µ > 10 H1 : µ < 10 A “greater than” or “less than” test is a one-tailed test. H1 : µ 6= 10 A “not equal to” test is a two-tailed test. The difference affects the p-value. The p-value Remember that the p-value is the probability of seeing a result as extreme or more extreme than the observed result given that the null hypothesis is true. The p-value Remember that the p-value is the probability of seeing a result as extreme or more extreme than the observed result given that the null hypothesis is true. Let’s say that we have calculated the test statistic and the z-score is -2. The p-value depends on whether the test is one-tailed or two-tailed. The “less than” test When the alternative hypothesis is H1 : µ < 10 it means that if the null hypothesis of chance is wrong, the true mean is smaller than the value given by the null. In this case, the p-value is on the left-hand side of the curve. The “greater than” test When the alternative hypothesis is H1 : µ > 10 it means that if the null hypothesis of chance is wrong, the true mean is larger than the value given by the null. This time, assume that the test statistic is 2. The p-value is on the right-hand side of the curve. The “not equal to” test When the alternative hypothesis is H1 : µ 6= 10 it means that if the null hypothesis of chance is wrong, the true mean is either larger or smaller than the value given by the null. In cases such as these, we have to remember to multiply the p-value by 2. How can you tell which alternative? The problem tells us... Suppose a test has been given to all high school students in a certain state. The mean test score of the entire state is 70, with a S.D. of 10. Members of the school board suspect that female students have a higher mean score on the test than male students. A random sample of 64 female students is equal to 73. Does this provide strong evidence that the overall mean for female students is higher? How can you tell which alternative? The problem tells us... Suppose a test has been given to all high school students in a certain state. The mean test score of the entire state is 70, with a S.D. of 10. Members of the school board suspect that female students have a higher mean score on the test than male students. A random sample of 64 female students is equal to 73. Does this provide strong evidence that the overall mean for female students is higher? Answer H0 : µ = 70 H1 : µ > 70 Answer H0 : µ = 70 H1 : µ > 70 pnorm(73,70,10/sqrt(64),lower.tail=FALSE) ## [1] 0.008197536 The p-value is 0.008. Answer H0 : µ = 70 H1 : µ > 70 pnorm(73,70,10/sqrt(64),lower.tail=FALSE) ## [1] 0.008197536 The p-value is 0.008. The p-value is small so we reject the null hypothesis. Answer if the alt. had been “not equal to.” Members of the school board suspect that female students have a different mean score on the test than male students. Does this provide strong evidence that the overall mean for female students is different from the male students? Answer if the alt. had been “not equal to.” Members of the school board suspect that female students have a different mean score on the test than male students. Does this provide strong evidence that the overall mean for female students is different from the male students? The p-value would have been 0.008 × 2 or 0.016. Answer if the alt. had been “not equal to.” Members of the school board suspect that female students have a different mean score on the test than male students. Does this provide strong evidence that the overall mean for female students is different from the male students? The p-value would have been 0.008 × 2 or 0.016. Note that this p-value is still small so we would still reject the null hypothesis. Two-tailed tests and CIs For z’s and t’s, two-tailed tests and confidence intervals are equivalent. Two-tailed tests and CIs For z’s and t’s, two-tailed tests and confidence intervals are equivalent. If the value of the null hypothesis falls outside the confidence interval, we decide to reject the null hypothesis. Two-tailed tests and CIs For z’s and t’s, two-tailed tests and confidence intervals are equivalent. If the value of the null hypothesis falls outside the confidence interval, we decide to reject the null hypothesis. If the null falls into the confidence interval, we fail to reject the null. Two-tailed tests and CIs For z’s and t’s, two-tailed tests and confidence intervals are equivalent. If the value of the null hypothesis falls outside the confidence interval, we decide to reject the null hypothesis. If the null falls into the confidence interval, we fail to reject the null. Why don’t we accept the null? Think of all the values contained in the confidence interval and remember that the null hypothesis is only about 1 value. Types of errors If we reject the null hypothesis when it is true, we make an error. If we fail to reject the null hypothesis when it is false, we make an error. Types of errors If we reject the null hypothesis when it is true, we make an error. If we fail to reject the null hypothesis when it is false, we make an error. World H0 H1 Decision H0 H1 no error type I type II no error Error probabilities World H0 H1 Decision H0 H1 α β - Error probabilities World H0 H1 Decision H0 H1 α β - • α is the probability of a type I error—the probability of rejecting the null hypothesis when it is true. Error probabilities World H0 H1 Decision H0 H1 α β - • α is the probability of a type I error—the probability of rejecting the null hypothesis when it is true. • It’s a number we agree upon as a community. Error probabilities World H0 H1 Decision H0 H1 α β - • α is the probability of a type I error—the probability of rejecting the null hypothesis when it is true. • It’s a number we agree upon as a community. • If the p-value is less than α, we decide to reject the null hypothesis. The probability of a type I error α is the long-run probability of rejecting the null hypothesis when it is true. The probability of a type I error α is the long-run probability of rejecting the null hypothesis when it is true. If we want to protect against a type I error, why not set α to a really low number? The probability of a type I error α is the long-run probability of rejecting the null hypothesis when it is true. If we want to protect against a type I error, why not set α to a really low number? Look at the picture that I am drawing on the board. Screw that The probability of a type I error α is the long-run probability of rejecting the null hypothesis when it is true. If we want to protect against a type I error, why not set α to a really low number? Look at the picture that I am drawing on the board. As α decreases, the β (the probability of a type II error) increases. Limitations of testing 1 There is nothing special about 5% or 1%. Limitations of testing 1 There is nothing special about 5% or 1%. If our α level is 5%, what is the difference between a p-value of 4.9% and p-value of 5.1%? Limitations of testing 1 There is nothing special about 5% or 1%. If our α level is 5%, what is the difference between a p-value of 4.9% and p-value of 5.1%? One is statistically significant, and one is not. Limitations of testing 1 There is nothing special about 5% or 1%. If our α level is 5%, what is the difference between a p-value of 4.9% and p-value of 5.1%? One is statistically significant, and one is not. But does that make sense? Limitations of testing 1 There is nothing special about 5% or 1%. If our α level is 5%, what is the difference between a p-value of 4.9% and p-value of 5.1%? One is statistically significant, and one is not. But does that make sense? Always report the p-value, not just the conclusion. Limitations of testing 2 Data snooping What does a significance level of 5% mean? Limitations of testing 2 Data snooping What does a significance level of 5% mean? There is a 5% chance of rejecting the null hypothesis when it is true. Limitations of testing 2 Data snooping What does a significance level of 5% mean? There is a 5% chance of rejecting the null hypothesis when it is true. If our significance level is 5%, how many results would be “statistically significant” just by chance if we ran 100 tests? Limitations of testing 2 Data snooping What does a significance level of 5% mean? There is a 5% chance of rejecting the null hypothesis when it is true. If our significance level is 5%, how many results would be “statistically significant” just by chance if we ran 100 tests? We would expect 5 to be “statistically significant,” and 1 to be “highly significant.” Limitations of testing 2.1 So what can we do? Limitations of testing 2.1 So what can we do? 1. Always state how many tests were run before statistically significant results turned up. 2. Always test your conclusions on an independent set of data, if possible. Limitations of testing 3 Was the result important? Limitations of testing 3 Was the result important? If we increase the sample size, what happens to: • the standard error? • the test statistic? • the p-value? Limitations of testing 3 Was the result important? If we increase the sample size, what happens to: • the standard error? • the test statistic? • the p-value? A statistically significant difference may not be important, and an important difference many not statistically significant. Limitations of testing 4 The role of the model. Limitations of testing 4 The role of the model. Significance tests only make sense when we can talk about them in the context of a box model. Limitations of testing 4 The role of the model. Significance tests only make sense when we can talk about them in the context of a box model. Give me two examples of when we would not have a box model. Limitations of testing 4.1 Two possibilities: • We have the entire population. There is no such thing as sampling variability in this case. All data is subject to many small errors, but these are not like draws from a box. Limitations of testing 4.1 Two possibilities: • We have the entire population. There is no such thing as sampling variability in this case. All data is subject to many small errors, but these are not like draws from a box. • We do not have a probability sample. With a sample of convenience, the concept of chance is hard to define, the phrase “the difference is due to chance” is hard to interpret, and p-values are nearly meaningless. Limitations of testing 5 Does the difference prove the point? Limitations of testing 5 Does the difference prove the point? Consider an ESP experiment in which a die is rolled, and the subject tries to make it land showing 6 pips. This is repeated 720 times, and the die lands 6 in 143 of these trials. Limitations of testing 5 Does the difference prove the point? Consider an ESP experiment in which a die is rolled, and the subject tries to make it land showing 6 pips. This is repeated 720 times, and the die lands 6 in 143 of these trials. If the die is fair and the subject does not have ESP, we would expect 720*1/6=120 sixes. The expected difference is 143-120=23. Limitations of testing 5.1 The standard error is se <- sqrt(720)*sqrt((1/6)*(5/6)) se ## [1] 10 Limitations of testing 5.1 The standard error is se <- sqrt(720)*sqrt((1/6)*(5/6)) se ## [1] 10 and the p-value is Limitations of testing 5.1 The standard error is se <- sqrt(720)*sqrt((1/6)*(5/6)) se ## [1] 10 and the p-value is pnorm(143,120,se,lower.tail=FALSE) ## [1] 0.01072411 So is ESP real, or are there alternative explanations for these findings? Limitations of testing 5.2 The die isn’t fair! Limitations of testing 5.2 The die isn’t fair! The take-home message: Significance tests do not check the design of the study. What did we learn? • Difference between one and two-tailed tests. • Types of errors and their probabilities. • The limitations of hypothesis testing.