* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 7 Testing for differences: Student`s t-test
Survey
Document related concepts
Transcript
7 Testing for differences: Student’s t-test In the previous sections of these notes we developed the idea of hypothesis testing and used it to decide whether experimentally observed data was consistent with the mean of some model distribution (e.g. did the coin come up Heads as often as it should if it were fair?). The Central Limit Theorem played a supporting role in that it showed us how to handle the statistics of averaged quantities: we thought of them as approximately normally distributed. In this section we will examine two other hypothesis tests that are applicable when one has fewer data or a less explicit null hypothesis. 7.1 Comparing the means of two large samples The most straightforward way to compare the means of two samples is to see if their difference is large compared to their standard deviations. Here is a formal setting where that is the right course. We will imagine that we have two sets experimental results: one with mean m1 ± s1 based on a total of N1 trials and a second with mean m2 ± s2 based on N2 trials. Each trial will have involved the averaging of a large number of measurements. We also imagine both N1 and N2 to be large, so that √ √ s1 / N1 and s2 / N2 , are good estimates of SEM1 and SEM2 , the standard errors of the means. Recipe 7.1 (Difference of means: large samples) The null hypothesis is that the means of the two underlying normal distributions are the same; there is no assumption about their variances. The test statistic is m1 − m2 z=q (7.1) (s21 /N1 ) + (s22 /N2 ) and it should be distributed normally, with zero mean and standard deviation one. As previosly, we check z for significance by consulting tables of the standard normal distribution. This test would be appropriate if, say, one were interested in comparing the prevalence of some childhood disorder in Scotland and the UK. One might then visit a large number of schools in both London and Glasgow and, for each, estimate the frequency of the trait in a random sample of, say, 100 students. This sampling procedure meets both of the requirements for the test above. a) As one is taking 100 children per school, the number of has-the-disorder/doesn’t results going into the prevalence calculation for each sample point is large: then on central-limit-theorem grounds, one would expect the prevalence data per school to be approximately normally distributed. b) As one is visiting a lot of schools and pooling the results, the net rates for Scotland and the UK would involve averages over large of amounts of normallydistributed data. One seldom has such a vast quantity of data, so the rest of this lecture is devoted to tests appropriate for smaller samples. 7.1 0.4 0.2 -4 -2 2 t 4 Figure 7.1: Student’s t-distribution for several values of ν: the curve with the smallest value at x = 0 is the one for ν = 2, the one above that has ν = 4 and the one above that ν = 8. The dashed curve at the top is the standard normal distribution (µ = 0, σ = 1). Notice how, as ν increases, the t-distribution becomes more and more similar to a normal distribution. 7.2 Student’s t-test Our first test was invented by a statistician working for the Guiness brewing company, W.S. Gosset (1876-1937). Employees of the firm were not allowed to publish under their own names so he wrote under the pseudonym ‘Student’. His statistic is similar to a z-score, but his contribution was to work out its distribution when the sample is so small that the Central Limit Theorem does not apply and one cannot expect to have a good estimate of the mean. His test-statistic, called Student’s t-statistic, is perhaps the most commonly used of all those we will study because it enables one to test for nonzero differences between two means even when the samples are small. But this advantage comes at a small cost: the t-distribution (and hence the tables one consults to use it) are less straightforward than those for the normal distribution in that the t-table depend on the amount of data available. Figure 7.1 shows t-distributions for various values of the number of degrees of freedom, ν, about which more is said in the recipe below. Example 7.1 (First half of an old exam problem) A trial was conducted to determine whether students performed better in examinations if they drank coffee just before sitting the paper. Students were divided randomly into two groups of 10; one group was given coffee before each of three exams and the other was not. The mean marks for the three papers are recorded below: 7.2 With Coffee Student Num. Mean mark 1 47 2 57 3 59 4 67 5 38 6 78 7 65 8 59 9 68 10 49 Without Student Num. Mean mark 11 57 12 62 13 41 14 39 15 53 16 72 17 45 18 46 19 58 20 60 Did the coffee seem to have a significant effect on student performance? The null hypothesis in trials of this kind is that there was no effect or, in statistical terms, the two samples are drawn from distributions (assumed to be normal) with the same mean and variance. One interesting alternative hypothesis says that the samples are drawn from distributions with different means, but makes no prediction about which of the two is bigger. In words, we are testing the null hypothesis against the alternative that “coffee has some effect.” One tests this using the t-statistic, as described in the recipe below. Recipe 7.2 (Two-sample t-test) The ingredients are two of lists of numbers, say, {x1 , x2 , . . . , xN1 } and {y1 , y2 , . . . , yN2 }, each of which we imagine to be normally distributed. The null hypothesis is that the two samples are drawn from distributions having the same mean and variance—that is to say, from the same normal distribution. To perform the test one: • Computes the two sample means, m1 and m2 . Recall that, for example, m1 = P N1 j=1 xj N1 . • Computes the two standard deviations, s1 and s2 . Recall that, for example, s22 = PN2 − m2 )2 . (N2 − 1) j=1 (yj • Computes the pooled standard deviation, s, which satisfies s2 = • Last, one computes (N1 − 1)s21 + (N2 − 1)s22 . (N1 − 1) + (N2 − 1) m1 − m2 t= s 7.3 s N1 N2 . N1 + N2 Finally, one looks at tables of the t-statistic to decide if the observed value is significant. The distribution of t is more complicated than that of the normal in that it depends on, ν, the number of degrees of freedom. In the experiment described here ν = N1 + N2 − 2. Returning to the example, the two means are m1 = 58.7 and m2 = 53.3; the standard deviations are s1 = 11.69 and s2 = 10.46, so the pooled variance s2 ≈ 123. This leads to t = 1.0887 with 18 degrees of freedom. Consultation of the attached tables says that this is not a sufficiently large difference to reject the null hypothesis. 7.2.1 Paired Samples In the experiment above there may have been a small effect masked by large intrasample variation among the students. In other words, the spread in ability between the best-prepared and worst-prepared students may have been so huge that it prevented us seeing the effect of the coffee. To address this possibility one does a slightly different experiment and tests it with a slightly different test. This is best made clear with an example. Example 7.2 (Second half of the exam problem) The experiment was repeated but the design was altered. Ten students were chosen at random and this time sat six exams, before three of which they received coffee. Again the mean of the marks was taken and the data are shown below: Student Num. 1 2 3 4 5 6 7 8 9 10 With coffee Without 57 59 62 68 41 39 39 43 53 58 72 73 45 48 46 56 58 55 60 67 Did the coffee have any effect on student performance in this experiment? The null hypothesis is still that there was no effect or, in other words, the two samples come frm normal distributions with the same mean and variance. We test this using the paired-sample t-test: Recipe 7.3 (Paired-sample t-test) The only ingredient is a list of pairs of numbers {(x1 , y1 ), . . . , (xN , yN )}; here there are N pairs. The null hypothesis is that the two members of each pair are drawn from normal distributions having the same mean. All the distributions for the x’s are assumed to share the same variance as are all the y’s, but the variance shared by the x’s need not equal that shared by the y’s. To perform the test one then: 7.4 • computes the list of differences δj = (xj − yj ); • computes the mean of the differences m= PN j=1 δj N ; • estimates the variance of the differences 2 s = PN − m)2 ; N −1 j=1 (δj • computes the paired-sample t-statistic √ m N t= . s Then, in the usual way, one looks up this statistic in standard tables. This time we use the one with N − 1 degrees of freedom. The table below summarises the computations required by the recipe above: With Coffee Without 57 59 62 68 41 39 39 43 53 58 72 73 45 48 46 56 58 55 60 67 Totals δ (δ − m) (δ − m)2 -2 (-2 -(-3.3)) = 1.3 1.69 -6 -2.7 7.29 2 5.3 28.09 -4 -0.7 0.49 -5 -1.7 2.89 -1 2.3 5.29 -3 0.3 0.09 -10 -6.7 44.89 3 6.3 39.69 -7 -3.7 13.69 -33 144.1 q ⇒ m = −33/10 ⇒ s = 144.1/(10 − 1) m = −3.3 s ≈ 4.0 and yields: m = -3.3; s ≈ 4; t ≈ -2.6. The attached table shows that this exceeds the critical value for α = 0.025 so our result is significant at the standard 95% confidence level for the two-sided test, which addresses the question “Does coffee affect exam performance?” (alternative hypothesis: “mean score with coffee differs from that without”). In light of the previous study we might also wish to test the hypothesis that coffee improves exam performance; here one should do a one-sided test to see if the mean “with” sample is unexpectedly high, but for these data we needn’t bother: the mean of the δ’s is negative. 7.5