Download 7 Testing for differences: Student`s t-test

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
7
Testing for differences: Student’s t-test
In the previous sections of these notes we developed the idea of hypothesis testing
and used it to decide whether experimentally observed data was consistent with the
mean of some model distribution (e.g. did the coin come up Heads as often as it
should if it were fair?). The Central Limit Theorem played a supporting role in that
it showed us how to handle the statistics of averaged quantities: we thought of them
as approximately normally distributed. In this section we will examine two other
hypothesis tests that are applicable when one has fewer data or a less explicit null
hypothesis.
7.1
Comparing the means of two large samples
The most straightforward way to compare the means of two samples is to see if their
difference is large compared to their standard deviations. Here is a formal setting
where that is the right course. We will imagine that we have two sets experimental
results: one with mean m1 ± s1 based on a total of N1 trials and a second with mean
m2 ± s2 based on N2 trials. Each trial will have involved the averaging of a large
number
of measurements.
We also imagine both N1 and N2 to be large, so that
√
√
s1 / N1 and s2 / N2 , are good estimates of SEM1 and SEM2 , the standard errors of
the means.
Recipe 7.1 (Difference of means: large samples) The null hypothesis is that the
means of the two underlying normal distributions are the same; there is no assumption
about their variances. The test statistic is
m1 − m2
z=q
(7.1)
(s21 /N1 ) + (s22 /N2 )
and it should be distributed normally, with zero mean and standard deviation one.
As previosly, we check z for significance by consulting tables of the standard normal
distribution.
This test would be appropriate if, say, one were interested in comparing the prevalence of some childhood disorder in Scotland and the UK. One might then visit a large
number of schools in both London and Glasgow and, for each, estimate the frequency
of the trait in a random sample of, say, 100 students. This sampling procedure meets
both of the requirements for the test above.
a) As one is taking 100 children per school, the number of has-the-disorder/doesn’t
results going into the prevalence calculation for each sample point is large: then
on central-limit-theorem grounds, one would expect the prevalence data per
school to be approximately normally distributed.
b) As one is visiting a lot of schools and pooling the results, the net rates for
Scotland and the UK would involve averages over large of amounts of normallydistributed data.
One seldom has such a vast quantity of data, so the rest of this lecture is devoted
to tests appropriate for smaller samples.
7.1
0.4
0.2
-4
-2
2
t
4
Figure 7.1: Student’s t-distribution for several values of ν: the curve with the smallest
value at x = 0 is the one for ν = 2, the one above that has ν = 4 and the one above
that ν = 8. The dashed curve at the top is the standard normal distribution (µ = 0, σ
= 1). Notice how, as ν increases, the t-distribution becomes more and more similar
to a normal distribution.
7.2
Student’s t-test
Our first test was invented by a statistician working for the Guiness brewing company,
W.S. Gosset (1876-1937). Employees of the firm were not allowed to publish under
their own names so he wrote under the pseudonym ‘Student’. His statistic is similar to
a z-score, but his contribution was to work out its distribution when the sample is so
small that the Central Limit Theorem does not apply and one cannot expect to have
a good estimate of the mean. His test-statistic, called Student’s t-statistic, is perhaps
the most commonly used of all those we will study because it enables one to test for
nonzero differences between two means even when the samples are small. But this
advantage comes at a small cost: the t-distribution (and hence the tables one consults
to use it) are less straightforward than those for the normal distribution in that the
t-table depend on the amount of data available. Figure 7.1 shows t-distributions for
various values of the number of degrees of freedom, ν, about which more is said in the
recipe below.
Example 7.1 (First half of an old exam problem) A trial was conducted to determine whether students performed better in examinations if they drank coffee just
before sitting the paper. Students were divided randomly into two groups of 10; one
group was given coffee before each of three exams and the other was not. The mean
marks for the three papers are recorded below:
7.2
With Coffee
Student Num. Mean mark
1
47
2
57
3
59
4
67
5
38
6
78
7
65
8
59
9
68
10
49
Without
Student Num. Mean mark
11
57
12
62
13
41
14
39
15
53
16
72
17
45
18
46
19
58
20
60
Did the coffee seem to have a significant effect on student performance?
The null hypothesis in trials of this kind is that there was no effect or, in statistical
terms, the two samples are drawn from distributions (assumed to be normal) with
the same mean and variance. One interesting alternative hypothesis says that the
samples are drawn from distributions with different means, but makes no prediction
about which of the two is bigger. In words, we are testing the null hypothesis against
the alternative that “coffee has some effect.” One tests this using the t-statistic, as
described in the recipe below.
Recipe 7.2 (Two-sample t-test) The ingredients are two of lists of numbers, say,
{x1 , x2 , . . . , xN1 } and {y1 , y2 , . . . , yN2 }, each of which we imagine to be normally
distributed. The null hypothesis is that the two samples are drawn from distributions having the same mean and variance—that is to say, from the same normal
distribution. To perform the test one:
• Computes the two sample means, m1 and m2 . Recall that, for example,
m1 =
P N1
j=1 xj
N1
.
• Computes the two standard deviations, s1 and s2 . Recall that, for example,
s22
=
PN2
− m2 )2
.
(N2 − 1)
j=1 (yj
• Computes the pooled standard deviation, s, which satisfies
s2 =
• Last, one computes
(N1 − 1)s21 + (N2 − 1)s22
.
(N1 − 1) + (N2 − 1)
m1 − m2
t=
s
7.3
s
N1 N2
.
N1 + N2
Finally, one looks at tables of the t-statistic to decide if the observed value is
significant. The distribution of t is more complicated than that of the normal in that
it depends on, ν, the number of degrees of freedom. In the experiment described here
ν = N1 + N2 − 2.
Returning to the example, the two means are m1 = 58.7 and m2 = 53.3; the
standard deviations are s1 = 11.69 and s2 = 10.46, so the pooled variance s2 ≈ 123.
This leads to t = 1.0887 with 18 degrees of freedom. Consultation of the attached
tables says that this is not a sufficiently large difference to reject the null hypothesis.
7.2.1
Paired Samples
In the experiment above there may have been a small effect masked by large intrasample variation among the students. In other words, the spread in ability between
the best-prepared and worst-prepared students may have been so huge that it prevented us seeing the effect of the coffee. To address this possibility one does a slightly
different experiment and tests it with a slightly different test. This is best made clear
with an example.
Example 7.2 (Second half of the exam problem) The experiment was repeated
but the design was altered. Ten students were chosen at random and this time sat six
exams, before three of which they received coffee. Again the mean of the marks was
taken and the data are shown below:
Student Num.
1
2
3
4
5
6
7
8
9
10
With coffee Without
57
59
62
68
41
39
39
43
53
58
72
73
45
48
46
56
58
55
60
67
Did the coffee have any effect on student performance in this experiment? The null
hypothesis is still that there was no effect or, in other words, the two samples come
frm normal distributions with the same mean and variance. We test this using the
paired-sample t-test:
Recipe 7.3 (Paired-sample t-test) The only ingredient is a list of pairs of numbers {(x1 , y1 ), . . . , (xN , yN )}; here there are N pairs. The null hypothesis is that
the two members of each pair are drawn from normal distributions having the same
mean. All the distributions for the x’s are assumed to share the same variance as are
all the y’s, but the variance shared by the x’s need not equal that shared by the y’s.
To perform the test one then:
7.4
• computes the list of differences δj = (xj − yj );
• computes the mean of the differences
m=
PN
j=1 δj
N
;
• estimates the variance of the differences
2
s =
PN
− m)2
;
N −1
j=1 (δj
• computes the paired-sample t-statistic
√
m N
t=
.
s
Then, in the usual way, one looks up this statistic in standard tables. This time
we use the one with N − 1 degrees of freedom.
The table below summarises the computations required by the recipe above:
With Coffee Without
57
59
62
68
41
39
39
43
53
58
72
73
45
48
46
56
58
55
60
67
Totals
δ
(δ − m)
(δ − m)2
-2
(-2 -(-3.3)) = 1.3
1.69
-6
-2.7
7.29
2
5.3
28.09
-4
-0.7
0.49
-5
-1.7
2.89
-1
2.3
5.29
-3
0.3
0.09
-10
-6.7
44.89
3
6.3
39.69
-7
-3.7
13.69
-33
144.1
q
⇒ m = −33/10
⇒ s = 144.1/(10 − 1)
m = −3.3
s ≈ 4.0
and yields: m = -3.3; s ≈ 4; t ≈ -2.6. The attached table shows that this exceeds the
critical value for α = 0.025 so our result is significant at the standard 95% confidence
level for the two-sided test, which addresses the question “Does coffee affect exam
performance?” (alternative hypothesis: “mean score with coffee differs from that
without”). In light of the previous study we might also wish to test the hypothesis
that coffee improves exam performance; here one should do a one-sided test to see if
the mean “with” sample is unexpectedly high, but for these data we needn’t bother:
the mean of the δ’s is negative.
7.5