Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HYPOTHESIS TESTING: SEEING GENERAL PRINCIPLES THROUGH THE ONE-SAMPLE T-TEST It’s often helpful to look again at class material but in a slightly different way. In this case I want to explore, again, the general idea of hypothesis testing but from a different direction. The goal is to help you establish your intuition and understanding more firmly by reviewing the material but ordering it differently than its order in class. Let’s begin by thinking about how we might construct a test of the idea that the true mean body length of adult killifish, Fundulus grandis, in the marshes near the FSU Marine Lab (FSUML) is 100 mm. Now, clearly, if we catch and measure 50 adult F. grandis at the FSUML and they range in size from 55 mm to 75 mm with an average of 64 mm, we’d probably be willing to bet, without doing any statistics, that the average size is NOT 100 mm. We come to this conclusion because we find it hard to believe that, in a sample of 50 fish, we wouldn’t even catch one fish that was 100 mm and that the average of those 50 fish could be 64 mm if the real mean were 100 mm. In other words, a difference between the actual average (64 mm) and our hypothesized average (100 mm) of 36 mm is too large, compared to the range of variation we found, to be consistent with the hypothesis. Sure, it could happen, but we’d have to have caught a very aberrant sample of adults, which might happen only once in a very, very large number of trials. In this case we’d reject the hypothesis as being extremely unlikely. This is of course the essence of hypothesis testing - we encounter a result that is so unlikely under the original, or null, hypothesis that we cannot believe that hypothesis to be true. All statistical hypothesis tests in the frequentist mode of operation (as opposed to a pure Bayesian mode) work in this manner. But what if our sample of 50 adults ranged from 80 to 110 mm with an average of 90 mm? Now we might have several adults whose size was greater than 100 mm even though the average size was below 100 mm. It could be that a sample with this average might occur reasonably often even if the true mean was 100 mm. In other words, we might believe that this result is not sufficiently odd under the null hypothesis; it’s not sufficiently unlikely, if you will, to cause us to reject the null hypothesis. In thinking about the problem purely at this intuitive level, we might be critical of our reasoning at this point - wondering if we have a clear idea of when we would find a result sufficiently unlikely to cause us to reject our null hypothesis. That is, how unlikely does a result have to be before we reject the original (= null) hypothesis? Now, we could, without reference to any numerical method, decide on a criterion for rejection. That is, we might say that we’ll reject the null hypothesis if the probability of a result like the one we have, or a result even more extreme than the one we have, is no greater than onein-twenty, or 0.05. This gives us a veneer of objectivity except for the untidy fact that we don’t have a way to estimate this probability. Clearly what we need is some function of the data whose distribution under the null hypothesis can be found. If we had that distribution, then we would know how unlikely any particular result might be under that null hypothesis. So let’s build an intuitively appealing function of the data. That is, let’s build a function of the data that has the intuitive property of revealing something about the likely truth of the null hypothesis. First, we could agree that the further the average deviates from the one predicted by the null hypothesis, the less likely we are to believe that the null hypothesis is true. So our function should include some variation on the idea of that difference, X’ - µ (0) where X’ refers to the sample average and µ (0) refers to the mean value specified by the null hypothesis. But the credence we’d give a particular absolute difference might depend on how variable the data appeared. A deviation from the prediction of 10 mm might be more convincing if our sample of 50 adult fish varied only over a range of 10 mm, say from 85 to 95 mm with an average of 90 mm. If they varied more widely, say from 80 to 110 with an average of 90, we’d likely be less confident in declaring the null hypothesis false, as we stated earlier. So we might like to weigh X’ - µ (0) by some measure of the spread in the data. Using the range (maximum minimum) might not be a good idea because it is very sensitive to 1-2 odd points in the sample. But we know that the variance is a good measure of the spread of a distribution so we could use the sample variance, s2 , as a weight. The higher the variance, the less confident we are in a given difference and the lower the variance, the more confident we are. We could then weigh X’ - µ (0) by the variance; more specifically, we would weigh by the inverse of the variance, which would give us an index with properties we like (index rises as X’ - µ (0) rises, index rises as s2 falls). All seems well, except that variances are in units squared and a difference is in units unsquared and it’s always nice to work in the same unit of measurement. So we can weigh by the sample standard deviation, s. Our function, [X’ - µ (0)] / s, is appealing. Now all we need do is figure out its distribution under the null hypothesis. Now, if we think about it, we realize that this looks like something we’ve seen before. The numerator is a linear combination of a random variable and a constant and so we know how to find its expected value, at least. It’s a function of a sample average and a constant, and we know something about the distribution of sample averages and linear combinations of averages, at least as sample size goes up (central limit theorem). In fact, we know that an average will have a normal distribution with expected value µ and variance σ2 /n. And so we can see that we’re dangerously close to a quantity whose distribution we, or some smart statistician, could find. So one way to approach this is to realize that [X’ - µ (0)] / [ s / √ n ] looks like it ought to be close to some variation on a normal distribution. Another way to approach it is to realize that if we squared all of this we’d be close to something we called an Fdistribution a few weeks ago (see the section on squares of centered normal distributions). One way or another, it turns out that this index, which we call a t-statistic, has a distribution that can be found. It’s also a nice, dimensionless quantity, which is a nice property for a general index. Of course, we could find that distribution in a practical way. We could erect a distribution of F. grandis body sizes (any distribution would do) with a true mean of 100 mm and some variance, simulate 1,000 samples of 50 from this original distribution (thereby obtaining a distribution of sample averages) and the resultant distribution of our t-statistic can be found accordingly by calculating a t-statistic for each of our 1,000 sample averages. Clearly the numerator of our t-statistic will be centered on 0 because we will have about as many sample averages above 100 mm as below 100 mm. We would reject the null hypothesis that the true mean is 100 mm if we take a real sample whose t-statistic fell into the “rare” range of possible tstatistics. What is the “rare” range? Well, if we say that we’re not sure if the true mean would be greater or less than 100 mm, then the “rare” range of t-statistics would be those values that are so large or so small that they would occur only rarely, say no more than 5% of the time. If we mean “no more than 5% of the time” then we mean that the cumulative probability associated with all “rare” values must not exceed 5%. So we find the 2.5% thresholds on either side of our tstatistic’s distribution and name these values as our thresholds - we reject the null hypothesis if we obtain a t-statistic value greater or less than these thresholds. If we choose a directional alternative hypothesis, that is, we want to test our null against an alternative that the true mean is larger (or smaller) than the null hypothesis, we would adjust our threshold values. In this case, we are looking for a result in one direction that would occur no more than 5% of the time, so we find the critical value beyond which no more than 5% of the observations will fall. Of course, the distributions of t-statistics can be derived analytically for different sample sizes (related to the degrees of freedom used as a parameter of the t-distribution) so we don’t have to do simulations. But you can see how the whole process unfolds by thinking about the simulated distributions. Consider what happens if the true adult size distribution were such that the true mean adult body size was 110 mm. We could simulate samples of size 50 and find the distribution of the t-statistic if the true mean was 110 mm. Remember that we’re still calculating the numerator of our t-statistic as X’ - 100 mm because that’s our null hypothesis. You can see, I hope, that with a true mean of 110 mm and a null hypothesis of 100 mm, we will accumulate more t-values whose numerator is in excess of 0 than we did under the null hypothesis, so our t-distribution will shift to the right of the distribution under the null hypothesis. We can calculate the probability of a type II error (accepting the null hypothesis when it is wrong) and power (probability of rejecting the null hypothesis when it is false) from that distribution simply by counting the proportion of samples falling to either side of the critical t-value that we chose when we specified the null hypothesis and the type of alternative (directional or not). In general, with analytic solutions, we could calculate these probabilities without counting up our simulation results, but you get the idea. You can also see immediately, I hope, as we showed in class, that choosing a directional or non-directional alternative changes the power of the test. Now consider what happens as the true mean adult size increases. If the true mean were 120 mm, and we repeated this process, we’d accumulate even more positive t-values than we did when we simulated a true mean of 110 mm, the distribution would shift even further to the right of our distribution under the null hypothesis, and our power would increase. And it would happen even more as the true mean moved to 130 mm, 140 mm, etc. So as the true mean increases, so does the power of our test. This is a critical lesson: power is a function of the true alternative hypothesis. In practical terms, we can see that, for a given sample size and variance within the data, we have more power to detect a large difference from the null hypothesis than we do to detect a small difference. Put another way, we can draw a curve of power on the vertical axis against “magnitude of difference” on the horizontal axis. The curve would rise and eventually hit an asymptote - at some point, when the true mean is a lot higher than 100 mm, almost all of the distribution of the t-statistic under the alternative hypothesis will be to the right of the distribution under the null hypothesis and we can’t increase power much more. I suppose this is a good juncture at which to state that this idea of power vs. magnitude of difference is NOT something we discussed in class. What we did discuss was how power is affected by the sample size and the variance in the data. We saw that as the sample size increases, the distribution of the t-statistic under any particular alternative hypothesis shifts to the right. To make sure you see why this happens, remember that we can rewrite the t-statistic as [X’ - µ (0)] √ n / [ s ] So you can see that with a specific alternative, the numerator of the t-statistic increases as the square root of the sample size increases, moving the distribution of the t-statistic to the right. As the variation within the data increases, the t-statistic decreases because s is in the denominator. What happens with decreases in s is NOT that the whole distribution shifts location, but the width of the t-distribution under both the null and alternative hypotheses decreases. As the shape of the distribution changes, so does the probability of a type II error and power: the higher the variance, the lower the power. It’s easy to see these effects when thinking about a one-sample t-test because the effects of each factor on the test statistic and its distribution under the null and alternative hypotheses are relatively easy to visualize. The effects are the same for any and all test statistics, whether we talk about analysis of variance, regression, or other test machineries. The key is to realize that power is a function of the specific alternative hypothesis, sample size, and the variance of the data. Knowing these facts, you can then study how different choices of experiment and experimental design can affect each of these elements and design the optimal experiment or observational hypothesis test for the problem you’re investigating.