Download Hypothesis Testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter Five
Hypothesis Testing: Concepts
The Purpose of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An Initial Look at Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Formal Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Null and Alternate Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Procedure for Formal Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Errors in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
False Positive Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
False Negative Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary: Choosing the Confidence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
112
114
114
114
115
120
124
124
124
126
130
131
The Purpose of Hypothesis Testing
The purpose of obtaining measurements of a chemical system is usually to draw some
conclusions about the properties of the system. One of the simplest use of statistics, one that has
largely concerned us to this point, is to obtain an estimate of the system properties through the
use of confidence intervals. This is an aspect of statistical estimation theory. Now, however, we
turn our attention to decision theory, where we learn how we can use measurement statistics to
draw general conclusions about chemical systems.
The following are examples of situations where we want to draw some kind of conclusion based
on measurements:
• two reactants are mixed, and the concentrations of the products are monitored as a function of
time in order to determine the rate constant, k, of the reaction. You want to compare the results
of your measurement with a value calculated from theory.
• you have just come up with a new synthetic procedure for a certain commercial product that
you believe increases the yield over the currently accepted method. You measure the yield by
both methods, and you find that your method gives a 65% yield while the older method gave a
60% yield. You must prove that your method is actually superior to the older method, and that
the increase in yield is not due to the uncertainty in the measured values.
For a more detailed example, consider the following situation. Let’s say we obtain the following
measurements of the pH of a particular solution
pH measurements: 9.5, 9.9, 9.8
Now we wish to know whether it is possible to state, with confidence, that the pH of the solution
is less than 10. If we can assume that the measurements are unbiased, we can restate this question
in a form that can be evaluated with statistics, namely:
“is it true that pH < 10?”
Now, assuming no measurement bias, the fact that none of the measurements of pH are greater
than 10 seems to support the notion that the true pH of the solution is less than ten. However,
since the measurement of pH is a random variables, there is always a chance that the actual pH is
indeed greater than ten, and that the three measurements, by random chance, all happen to be less
than 10 – just like there is a chance that three coin flips in a row will come up tails, even though
there is a fifty-fifty chance of getting heads on any single coin toss.
Our problem is this: at what point can we say that random variability is an unlikely explanation
for the difference between the measured pH values and a fixed value (e.g., a pH of 10)? In other
words, when do the measured values “differ significantly” from the fixed value? The meaning of
the word “significantly” must be very clear: a statistically significant difference in the values
would be a greater difference than could be reasonably explained by random error.
This is exactly the type of question that hypothesis testing answers. Hypothesis tests are
sometimes called significance tests, since they detect “significant” differences in numbers,
differences that are unlikely to be due to random chance.
Page 110
Chapter 5
Hypothesis Testing: Concepts
An Initial Look at Hypothesis Testing
Let’s use an example to help us to see how we might derive conclusions using random variables
(i.e., measurements).
Example 5.1
A cigarette manufacturer states that the nicotine level of its cigarettes is 14 mg per cigarette.
You wish to test this claim. You collect a random sample of 5 cigarettes and test for nicotine
content. The measured nicotine level (in mg) of the cigarettes in the sample are
14.05, 14.33, 16.36, 18.55, 14.76.
Do these measurements indicate a nicotine level different than that claimed by the
manufacturer?
Basically, what we would like to do is test the following statement:
Hypothesis: The true nicotine level of the cigarettes is different from that claimed
(14 mg) by the manufacturer.
Let’s calculate the mean of the measured nicotine level.
T
( 14.05 14.33 16.36 18.55 14.76 ) . mg
x
x bar
mean( x)
measurements
x bar = 15.61 mg
So the mean measured level of nicotine in the five cigarettes was 15.61 mg/cigarette. Obviously,
this value is somewhat larger than the nicotine level stated by the manufacturer. The question is,
however, is the difference between the nicotine levels “significant?” Do we have any
justification for challenging the nicotine levels claimed by the manufacturer?
In order to answer this question, we need more information than simply the measurement
average: we must also make use of the observed variability of the five measurements to construct
a confidence interval.
sx
t
stdev ( x)
se
sx
se = 0.837 mg
standard error of mean value
5
2.7765
width
x lower
critical t-value for 4 df's at the 5% level
t . se
x bar
width = 2.32 mg
t . se
x lower = 13.29 mg
lower boundary of CI
x upper
x bar
t . se
x upper = 17.93 mg
upper boundary of CI
In this instance, the 95% confidence interval is 15.61 ± 2.32 mg/cigarette. Recall exactly what
this interval represents: assuming no bias, this range of values (13.29 → 17.93 mg) contains the
true amount of nicotine in the cigarettes analyzed, with 95% probability.
Page 111
Chapter 5
Hypothesis Testing: Concepts
Since the confidence interval calculated from the measurements on five cigarettes includes 14
mg, we cannot support the original hypothesis that the manufacturer’s claimed nicotine level is
incorrect. In other words, the difference between the measurement mean of 15.61 mg and the
manufacturer’s stated level of 14 mg is not significant.
Note that we must be very careful in how we phrase our conclusion. Even though the confidence
interval includes the value 14 mg, we have not proven that the manufacturer’s claim is true. In
other words
• we do not prove that [nicotine] = 14 mg/cigarette. We can only state that there is a 95%
probability that the true nicotine content is somewhere between 13.29 and 17.93; out best
estimate of the nicotine content is 15.61 mg.
• we cannot prove (with 95% probability) that [nicotine] ≠ 14 mg/cigarette, since the 95%
confidence interval contains this value.
We have just had our first brush with hypothesis testing, where we use data (containing random
error) from an experiment to test an assertion. This is obviously an important area of statistics,
and one that we will discuss in detail.
Page 112
Chapter 5
Hypothesis Testing: Concepts
Formal Hypothesis Testing
Introduction
In the last section, a confidence interval was constructed in order to test a specific hypothesis. In
scientific endeavors, there are a wide variety of different types of hypotheses that may need to be
tested using the results of one or more experiments. In this section, we will formalize the
procedure to be used in hypothesis testing. Although the procedure may seem a little rigid, it can
be adopted for almost any situation. The price for the general applicability of the procedure is the
use of somewhat abstract language and concepts.
Null and Alternate Hypotheses
All hypothesis tests actually involve at least two statements, called the null hypothesis (H0) and
the alternate (or working) hypothesis (H1). A statistical hypothesis is an assertion or conjecture
concerning one or more population parameters. Basically, this step is a translation from words to
population parameters. The null hypothesis, H0, will generally involve an equality and one or
more population parameters. In our nicotine example, the null hypothesis would be:
H0: µx = 14 mg/cigarette
null hypothesis
In other words, we accept as the null hypothesis the manufacturer’s claim that each cigarette
contains 14 mg of nicotine. If the null hypothesis is true, and if there is no bias in the
measurements, then the population mean µx of all measurements will be 14 mg. As you can see,
the null hypothesis involves a population parameter (µx, the population mean of the
measurements) and a statement of equality. As we will stress time and again, the null hypothesis
cannot be proven. It is assumed as fact unless the data proves otherwise.
The alternate hypothesis, H1, will be a statement involving the same population parameters, in
such a way that H1 and H0 cannot both be true. Usually the alternate hypothesis involves one of
the following relational operators: ≠, <, or >. For our example,
H1: µx ≠ 14 mg/cigarette
alternate hypothesis
(two-tailed test)
Alternate hypotheses such as this one, with a “not equals” (≠) relationship, result in two-tailed
tests. This statement claims that the measurement population mean is not 14 mg; if we assume no
measurement bias, this hypothesis disputes the manufacturer’s claim of nicotine level.
The form of both hypotheses is very important, particularly that of the alternate hypothesis. This
is because we are testing the alternate hypothesis in the hypothesis test procedure.
Suppose we actually suspect that the manufacturer is underestimating the nicotine level in the
cigarettes; in this case, we would use the following alternate hypothesis:
a different alternate hypothesis
or,
H1: µx > 14 mg/cigarette
H1: “the true nicotine content is greater than 14 mg/cigarette”
Page 113
(one-tailed test)
Chapter 5
Hypothesis Testing: Concepts
This form of H1 would result in a slightly different hypothesis test. Alternate hypotheses such as
this one, with a greater than (>) or less than (<) relationship, result in one-tailed tests.
In the hypothesis testing procedure, we assume that the null hypothesis is true, and it is not
tested. The goal of the procedure is to test the assertion embodied by the alternate hypothesis, H1.
If H1 is proven to be true, then obviously H0 will be false. This format is exactly the same as that
of the US criminal legal system, as represented in the famous statement “innocent until proven
guilty.” In statistical hypothesis testing, H0 is assumed to be true unless H1 can be proven to be
true with reasonable certainty.
Procedure for Formal Hypothesis Tests
For easy reference, here is a list of the steps in hypothesis testing; each step will be discussed in
detail.
1. Form the null hypothesis, H0, and the alternative hypothesis, H1, in terms of statistical
population parameters.
2. Choose the desired confidence level. The confidence level this is also sometimes called the
significance level.
3. Choose a test statistic and calculate it.
4. Calculate the critical values; alternately, determine the P-value of the test statistic.
5. State the conclusion clearly, avoiding statistical jargon.
Step 1: State the null hypothesis (H0) and the alternate hypothesis (H1)
We have described the null and alternate hypotheses. Formulating these is the most difficult but
crucial part of the test procedure. Remember that we begin with an assumption that H0 is true,
and that we are trying to test H1. We may be interested in either proving or disproving H1.
The following table gives the null hypotheses for three common statistical tests. Note that the
null hypothesis always involves population parameters, and (in these cases) is expressed as an
equality.
Page 114
Chapter 5
Hypothesis Testing: Concepts
Situation
Form of the
null hypothesis
Answers the question:
comparison of a random
variable, x, and a fixed value, k
H0 : ✙x = k
Is there a significant difference between
the mean of some measurements, and
some fixed value?
comparison of the mean of two
variables, x and y
H0 : ✙x = ✙y
Is there a significant difference between
the mean of two sets of measurements?
comparison of the variances of
two variables, x and y
H 0 : ✤ 2x = ✤ 2y
Is there a significant difference between
the variances of two sets of
measurements?
The alternate hypotheses, H1, in these cases may involve an inequality (≠) or a relational operator
(< or >). As discussed previously, the form of H1 determines whether we use a one-tailed or a
two-tailed test.
Step 2: Choose the desired level of confidence/significance
Remember that any confidence interval has an associated confidence level. The purpose of a
confidence interval is to “bracket” the possible values for a population parameter such as µx.
Random variables always add a little “spice” (i.e., uncertainty) to any conclusion; there is always
a chance that we are wrong, since random variables are, well, random. So the confidence level is
needed to state the probability that the population parameter is truly contained with our
confidence interval. It is a measure of how much we trust the interval, how “confident” we are in
our result.
Since confidence intervals play a crucial role in hypothesis testing, it is not surprising that we
generally choose a confidence level when testing assertions using the results of experiments,
which are almost always random variables. The meaning of the confidence level in hypothesis
testing is slightly different than in confidence intervals, however.
Consider our example. We have two competing hypothesis: H0: µx = 14 mg and H1: µx ≠ 14 mg.
We are testing the alternate hypothesis, H1, and there are two possible outcomes:
1. We succeed in proving that H1 is true, in which case H0 is know to be false.
2. We fail to prove that H1 is true. [Remember! We cannot prove that H0 is true.]
The confidence level in hypothesis testing measures our certainty when we succeed in
proving H1. It is the probability that the conclusion that H1 is true and H0 is false is correct. Let’s
assume that we want to test at the 95% level for our example. That means that, if our test proves
that the nicotine level is not 14 mg, there is a 95% probability that our data has lead us to the
proper conclusion.
You might wonder: why wouldn’t I want to be very certain in my conclusion? In other words,
shouldn’t I always choose a high confidence level in hypothesis testing (at least 95%, and maybe
Page 115
Chapter 5
Hypothesis Testing: Concepts
99% or even 99.9%!). We will defer a discussion of the appropriate confidence level in testing to
later in the chapter. But for now, ask yourself this question: why don’t you similarly always
choose a high confidence level in constructing confidence intervals? A 95% confidence interval
is commonly given; why not always use 99%, or 99.9%? What affect would that have on the
confidence interval? There are both advantages and disadvantages in choosing high confidence
levels, as we will discover.
In statistics, the term significance level is probably more common than confidence level in
hypothesis testing. The significance level (SL) is directly related to the confidence level (CL):
SL = 100% − CL. Thus, instead of testing at the 95% confidence level, we may instead test at the
5% significance level and arrive at the same conclusions. Although we will tend to use the term
“confidence level” in this text, you should be familiar with both terms.
Step 3: Choose a test statistic and calculate its value
The next step in hypothesis testing is to choose a statistic (the test statistic) appropriate for
testing the hypotheses. The test statistic (like any statistic) is a value calculated in some manner
from the data. Since the data presumably contain random error, the test statistic will likewise be a
random variable. There are two requirements for a test statistic:
1. Its probability distribution must be known; preferably, tables of critical values exist for the
statistic.
2. The test statistic should result in a reasonably “good” (or “efficient”) hypothesis test. What
factors might make one test better than another? Let’s come back to that point in a little bit.
In example 5.1, the null and alternate hypotheses both deal with the population mean µx of the
measurements, so it would seem that we could certainly use the sample mean of the
measurements as the basis for the test statistic. In constructing a confidence interval for µx, the
t-distribution is used (when σx is not known). This suggests that the following test statistic, T,
could be used in this hypothesis test:
T=
possible test statistic
x n − 14
s(x n )
The test statistic is the studentized sample mean. It has a t-distribution; if H0 is true, then µT = 0.
The sample mean is not the only possible basis of the test statistic. Instead, we could use the
sample median, or some other form of weighted average. It turns out that for normally distributed
data, the studentized sample mean is the best test statistic to use for hypothesis tests such as for
example 5.1.
Let’s calculate the observed value for the test statistic for the five measurements in example 5.1:
T obs
x bar
14. mg
se
T obs = 1.9243
This is the "studentized" mean: the number of
std devs of the mean from 14 mg
In this equation, “se” is the standard error of the sample mean, xbar. According to the observed
test statistic, the mean of the measurements, 15.61 mg/cigarette, is 1.92 standard deviations from
the manufacturer’s claimed value of 14 mg/cigarette.
Page 116
Chapter 5
Hypothesis Testing: Concepts
Step 4: Calculate the critical value(s) or the P-value
It is important to keep in mind that the null hypothesis, H0, is “innocent until proven guilty.” The
probability distribution of the test statistic, T, assuming that the null hypothesis is true, is called
the null distribution. The next step in hypothesis testing is to calculate the critical value(s) of the
null distribution.
For two-tailed tests, such as the one we must use for example 5.1, there are two critical values.
(One-tailed tests only have a single critical value). The null distribution of T is a t-distribution
with four degrees of freedom and a mean of zero. Recalling that we choose 95% as our
confidence level, the critical values are the values such that
Tcrit = ± t4,0.025 = ± 2.7765
lower critical value
upper critical value
95%
-4
-3
-2
-1
0
1
2
3
4
Test Statistic
accept H1
reject H0
accept H1
reject H0
T = -2.7765
(lower critical value)
accept H0
T = +2.7765
upper critical value
Figure 5.1: Decision criteria for the hypothesis test for example 5.1. If the observed test
statistic is above the upper critical value, or below the lower critical value, then we accept
the alternate hypothesis, H1, and reject the null hypothesis, H0.
The critical values are the boundaries between two decision-making regions:
• the acceptance region, between the two critical values. If the test statistic assumes a value in
this region, then the null hypothesis, H0, is accepted. We cannot prove the alternate hypothesis,
H1, with the desired confidence level.
Page 117
Chapter 5
Hypothesis Testing: Concepts
• the rejection region, where Tobs > Tupper or Tobs < Tlower. If the test statistic is in this region, then
H0 is rejected and H1 is accepted. We have proven that H1 is true at the desired confidence
level.
By inspecting the null distribution, we can see how the critical values are chosen, and we can
understand the role of the confidence level in hypothesis testing. Figure 5.1 shows the situation
for a two-tailed test at the 95% confidence level. We choose the critical values so that the 95% of
the area under the null distribution is between the critical values. What this means is that, if the
null hypothesis is true, there is a 95% probability that the observed test statistic will be within the
acceptance region.
It is not strictly necessary to calculate the critical values. An alternative approach makes use of
the concept of the P-value, which has been mentioned before. The P-value can be interpreted in
terms of the null distribution; in particular, for a two-tailed test, the P-value is
two-tailed P-value
P obs = P(T > T obs ) + P(T < −T obs ) = 2 $ P(T > T obs )
Consider example 5.1: the mean of five measurements of nicotine content was 15.61
mg/cigarette, which is 1.92 standard deviations from the manufacturer’s claimed value. Most
statistical programs and spreadsheets will also calculate the P-value; for example 5.1, the
two-tailed P-value is
Pobs = 0.1266
In other words, if the null hypothesis were true, there is a 12.66% probability that we would
obtain a sample mean that is farther than 1.92 standard deviations from 14 mg/cigarette (in either
direction).
The P-value is used instead of (or in addition to) critical values. It indicates the weight of the
evidence in favor of the alternate hypothesis: the smaller the P-value, the less likely it is that
random variability can account for the observed data.
To tie the P-value approach with the “critical region” approach, consider this: the P-value tells us
the maximum value of the confidence level that we can adopt and still prove the alternate
hypothesis. We calculate this value by
maximum confidence level:
CL = 100% $ (1 − P obs )
where CL is the confidence level as a percentage. For example 5.1, if we choose a confidence
level of 87.44% or less, then we can prove that the alternate hypothesis is true. Of course, a
smaller confidence level means that we are less confident of our conclusion, so we want a
P-value as small as possible.
We may more directly interpret the P-value in terms of the significance level. The P-value is the
largest significance level at which we may accept the alternate hypothesis. Thus, in this example,
we can prove H1 at the 12.66% significance level, at best. Remember: a smaller significance
level means we are more certain of this conclusion.
Aside: calculating P-values in Excel
Page 118
Chapter 5
Hypothesis Testing: Concepts
When the null distribution is a t-distribution, then the P-value is calculated in Excel by using the
TDIST() function:
calculation P-values in Excel
P obs = tdist(T obs , df, tail)
where Tobs is the observed value of the test statistic, df are the degrees of freedom of the
t-distribution, and tails is either one or two (for 1- or 2-tailed P-values). For example 5.1, you
would enter “= tdist(1.9243, 4, 2)” into any cell to obtain the 2-tailed P-value.
Other Excel functions would be needed when the null distribution does not follow a
t-distribution.
Step 5: State the conclusion
After we decide whether to accept H0 or H1, we must state our conclusion in a manner that is
accurate and yet can be understood by anyone who does not have a background in statistics.
Essentially, we must translate our conclusions from “statistic-ese” (e.g., “reject H0, accept H1”) to
normal language. We should give both our conclusion and the confidence level, even though the
confidence level is most properly understood in a statistics framework.
For example 5.1, we accepted H0; we couldn’t prove H1. In other words, our conclusion would
be:
We cannot prove with 95% confidence that the nicotine level in the cigarettes is different
than 14 mg/cigarette.
This statement sounds like poor English (basically a double negative), but the wording was very
carefully chosen. We begin with the assumption that the cigarettes have 14 mg of nicotine, and
we fail prove any differently. This is similar to a jury returning a verdict of “Not Guilty” in a
criminal trial. Notice that the verdict is not that the defendant was innocent, simply that guilt was
not proven beyond a “reasonable doubt.” In hypothesis testing, the level of “reasonable doubt” is
determined when the confidence level is set.
Examples
Let’s try another two-tailed test. This test is similar in nature to example 5.1.
Example 5.2
A certain analytical procedure is being tested for the presence of measurement bias. Twenty
measurements are made on a solution whose concentration has been certified at 1.000 µM.
The sample mean is 1.010 µM, with an RSD of 5.0% for the individual measurements. Is
there any evidence of measurement bias?
Page 119
Chapter 5
Hypothesis Testing: Concepts
First let's set up the null and alternate hypotheses
H0 :
µ x 1.000. µM
There is no bias in the measurements.
H1 :
µ x 1.000. µM
Bias exists; two-tailed test.
ξx
1.000. µM
sx
RSD. x bar
x bar
1.010. µM
RSD
s x = 0.0505 µM
5.0. %
std_err
sx
std_err = 0.0116 µM
19
Let's use the studentized mean as the test statistic, and calculate the observed test statistic
x bar
T obs
P obs
ξx
std_err
0.3988
T obs = 0.8631
sample mean is this many std devs from the true value
This is the two-tailed P-value of the observed value of the test statistic
Now we look up the critical values from the t-tables. For 19 degrees of freedom, a 95% confidence
level and a two-tailed test, the critical values are -2.0930 and +2.0930. Since the observed value of
the test statistic is within the acceptance region, we must accept the null hypothesis.
Thus, we cannot prove bias in these measurements at the 95% confidence level.
Note: from the observed P-value for this example, we see that we can only prove H1 with 60.22%
confidence, at best.
Now let’s try a one-tailed test.
Example 5.3
It is suspected that a series of tests of blood alcohol level proves that the alcohol level is
above the legal limit of 0.10%. The measurements are:
0.106 0.118 0.097 0.127 0.134 0.141
Do these measurements prove legal intoxication with 95% confidence?
As always, the first step is to set up the null and alternate hypotheses. In this case, we should use
the following:
null
alternate
H0: µx = 0.10 %
H1: µx > 0.10 %
“blood alcohol level at the legal limit (assuming no bias)”
“blood alcohol level above the legal limit”
It may be a little difficult to see why the null hypothesis should be that the blood alcohol level is
exactly 0.10 %. In setting up the hypotheses, it is best to always ask yourself, what is it that I
want to test? What are the possible conclusions? The answers to these questions determine the
form of the alternate hypothesis; the null hypothesis will follow.
For this example, we want to test whether or not the alcohol level is above the legal limit.
Remember that the purpose of the statistical test procedure is actually to test the alternate
hypothesis, so that we would propose as the alternate hypothesis that the alcohol level is too
high. The nature of the testing procedure is such that we either prove or fail to prove this
Page 120
Chapter 5
Hypothesis Testing: Concepts
hypothesis; i.e., or conclusion will be either that we can prove that the alcohol level is too high (a
“guilty” verdict) or that we cannot prove an excessive alcohol level (“not guilty”) .These
conclusions are proper for our intentions in this example. Since we propose that µx > 0.10 % is
our alternate hypothesis statement, the corresponding null hypothesis is µx = 0.10 %.
The other thing to notice about the form of H1 in this example is that it results in a one-tailed test.
This will affect the critical values (and the P-value, if we calculate it). Let’s continue with our
testing procedure. We can proceed by calculating the observed test statistic.
x
T
( 0.106 0.118 0.097 0.127 0.134 0.141 ) . %
x bar
mean( x)
x bar = 0.1205 %
std_err
Let's calculate the observed test statistic
x bar 0.10. %
T obs
T obs = 2.9865
std_err
P obs
0.00379
stdev ( x)
std_err = 0.0069 %
6
studentized measurement mean
Probability of seeing a larger value that Tobs i s 0.379%.
The P-value is standard output for many statistical programs. In this case, the one-tailed P-value
is 0.379 %, which means that we can prove H1 at the 99.721% confidence level, if we desired;
certainly at the 95% level we may reject H0 and accept H1. However, it is difficult to use t-tables
to calculate Pobs, so we will confirm this decision using the critical value approach.
For a one-tailed test, there is only a single critical value, as shown in the next figure.
Page 121
Chapter 5
Hypothesis Testing: Concepts
critical v alue
95%
-4
-3
-2
-1
-0
1
2
3
4
Test Statistic
H :µ=k
0
H :µ>k
1
accept H1
accept H
0
critical value
Figure 5.2: An example of a one-tailed test. There is only a single critical value. The top
figure shows the null distribution. The critical value is chosen such that the area to under
the curve to the left of the critical value is at the appropriate confidence level (95% for
this example). The lower figure shows the decision process: if the observed test statistic is
larger than the critical value, Tobs > Tcrit, then the null hypothesis is rejected and the
alternate hypothesis is proven.
Recall that the null distribution is the probability distribution of the test statistic, T, assuming that
H0 is true. As the upper figure shows, we must choose the critical value such that, for the null
distribution,
P(Tobs < Tcrit) = CL
where “CL” is the chosen confidence level. For our example, we have chosen a confidence level
of 95%. We can determine the critical value from the t-tables:
one-tailed critical value
T crit = t ✚,✍ = t 5,.05
where ν is the appropriate degrees of freedom, and α is the area in the right tail of the t
distribution. We determine the value of α from the confidence level: CL = (1 − ✍) $ 100%.
For our example, the t-tables tell us that the critical value is Tcrit = 2.0150. If you recall, the
observed value of the test statistic was 2.9865; since this is larger than the critical value, we
reject the null hypothesis and accept the alternate hypothesis. Our conclusion is:
Assuming no measurement bias, the data show that the blood alcohol level is above
the legal limit (at the 95% confidence level).
Page 122
Chapter 5
Hypothesis Testing: Concepts
Errors in Hypothesis Testing
Introduction
Since they involve random variables, there is always an element of uncertainty in hypothesis
tests. Specifically, there is always a chance that the conclusion of a test is in error. This
uncertainty is the reason that you must specify a confidence level when you perform statistical
tests. Choosing the confidence level allows you to determine the degree of the uncertainty in
your test: basically, you can control the likelihood that your conclusion is correct. As we will see,
the confidence level also indirectly determines the ability of the statistical test to detect and label
small differences as “significant.”
How can the conclusion from a hypothesis test be in error? For tests with a single null
hypothesis, H0, and a single alternate hypothesis, H1, then the following table shows all the
possibilities:
decision
reality
H1 is not true:
accept H0
accept H1
(“negative” result)
(“positive” result)
correct
false positive
H1 is true: false negative
correct
Let’s illustrate with an example. Let’s say someone undergoes a pregnancy test. Now the reality
of the matter is that the person is either pregnant or she isn’t .The test will either decide in favor
of pregnancy (called a “positive” test result) or will decide that the subject is not pregnancy (a
“negative” result).
We can draw an analogy to statistical hypothesis tests. We begin with the assumption (the null
hypothesis) that the subject is “not pregnant.” The alternate hypothesis, the one we want to test,
is that the subject is pregnant. A conclusion in favor of pregnancy (H1 is accepted) is considered a
positive test result; however, if the subject actually is not pregnant (H0 is actually true), then our
conclusion is in error. This situation − an incorrect acceptance of H1 − is called a false positive.
On the other hand, if the conclusion of the test is that the subject is not pregnant (H0 is accepted),
and this conclusion is in error (H1 is actually true), then the test gives a false negative.
In the remainder of this section, we will describe how to calculate the probability that the result
of a hypothesis test is in error (either a false positive or false negative).
False Positive Errors
All of the hypothesis tests presented so far in this chapter have been of the following type: the
null hypothesis is
H0: µx = k
the true measurement mean is some fixed value, k
While the alternate hypothesis is one of the following
Page 123
Chapter 5
H1: µx ≠ k
H1: µx > k
H1: µx < k
Hypothesis Testing: Concepts
the true measurement mean is not some fixed value, k (a two-tailed test)
the true measurement mean is larger than some fixed value, k (a one-tailed test)
the true measurement mean is smaller than some fixed value, k (a one-tailed test)
The decision criterion of the test is the following: if the observed test statistic, Tobs, is outside of
the interval defined by the critical value(s), then we reject H0 and accept H1. A false positive
occurs when Tobs is outside the H0 acceptance region when, in fact, H0 is true. The probability of a
false positive is controlled by choosing the appropriate confidence levels in a statistical test. To
be exact,
CL = 1 − ✍
where CL is the chosen confidence level, and α is the probability of a false positive. In other
words, when testing at the 90% confidence level, there is a 10% chance of falsely accepting H1.
Let’s imagine that we are comparing a mean value, µx, to a fixed value k. Unknown to us, the
null hypothesis is actually true. The following figure shows the null distribution of the test
statistic, i.e., the probability distribution of the test statistic when the null hypothesis is actually
true.
Null Distribution
critical v alue
critical v alue
area: α/2
-4
-3
area: α/2
-2
-1
-0
1
2
3
4
Test Statistic
Figure 5.3: choosing the critical values for a two-tailed test. If Tobs occurs between the
critical values, then the null hypothesis is accepted; if not, then H1 is accepted. The
shaded area in both tails is probability of a false positive: it is the probability that Tobs
does not fall between the critical values, even though it “should,” since H0 is true.
Now we can see how the critical values are chosen for two-tailed tests: each tail must contain an
area of α/2, so that the total probability of a false positive is α, the desired value.
Page 124
Chapter 5
Hypothesis Testing: Concepts
Now let’s consider the probability of false positive error for a one-tailed test. In such a test, there
is only a single critical value. Let’s imagine that we are testing for values that are greater than a
fixed value, k; in other words, our alternate hypothesis is H1: µx > k. The next figure shows the
null distribution, together with the critical value and the probability of false positive.
Null Distribution
critical v alue
area: α
-4
-3
-2
-1
-0
1
2
3
4
Test Statistic
Figure 5.4: choosing the critical value for a one-tailed test. If Tobs is less than the critical
value, then the null hypothesis is accepted; if not, then H1 is accepted. The shaded area in
both tails is probability of a false positive. Note that the critical value was chosen such
that the probability of false positive, α, is the same as in figure 5.3
To summarize, we set the probability of false positive error when we choose the confidence
level. We must then choose the critical values according to our desired value of α. This means
that, for a two-tailed test, the area in each tail of the null distribution must be α/2; for a
one-tailed test, the area in the single tail (since there is only one critical value) will be α.
False Negative Errors
A false negative occurs when we incorrectly accept H0 when we should actually reject H0 and
accept H1. In other words, the alternate hypothesis is actually true, but the test statistic still falls
within the critical region (so that the null hypothesis is accepted). The next figure shows the
probability distribution of the test statistic when the alternate hypothesis is true.
Page 125
Chapter 5
Hypothesis Testing: Concepts
True Distribution of Test Statistic (H1 is true)
accept H0
accept H1
critical v alue
β: probability of
f alse negativ e
-2
-1
0
1
2
3
4
5
6
7
Test Statistic
Figure 5.5: This figure shows the probability distribution (not the null distribution) of the
test statistic in a situation when the alternate hypothesis is actually true (in this case, µx >
k). However, if the test statistic is less than the critical value, shown in the figure, then the
null hypothesis will be accepted: this would be a false negative error. The shaded area
shows the probability, β, of this occurring.
As we see in the figure, even when the alternate hypothesis is true, there is some chance (β) that
the test statistic will be less than the critical value. This chance is the probability of false
negative error, β.
In order to calculate β, we must know the value of the population parameter, µx. We can always
calculate the value of β for some hypothetical situation in which we postulate a value for the
population parameter. This type of exercise would give us some idea of how “sensitive” our
testing procedure is to situations in which the alternate hypothesis is false. The next example
illustrates this point.
Page 126
Chapter 5
Hypothesis Testing: Concepts
Example 5.4
You wish to develop a procedure to test for bias in the analysis of fluoride in water. During
the analytical procedure, three independent measurements are obtained on a sample, and
averaged to determine the fluoride concentration. The standard solution to be used in the test
is known to contain 0.45 w/w% F, and the RSD of the entire analytical procedure is known to
be 0.10 (i.e., 10% RSD for the average of the three measurements)
(a) What are the critical values that can be used to determine if there is bias in a
measurement?
(b) What values of population measurement mean, µx, would result in a 90% probability that
bias will be detected? In other words, what bias would result in acceptance (with 90%
probability) of the alternate hypothesis in part (a)?
The true fluoride concentration, ξx, of the standard solution is 0.45 w/w%. The analytical
procedure in this situation consists of obtaining three measurements and averaging them to
obtain a point estimate of the fluoride concentration. We can calculate the standard error of the
mean of three measurements:
ξx
0.45. %
RSD
0.1
RSD. ξ x σ overall = 0.0450 %
the true standard error (a population parameter) is known
σ overall
The null and alternate hypotheses will be
H0: µx = ξx
there is no measurement bias
H1: µx ≠ ξξ
measurement bias exists (two-tailed test)
One thing that is different about this hypothesis test, compared to all the others we have done: the
true (i.e., population) standard deviation of the mean, ✤(x n ), is known. Thus, the test statistic will
be the standardized different between the mean of three measurements and the true concentration
of the solution:
test statistic
T=
x 3 − ✛x
✤(x 3 )
where x 3 is the mean of 3 measurements. Assuming that x 3 is distributed normally, T will follow
a normal distribution with a standard deviation of one.. The null distribution, which assumes that
µx = ξx, will follow a z-distribution (i.e., a standard normal distribution).
Let’s set our confidence level at 99%; in other words, we are limiting the probability of false
positives to 1%: α = 0.01. Now we can find the critical values. From the z-tables, we see that
z0.005 ≈ 2.575 (you should verify this; the actual value is 2.5758, as reported by Excel). Our
decision rules for this hypothesis test are:
• if −2.5758 < Tobs < 2.5758, then accept H0. We cannot prove measurement bias with 99%
confidence.
Page 127
Chapter 5
Hypothesis Testing: Concepts
• if Tobs < −2.5758 or Tobs > 2.5758, then reject H0 and accept H1. We can prove bias with 99%
confidence.
In this instance, it is useful to note that there is an equivalent way of stating these decision rules:
if the observed measurement mean, x 3 , is more than 2.5758 standard errors from the true
concentration, ξx, then we have evidence of bias.
crit lower
ξ x z crit. σ overall
crit lower = 0.3341 %
crit upper
ξ x z crit. σ overall
crit upper = 0.5659 %
Alternate decision rules
• if 0.3321 w/w% < x 3 < 0.5659 w/w%, then we must accept H0
• if x 3 < 0.3321 w/w% or x 3 > 0.5659 w/w%, then we reject H0 and accept H1 at the 99% confidence
level
You should realize that these rules are not different then the first ones; they would result in
exactly the same conclusion for a given set of data. These rules just give another way of looking
at the hypothesis test process.
Now let’s look at part (b). We want to find the measurement population mean, µx, that would
result in a 90% chance that measurement bias would be detected. Let’s imagine that there is
actually a certain amount of positive bias in the measurements. The probability that the bias will
actually be detected is the area under the probability distribution curve that is greater than the
upper critical value. In other words, if we want to find the minimum amount of positive bias that
will be detected with a 90% probability, we need to find the measurement mean, µx, that satisfies:
P(x 3 > 0.5659 w/w%) = 0.90
This situation is shown in the following figure.
Page 128
Chapter 5
Hypothesis Testing: Concepts
probability distribution
of measurement mean
accept H1
accept H1
accept H0
β = 0.10
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
measurement mean, w/w %
Figure 5.6: The critical values associated with the decision rules for two-tailed bias
detection at the 99% confidence level are represented by the dashed vertical lines. The
probability distribution describes the mean of three positively biased measurements, and
results in β = 0.10; in other words, for measurements described by this distribution, there
is a 10% chance of a false negative result to bias testing at the 99% confidence level.
From the z-tables, we know that z0.90 = − 1.2816 gives a right-tailed area of 0.90. We must solve
for µx in the following expression:
x crit − ✙ x
= −1.2816
✤(x 3 )
where xcrit is the upper critical value for the testing procedure, and ✤(x 3 ) is the standard error of
the mean of three measurements. Solving for µx gives
✙ = x crit + z 0.90 ✤(x 3 )
This is the mean of the probability distribution shown in the figure. Substituting 0.5659 w/w% for
the critical value, and a standard error of 0.0450 w/w%, gives µx = 0.6236 w/w%. This corresponds
to a bias, γx, of
γx
µ x ξ x γ x = 0.1736 %
If you repeat this procedure to find the negative bias that gives β = 0.10, you will find that a bias
of γx = −0.1736 w/w% will give the desired false negative probability value.
In other words, our calculations tell us that when testing for bias at the 99% confidence level
under these conditions, we have a 90% chance of detecting bias of 0.1736 w/w%. This is useful
information. If, for example, the “sensitivity” of our hypothesis test for bias detection is
unacceptable, then we have two options: lower our confidence level from 99% (which would
Page 129
Chapter 5
Hypothesis Testing: Concepts
decrease our critical values) or average more measurements to decrease our standard error. We
could also try to improve the precision of our method, so that the standard deviation of the
individual measurements is smaller.
Summary: Choosing the Confidence Level
Choosing the confidence level directly determines the critical values and the value of α, the
probability of a false positive error. Let’s consider a two-tailed test:
H0:
H1:
µx = k
µx ≠ k
for which there are two critical values, represented on the following number line:
Choosing a larger confidence level will cause the critical values to move further “apart.” True,
this means that there is a less chance of a false positive error; however, the power of the test to
detect small differences between µx and k has been decreased. In other words, there is a greater
chance of a false negative error (i.e., β has increased).
Thus there is always a compromise to consider in choosing the confidence level; values of 95%
and 99% are very common. The value chosen may depend on the potential consequences of
errors. Consider the following situations:
• in employee drug testing, no employer want to deal with false accusations. In such a situation,
a high confidence level (99% or even higher) might be appropriate, because the consequences
of a false positive (wrongly accusing an employee of taking drugs) are perceived to be more
severe than missing the borderline cases.
• in screening patients for HIV, the consequences of a false negative (incorrectly concluding that
the patient is not infected) are very severe. In this case, the confidence level might be set
relaltively low. To be sure, there will be an increase in false positives, but a separate,
independent test can be performed on these patients.
Page 130
Chapter 5
Hypothesis Testing: Concepts
Chapter Checkpoint
The following terms/concepts were introduced in this chapter:
acceptance region
P-value
alternate hypothesis
rejection region
critical value
significance level
false positive
significance test
false negative
statistical hypothesis
hypothesis test
statistical significance
null hypothesis
test statistic
null distribution
two-tailed test
one-tailed test
In addition to being able to understand and use these terms, after mastering this chapter, you
should
• use formal hypothesis testing procedures to determine if there is a significant difference
between a normally-distributed random variable and a fixed value, using either a one- or
two-tailed test
• interpret P-values from a hypothesis test
• explain trade-offs in choosing a confidence level
Page 131