Download Hypotheses and Sample Size for Showing Equivalence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Hypotheses and Sample Size for Showing Equivalence
Jim Ashton, SAS Institute Inc., Cary, NC
hypothesis and conclude that the treatments are not equally effec-
ABSTRACT
tive.
If you are conducting an equivalence trial, there Is an approach
within the Neyman--Pearson theory of hypothesis testing that reformuates the hypotheses. with equivalence of treatments as the after·
native rather than the null hypothesis. This approach swaps the
roles of the type I and type II errors. It also means you can explicitly
control the probability of making the more serious error of finding
no difference In treatments when in fact the standard Is superior.
Most clinical trials comparing two treatments are conducted to
deteonlne H one treatment Is significantly different hom the other.
The traditional approach to this problem tests the null hypothesis
that the success rates for the two treatments are equal against a
two-<lided afternative that they are not equal. However, equivalence
trlals are conducted with the intent of showing that two treatments
are equally effective, that is, showing that an experimental treatment
Is as good as, but not necessarily better than, a standard treatment.
This approach is inconsistent with the intent of an equivalence trial.
In an equivalence trial. you are interested in a difference In a single
direction only, that Is, when the experimental treatment happens to
be inferior.
A second approach that is consistent with the intent of an equivalence trial is to state the problem as a one-sided test. In this case,
you can state the hypotheses as
H : 1ts !51te
versus
A: 1ts > 1te
.
The usual test statistic is the normal approximation to the binomial
given by
Ps - Pe
z-se
ThIs paper investigates the properties of this approach. Sample
sizes are calculated for both forms of tests. The relative efficiency
Of one form to the other depends on the specific assumptions made.
In the appropriate setting, the sample size requirements for the
equivalence approach can be substantially smaller.
where se- V IP.(l-p,)/nJ+(p.(l-p.VnO)
You reject the null hypothesis of equality of treatments when the
test statistic gets too large, that Is, larger than your reference value
(usually z,_,,).
See Blackwelder (1982), Donner (1984), and Makuch and Simon
(1978).
Figure 1 shows how the sample space Is divided into two parts,
above and below the line ",-",. For points (""x,) above the line,
the standard treatment is superior, while for points below the line,
the experimental treatment Is superior.
All sample size caloulations and graphics were done using SAS/IML'
software.
KEYWORDS
1.0
r------------------------------------~-~~~
equivalence tests. hypothesis tests, proportions, sample size,
SAS/IML software, type I errOl, type /I error.
~
~
~
;/////
A
INTRODUCTION
////"
Suppose that both a standard treatment and an experimental treat·
ment are available for treating a disease. A clinical trial designed to
determine whether the experimental treatment is as effective as, but
not necessarily better than, the standard is referred to as an equivalence trial. A typical setting for this would have the standard a severe
treatment (for example, radiation treatments) and the experimental
a therapy with fewer side eflects. It is hoped that the experimental
Is as effective as the standard. In this setting, the standard treat·
men! should be a highly effective therapy, or else there is small value
In finding another treatment equally eflectlve,
////"
///"/"
//~//~
/
H
/////
/
o
THE CONVENTIONAL APPROACHES
Figure 1 Sample Space for Conventional Hypotheses
One approach you can take Is to state the problem as a two-sided
test of hypotheses, wHh the null hypothesis being that the two treat·
ments are equally effective. For example, let '" and lte be the true
success rates for the standard and experimental treatments,
respectively, and p,and p. be the corresponding sample propor·
tions. You can then state the hypotheses as
H : '" - '" versus
A: '" ,. '"
1.0
TIe
Table 1 classifies the possible decisions and the corresponding
types of errors you can make with this test. Let S represent the dn·
ference between the true treatment success rates. that is,
S - ",-",. You make a type I error, rejecting the null hypothesis
when It Is true, n you erroneously declare the standard to be superior. You make a type II error, not rejecting the nuU hypothesis when
~ is false, if you find no difference in treatments when the standard
Is in fact superior,
.
You collect your data and observe the sample proportions for each
treatment. n the experimental treatment's success rate differs suffl·
dently from the standard's in either direction, you reject the null
1387
gives you control over the error rate you want to control. The nuB
Table 1 Oassification of Possible Decisions
and sUemative hypotheses
o
0> 0
fail to -reject H
reject H
correct
type I error
type II error
correct
H' : x" '" tt"
+0
are
versus
A': x"
<
tt"
+0
The statistic for testing this hypothesis Is
-p - 0
z' = P~
e
The more serious error in an equivaJence trial is a type II error, calling
the treatments equivalent when the standard Is superior. FIQure 2
shows the probability of making a type II error. For a fixed level of
a, the power of this test depends on the sample size n, 11:. and 1te
and must be calculated for each posslbie value of o. Note that the
probability of making a type II error decreases as &se increases.
so
For this test, you reject the null hypothesis If the test statistic gets
too small, that Is, less than your reference value (usually -ZI -,J.
Dunnett and Gent (1977) give sUemativa test statistics for testing
H', and discuss their properties.
Figure 3 shows how the sample space is divided when you use a
test of a specified differenoa. The sample space Is divided along the
fine x" -tt" +0. The treatments differ by an amount equal to or
greater than 0 for points in H', and they diller by an amount less
than a for points In A'.
{3
1.0
,..-------------""7---,
Ii
H'
Z I-a
Z
I-a
- 0
Figure 2 Probability of a Type II Error
The conventional approach has two major problems. First, a nonsignificant test can be difficuU to interpret. For example, you may have
concerns oyer sufficient sample size. An inherent problem with this
approach, pointed out by Blackwelder (1982), is that H is easier to
fail to find a significant difference with a small study than wHh a large
one. Statisticians agree that the nuR hypothesis cannot be proved
and that failure to reject the nun hypothesis cannot be interpreted
as pennlssion to accept it. Again, as Blackwelder puts It, the p-value
is a measure of evidence against H, not for it. Insufficient evidence
to reject H does not imply sufficient evidence to accept it. Second,
the more serious error in this settmg is the type II error. The conventional approach does not, per se, take the type II error rate into
account. The basic problem is that the type Ii error rate must be
caicuiated for each possible difference in treatments which is of
interest. You cannot make a global statemant about the type II error
rate as you can for the type I error rate.
A'
d
o
1.0
Figure 3 Sample Space under Hypothesis of a Specified
Difference
Table 2 gives the possible decisions and the corresponding errors
you can make with the test of a specified difference.
Table 2 Classification of PossIble Decisions
o
NULL HYPOTHESIS OF A SPECIFIED DIFFERENCE
0> 0
Now consider a situation where you have a standard therapy having
a success rate of 80".4 and you want to detennlne if the success
rate for an expertmental therapy is within .10 of the standard. The
conventional approach formulates the null hypothesis of equality of
success rates, x.=1te. You determine the power of the test, 1-P.
for each specific sUemative, that is, for each value of x" -x" -0,
where 0 is considered to be a cfotically significant observable difference.
reject H'
fail to reject H'
correct
type /I error
type I error
correct
Here the roles of the type I and type II errors are reversed from the
previous setting. A type I error occurs when you reject H' and c0nclude that the treatments are equivalent when In fact the standard
is superior. A type II error occurs when you do not reject H' and err0neously conclude that the standard is superior. As stated before,
in an equivalence trial, the more serious error is claiming the treatments to be equivalent when tha experimental treatment is inferior.
WHhin this setting, you can control the type I error rata expiicilty.
When you select a, you know that the probability of making a type
I error is less than a.
The point here is that even with the conventional approach, you
must sooner or later specify a value for a. In an equivalence trial,
you select 6 as the minimal difference so that, if the treatments truly
differed by at least 0, you would consider them to be different. The
appropriate null hypothesis is that the treatments differ by at least
6. The alternative, then, is that the success rates do not differ by
a; that is, the treatments are what may be called &«!uivaient. This
trick of making the alternative hypothesis the one of equivalence
1388
Figures 4 through 7 graphically present these sample size calculations. As you can see, the advantage n' has over n diminishes as
the success rate 01 the standard treatment decreases. When the
standand treatment has a success rate of 90%. the advantage of
n' is apparent. For a success rate of 60% for the standand. the lormulas give similar results. When the success rate for the standard
drops to 40%. n' is less efficient. Makuch and Simon found that the
sample size for testing H' is less than that for testing H when ~ Is
greater than (1 +S)l2. When '" is less than (t +6V2. the sample size
for testing H'1s greater than that lor testing H. You can then determine which approach is more economical for a particular situation.
SAMPLE SIZES
You are probably lamillar with the formula for estimating the
required sample size lor the conventional hypothesis. The simplest
version Is
n ~
.:.(Z-,,'____d,-+_Z-,-'__.!'~)_2(,-n,.;(_1_--,lt;,,-)_+_n-,.~(1,----_n2."'-))
(Its -
1te )2
.
The corresponding lormula for the test of a specified difference was
presented by Makuch and Simon (1976). It differs from the conventional ronnula only by a term 6 in the denominator.
30"
m
15"
Although there are better fonnulas lor determining sample sizes
when comparing two proportions. lor simplicity 01 calculations and
ease of oornparability the normal approximations are used. See the
excellent treafrnent of this subject by casagrande and Pike (1978).
You usually calculate n' by setting '" ~ "e and treating 6 as the difference In treatment efficacy that you want to rule out wHh probability
(I-P). as suggested by Donner (1964). Makuch and Simon recommend setting p-.l0.
,to
Table 3 presents sample size calculations lor varying values of '"
and "e. determined wHh a~.05 and ~~.10. The data are a subset
taken from the data used In the graphs in Figures 4 through 7. The
relationship between the nun and alternative hypotheses for the two
tests creates a symmetry; the hypothesis of equivalence Is the null
for the conventionalles1 and the ahomative for the test of a specified difference. Because 01 this symmetry. the a for one test is the
P for the other and vice versa. The first three columns give the val
ues lor ",. "e. and n under the conventional hypothesis. The next
four columns give the values for ",. "e. 6. and n' under the hypothesis 01 a specified difference Cl. The last ooumn gives the ratio of n
to n'. a measure of the relative efficiency 01 the two methods. When
the ratio Is greater than unity. the equivalence approach is more efficient. When the ratio Is less than unity. the conventional approach
is more efficient.
Difference to Detect
Figure 4
and
,
,,
n'
ratio
1.39
0.90
0.15
69
1.54
0.90
0.20
39
1.67
0.90
0.25
25
1.76
0.80
0.80
0.10
275
L15
0.80
0.80
0.15
122
1.21
0.90
0.75
106
0.90
0.90
0.70
65
0.90
0.90
0.65
44
0.90
0.80
0.70
317
O.Ba
0.65
148
0.90
0.90
30"
.
H'
""
'""
15"
to"
155
215
Standard Treatment of 1ts=.90
H
m
~~.10
0.10
0.80
.25
350
H'
H
0.90
a~.05
,'"
'""
w
Table 3 Sample Sizes Calculations for
15
50
,to
.\5
Difl~rence
O.Bo
0.60
86
0.80
0.80
0.20
69
1.25
0.80
0.55
56
0.80
0.80
0.25
44
1.27
0.60
0.50
420
0.60
0.60
0.10
412
1.02
0.60
0.45
186
0.60
0.60
0.15
183
1.02
0.60
0.40
103
0.60
0.60
0.20
103
1.00
0.60
0.35
65
0.60
0.60
0.25
66
0.98
0.40
0.30
386
0.40
OAO
0.10
412
0.94
0.40
0.25
163
0.40
0.40
0.15
183
0.89
0.40
0.20
86
0.40
DAD
0.20
103
0.83
0.40
0.15
51
0.40
0.40
0.25
66
0.77
Figure 5
1389
.20
to Detect
Standard Treatment of 1ts =.80
,"
.
,
,,
450
Figure 8 shows the Increases in sample sizes needed when the
assumptions are that .. + .. fO( .. ~.90 and "" ranging from .80
to .90. The cust~ method fO( calculating semple sizes Is to
assume that .. ~ .. so that the fine corresponding to .. -.90
serves as the reference sample sizes.
m
400
350
300
250
'00
;,
200
700
'50
'00
'00
'00
50
400
10
.15
.20
.25
300
Dlfference to Deled
200
figure 6 Standard Treatment of .,-.60
1Te=.90
100
.
450
.10
J!'
m
400
.<5
Difference to Deled
H
Figure 8 Sample Sizes fO( .. + ",,(a~.05,
p~.10)
350
DISCUSSION
300
Testing a hypothesis of a specified difference Is not new. Hfits nicely
within the Neyman-Pearson th""'Y for testing hypotheses, which
should make statisticians feel comfortable with H. And there can be
substantial savings In terms of sample size reqUirements oompared
to the conventional approach.
250
200
'50
In gaining any advantage, you must sacrifice something. In this
100
case, what you sacrifice Is power; the test of a specified difference
Is a conservative test. You give up power to reject the null hypoth&-
50
.10
.15
.20
sis when the treatments are really equivalent in order to minimize
the chances of rejecting the null hypothesis when the standard Is
actually superior.
.25
Dilf",ence to Detect
Table 5 shows what happens when .. -.9 and you set a -.05 and
P~ .1 0 to be the errO( rates fO( the hypothesis of a specified difference. The sample size r8(JJined to detect a difference a-.l, from
Table 3, Is n'~155. (By symmetry, this means that a~.l fO( the
conventional test. The corresponding sample size for the convenlional hypothesis fO( p~.05, also taken from Table 3, Is n-215).
Table 5 gives the error rates fO( each test for values of "" ranging
from .9 (equivalent), .89-.81 (&-equlvalent), and .8-.75 (definitely not
equivalent) calculated using a fixed sample size of n-n'- 155.
Figure 7 Standard Treatment of .. ~.40
On a slightly different note, conslder a case where you know the
standard treatment is marginally superior but want to rule out a difference as large as 6 with probability 1 - p. For example, suppose
that .. ~ .90. You believe that ",,~.85 is a good guess for the experImental treatment, and you want to rule out a difference of .10
between .. and .. with probability .90. The requined sample size
for this situation is 746. Compare this to n'=155 when you can
assume the 1ta-1te-
Table 5
Table 4 and Figure 8 show what happens to sample size requirements using n' when you assume lts +- 1te. The sample sizes in Table
4 are a subset of the data used to create Figure 8.
Table 4
Sample Sizes when .. + "" (a ~ .05,
6
n'
P-
.10)
'"
.9
".9
.9
.89
Example of Emor Rates of the Two Tests
H
A
H'
.10
.10
.16
.17
.39
.9
.87
.33
.9
.85
.52
.62
.83
.70
.81
.84
.92
A'
equivalent
'"
"
.85
.10
746
.9
.90
.85
.15
187
.9
.81
.90
.85
.20 83
.9
.8
.11
.05
.9
.75
.01
.002
.90
.90
.80 .15
.90
.80 .20 215
857
1390
&-equivalent
not equivalent
The test of a specified difference cIear1y minimizes the chances of
making an error when the treatments diller by more than .1 (the definitely not equivalent area). n also perlonns well when "" is equal
to or greater than X. (the equivalent area). In the (koquivalenl area,
Where the standard is better by a smaD margin, Its conservative
nature shows. The chance of not rejecting H' is larger than you
would like to see. For instance, when ne=.85, you have a 620/0
chance of not rejecting H' and making the mistake of declaring the
treatments not eqUivalent. With the conventional test, you have a
52% chance of declaring the treatments not equivalent.
REFERENCES
Blackw_, W.C. (1982), "Proving the Nul HypotheSIs In CYnical
Trials," Controlled Clinical Toals 3, 345-353.
Casagrande, J.T. and Pike, M.C. (1978), "An Improved Formula for
calculating Sample Slzes for Comparing Two Binomial 0isIr1butians: Biometrics 34, 483486.
Donner, A. (1984), "Approaches to Sample Siza Estimation In the
Design of Clinical Toals - a Review: Statistics in Medicine, Vol
In appropriate situations and under appropriate assumptions, this
approach to testing for equivalence protects you against erroneously finding that the treatments are equally effective. Specifically,
~ is very efficient when
3, 199-214.
Dunnett, C.w. and Gent, M. (1977), ·Signiflcance testing to estal>lish between treatments, with special reference to data In the form
of 2x2 tables," Biometrics 33:593-602.
• you have a standard treatment with a high success rate
. Makuch, R.w. and Simon, R. (1978), "Sample Size Requirements
for Evaluating a Conservative Therapy," Cancer Treat Rep 62,
1037-1040.
• your Intent is to demonstrate equivalence of an experimental
treatment with the standard
• you have reason to believe that the success rate for the
SAsnML is a registered trademar1< of SAS Institute Inc., Cary, NC,
USA.
experimental treatment is at least as large as the standard's
•
~ is Imperative that you not declare the treatments
equivalent when the standard is truly the superior treatment.
1391