Download Shaffer (1995) Multiple hypothesis testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Annual Reviews
www.annualreviews.org/aronline
Ann~Rev. Psychol.1995. 46:561-84
Copyright
©1995by AnnualReviewsInc. All rights reserved
MULTIPLEHYPOTHESISTESTING
Juliet
Popper Shaffer
Department
of Statistics, University
of California,Berkeley,California94720
KEYWORDS:
multiple comparisons, simultaneous testing, p-values, closed test procedures,
pairwise comparisons
CONTENTS
INTRODUCTION
.....................................................................................................................
ORGANIZING
CONCEPTS
.....................................................................................................
PrimaryHypotheses,Closure, HierarchicalSets, andMinimalHypotheses......................
Families
................................................................................................................................
Type
1Error
Control
............................................................................................................
Power
...................................................................................................................................
P-Values
andAdjusted
P-Values
.........................................................................................
Closed
TestProcedures
.......................................................................................................
METHODS
BASED
ONORDERED
P-VALUES
...................................................................
Methods
BasedontheFirst-Order
Bonferroni
Inequality..................................................
Methods
Based
ontheSimes
Equality
.................................................................................
Modifications
for Logically
Related
Hypotheses
.................................................................
Methods
Controlling
theFalseDiscovery
Rate...................................................................
COMPARING
NORMALLY
DISTRIBUTED
MEANS
.........................................................
OTHER
ISSUES
........................................................................................................................
Testsvs Confidence
Intervals
...............................................................................................
Directional
vs Nondirectional
Inference
.............................................................................
Robustness
............................................................................................................................
Others
........................................................................ . ..........................................................
CONCLUSION
..........................................................................................................................
561
564
564
565
566
567
568
569
569
569
570
571
572
573
575
575
576
577
578
580
INTRODUCTION
Multipletesting refers to the testing of morethanonehypothesisat a time. It is
a subfieldof the broaderfield of multipleinference,or simultaneous
inference,
whichincludes multipleestimationas well as testing. Thisreviewconcentrates
on testing anddeals withthe special problemsarising fromthe multipleaspect.
The term "multiple comparisons"has come to be used synonymouslywith
0066-4308/95/0201-0561505.00
561
Annual Reviews
www.annualreviews.org/aronline
562 SHAFFER
"simultaneous inference," even whenthe inferences do not deal with comparisons. It is used in this broader sense throughoutthis review.
In general, in testing any single hypothesis, conclusionsbased on statistical
evidence are uncertain. Wetypically specify an acceptable maximum
probability of rejecting the null hypothesiswhenit is true, thus committinga Type
I error, and base the conclusionon the value of a statistic meetingthis specification, preferably one with high power. Whenmanyhypotheses are tested, and
each test has a specified TypeI error probability, the probability that at least
someTypeI errors are committedincreases, often sharply, with the numberof
hypotheses. This mayhave serious consequencesif the set of conclusions must
be evaluated as a whole. Numerousmethods have been proposed for dealing
with this problem,but no one solution will be acceptable for all situations.
Threeexamplesare given belowto illuslrate different types of multiple testing
problems.
SUBPOPULATIONS:
A HISTORICAL
EXAMPLE
Cournot (1843) described vividly
the multiple testing problemresulting from the exploration of effects within
different subpopulationsof an overall population. In his words, as translated
from the French, "...it is clear that nothing limits...the numberof features
accordingto whichone can distribute [natural events or social facts] into several
groups or distinct categories." As an examplehe mentions investigating the
chanceof a malebirth: "Onecould distinguish first of all legitimate births from
those occurringout of wedlock.... one can also classify births accordingto birth
order, accordingto the age, profession, wealth, or religion of the parents...usually these attempts through which the experimenter passed don’t leave any
traces; the public will only knowthe result that has been foundworth pointing
out; and as a consequence, someoneunfamiliar with the attempts which have
led to this result completelylacks a clear rule for deciding whetherthe result
can or can not be attributed to chance."(See Stigler 1986,for further discttssion
of the historical context; see also Shafer &Olkin 1983, Nowak1994.)
large social science surthousandsof variables are investigated, and participants are groupedin
myriadways. The results of these surveys are often widely publicized and have
potentially large effects on legislation, monetarydisbursements,public behavior, etc. Thus, it is important to analyze results in a waythat minimizes
misleadingconclusions.Sometype of multiple error control is needed,but it is
clearly impractical, if not impossible,to control errors at a small level overthe
entire set of potential comparisons.
LARGE SURVEYSAND OBSERVATIONALSTUDIES ]rl
veys,
FACTORIAL
DESIGNS
The standard textbook presentation of multiple comparison issues is in the context of a one-factorinvestigation, wherethere is evidence
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 563
from an overall test that the meansof the dependentvariable for the different
levels of a factor are not all equal, and morespecific inferences are desired to
delineate whichmeansare different from whichothers. Here, in contrast to many
of the examplesabove, the family of inferences for whicherror control is desired
is usually clearly specified and is often relatively small. Onthe other hand, in
multifactorial studies, the situation is less clear. Thetypical approachis to treat
the maineffects of eachfactor as a separate familyfor purposesof error control,
although both Tukey (1953) and Hartley (1955) gave examples of 2 x 2
factorial designsin whichthey treated all sevenmaineffect and interaction tests
as a single family. The probability of finding somesignificances maybe very
large if each of manymain effect and interaction tests is carried out at a
conventionallevel in a multifactor design. Furthermore,it is importantin many
studies to assess the effects of a particular factor separatelyat eachlevel of other
factors, thus bringing in another layer of multiplicity (see Shaffer 1991).
As noted above, Cournotclearly recognized the problemsinvolved in multiple inference, but he considered theminsoluble. Althoughthere were a few
isolated earlier relevant publications, sustained statistical attacks on the problemsdid not begin until the late 1940s. Mosteller (1948) and Nair (1948) dealt
with extreme value problems; Tukey (1949) presented a more comprehensive
approach. Duncan(1951) treated multiple range tests. Related workon ranking and selection was published by Paulson (1949) and Bechhofer (1952).
Scheff6 (1953) introduced his well-known procedures, and work by Roy
Bose (1953) developed another simultaneous confidence interval approach.
Also in 1953, a book-length unpublished manuscript by Tukey presented a
general frameworkcovering a numberof aspects of multiple inference. This
manuscriptremainedunpublisheduntil recently, whenit was reprinted in full
(Braun 1994). Later, Lehmann(1957a,b) developed a decision-theoretic
proach, and Duncan(1961) developed a Bayesian decision-theoretic approach
shortly afterward. For additional historical material, see Tukey(1953), Harter
(1980), Miller (1981), Hochberg&Tamhane(1987), and Shaffer (1988).
The first published book on multiple inference was Miller (1966), which
was reissued in 1981, with the addition of a review article (Miller 1977).
Except in the ranking and selection area, there were no other book-length
treatments until 1986, whena series of book-length publications began to
appear: 1. Multiple Comparisons(Klockars &Sax 1986); 2. Multiple Comparison Procedures (Hochberg & Tamhane1987; for reviews, see Littell
1989, Peritz 1989); 3. Multiple Hypothesenpriifung(Multiple HypothesesTesting) (Bauer et el 1988; for reviews, see L~iuter 1990, Holm1990); 4. Multiple
Comparisonsfor Researchers (Toothaker 1991; for reviews, see Gaffan 1992,
Tatsuoka 1992) and Multiple ComparisonProcedures (Toothaker 1993);
Multiple Comparisons, Selection, and Applications in Biometry (Hoppe
1993b; for a review, see Ziegel 1994); 6. Resampling-basedMultiple Testing
Annual Reviews
www.annualreviews.org/aronline
564 SHAFFER
(Wesffall &Young1993; for reviews, see Chaubey1993, Booth 1994); 7. The
Collected Works of John W. Tukey, VolumeVII: Multiple Comparisons:19481983 (Braun 1994); and 8. Multiple Comparisons: Theory and Methods (Hsu
1996).
This review emphasizesconceptual issues and general approaches. In particular, two types of methodsare discussed in detail: (a) methodsbased
ordered p-values and (b) comparisons amongnormally distributed means. The
literature cited offers manyexamplesof the application of techniques discussed here.
ORGANIZING CONCEPTS
Primary Hypotheses,
Hypotheses
Closure,
Hierarchical
Sets,
and Minimal
Assumesomeset of null hypotheses of primary interest to be tested. Sometimes the numberof hypothesesin the set is infinite (e.g. hypothesizedvalues
of all linear contrasts amonga set of population means), although in most
practical applications it is finite (e.g. values of all pairwisecontrasts among
set of populationmeans).It is assumedthat there is a set of observations with
joint distribution dependingon someparameters and that the hypotheses specify limits on the values of those parameters. The following examplesuse a
primaryset basedon differences ~tl, ~t2 ..... ~tmamongthe meansof mpopulations, althoughthe conceptsapply in general. Let ~ij be the difference ~ti - ~tj;
let ~)ijk be the set of differences amongthe means~ti, ~tj, and ~tk, etc. The
hypothesesare of the form Hijk...:5ijk... = 0, indicating that all subscripted
meansare equal; e.g. H1234is the hypothesis 91 = ~x2 = ~x3 = ~. The primary
set neednot consist of the individual pairwisehypothesesHij. If m= 4, it may,
for example, be the set H12,H123,H1234,etc, whichwouldsignify a lack of
interest in including inference concerning someof the pairwise differences
(e.g. H23)and therefore no need to control errors with respect to those differences.
Theclosure of the set is the collection of the original set together with all
distinct hypotheses formedby intersections of hypotheses in the set; such a
collection is called a closed set. For example,an intersection of the hypotheses
Hij andHigis the hypothesisHijk: ~ti = ~tj = ~tk. Thehypothesesincludedin an
intersection are called components
of the intersection hypothesis. Technically,
a hypothesis is a componentof itself; any other componentis called a proper
component.In the exampleabove, the proper componentsof nijk are Hij, Hi~,
and, if it is includedin the set of primaryinterest, Hjk becauseits intersection
with either Hij or Hik also gives Hijk. Notethat the Wuthof a hypothesisimplies
the truth of all its proper components.
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 565
Anyset of hypotheses in which someare proper componentsof others will
be called a hierarchical set. (That term is sometimesused in a morelimited
way, but this definition is adopted here.) A closed set (with more than one
hypothesis) is therefore a hierarchical set. In a closed set, the top of the
hierarchyis the intersection of all hypotheses:in the examplesabove, it is the
hypothesisH12...m, or [Xl = ~t2 ..... ptm. The set of hypothesesthat have no
proper components
represent the lowest level of the hierarchy; these are called
the minimalhypotheses (Gabriel 1969). Equivalently, a minimalhypothesis
one that does not imply the truth of any other hypothesis in the set. For
example,if all the hypothesesstate that ttlere are no differences amongsets of
means,and the set of primaryinterest includesall hypothesesH/j for all i ,~ j =
1,...m, these pairwise equality hypothesesare the minimalhypotheses.
Families
Thefirst and perhapsmostcrucial decision is whatset of hypothesesto treat as
a family, that is, as the set for whichsignificance statementswill be considered
and errors controlled jointly. In someof the early multiple comparisonsliterature (e.g. Ryan1959, 1960), the term "experiment" rather than "family" was
usedin referring to error control. Implicitly, attention wasdirected to relatively
small and limited experiments. As a dramaticcontrast, consider the exampleof
large surveys and observational studies described above. Here, because of the
inverse relationship betweencontrol of TypeI errors and power,it is unreasonable if not impossible to consider methodscontrolling the error rate at a
conventionallevel, or indeed any level, over all potential inferences from such
surveys. Anintermediate case is a multifactorial study (see aboveexample),
which it frequently seemsunwise from the point of view of powerto control
error over all inferences. The term "family" was introduced by Tukey(1952,
1953). Miller (1981), Diaconis (1985), Hochberg & Tamhane(1987),
others discuss the issues involved in deciding on a family. Westfall &Young
(1993) give explicit advice on methodsfor approaching complexexperimental
studies.
Becausea study can be used for different purposes, the results mayhave to
be considered under several different family configurations. This issue came
up in reporting state and other geographical comparisons in the National
Assessmentof Educational Progress (see Ahmed1991). In a recent national
report, each of the 780 pairwise differences amongthe 40 jurisdictions involved(states, territories, and the District of Columbia)
was tested for significance at level .05/780 in order to control TypeI errors for that family. However, fromthe point of viewof a single jurisdiction, the familyof interest is the
39 comparisonsof itself with each of the others, so it wouldbe reasonable to
test those differences each at level .05/39, in which case somedifferences
would be declared significant that were not so designated in the national
Annual Reviews
www.annualreviews.org/aronline
566 SHAFFER
report. See Ahmed
(1991) for a discussion of this exampleand other issues
the context of large surveys.
Type I Error Control
In testing a single hypothesis,the probability of a TypeI error, i.e. of rejecting
the null hypothesis whenit is true, is usually controlled at somedesignated
level a. The choice of ct should be governedby considerations of the costs of
rejecting a true hypothesis as comparedwith those of accepting a false hypothesis. Becauseof the difficulty of quantifyingthese costs and the subjectivity involved, ct is usually set at someconventionallevel, often .05. Avariety of
generalizationsto the multiple testing situation are possible.
Somemultiple comparisonmethodscontrol the TypeI error rate only when
all null hypothesesin the family are true. Others control this error rate for any
combination of true and false hypotheses. Hochberg&Tamhane(1987) refer
to these as weakcontrol and strong control, respectively. Examplesof methods
with only weakerror control are the Fisher protected least significant difference (LSD) procedure, the Newman-Keulsprocedure, and some nonparametric procedures (see Fligner 1984, Keselmanet al 1991a). The multiple comparison literature has been confusing because the distinction between weak
and strong control is often ignored. In fact, weakerror rate control without
other safeguards is unsatisfactory. This review concentrates on procedures
with strong control of the error rate. Several different error rates have been
consideredin the multiple testing literature. The major ones are the error rate
per hypothesis, the error rate per family, and the error rate familywise or
familywise error rate.
The error rate per hypothesis (usually called PCE,for per-comparisonerror
rate, although the hypothesesneed not be restricted to comparisons)is defined
for each hypothesis as the probability of Type I error or, whenthe numberof
hypotheses is finite, the average PCEcan be defined as the expected value of
(numberof false rejections/number of hypotheses), where a false rejection
meansthe rejection of a true hypothesis. The error rate per family (PFE)
def’medas the expectednumberof false rejections in the family. This error rate
does not apply if the family size is infinite. Thefamilywiseerror rate (FWE)
definedas the probability of at least one error in the family.
A fourth type of error rate, the false discovery rate, is described below.
To makethe three definitions above clearer, consider what they imply in a
simple examplein whicheach of n hypothesesH1..... Hnis tested individually
at a level eti, and the decision on each is basedsolely on that test. (Procedures
of this type are called single-stage; other procedureshave a morecomplicated
structure.) If all the hypothesesare true, the averagePCEequals the averageof
the ~xi, the PFEequals the sumof the cti, and the FWE
is a function not of the
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 567
cti alone, but involvesthe joint distribution of the test statistics; it is smaller
than or equal to the PFE,andlarger than or equal to the largest cti.
A common
misconceptionof the meaningof an overall error rate ct applied
to a family of tests is that on the average, only a proportion ct of the rejected
hypothesesare true ones, i.e. are falsely rejected. To see whythis is not so,
consider the case in whichall the hypotheses are true; then 100%of rejected
hypothesesare true, i.e. are rejected in error, in those situations in whichany
rejections occur. This misconception,however,suggests considering the proportion of rejected hypothesesthat are falsely rejected and trying to control
this proportion in someway. Letting V equal the numberof false rejections
(i.e. rejections of true hypotheses)and R equal the total numberof rejections,
the proportion of false rejections is Q = V/R. Someinteresting early work
related to this ratio is described by Seeger (1968), whocredits the initial
investigation to unpublishedpapers of Eklund. Sori6 (1989) describes a different approachto this ratio. Thesepapers (Seeger, Eklund, and Sorir) advocated
informal consideration of the ratio; the following new approachis moreformal. The false discovery rate (FDR)is the expected value of Q = (number
false significances/number of significances) (Benjamini& Hochberg1994).
Power
As shownabove, the error rate can be generalized in different ways when
movingfrom single to multiple hypothesis testing. The sameis true of power.
Three definitions of powerhave been common:the probability of rejecting at
least one false hypothesis, the average probability of rejecting the false hypotheses, and the probability of rejecting all false hypotheses.Whenthe family
consists of pairwise meancomparisons,these have been called, respectively,
any-pair power(Ramsey1978), per-pair power (Einot & Gabriel 1975),
all-pairs power(Ramsey1978). Ramsey(1978) showedthat the difference
powerbetweensingle-stage and multistage methodsis muchgreater for allpairs than for any-pair or per-pair power(see also Gabriel 1978, Hochberg
Tamhane1987).
P-Values and Adjusted
P-Values
In testing a single hypothesis, investigators have movedaway from simply
accepting or rejecting the hypothesis, giving instead the p-value connected
with the test, i.e. the probabifity of observinga test statsfic as extremeor more
extremein the direction of rejection as the observedvalue. This can be conceptualized as the level at which the hypothesis wouldjust be rejected, and
therefore both allows individuals to apply their owncriteria and gives more
information than merely acceptanceor rejection. Extensionof this concept in
its full meaningto the multiple testing context is not necessarily straightforward. A concept that allows generalization from the test of a single hypothesis
Annual Reviews
www.annualreviews.org/aronline
568 SHAFFER
tO the multiple context is the adjusted p-value (Rosenthal &Rubin 1983).
Given any test procedure, the adjusted p-value correspondingto the test of a
single hypothesisHi can be defined as the level of the entire test procedureat
whichHi wouldjust be rejected, given the values of all test statistics involved.
Application of this definition in complexmultiple comparisonprocedures is
discussed by Wright (1992) and by Westfall & Young(1993), whobase their
methodologyon the use of such values. These values are interpretable on the
same scale as those for tests of individual hypotheses, making comparison
with single hypothesistesting easier.
Closed Test Procedures
Most of the multiple comparisonmethodsin use are designed to control the
FWE.The most powerful of these methods are in the class of closed test
procedures, described in Marcuset al (1976). To define this general class,
assumea set of hypothesesof primary interest, add hypothesesas necessary to
formthe closure of this set, and recall that the closed set consists of a hierarchy
of hypotheses.The closure principle is as follows: A hypothesis is rejected at
level ~t if and only if it and every hypothesisdirectly aboveit in the hierarchy
(i.e. every hypothesisthat includes it in an intersection and thus implies it)
rejected at level c~. For example,given four means,with the six hypothesesHij,
i # j = 1 ..... 4 as the minimal hypotheses, the highest hypothesis in the
hierarchy is H1234,and no hypothesis belowH1234can be rejected unless it is
rejected at level c~. Assumingit is rejected, the hypothesis H12cannot be
rejected unless the three other hypothesesaboveit in the hierarchy, H123,H124,
and the intersection hypothesisH12and H34(i.e. the single hypothesis~tl = ~t2
and ~t3 = ~4), are rejected at level et, and then H~2is rejected if its associated
test statistic is significant at that level. Anytests can be usedat each of these
levels, provided the choice of tests does not dependon the observedconfiguration of the means. The proof that closed test procedures control the I:3VE
involves a simple logical argument. Consider every possible true situation,
each of which can be represented as an intersection of null and alternative
hypotheses. Only one of these situations can be the true one, and under a
closed testing procedurethe probability of rejecting that one true configuration
is -: c~. All true null hypothesesin the primaryset are containedin the intersection correspondingto the true configuration, and none of themcan be rejected
unless that configuration is rejected. Therefore, the probability of one or more
of these true primary hypothesesbeing rejected is
METHODS BASED
ON ORDERED
P-VALUES
The methodsdiscussed in this section are defined in terms of a finite family of
hypothesesHi, i = 1 ..... n, consisting of minimalhypothesesonly. It is as-
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 569
sumedthat for each hypothesisHi there is a correspondingtest statistic Ti with
a distribution that dependsonly on the truth or falsity of Hi. It is further
assumedthat Hi is to be rejected for large values of Ti. (The Ti are absolute
values for two-sidedtests.) Thenthe (unadjusted)p-value pi of Hi is defined as
the probability that Ti is larger than or equal to ti, whereT refers to the random
variable and t to its observed value. For simplicity of notation, assumethe
hypothesesare numbered
in the order of their p-valuesso that pl -: p2 "~...~ pn,
with arbitrary ordering in case of ties. Withthe exceptionof the subsection on
MethodsControlling the FDR,all methods in this section are intended to
provide strong control of the FWE.
Methods Based on the First-Order
Bonferroni
Inequality
The first-order Bonferroniinequality states that, given any set of events A1,
A2..... An, the probabilityof their union(i.e. of the event A1orA2or...or An)
smaller than or equal to the sumof their probabilities. Letting Ai stand for the
rejection of Hi, i =1 ..... n, this inequality is the basis of the Bonferroni
methodsdiscussed in this section.
THESIMPLEBONFERRONI
METHOD
This methodtakes the form: Reject Hi ifpi
-: ai, wherethe cti are chosenso that their sumequals ct. Usually, the cti are
chosen to be equal (all equal to ~n), and the method is then called the
unweightedBonferroni method.This procedure controls the PFEto be .~ ~t and
to be exactly c~ if all hypothesesare true. TheFWE
is usually < ct.
This simple Bonferroni method is an exampleof a single-stage testing
procedure. In single-stage procedures, control of the FWE
has the consequence
that the larger the numberof hypothesesin the family, the smaller the average
powerfor testing the individual hypotheses. Multistage testing procedurescan
partially overcomethis disadvantage. Somemultistage modifications of the
Bonferroni methodare discussed below.
HOLM’S
SEQUENTIALLY-REJECTIVE
BONFERRONI
METHOD
The unweighted
methodis described here; for the weighted method, see Holm(1979). This
methodis applied in stages as follows: At the first stage, H1is rejected ifpl ~
ct/n. If H1is accepted, all hypothesesare acceptedwithout further test; otherwise, H2is rejected if p2 ~ a/(n - 1). Continuingin this fashion, at any stage
Hj is rejected if and only if all Hi have been rejected, i <j, andpj~ ~(n -j
1).
To prove that this methodcontrols the FWE,let k be the numberof hypotheses that are true, wherek is somenumberbetween0 and n. If k = n, the
test at the first stage will result in a TypeI error with probability sct. If k =
n - 1, an error mightoccurat the first stage but will certainly occurif there is a
rejection at the secondstage, so again the probability of a TypeI error is -: ct
Annual Reviews
www.annualreviews.org/aronline
570 SHAFFER
[because there are n - 1 true hypotheses and none can be rejected unless at
least one has an associated p-value g o/(n - 1)]. Similarly, whateverthe value
of k, a TypeI error mayoccurat an early stage but will certainly occurif there
is a rejection at stage n - k + 1, in whichcase the probability of a TypeI ,error
is ~ ct. Thus,the FWE
is ~ et for every possible configurationof tree and ~false
hypotheses.
A MODIFICATION FOR INDEPENDENT AND SOME DEPENDENT STATISTICS If
test statsfics are independent, the Bonferroniprocedure and the Holmmodification described abovecan be improvedslightly by replacing o/k for any k =
1 ..... n by 1 - (1 - a)~l/k), always> o/k, althoughthe difference is small
small values of ct. Thesesomewhat
higher levels can also be used whenthe test
statistics are positive orthant dependent,a class that includes the two-sidedt
statistics for pairwise comparisonsof normallydistributed meansin a one-way
layout. Holland &Copenhaver(1988) note this fact and give examplesof other
positive orthant dependentstatistics.
Methods Based on the Simes Equality
Simes(1986) provedthat if a set of hypothesesH1, H2..... Hnare all true, and
the associated test statistics are independent,then with probability 1 - ct, pi >
io/n for i = 1 ..... n, wherethepi are the orderedp-values, and ~t is any number
between 0 and 1. Furthermore, although Simes noted that the probability of
this joint event could be smaller than 1 - ct for dependenttest statistics, this
appearedto be true only in rather pathological cases. Simesand others (I-Iommel 1988, Holland 1991, Klockars & Hancock 1992) have prowided
simulation results suggestingthat the probability of the joint event is larger
than 1 - ct for manytypes of dependencefound in typical testing situations,
including the usual two-sided t test statistics for all pairwise comparisons
amongnormally distributed treatment means.
Simessuggestedthat this result could be used in mukipletesting but did not
provide a formal procedure. As Hochberg(1988) and Hommel(1988) pointed
out, on the assumptionthat the inequality applies in a testing situation, rnore
powerfulproceduresthan the sequentially rejective Bonferronican be obtained
by invoking the Simes result in combinationwith the closure principle. Because carrying out a full Simes-basedclosure procedure testing all possible
hypotheses would be tedious with a large closed set, Hochberg(1988) and
Hommel(1988) each give simplified, conservative methodsof utilizing the
Simesresult.
HOCHBERG’S MULTIPLE TEST PROCEDUREHochberg’s(1988)
procedure can
be described as a "step-up" modification of Holm’sprocedure. Considerthe set
of primaryhypothesesH1..... Hn.Ifpj "~ o/(n -j +1) for anyj = 1 ..... n, reject
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 571
all hypothesesHi for i .~j. In other words,ifpn ~ ct, reject all Hi; otherwise,if
pn- 1 ~ 0./2, reject H1..... Hn-1, etc.
HOMMEL’S
MULTIPLE
TESTPROCEDURE
Hommel’s(1988) procedure is more
powerful than Hochberg’sbut is moredifficult to understand and apply. Letj
be the largest integer for whichpn- j +k >ktx/j for all k = 1 ..... j. If no suchj
exists, reject all hypotheses;otherwise,reject all Hi withpi ~ ct/j.
(1990) gave slightly
higher critical p-value levels that can be used with Hochberg’sprocedure,
makingit somewhatmore powerful. The values must be calculated; see Rom
(1990)for details and a table of values for small
ROM’S MODIFICATION OF HOCHBERG’SPROCEDURERom
Modifications
for Logically
Related Hypotheses
Shaffer (1986) pointed out that Holm’ssequentially-rejective multiple test
procedure can be improvedwhenhypotheses are logically related; the same
considerations apply to multistage methodsbased on Simes’equality. In many
testing situations, it is not possible to get all combinationsof true and false
hypotheses. For example, if the hypotheses refer to pairwise differences
amongtreatment means,it is impossible to have ~tl =
~t3. Usingthis reasoning, with four meansand six possible pairwise equality
null hypotheses,if all six are not true, then at mostthree are tree. Therefore,it
is not necessaryto protect against error in the event that five hypothesesare
true and one is false, because this combinationis impossible. Let tj be the
maximum
numberof hypothesesthat are true given that at leastj - 1 hypotheses are false. Shaffer (1986) gives recursive methodsfor finding the values tj
for several types of testing situations (see also Holland &Copenhaver1987,
Westfall & Young1993). The methods discussed above can be modified to
increase powerwhenthe hypothesesare logically related; all methodsin this
section are intended to control the FWE
at a level ~
MODIFIED
METHODS
As is clear from the proof that it maintains FWEcontrol,
the Holmprocedure can be modified as follows: At stage j, instead of
rejecting Hj only if pj ~ ct/(n - j + 1), Hj can be rejected if pj < a/tj. Thus,
whenthe hypothesesof primaryinterest are logically related, as in the example
above, the modifiedsequentially-rejective Bonferroni methodis morepowerful
than the unmodified method. For somesimple applications, see Levin et al
(1994).
Hochberg & Rom(1994) and Hommel(1988) describe modifications
of their Simes-based procedures for logically related hypotheses. The simpler of the two modificationsthe formerdescribes is to proceedfromi = n, n 1, n - 2, etc until for the first time pi ~ oJ(n - i + 1). Thenreject all Hi for
Annual Reviews
www.annualreviews.org/aronline
572 SHAFFER
which pi ~ oJti + 1. [The Rom(1990) modification of the Hochbergprocedure
can be improvedin a similar way.] In the Hommel
modification, let j be the
largest integerin the set n, t2 ..... tn, and proceed as in the unmodifiedHommel
procedure.
Still further modifications at the expense of greater complexity can be
achieved, since it can also be shown(Shaffer 1986) that for FWE
control it
necessary to consider only the numberof hypotheses that can be true given
that the specific hypothesesthat have been rejected are false. Hommel
(1986),
Conforti & Hochberg(1987), Rasmussen(1993), Rom& Holland (1994),
Hochberg& Rom(1994) consider more general procedures.
COMPARISON
OF PROCEDURES
Amongthe unmodified procedures, Hommel’s
and Rom’sare more powerful than Hochberg’s, which is more powerful than
Holm’s;the latter two, however,are the easiest to apply (Hommel
1988, 1989;
Hochberg1988; Hochberg&Rom1994). Simulation results using the unmodified methodssuggest that the differences are usually small (Holland 1991).
Comparisonsamongthe modified procedures are more complex (see Hochberg
& Rom1994).
A CAUTION
All methodsbased on Simes’s results rest on the assumption that
the equality he provedfor independenttests results in a conservative multiple
comparisonprocedure for dependenttests. Thus, the use of these methodsin
atypical multiple test situations should be backedup by simulation or further
theoretical results (see Hochberg& Rom1994).
Methods Controlling
the False Discovery
Rate
The ordered p-value methodsdescribed above provide strong control of the
FWE.Whenthe test statistics are independent, the following less conservative
step-up procedure controls the FDR(Benjamini & Hochberg1994): If pj ~
o/n, reject all Hi for i .~ j. A recent simulation study (Y Benjamini,
Hochberg,&Y Kling, manuscript in preparation) suggests that the FDRis also
controlled at this level for the dependenttests involved in pairwise comparisons. VSLWilliams, LVJones, &JWTukey (manuscript in preparation) show
in a numberof real data examplesthat the Benjamini-HochbergFDR-controlling proceduremayresult in substantially morerejections than other multiple
comparison methods. However, to obtain an expected proportion of false
rejections, Benjamini & Hochberghave to define a value whenthe denominator, i.e. the numberof rejections, equals zero; they define the ratio then as zero.
Asa result, the expectedproportion, given that somerejections actually occur,
is greater than ct in somesituations (it necessarily equals one whenall hypothesesare la’ue), so moreinvestigation of the error properties of this procedure is needed.
Annual Reviews
www.annualreviews.org/aronline
COMPARING
MULTIPLEHYPOTHESISTESTING 573
NORMALLY DISTRIBUTED
MEANS
Themethodsin this section differ fromthose of the last in three respects: They
deal specifically with comparisonsof means, they are derived assumingnormally distributed observations, and they are based on the joint distribution of
all observations. In contrast, the methodsconsidered in the previous section
are completely general, both with respect to the types of hypotheses and the
distributions of test statistics, and except for someresults related to independenceof statistics, they udlize only the individual marginaldistributions of
thosestatistics.
Contrasts amongtreatment meansare linear functions of the form XCi~ti,
where Xci -- 0. The pairwise differences amongmeans are called simple
contrasts; a general contrast can be thought of as a weightedaverage of some
subset of meansminus a weighted average of another subset. The reader is
presumably familiar with the most commonlyused methods for testing the
hypotheses that sets of linear contrasts equal zero with FWEcontrol in a
one-wayanalysis of variance layout under standard assumptions. They are
described briefly below.
Assumern treatments with N observations per treatment and a total of T
observations over all treatments, let ~i be the samplemeanfor treatment i, and
let MSW
be the within-treatment meansquare.
If the primary hypotheses consist of all linear contrasts amongtreatment means, the Scheff6 method (1953) controls the FWE. Using the
Scheff6 method, a contrast hypothesis Eci[ti = 0 is rejected if
I Xci~il"x/XciZ(MSW/N)(m-1)
Fm-l,T-m;c~, where Fm- 1, r- m; ct is the
a-level critical value of the F distribution with rn - 1 and T - rn degrees of
freedom.
If the primaryhypothesesconsist of the pairwise differences, i.e. the simple
contrasts, the Tukey method(1953) controls the FWEover this set. Using
this method, any simple contrast hypothesis 5ij = 0 is rejected if
[ ~i - "~j I >MSvr’~--~-’Nqm,T-m;ct,
whereqm,T-m;txis the a-critical value of the
studentized range statistic for rn meansand T - rn error degrees of freedom.
If the primary hypothesesconsist of comparisonsof each of the first rn - 1
means with the mth mean (e.g. of rn - 1 treatments with a control), the
Dunnett method(1955) controls the FWEover this set. Using this method,
any hypothesis ~3im = 0 is rejected if I~i - "~mI > x/2MSW/Ndm-I,T-m;~,
where
dm - 1, T- m; ct is the a-level critical value of the appropriatedistributionfor this
test.
Both the Tukey and Dunnett methods can be generalized to test the hypotheses that all linear contrasts amongthe meansequal zero, so that the three
procedurescan be comparedin poweron this wholeset of tests (for discussion
of these extended methodsand specific comparisons,see Shaffer 1977). Rich-
Annual Reviews
www.annualreviews.org/aronline
574 SHAFFER
mond(1982)providesa moregeneral treatmentof the extensionof confidence
intervalsfor a finite set to intervalsfor all linearfunctionsof the set.
All three methodscan be modifiedto multistagemethodsthat give :more
powerfor hypothesistesting. In the Scheff6method,if the F test is significant,
the FWEis preserved if rn - 1 is replaced by rn - 2 everywherein the
expressionfor Scheff6significance tests (Scheff61970). TheTukeymethod
can be improvedby a multiplerange test using significancelevels described
by Tukey(1953) and sometimesreferred to as Tukey-Welsch-Ryan
levels
(see also Einot & Gabriel 1975, Lehmann& Shaffer 1979). Begun
Gabriel (1981) describe an improvedbut more complexmultiple range
procedurebasedon a suggestionby E Peritz [unpublishedmanuscript(1970)]
using closure principles, and denotedthe Peritz-Begun-Gabrielmethodby
Grechanovsky(1993). Welsch(1977) and Dunnett & Tamhane(1992)
posedstep-upmethods(lookingfirst at adjacentdifferences)as opposedto the
step-downmethodsin the multiplerange proceduresjust described. Thestepup methodshave somedesirable properties (see Ramsey1981, Dunnett
Tamhane1992, Keselman& Lix 1994) but require heavy computation or
specialtables for application.TheDunnetttest canbe treatedin a sequentiallyrejectivefashion,whereat stagej the smallervaluedm-j,T-m;ct canbe substitutedfor dm-1, T-m;
Becausethe hypothesesin a closedset mayeachbe tested at level ct by a
variety of procedures,there are manyother possible multistageprocedures.
For example,results of Ramsey(1978), Shaffer (1981), and Kunert(1990)
suggestthat for mostconfigurationsof means,a multiple F-test multistage
procedureis morepowerfulthan the multiple range procedures described
abovefor testing pairwise differences, althoughthe opposite is true with
single-stage procedures. Other approaches to comparingmeansbased on
ranges havebeeninvestigated by Braun& Tukey(1983), Finner (1988).,
Royen(1989, 1990).
The Scheff6 methodand its multistage version are easy to apply when
samplesizes are unequal;simplysubstitute Ni for N in the Scheff6formula
given above,whereNi is the numberof observationsfor treatment i. Exact
solutions for the Tukeyand Dunnettproceduresare possible in principle but
involveevaluationof multidimensional
integrals. Morepractical approximate
methodsare basedon replacing MSW/N,
whichis half the estimatedvariance
ofT/- ~ in the equal-sample-size
case, with (1/2) MSW
(llNi + I/Nj), whichis
half its estimatedvariancein the unequal-sample-size
case. Thecommon
value
MSW/N
is thus replacedby a different valuefor eachpair of subscriptsi andj.
TheTukey-Kramer
method(Tukey1953, Kramer1956) uses the single-stage
Tukeystudentizedrangeprocedurewith these half-varianceestimates substituted for MSW/N.Kramer(1956) proposed a similar multistage method;
preferred, somewhatless conservative methodproposedby Duncan(1957)
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 575
modifies the Tukeymultiple range methodto allow for the fact that a small
difference maybe more significant than a large difference if it is based on
larger sample sizes. Hochberg& Tarnhane(1987) discuss the implementation
of the Duncanmodification and showthat it is conservative in the unbalanced
one-waylayout. For modifications of the Dunnett procedure for unequal sample sizes, see Hochberg&Tamhane(1987).
The methods must be modified when it cannot be assumed that withintreatment variances are equal. If variance heterogeneity is suspected, it is
important to use a separate variance estimate for each samplemeandifference
or other contrast. The multiple comparisonprocedure should be based on the
set of values of each meandifference or contrast divided by the square root of
its estimated variance. The distribution of each can be approximatedby a t
distribution with estimated degrees of freedom (Welch 1938, Satterthwaite
1946). Tamhane(1979) and Dunnett (1980) compared a number of singlestage proceduresbasedon these approximatet statistics; several of the procedures providedsatisfactory error control.
In one-wayrepeated measures designs (one factor within-subjects or subjects-by-treatments designs), the standard mixedmodelassumessphericity of
the treatment covariance matrix, equivalent to the assumptionof equality of
the variance of each difference between sample treatment means. Standard
modelsfor between-subjects-within-subjects designs have the added assumption of equality of the covariance matrices amongthe levels of the betweensubjects factor(s). Keselmanet al (1991b) give a detailed account of
calculation of appropriate test statistics whenboth these assumptionsare violated and showin a simulation study that simple multiple comparisonprocedures based on these statistics have satisfactory properties (see also Keselman
&Lix 1994).
OTHERISSUES
Tests vs ConfidenceIntervals
The simple Bonferroni and the basic Scheff6, Tukey, and Dunnett methods
described aboveare single-stage methods,and all have associated simultaneous confidenceinterval interpretations. Whena confidenceinterval for a difference does not include zero, the hypothesis that the difference is zero is
rejected, but the confidenceinterval gives moreinformation by indicating the
direction and somethingabout the magnitudeof the difference or, if the hypothesis is not rejected, the powerof the procedurecan be gaugedby the width
of the interval. In contrast, the multistage or stepwise procedureshave no such
straightforward confidence-interval interpretations, but morecomplicatedintervals can sometimesbe constructed. The first confidence-interval interpreta-
Annual Reviews
www.annualreviews.org/aronline
576 SHAFFER
tion of a multistage procedure was given by Kimet al (1988), and Hayt~x
Hsu(1994) have described a general methodfor obtaining these intervals. The
intervals are complicatedin structure, and moreassumptionsare requirexl for
them to be valid than for conventional confidence intervals. Furtherrnore,
although as a testing methoda multistage procedure might be uniformly more
powerfulthan a single-stage procedure, the confidenceintervals corresponding
to the former are sometimesless informative than those correspondingto the
latter. Nonetheless,these are interesting results, and morealong this line are to
be expected.
Directional
vs Nondirectional
Inference
In the examplesdiscussed above, most attention has been focused on simple
contrasts, testing hypothesesHo:6ij = 0 vs HA:6ij ~ O. However,in most cases,
if H0is rejected, it is crucialto conclude
either [t.ti > [LI,j or [Lti < [Ltj. Different
types of testing problemsarise whendirection of difference is considered: 1.
Sometimes
the interest is in testing one-sidedhypothesesof the form~ti -: ~tj vs
~ti > ~j, e.g. if a newtreatmentis beingtested to see whetherit is better than a
standardtreatment, and there is no interest in pursuingthe matter further if it is
inferior. 2. In a two-sidedhypothesistest, as formulatedabove, rejection of the
hypothesisis equivalent to the decision ~ti ~, ~xj. Is it appropriate to further
conclude~tl > ~tj if~i > ~j and the opposite otherwise?3. Sometimes
there is an
a priori ordering assumption~tl .~ ~t2 ~....~ ~tm, or somesubset of these means
are considered ordered, and the interest is in deciding whether someof these
inequalitiesare strict.
Eachof these situations is different, and different considerationsarise. An
important issue in connection with the second and third problems mentioned
aboveis whetherit makessense to even consider the possibility that the means
under two different experimental conditions are equal. Somewriters contend
that a priori no difference is ever zero (for a recent defenseof this position, see
Tukey1991, 1993). Others, including this author, believe that it is not necessary to assumethat every variation in conditions must have an effect. In any
case, even if one believes that a meandifference of zero is impossible, an
intervention can havean effect so minutethat it is essentially undeteetable and
unimportant,in whichcase the null hypothesis is reasonable as a practical way
of framing the question. Whateverthe views on this issue, the hypotheses in
the second case described above are not correctly specified if directional
decisions are desired. Onemust consider, in addition to Type I and Type II
errors, the probably more severe error of concluding a difference exists but
makingthe wrong choice of direction. This has sometimesbeen called a Type
III error and maybe the most important or even the only concemin the second
testing situation.
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 577
For methodswith correspondingsimultaneousconfidence intervals, inspection of the intervals yields a directional answerimmediately.For manymultistage methods,the situation is less clear. Shaffer (1980) showedthat an additional decision on direction in the secondtesting situation does not control the
FWEof Type III for all test statistic distributions. Hochberg&Tamhane
(1987) describe these results and others found by S Holm[unpublished manuscript (1979)] (for newerresults, see Finner 1990). Otherless powerfulmethods with guaranteed Type I and/or Type I11 FWEcontrol have been developed
by Spj~tvoll (1972), Holm[1979; improved and extended by Bauer et
(1986)], Bohrer (1979), Bofinger (1985), and Hochberg(1987).
Somewriters have considered methodsfor testing one-sided hypotheses of
the third type discussed above(e.g. Marcuset al 1976, SpjCtvoll 1977, Berenson 1982). Budde&Bauer (1989) compare a numberof such procedures both
theoretically and via simulation.
In another type of one-sided situation, Hsu (1981,1984) introduced
methodthat can be used to test the set of primaryhypothesesof the formHi: ~ti
is the largest mean.Thetests are closely related to a one-sidedversion of the
Dunnett methoddescribed above. They also relate the multiple testing literature to the rankingand selection literature.
Robustness
This is a necessarily brief look at robustness of methodsbased on the homogeneity of variance and normality assumptionsof standard analysis of variance.
Chapter 10 of Scheff6 (1959) is a good source for basic theoretical results
concerningthese violations.
As Tukey(1993) has pointed out, an amountof variance heterogeneity that
affects an overall F test only slightly becomesa more serious concern when
multiple comparisonmethodsare used, because the variance of a particular
comparison may be badly biased by use of a commonestimated value.
Hochberg&Tamhane(1987) discuss the effects of variance heterogeneity
the error properties of tests basedon the assumptionof homogeneity.
Withrespect to nonnormality, asymptotic theory ensures that with sufficiently large samples, results on Type I error and powerin comparisonsof
means based on normally distributed observations are approximately valid
under a wide variety of nonnormaldistributions. (Results assumingnormally
distributed observations often are not even approximatelyvalid under nonnormality, however,for inference on variances, covariances, and correlations.)
This leaves the question of Howlarge is large? In addition, alternative methods are more powerful than normal theory-based methodsunder manynonnormal distributions. Hochberg&Tamhane(1987, Chap. 9) discuss distributionfree and robust procedures and give references to manystudies of the robustness of normal theory-based methodsand of possible alternative methodsfor
Annual Reviews
www.annualreviews.org/aronline
578 SHAFFER
multiple comparisons. In addition, Westfall & Young(1993) give detailed
guidance for using robust resampling methods to obtain appropriate error
control.
Others
Frequentist methodscontrol error without any assumptionsabout possible alternative
values of parametersexcept for those that maybe implied logically. Meta-analysis in its simplest form assumesthat all hypothesesrefer to the sameparameter
and it combinesresults into a single statement. Bayes and Empirical Bayes
procedures are intermediate in that they assume someconnection amongparameters and base error control on that assumption.A major contributor to the
Bayesian methods is Duncan(see e.g. Duncan1961, 1965; Duncan& Dixon
1983). Hochberg & Tamhane(1987) describe Bayesian approaches (see
Berry 1988). Westfall &Young(1993) discuss the relations amongthese three
approaches.
FREQUENTIST METHODS, BAYESIAN METHODS, AND META-ANALYSIS
DECISION-THEORETIC
OPTIMALITY
Lehmann(1957a,b), Bohrer (1979),
SpjCtvoll (1972) defined optimal multiple comparisonmethodsbased on fiequentist decision-theoretic principles, and Duncan(1961, 1965) and coworkers
developed optimal procedures from the Bayesian decision-the0retic point of
view. Hochberg&Tamhane(1987) discuss these and other results.
RANKING
ANDSELECTION The methods of Dunnett (1955) and Hsu (1981,
1984), discussedabove, forma bridge betweenthe selection and multiple testing
literature, and are discussedin relation to that literature in Hochberg&Tamhane
(1987). B echhoferet al (1989) describe another methodthat incorporates aspects
of both approaches.
GRAPHS
AND
DIAGRAMS
As with all statistical results, the results of multiple
comparison procedures are often most clearly and comprehensivelyconveyed
through graphs and diagrams, especially when a large number of tests is
involved. Hochberg&Tamhane(1987) discuss a numberof procedures. Duncan
(1955) includes several illuminating geometricdiagramsof acceptanceregions,
as do Tukey (1953) and Bohrer &Schervish (1980). Tukey(1953, 1991)
a numberof graphical methodsfor describing differences amongmeans(see
also Hochberget al 1982, Gabriel & Gheva1982, Hsu & Pemggia1994). Tukey
(1993) suggests graphical methodsfor displaying interactions. Schweder
Spjctvoll (1982) illustrate a graphical methodfor plotting large numbers
ordered p-values that can be used to help decide on the numberof true hypotheses; this approach is used by Y Benj amini &Y Hochberg(manuscript submitted
Annual Reviews
www.annualreviews.org/aronline
MULTIPLEHYPOTHESISTESTING 579
for publication) to develop a more powerful FDR-controlling method. See
Hochberg& Tamhane(1987) for further references.
One way to use
partial knowledgeof joint distributions is to consider higher-order Bonferroni
inequalities in testing someof the intersection hypotheses, thus potentially
increasing the powerof FWE-controlling multiple comparison methods. The
Bonferroniinequalities are derived froma general expressionfor the probability
of the union of a numberof events. The simple Bonferroni methods using
individual p-values are based on the upper bound given by the first-order
inequality. Second-orderapproximationsuse joint distributions of pairs of test
statistics, third-order approximationsuse joint distributions of triples of test
statistics, etc, thus forminga bridge betweenmethodsrequiring only univariate
distributions and those requiring the full multivariate distribution (see Hochberg
& Tamhane1987 for further references to methods based on second-order
approximations; see also Bauer &Hackl 1985). Hoover(1990) gives results
using third-order or higher approximations,and Glaz (1993) includes an extensive discussion of these inequalities (see also Naiman& Wynn1992, Hoppe
1993a, Seneta 1993). Someapproachesare based on the distribution of combinations of p-values (see Cameron& Eagleson 1985, Buckley& Eagleson 1986,
Maurer&Mellein 1988, Rom&Connell 1994). Other types of inequalities are
also useful in obtaining improved approximate methods (see Hochberg
Tarnhane1987, Appendix2).
HIGHER-ORDER BONFERRONI AND OTHER INEQUALITIES
WEIGHTS
In the description of the simple Bonferroni methodit was noted that
each hypothesis Hi can be tested at any level Cti with the FWEcontrolled at
c~ = 53cti. In mostapplications, the ~i are equal, but there maybe reasons to
prefer unequalallocation of error protection. For methodscontrolling FWE,see
Holm(1979), Rosenthal & Rubin (1983), DeCani (1984), and Hochberg
Liberman (1994). Y Benjamini & Y Hochberg (manuscript submitted
publication) extend the FDRmethodto allow for unequal weights and discuss
various purposesfor differential weightingand alternative methodsof achieving
it.
OTHER
AREAS
OFAPPLICATION
Hypotheses specifying values of linear combinations of independentnormalmeansother than contrasts can be tested jointly
using the distribution of either the maximum
modulusor the augmentedrange
(for details, see Scheff6 1959). Hochberg& Tamhane(1987) discuss methods
in analysis of covariance, methodsfor categorical data, methodsfor comparing
variances, and experimentaldesign issues in various areas. Cameron&Eagleson
(1985) and Buckley&Eagleson (1986) consider multiple tests for significance
of correlations. Gabriel (1968) and Morrison (1990) deal with methods
Annual Reviews
www.annualreviews.org/aronline
580
SHAFFER
multivariate multiplecomparisons.Westfall& Young(1993, Chap.4) discuss
resamplingmethodsin a variety of situations. Thelarge literature on model
selection in regressionincludesmanypapersfocusingon the multipletesting
aspectsof this area.
CONCLUSION
Thefield of multiplehypothesistesting is too broadto be coveredentirely in a
reviewof this length; apologiesare due to manyresearcherswhosecontributions have not been acknowledged.Theproblemof multiplicity is gaining
increasing recognition,and research in the area is proliferating. Themajor
challengeis to devisemethodsthat incorporatesomekind of overall controlof
TypeI error while retaining reasonable powerfor tests of the individual
hypotheses.This review,while sketchinga numberof issues and approaches,
has emphasized
recent research on relatively simple and general multistage
testing methods
that are providingprogressin this direction.
ACKNOWLEDGMENTS
Researchsupportedin part throughthe NationalInstitute of Statistical Sciences by NSF Grant RED-9350005. Thanks to Yosef Hochberg, Lyle V.
Jones, Erich L. Lehmann, Barbara A. Mellers, Seth D. Roberts, and Valerie S.
L. Williamsfor helpful comments
andsuggestions.
AnyAnnualReviewchapter,as well as anyartide cited in an AnnualReviewchapter,
maybe purchased
fromthe AnnualReviewsPreprintsandReprintsservice.
1-800-347-8007;415-259-5017;
email: arpr@class.org
Literature Cited
Ahmed
SW.1991.Issues arising in the applicationof Bonferrortiproceduresin federal
surveys. 1991ASAProc.Surv. Res. Methotis Sect., pp. 344-49
BauerP, HacklP. 1985. Theapplication of
Hunter’sinequalityto simultaneous
testing.
Biometr.J. 27:25-38
Bauer P, Hackl P, Hommel
G, SonnemannE.
1986.Multipletesting of pairs of one-sided
hypotheses.Metrika33:121-27
Bauer P, Hommel
G, Sonnemann
E, eds. 1988.
Multiple Hypothesenprgifung.(Multiple
Hypotheses
Testing.) Berlin: Springer-Verlag (In German
andEnglish)
BechhoferRE.1952.Theprobability of a correct ranking.Anr~Math.Star. 23:139-40
Bechhofer RE, Durmett CW,TamhaneAC.
1989. Two-stageproceduresfor comparing
treatmentswitha control:eliminationat the
first stage and estimation at the second
stage. Biometr.J. 31:545-61
BegunJ, Gabriel KR.1981. Closure of the
Newman-Keuls
multiple comparisonprocedure.J. Am.Stat. Assoc.76:241--45
BenjaminiY, HochbergY. 1994. Controlling
the false discoveryrate: a practical and
powerfulapproach
to multipletesting. J. R.
Stat. Soc.Ser. B. In press
BerensonML.1982. A comparisonof several
k sampletests for orderedalternatives in
completely randomized designs. Psychometrika47:265-80(Corr. 535-39)
Annual Reviews
www.annualreviews.org/aronline
MULTIPLE
Berry DA.1988. Multiple comparisons, multiple tests, and data dredging: a Bayesian
perspective (with discussion). In Bayesian
Statistics, ed. JMBernardo, MHDeGroot,
DVLindley, AFMSmith, 3:79-94. London: Oxford Univ. Press
Bofinger E. 1985. Multiple comparisons and
Type III errors. J. Am.Stat. Assoc. 80:43337
Bohrer R. 1979. Multiple three-decision rules
for parametric signs. J. Am. Star. Assoc.
74:432-37
Bohrer R, Schervish MJ. 1980. An optimal
multiple decision rule for signs of parameters. Proc. Natl. Acad. Sci. USA77:52-56
Booth JG. 1994. Review of "Resampling
Based Multiple Testing." J. Am. Stat. Assoc. 89:354-55
BraunHI, ed. 1994. The Collected Works of
John W. Tukey. Vol. VIII: Multiple Comparisons:1948-1983. NewYork: Chapman
& Hall
Braun HI, Tukey JW. 1983. Multiple comparisons through orderly partitions: the maximumsubrange procedure. In Principals of
ModemPsychological Measurement: A
Festschrift for Frederic M. Lord, ed. H
Wainer, S Messick, pp. 55-65. Hillsdale,
NJ: Erlbaum
Buckley MJ, Eagleson GK. 1986. Assessing
large sets of rank correlations. Biometrika
73:151-57
BuddeM, Bauer P. 1989. Multiple test procedures in clinical dose finding studies. J.
Am. Stat. Assoc. 84:792-96
CameronMA,Eagleson GK. 1985. A new procedure for assessing large sets of correlations. Aust. J. Stat. 27:84-95
Chaubey YP. 1993. Review of "Resampling
Based Multiple Testing." Technometrics
35:450-51
Conforti M, Hochberg Y. 1987. Sequentially
rejective pairwise testing procedures. J.
Stat. Plan. Infer. 17:193-208
Cournot AA. 1843. Exposition de la Thgorie
des Chances et des Probabilitgs. Paris:
Hachette. Reprinted 1984 as Vol. 1 of
Cournot’s Oeuvres Completes, ed. B Bru.
Paris: Vrin
DeCaniJS. 1984. Balancing Type I risk and
loss of powerin ordered Bonferroni procedures. J. Educ. Psychol. 76:1035-37
Diaconis P. 1985. Theories of data analysis:
from magical thinking through classical
statistics.
In Exploring Data Tables,
Trends, and Shapes, ed. DCHoaglin, F
Mosteller, JWTukey, pp. 1-36. NewYork:
Wiley
DuncanDB.1951. A significance test for differences between ranked treatments in an
analysis of variance. Va. J. Sci. 2:172-89
DuncanDB. 1955. Multiple range and multiple
F tests. Biometrics11 : 1-42
DuncanDB.1957. Multiple range tests for cor-
HYPOTHESIS
TESTING
581
related and heteroscedastic means.Biometrics 13:164-76
Duncan DB. 1961. Bayes rules for a common
multiple comparisons problem and related
Student-t problems. Ann. Math. Stat. 32:
1013-33
Duncan DB. 1965. A Bayesian approach to
multiple comparisons. Technometrics 7:
171-222
DuncanDB,DixonDO.1983. k-ratio t tests, t
intervals, and point estimates for multiple
comparisons.In Encyclopediaof Statistical
Sciences, ed. S Kotz, NLJohnson, 4: 40310. NewYork: Wiley
Dunnett CW.1955. A multiple comparison
procedure for comparingseveral treatments
with a control. J. Am. Stat. Assoc. 50:
1096-1121
Dunnett CW.1980. Pairwise multiple comparisons in the unequal variance case. J. Am.
Stat. Assoc. 75:796-800
Dunaett CW,Tamhane AC. 1992. A step-up
multiple test procedure.J. Am.Stat. Assoc.
87:162-70
Einot I, Gabriel KR.1975. Astudy of the powers of several methodsin multiple comparisons. J. Am. Stat. Assoc. 70:574-83
Finner H. 1988. Abgeschlossene Spannweitentests (Closed multiple range tests). See
Bauer et al 1988, pp. 10-32 (In German)
Finner H. 1990. Onthe modified S-methodand
directional errors. Commun.
Stat. Part A:
Theory Methods 19:41-53
Fligner MA.1984. Anote on two-sided distribution-free treatment versus control multiple comparisons. J. Am. Stat. Assoc. 79:
208-11
Gabriel KR. 1968. Simultaneous test procedures in multivariate analysis of variance.
Biometrika 55:489-504
Gabriel KR. 1969. Simultaneous test procedures-some theory of multiple comparisons. Ann. Math. Stat. 40:224-50
Gabriel KR. 1978. Commenton the paper by
Ramsey.J. Am. Stat. Assoc. 73:485-87
Gabriel KR, GhevaD. 1982, Somenew simultaneous confidence intervals in MANOVA
and their geometric representation and
graphical display. In ExperimentalDesign,
Statistical Models,and Genetic Statistics,
ed. K Hinkelmann, pp. 239-75. NewYork:
Dekker
Gaffan EA. 1992. Review of "Multiple Comparisons for Researchers." Br. J. Math.
Stat. PsychoL 45:334-35
Glaz J. 1993. Approximatesimultaneous confidence intervals. See Hoppe 1993b, pp.
149-66
Grechanovsky E. 1993. Comparing stepdown
multiple comparisonprocedures. Presented
at Annu.Jt. Stat. Meet., 153rd, San Francisco
Harter HL. 1980. Early history of multiple
comparison tests. In Handbookof Statis-
Annual Reviews
www.annualreviews.org/aronline
582
SHAFFER
tics, ed. PRKrishnaiah, 1:617-22. Amsterdam: North-Holland
Hartley HO. 1955. Somerecent developments
in analysis of variance. Commun.Pure
Appl. Math. 8:47-72
Hayter AJ, HsuJC. 1994. On the relationship
between stepwise decision procedures and
confidence sets. J. Am. Stat. Assoc. 89:
128-36
Hochberg Y. 1987. Multiple classification
rules for signs of parameters.J. Stat. Plan.
Infer. 15:177-88
HochbergY. 1988. A sharper Bonferroni procedure for multiple tests of significance.
Biometrika 75:800-3
Hochberg Y, Liberman U. 1994. An extended
Simestest. Stat. Prob.Lett. In press
Hochberg Y, RomD. 1994. Extensions of multiple testing procedures based on Simes’
test. J. Stat. Plan.Infer. In press
Hochberg Y, Tamhane AC. 1987. Multiple
Comparison Procedures.
New York:
Wiley
Hochberg Y, Weiss G, Hart S. 1982. On
graphical procedures for multiple comparisons. J. Am. Stat. Assoc. 77:767-72
blolland B. 1991. Onthe application of three
modified Bonferroni procedures to pairwise multiple comparisons in balanced repeated measuresdesigns. Comput.Stat. Q.
6:21%31.(Corr. 7:223)
Holland BS, Copenhaver MD. 1987. An improved sequentially rejective Bonferroni
test procedure. Biometrics 43:417-23.
(Corr:43:737)
Holland BS, Copenhaver MD.1988. Improved
Bonferroni-type multiple testing procedures. Psychol. Bull 104:145-49
HolmS. 1979. A simple sequentially rejective
multiple test procedure. Scand. J. Stat. 6:
65-70
HolmS. 1990. Review of "Multiple Hypothesis Testing." Metrika 37:206
Hommel
G. 1986. Multiple test procedures for
arbitrary dependence structures. Metrika
33:321-36
Hommel
G. 1988. A stagewise rejective multiple test procedure based on a modified
Bonferroni test. Biometrika 75:383-86
HommelG. 1989. A comparison of two modified Bonferroni procedures. Biometrika 76:
624-25
Hoover DR. 1990. Subset complememaddition upper bounds--an improved inclusion-exclusion method.J. Stat. Plan. Infer.
24:195-202
Hoppe FM. 1993a. Beyond inclusion-and-exclusion: natural identities for P[exactly t
events] and Plat least t evems]and resulting inequalities. Int. Stat. Rev. 61:435-46
HoppeFM, ed. 1993b. Multiple Comparisons,
Selection, and Applications in Biometry.
NewYork: Dekker
Hsu JC. 1981. Simultaneous confidence inter-
vals for all distances from the ’best’. Ann.
Stat. 9:1026-34
Hsu JC. 1984.Constrainedsimultaneou,,; confidence intervals for multiple comparisons
with the best. Ann. Star. 12:1136-44
Hsu JC. 1996. Multiple Comparisons: Theory
and Methods. New York: Chap~nan &
Hall. In press
Hsu JC, Peruggia M. 1994. Graphical representations of Tukey’s multiple comparison
method. J. Comput.Graph. Stat. 3:143-61
Keselman HJ, Keselman JC, Games PA.
1991a. Maximum
familywise Type I error
rate: the least significant difference, Newman-Keuls, and other multiple comparison
procedures. Psychol. Bull 110:155-61
Keselman H J, Keselman JC, Shaffer JP.
1991b. Multiple pairwise comparisons of
repeated measures means under violation
of multisample sphericity. Psychol. Bull.
110:162-70
Keselman HJ, Lix LM. 1994. Improved repeated-measures stepwise multiple comparison procedures.J. Educ.Stat. In press
Kim WC, Stefansson G, Hsu JC. 1988. On
confidencesets in multiple comparisons.In
Statistical Decision Theory and Related
Topics IV, ed. SS Gupta, JO Berger, 2:89104. NewYork: Academic
Klockars AJ, HancockGR. 1992. Power of recent multiple comparisonprocedures as applied to a completeset of planned orthogonal contrasts. PsychoLBull. 111:505-10
Klockars AJ, Sax G. 1986. Multiple Compari.
sons. NewburyPark, CA: Sage
KramerCY. 1956. Extension of multiple range
tests to group meanswith unequal numbers
of replications. Biometrics 12:307-10
KunertJ. 1990. Onthe powerof tests for multiple comparison of three normal means. J.
Am. Stat. Assoc. 85:808-12
L~iuter J. 1990. Reviewof "Multiple Hypotheses Testing." Comput.Stat. Q. 5:333
LehmannEL. 1957a. A theory of somemultipie decision problems. I. Ann. Math. Stat.
28:1-25
LehmannEL. 1957b. A theory of some multiple decision problems.1/. Ann. Math.Stat.
28:547-72
LehmannEL, Shaffer JP. 1979. Optimumsignificance levels for multistage comparison
procedures. Ann. Stat. 7:27-45
Levin JR, Serlin RC, Seaman MA.1994. A
controlled, powerful multiple-comparison
strategy for several situations. PsychoL
Bull. 115:153-59
Littell RC. 1989. Review of "Multiple Comparison Procedures." Technometrics 31:
261-62
Marcus R, Peritz E, Gabriel KR. 1976. On
closed testing procedureswith spex:ial reference
to ordered
analysis of variance.
Biometrika
63:655-60
Maurer W, Mellein B. 1988. Onnew multiple
Annual Reviews
www.annualreviews.org/aronline
MULTIPLE
tests based on independentp-values and the
assessment of their power. See Bauer et al
1988, pp. 48-66
Miller RG.1966. SimultaneousStatistical Inference. NewYork: Wiley
Miller RG. 1977. Developments in multiple
comparisons 1966-1976. J. Am. Stat. Assoc. 72:779--88
Miller RG.1981. SimultaneousStatistical lnferenee. NewYork: Wiley. 2nd ed.
Morrison DF. 1990. Multivariate Statistical
Methods. NewYork: McGraw-Hill.3rd ed.
Mosteller F. 1948. A k-sampleslippage test for
an extreme population. Ann. Math. Stat.
19:58-65
NaimanDQ, WynnHP. 1992. Inclusion-exclusion-Bonferroniidentities and inequalities
for discrete tube-like problemsvia Euler
characteristics. Ann.Stat. 20:43-76
Nair KR.1948. Distribution of the extremedeviate from the sample mean. Biometrika
35:118-44
NowakR. 1994. Problemsin clinical trials go
far beyond misconduct. Science 264:153841
Paulson E. 1949. A multiple decision procedure for certain problemsin the analysis of
variance. Ann. Math.S~at. 20:95-98
Peritz E. 1989. Reviewof "Multiple Comparison Procedures." J. Educ. Stat. 14:103-6
RamseyPH. 1978. Power differences between
pairwise multiple comparisons.J. Am.Stat.
Assoc. 73:479-85
RamseyPH. 1981. Power of univariate pairwise
comparisonprocedures. Psychol.multiple
Bull, 90:352-66
HYPOTHESIS
TESTING
583
RyanTA. 1959. Multiple comparisons in psychological research. PsychoL Bull 56:2647
RyanTA. 1960. Significance tests for multiple
comparisonof proportions, variances, and
other statistics. Psychol. Bull. 57:318-28
Satterthwaite FE. 1946, Anapproximatedistribution of estimates of variance components. Biometrics 2:110-14
Scheff6 H. 1953. Amethodfor judging all contrasts in the analysis of variance. Biometrika 40:87-104
Scheff6 H. 1959. The Analysis of Variance.
New York: Wiley
Scheff6H. 1970. Multiple testing versus multiple estimation. Improperconfidence sets.
Estimation of directions and ratios. Ann.
Math. Stat. 41:1-19
SchwederT, Spj0tvoll E. 1982. Plots of P-values to evaluate manytests simultaneously.
Biometrika 69:493-502
Seeger P. 1968. A note on a method for the
analysis of significances en masse. Technometrics 10:586-93
Seneta E. 1993. Probability inequalities and
Dunnett’s test. See Hoppe1993b, pp. 2945
Shafer G, Olkin I. 1983. Adjusting p values to
account for selection over dichotomies. J.
Am. Stat. Assoc. 78:674-78
Shaffer JP. 1977. Multiple comparisons emphasizing selected contrasts: an extension
and generalization of Dunnett’s procedure.
Biometrics 33:29 3-303
Shaffer JP. 1980. Control of directional errors
with stagewise multiple test procedures.
RasmussenJL. 1993. Algorithm for Shaffer’s
Anr~Stat. 8:1342-48
multiple comparisontests. Educ. Psychol. Shaffer JP. 1981. Complexity:an interpretabilMeas. 53:329-35
ity criterion for multiple comparisons.J.
RichmondJ. 1982. A general methodfor conAm. Stat. Assoc. 76:395-401
structing simultaneous confidence interShaffer JP. 1986. Modifiedsequentially rejecvals. J. Am. Stat. Assoc. 77:455-60
tive multiple test procedures. J. Am. Stat.
RomDM.1990. A sequentially rejective test
Assoc. 81:826-31
procedure based on a modified Bonferroni Shaffer JP. 1988. Simultaneoustesting. In Eninequality. Biometrika 77:663-65
cyclopedia of Statistical Sciences, ed. S
RomDM, Connell L. 1994. A generalized
Kotz, NL Johnson, 8:484-90. NewYork:
family of multiple test procedures. ComWiley
mun.Stat. Part A: Theory Methods, 23. In Shaffer JP. 1991. Probability of directional erpress
rors with disordinal (qualitative) interacRomDM,Holland B. 1994. A new closed multion. Psychometrika 56:29-38
tiple testing procedure for hierarchical
Simes RJ. 1986. Animproved Bonferroni profamilies of hypotheses.J. Stat. Plan. Infer.
cedure for multiple tests of significance.
In press
Biometrika 73:751-54
Rosenthal R, Rubin DB. 1983. Ensemble-ad- Sorid B. 1989. Statistical "discoveries" and efjusted p values. Psychol. Bull, 94:540-41
fect-size estimation. J. Am. Star. Assoc.
Roy SN, Bose RC. 1953. Simultaneous confi84:608-10
dence interval estimation. Ann. Math.Stat.
Spjctvoll E. 1972. Onthe optimality of some
24:513-36
multiple comparison procedures. Ann.
Royen T. 1989. Generalized maximumrange
Math. Stat. 43:398-411
tests for pairwise comparisonsof several
SpjOtvoll E. 1977. Ordering ordered paramepopulations. Biometr. J. 31:905-29
ters. Biometrika 64:327-34
RoyenT. 1990. A probability inequality for
Stigler SM. 1986. The History of Statistics.
ranges and its application to maximum
Cambridge:Harvard Univ, Press
range test procedures. Metrika 37:145-54
TamhaneAC. 1979. A comparison of proce-
Annual Reviews
www.annualreviews.org/aronline
584
SHAFFER
dures for multiple comparisons of means
with unequalvariances. J. Am. Stat. Assoc.
74:471-80
Tatsuoka MM.1992. Review of "Multiple
Comparisons for Researchers." Contemp.
Psychol. 37:775-76
Toothaker LE. 1991. Multiple Comparisonsfor
Researchers. NewburyPark, CA: Sage
Toothaker LE. 1993. Muttiple Comparison
Procedures. NewburyPark, CA: Sage
Tukey JW. 1949. Comparingindividual means
in the analysis of variance. Biometrics 5:
99-114
TukeyJW. 1952. Reminder sheets for "Multiple Comparisons." See Braun 1994, pp.
341-45
Tukey JW. 1953. The problem of multiple
compmfsons.See Braun 1994, pp. 1-300
Tukey JW. 1991. The philsophy of multiple
comparisons.Stat. Sci. 6:100-16
Tukey JW. 1993. Whereshould multiple comparisons go next? See Hoppe 1993b, pp.
187-207
WelchBL. 1938. The significance of the difference between two means when the
population variances are unequ~fl. Bioo
metrika 25:350-62
Welsch RE. 1977. Stepwise multiple comparison procedures, J. Am. Stat. Assoc. 72:
566-75
Westfall PH, YoungSS. 1993. Resamplingbased Multiple Testing. NewYork: Wiley
Wright SP. 1992. Adjusted p-values for simultaneous inference. Biometrics 48:1005-13
Ziegel ER. 1994. Review of "Multiple Comparisons, Selection, and Applications in
Biometry." Technometrics 36:230-31
Related documents