Download Review of Assumptions and Problems in the Appropriate

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Copyright 2001 by the American Psychological Association, Inc.
I082-989X/01/S5.00 DOI: 10.1037//1082-989X.6.2.135
Psychological Methods
2001. Vol.6, No. 2, 135-146
Review of Assumptions and Problems in the Appropriate
Conceptualization of Effect Size
Robert J. Grissom and John J. Kim
San Francisco State University
Estimation of the effect size parameter, D, the standardized difference between
population means, is sensitive to heterogeneity of variance (heteroscedasticity),
which seems to abound in psychological data. Pooling s2s assumes homoscedasticity, as do methods for constructing a confidence interval for D, estimating D
from t or analysis of variance results, formulas that adjust estimates for inflation by
main effects or covariates, and the Q statistic. The common language effect size
statistic as an estimate of Pr(X, > X2), the probability that a randomly sampled
member of Population 1 will outscore a randomly sampled member of Population
2, also assumes normality and homoscedasticity. Various proposed solutions are
reviewed, including measures that do not make these assumptions, such as the
probability of superiority estimate of Pr(X} > X2). Ways to reconceptualize effect
size when treatments may affect moments such as the variance are also discussed.
Tomarken & Serlin, 1986; Wilcox, 1987). When
group Xs differ, group s2s often differ in the same
direction (Fisher & van Belle, 1993; Fleiss, 1986;
Norusis, 1995; Raudenbush & Bryk, 1987; Sawilowsky & Blair, 1992; Snedecor & Cochran, 1989).
Samples with greater positive skew often have the
larger means and variances. Reaction time, latency,
difference-threshold, and galvanic skin response data
provide some examples. In a randomized design, homoscedasticity is to be expected a priori, but the treatment after randomization can produce heteroscedasticity as described in Bryk and Raudenbush (1988).
However, research using nonrandomly formed groups
may well involve a priori heteroscedasticity (Grissom,
2000).
In educational research, Wilcox (1987) reported
that ratios of largest to smallest sample variances
(maximum sample variance ratios [VRs]) exceeding
16 are not uncommon, and VR values exceeding 400
have been reported (Lix et al., 1996). Maximum VRs
of up to 12 are considered to be realistic by Tomarken
and Serlin (1986). In clinical outcome research with
children and adolescents, variances have been found
to be significantly different in therapy and control
groups (Weisz, Weiss, Han, Granger, & Morton,
1995). In comparing a systematic desensitization
group with an implosive therapy and control group,
Hekmat (1973) found sample VRs of over 12 and
nearly 29, respectively, on the Behavior Avoidance
In univariate research that compares the means of
two independent groups, the parameter that is typically estimated to measure effect size is D = (|x, |X2)/CT (Cohen, 1988). Estimation of D is problematic
when there is heterogeneity of variance (heteroscedasticity) because there is no commonCT,and different
estimators of cr can lead to very different estimates of
D. Moreover, heteroscedasticity should prompt different conceptions of effect size. As discussed later, if
two treatments produce distributions with different
variances and/or different shapes, measures of effect
size other than the standardized difference between
means would be informative. This article describes
the problems and reviews proposed solutions.
Numerous results in samples suggest heteroscedasticity in a wide variety of areas of research (Grissom,
2000; Keppel, 1991; Lix, Cribbie, & Keselman, 1996;
Robert J. Grissom and John J. Kim, Department of Psychology, San Francisco State University.
We gratefully acknowledge Rebecca Ray and Julie
Gorecki for assistance with the preparation of an earlier
version of this article, and Jeffrey Keyes for assistance with
data analysis.
Correspondence concerning this article should be addressed to Robert J. Grissom, 564 Apple Grove Lane, Santa
Barbara, California 93105. Electronic mail may be sent to
rgrissom@sfsu.edu.
135
136
GRISSOM AND KIM
Test. Inspection of several articles in one recent issue
of the Journal of Consulting and Clinical Psychology
yielded sample VR values of 3.24, 4.00 (several),
6.48, 6.67, 7.32, 7.84, 25.00, and 281.79. The latter
VR involved skewed distributions of the number of
drinks per day under two different treatments for depression in alcoholism, data that the authors transformed to normalize (R. A. Brown, Evans, Miller,
Burgess, & Mueller, 1997). When control and therapy
conditions for number of posttest panic attacks were
compared, VRs of 8.02 and 6.65 were found for control s2/treated s2 and Therapy 1 s2/Therapy 2 s2, respectively (Feske & Goldstein, 1997). When comparing activity levels of drugged rats with those of a
control group Conti and Musty (1984) found a maximum sample VR of 6.87 across dosages of the active
component of marijuana.
The magnitudes of reported sample VRs strongly
suggest heteroscedasticity. However, because of the
great sampling variability of s2, estimates of population VRs are needed. It would also be useful to have
Monte Carlo studies of the "robustness" to heteroscedasticity of the many measures of effect size that assume homoscedasticity. However, under heteroscedasticity, a major issue beyond robustness of
traditional measures of effect size is the conceptualization of appropriate alternative measures of effects
of treatments on moments about the mean. It is hoped
that the present review will prompt (a) more cautious
interpretation of current measures and conversion formulas of effect size that assume homoscedasticity; (b)
use of measures of overall effect of treatments on
centers, tails, spreads, and shapes of distributions; and
(c) research on the estimation of population VRs.
Choice of Denominator for Effect Size
When Comparing Two Means and
Assuming Normality
In cases where there is heteroscedasticity and a
control group, Glass, McGaw, and Smith (1981) have
proposed using sc, the standard deviation of the control group, to estimate cr. In this case the effect size
estimator, dc = (Xe - Xc)sc, is estimating (|o,e - JJLC)/
CTC, an informative parameter under normality that indicates the location of an average experimental population (e) participant with respect to the control
population's (c) distribution. Unfortunately for the
measures of effect size that do so, assuming normality
may not be realistic (Micceri, 1989; Wilcox, 1996).
When estimating £>e = (|ie - |xc)Are, using the standard deviation of an experimental group, se, instead of
sc can result in a numerically very different estimate
as well as a different definition of effect size. Treatment x Subject interaction or a ceiling or floor effect
could cause heteroscedasticity and, therefore, Dc +
De. The problem of widely varying values of d depending on choice of 5 exists even if there is equality
of population variances because sample variances are
still likely to vary greatly. With df < 20, point estimates of CT can be in error by hundreds of percent
(Hoaglin, Mosteller, & Tukey, 1991). In addition, although there is a case for using some standard measure to estimate CT, such as sc, one could question
whether it is always appropriate to use o-c just because
there is a comparison (control) group. The comparison group may represent the current standard treatment, but it may not turn out to be essentially different
from the "experimental" treatment.
Under heteroscedasticity, Cohen (1988) suggested
estimating the parameter D' that uses the denominator
a' = [(o-,2 + o-22)/2]1/2. Huynh'^(1989) Monte Carlo
simulations showed that d' = (X, - X2)l[(s2 + s22)/
2]1/2, which uses n-} - 1 in the denominator to calculate
each s2, is generally a biased estimator of D', but is
only slightly biased when D' = 1 and is not at all
biased when D' = 0. (There is also some bias when
the usual pooled estimator is used). The estimator d'
was also found to be unstable, the variance of d' increasing with D' and, not surprisingly, decreasing
monotonically with increasing ns. The simulation
supported the use of h = cf d', an adjusted estimator
of D' to correct bias, where cf = T(//2)/[(//2)1/2
F((/-l)/2)], F is the gamma function, and/ = [(n{—
1)(«2 - l)(s2 + s 2 ) 2 ] l [ ( n , - l)s,4 + («2 - I)s24]. The
simulation assumed normality of X, and X2. See
Huynh, 1989, for a detailed discussion of a method for
addressing the instability of d'. However, issues of
bias and instability aside, D' is difficult to interpret
because it uses theCTof a contrived population.
In the case of homoscedasticity, the pooled estimator, sp, is a less biased and less variable estimator of
the common a than is sc, an estimator that is only
available when there is a control group. Thus, in twogroup studies that rightly or wrongly assume homoscedasticity, Hedges' g = (X, -X2)/sp (or the version
that removes small-sample bias) is the most widely
used estimator of D (Hedges & Olkin, 1984). However, one could argue that, if JJLS differ, homoscedasticity is unlikely and pooling is inappropriate. The
formulas provided by Hedges and Olkin (1985)
REVIEW OF ASSUMPTIONS
137
for confidence limits for D also assume homoscedasticity.
When comparing two groups by estimating a D in
a multigroup study that assumes homoscedasticity,
the commendable goal of attempting to obtain a less
biased and less variable estimate of a results in pooling within-group variances, sp = (MSW)V2. However,
some researchers may be assuming homoscedasticity
without testing for it, having tested for it with one of
the relatively low-power tests that are often provided
in statistical packages (e.g., Levene, 1960) or with one
of the possibly more powerful but not very powerful
tests (M. B. Brown & Forsythe, 1974; R. G. O'Brien,
1978; Weston & Hopkins, 1998) that still failed to
detect heteroscedasticity. Moreover, the amount of
heteroscedasticity that would be troublesome for estimating D is likely much less than what would be
troublesome for F testing in ANOVA. Under heteroscedasticity, the estimator of Dp based on pooling, dp
= (X, - X2)/(MSw)l/2, is again problematic because
there is no common a to be estimated (the noncentrality parameter is not defined).
In summary, when homoscedasticity is assumed,
the most efficient estimator of the common CT is sp.
However, using an sp that weights the two ss (approximately) proportionally to n{ and «2 is problematic
under heteroscedasticity. In the latter case, one can
use the s of whichever group is to be the comparison
group to estimate that population's a, or use s} and s2
to estimate the square root of the mean of o^2, and a22
to measure D' by using a' (or the mean of o^, and a2).
Finally, under heteroscedasticity, the choice of denominator should depend on the research context. For
example, if the two genders are being compared, it
may not be sensible to use CTM or CTF for the denominator for the purpose of estimating a DM or D¥. On the
other hand, using a' results in a D' in a contrived
population whose CT is the average of the male and
female populations' crs.
ized longitudinal design for comparing two or more
groups, but this method also assumes homoscedasticity, and the sensitivity of this measure to heteroscedasticity also is not known.
When factorial designs are used, various choices
for estimating cr arise, depending on which two means
are being compared. For example, consider a 2 x 2
design in which one factor compares treatment and
control and the other factor compares men and
women. One might want to estimate D for treatment
versus control overall. Alternatively, one might want
to estimate treatment effect sizes separately for men
and women as if only the single treatment factor had
been involved. The latter makes sense if there is interaction. However, in a factorial design in which
there is a classification factor, such as gender, and a
treatment factor with respect to which effect sizes are
to be estimated as if treatment were the only factor, a
main effect of the classification factor reduces withincell MS values. Oliver and Hyde (1995) cite some
meta-analyses that have inflated estimates of D by not
using formulas to correct for such reduction of MSW
values for the purpose at hand. Correction formulas
are provided by Glass et al. (1981) and Smith, Glass,
and Miller (1980), and are extended to larger designs
by Ray and Shadish (1996). However, the correction
formulas assume homoscedasticity. The correction
formulas by Glass and his coworkers that attempt to
render estimates of D from factorial designs that are
comparable to estimates from single-factor designs
require SS values that are typically not available in
meta-analysis. Morris and DeShon (1997) provided a
modified correction formula that requires values of F
and df instead of SS values, but the modification assumes homoscedasticity. Abelson and Prentice (1997)
presented a measure of effect size for contrasts in
two-way designs, but homoscedasticity was again assumed and sensitivity to heteroscedasticity is not
known.
Multigroup and Factorial Designs
Meta-Analysis: Converting Statistics to
Estimates of D and Other Problems
of Synthesis
Two overall estimators of standardized effect size
for one-way ANOVA, dmm = (Xmax - XmJ/sp and/
= s-^ ... /sp, where s^ is the standard deviation of all
of the sample means, assume homoscedasticity. The/
is related to estimating T|, the correlation between
group membership and the dependent variable
(Winer, Brown, & Michels, 1991), that also assumes
homoscedasticity. Maxwell (1998) presented a promising measure of effect size for a powerful random-
The problem of heteroscedasticity is compounded
in meta-analysis. When attempting to cumulate effect
sizes, meta-analysts are often averaging different
kinds of estimates of different kinds of D arising from
the underlying primary studies. This is because primary researchers usually provide the values of test
statistics, such as t or F, but do not provide sufficient
138
GRISSOM AND KIM
information to calculate dc or dp directly. When using
the primary researcher's t or two-group F to estimate
D by means of t(l/ne + l/n c ) 1/2 or [F(l/ne + l/nc)]1/2,
meta-analysts indirectly assume homoscedasticity because these formulas so assume. When se2 > sc2, dc >
dp, and when se2 < sc2, dc < dp. (Glass et al., 1981,
provide numerical values for the relationship between
se2/5c2 and dc/dp). Therefore, whether attempting to
estimate Dc or Dp, meta-analysts are often unwittingly
averaging values of d, among some of which dp > dc
and, among others, dp < dc. Unfortunately, the averaging procedure is not likely to result in overestimations canceling underestimations (Glass et al., 1981).
The problem is compounded by the fact that one often
has to convert to, and average, values of dp from a mix
of primary studies that collectively provide a variety
of forms of results, such as dp itself (rare), t, twogroup F, one-way or factorial ANOVA F, one-way or
factorial repeated measures ANOVA F, analysis of
covariance F, change-score t, change-score F, reported p value, and inferable-only p value. Ray and
Shadish (1996) empirically demonstrated that different calculated conversions to dp arising from the different forms of results reported in primary studies can
vary, sometimes greatly. Also, heteroscedasticity can
affect the power and rate of Type I error of the Hedges
and Olkin (1985) Q statistic that is often used to test
for the significance of the variability of the set of
estimates of D that arise from the underlying primary
studies. See Harwell's 1997 Monte Carlo study for
details on the sensitivity of Q to heteroscedasticity
under varying conditions, within the primary studies,
of population VR, distributional shape, and ns.
When different primary studies underlying a metaanalysis use different measures of the same latent dependent variable, the primary effect sizes may vary
simply because varying measurement reliabilities result in varying values of sc. The dependent variable
can be corrected for the unreliability that can reduce
an uncorrected estimate of D by increasing s (Becker,
1996; Schmidt & Hunter, 1996). However, the correction yields an estimate of a D that is theoretically
possible but not currently realizable in practice. Different estimates of primary Dps or Des can also arise
because of varying values of se when, in some primary studies, subjects may show greater variation in
responsiveness to a treatment than in other studies of
the same treatment.
It is not surprising that meta-analysts rarely explicitly state the nature of the D parameter that they are
estimating with the various kinds of d values that are
being averaged. Moreover, if the type of estimate of D
is confounded with varying substantive characteristics
of the primary studies, interpretation of statistically
significant moderators of effect size will be misleading. Editors of journals would greatly enhance the
quality of primary studies and meta-analyses if they
would require that primary studies include all ns, Xs,
and ss. Because of heteroscedasticity, the Publication
Manual of the American Psychological Association
(American Psychological Association [APA], 1994)
oversimplifies when it states that estimates of effect
size are readily obtainable from statistics such as t or
F with ns or J/s. The APA should consider endorsing
the movement toward requiring researchers to archive
data as a condition for research funding and publication (Eagly, 1997). Finally, because values of s can
vary greatly when dfs are small, varying ns across
studies can exacerbate the indeterminacy of CT and,
therefore, of D.
Some Additional Proposed Solutions to
Heteroscedasticity in Two-Group Comparisons
There are various additional proposed solutions to
the problem of heteroscedasticity. Some solutions attempt to improve estimation of D, whereas more radical solutions offer measures of effect size that are
conceptually different from D. A modest suggestion is
that researchers in the two-sample case reduce the
problem of heteroscedasticity by using ns > 10 and
either equal ns or ns in which neither group contains
more than 60% of the total number of participants
(Huynh, 1989; Kraemer, 1983). In research domains
in which the dependent variable is a score on a test
that has been normed on a vast sample, such as some
clinical and educational outcome studies, a possible
solution to the estimation problem exists. In this case
one can divide (Xe - Xn) by the sn of the normative (n)
group to estimate Dn = (|xe - |JLn)/o-n (Kendall &
Grove, 1988; Kendall, Marss-Garcia, Nath, &
Sheldrick, 1999). The use of such a constant s is beneficial because, otherwise, values of (X, - X2) may
actually vary only slightly among primary studies
whose values of d vary greatly simply because of
varying values of 5.
In the two-group case Rosenthal (1991) suggested
transforming the data to equate variances before calculating d, but he came to prefer a reconceptualization
of effect size as a point-biserial correlation, ppb, a
measure of effect size in terms of the strength and
direction of the relationship between the grouping
REVIEW OF ASSUMPTIONS
variable and the dependent variable. (A variancestabilizing transformation of the data may greatly alter
the values of D and possibly their interpretation, but
this may be less of a problem for primary researchers
who are using scales that are already contrived.)
Trimming to reduce heteroscedasticity is possible before calculating rpb, but this may result in a value of
rpb that estimates an ill-defined parameter (Wilcox,
1994). Although heteroscedasticity does not necessarily directly influence the value of ppb, it can influence
the t test that is used to test the significance of rpb, and
converting d or t to rpb by the usual formulas assumes
homoscedasticity. Also, if heteroscedasticity is associated with one or more outliers, the latter can greatly
affect ppb. Even slightly heavy tails in the distribution
of the dependent variable scores can greatly affect ppb
and its associated confidence interval. Although the
formula for rpb should be corrected for attenuation
caused by unequal «s (rpbc) this seems to be rarely
done in practice. The attenuation increases with increasing disproportionality of ns and increasing magnitude of ppb (McNemar, 1969). The corrected rpbc =
a rpb/[(a2 - 1) r2^ + 1]1/2, where a = .25pq and p and
q are the proportions of total sample size in each
group (Hunter & Schmidt, 1990). When a metaanalyst averages uncorrected rpbs, some of the variability in rpbs may arise artifactually from varying
degrees of inequality of sample sizes from primary
study to primary study. Wilcox (1996) critiqued alternative measures of correlation in terms of resistance to outliers and provided Minitab macros for
them. These rs may prove to be useful alternative
robust estimators of correlational effect size measures.
Measures Using Estimators That Are
More Robust
We have seen that, if normality and homoscedasticity are assumed, D can appropriately be estimated
by using Xc, Xe, and sp. Assuming normality, the
usual interpretation of Dc locates |xe as Dc x ac units
away from JJLC. Thus, if u,e > u,c and, for example, dc
= +1, it is estimated that the average experimental
population subject scores at the 84th percentile (1 crc
unit above (ic) of the control group's population distribution. However, if there is heteroscedasticity associated with one or more outliers normality does not
hold and such interpretation of the estimate dc is invalid. Also, because a can be very sensitive to shape,
nonnormality can greatly affect the value of a D type
139
of effect size, as compellingly demonstrated by Wilcox and Muska (1999) for heavy-tailed distributions.
One of the solutions offered for this problem of estimation and interpretation is from Hedges and Olkin
(1985), who suggested trimming the highest and lowest scores from the control group, replacing (Xe - Xc)
with (Mrin^. - Mdnc), and replacing sc with the range
of the trimmed data or some other measure of variability.
One such alternative measure of variability is the
median absolute deviation from the median (MAD).
A statistic should not only be resistant to outliers, as
is MAD, but should also have relatively small sampling variability to increase power and narrow confidence intervals. In this latter regard, the biweight standard deviation, sbw (Goldberg & Iglewicz, 1992; Lax,
1985), is superior to MAD. Calculation of s2bw by
hand is laborious. First, calculate Y-t = (X^ - Mdri)l
9(MAD) and set a{ = 1 if I7J < 1 and a-, = 0 if
1. Then
i/2-,4-11/2
bw
(D
Wilcox (1997) presented an S-PLUS software function for calculating biweight mid variances (s2bw)- Because raw scores are required to calculate .$2bw, primary researchers, but not meta-analysts who are
synthesizing unarchived data, can thus use (Mdne Mdnc~)/sbv,c as an estimate of effect size. Laird and
Mosteller (1990) suggested dividing (Mdne - Mdnc)
by .75 (interquartile range of control group) to provide resistance to outliers while maintaining a denominator that approximates sc. Under normality, .75
(interquartile range) approximates s. However, the interquartile range is just one example of an interquantile range. Quantiles may be generally defined as
scores that are equal to or greater than certain specified proportions of the other scores in the distribution.
Results from Shoemaker (1999) indicate that a superior robust measure of spread may sometimes be obtained by using interquantile ranges of more extreme
quantiles than are used in the interquartile range.
These results suggest additional possible denominators for measuring effect size, but the method may
perform poorly under extreme skew.
Kraemer and Andrews (1982) presented a nonparametric estimator of effect size, but this estimator requires a pretest-posttest design. Hedges and Olkin
(1984, 1985) presented several nonparametric estimators, one of which, d*c, does not require homoscedas-
140
GRISSOM AND KIM
ticity or pretest scores. Assuming normality, this
method estimates Dc by using d*c = <&~l pc, where
<5~' is the inverse normal cumulative distribution
function and pc is the proportion of the control group
scores that are below Mdne. Because it requires raw
scores, this method is of use to primary researchers
but, again, not to meta-analysts who are synthesizing
unarchived data. Moreover, the sampling distribution
of this estimator is not known, so significance testing
and construction of confidence intervals are not developed.
The methods reviewed in this section address heteroscedasticity and nonnormality as problems to be
circumvented when estimating a D rather than as
characteristics of data that can be accommodated by
new conceptual approaches to measuring effect size.
All of the suggested formulas are estimating parameters that are conceptually similar to D. Replacing owith a more resistant measure of scale is only a computational solution. A more radical solution would
involve reconceptualization of effect size as a measure of the effect of a treatment throughout a distribution, not just at its center. Heteroscedasticity would
then no longer be a mere nuisance, as it is when
attempting to measure effect size in the traditional
manner, but a utilized reflection of a treatment's effect on moments. The next two sections address this
issue.
Robust Estimation of Pr(Xv > X2)
An intuitively appealing measure of effect size in
the two-group case is the probability that a randomly
sampled member of a population given one treatment
will have a score (X,) that is superior to the score (X2)
of a randomly sampled member of a population given
another treatment. A theoretically unbiased and consistent estimator of this probability has been called the
probability of superiority (PS; Grissom, 1994a,
1994b, 1996). Theoretically, the PS has the smallest
variance of all unbiased estimators of Pr(X} > X2).
The PS can be based on raw data that are continuous,
ordinal, or ordinal-categorical (Grissom, 1994a,
1994b). The applicability of the PS estimator to ordinal data is an advantage in much psychological research, in which dependent variables are often likely
to be monotonically, but not necessarily linearly, related to underlying latent variables. Monotonic transformations leave Pr(X\ > X2) invariant. The PS =
Ulmn, where U is the Mann-Whitney statistic and m
and n are the sample sizes. The value of U indicates
the number of times that the m subjects given Treatment 1 have scores that outrank those of the n subjects
given Treatment 2, assuming no ties or equal allocation of ties, in all possible comparisons of X, and X2
values. There are mn such head-to-head comparisons
possible. Therefore, the proportion of times that subjects given Treatment 1 are superior in their scores to
subjects given Treatment 2 is Ulmn. For example, if .7
of the comparisons of all treated subjects with all
control subjects result in better scores being observed
in the treated group, PS = .7. A proportion in a
sample estimates a probability in a population. Estimating /V(X, > X2) from the PS is robust to nonnormality. Wilcox presented a Minitab macro (Wilcox,
1996) and S-PLUS software functions (Wilcox, 1997)
for constructing a confidence interval for Pr(Xl > X2)
based on Fligner and Policello's (1981) adjusted U'
statistic and on a method by Mee (1990) that appears
to yield fairly accurate confidence levels.
The Mann-Whitney U test is nonrobust to heteroscedasticity when testing for the equality of centers
(Murphy, 1976; Pratt, 1964; Zimmerman & Zumbo,
1993) because values of U are functions of higher
moments as well as means. However, sensitivity to
the effect of a treatment on variance and on higher
moments may be considered to be an advantage for a
broader measure of effect size. In this context we are
not testing H0: Mdnl = Mdn2, against H,: Mdn{ +
Mdn2 but a broader H0: Pr(Xl > X2) = .5 against H{:
Pr(X} > X2) * .5. The alternative H,: Mdnt * Mdn2
(any other location parameter could have been used)
is an example of a special-case model, a shift-model
in which a treatment merely adds a constant to what
the score of each participant would have been without
the treatment. In this model, homoscedasticity is assumed because adding a constant to each score merely
shifts a distribution without changing its spread. However, the alternative H,: Pr(Xt > X2) + .5, tested
against H0: Pr(Xt > X2) = .5, is the more general case
of stochastic superiority of one treatment over another. Under this H,, the effect of a treatment need not
be constant, so that the treatment may affect spread
and, therefore, homoscedasticity need not be assumed. This general case is perhaps more realistic
because treatments may well affect spread, and possibly shape, as well as the center of a distribution.
Using the PS to test H0: Pr(X} > X2) = .5 against H,:
Pr(X{ > X2) =£ .5, or against a one-tailed alternative,
is a consistent test in the sense that its power approaches unity as sample sizes approach infinity.
When raw data are not available, the parameter
141
REVIEW OF ASSUMPTIONS
Pr(X} > X2) can be estimated by using the common
language effect size statistic (CL; McGraw & Wong,
1992), which is calculated from sample means and
variances. The CL is based on a z score, ZCL= C^i ~
X2)/Cs,2 + s22)1'2. The value of Pr(Xl > X2) is estimated by the proportion of the area of the normal
curve below ZCL. For example, if ZCL = +1.00 or
-1.00, CL = .84 or .16, respectively. The value of
/MX, > X2) is estimated to be .84 or . 16 in these cases.
This CL method assumes normality of X, and X2, but
it may not seem to some readers to require homoscedasticity because the variance of (X, - X2) would be
<j\ _x = (cr2. + a2 ) regardless of the values of a2^ and
cr22. However, in theory, CL only estimates Pr(X, >
X2) under normality and homoscedasticity (Pratt &
Gibbons, 1981). McGraw and Wong (1992), who introduced the CL to psychologists, assumed homoscedasticity and reported Monte Carlo simulations that
indicated fairly good performance in the face of nonnormality but somewhat poorer performance under
joint nonnormality and heteroscedasticity. In theory,
the CL is not quite an unbiased estimator unless it is
adjusted (Pratt & Gibbons, 1981).
Various statistics and estimates of effect size can be
converted to approximate values of ZCL (Grissom,
1994a). For example, if n, = n2, ZCL = -tin112, ZCL
1
= -.7074 and ZCL = -rpb[2(n - l)]1/2/[n(l - 4 )]1/2
For conversion formulas when n, =£ n2, see Grissom,
1994a. All of these conversion formulas assume homoscedasticity and normality, and their sensitivity to
violation of these assumptions is unknown.
Estimates of Pr(X, > X2) have been used in a metaanalysis (Mosteller & Chalmers, 1992), as well as in
a meta-meta-analysis that was compelled to rely on
the previously mentioned possibly nonrobust conversion formulas for ZCL because of the unavailability of
raw data (Grissom, 1996). Again, the archiving of raw
data would benefit meta-analyses that attempt to estimate Pr(X, > X2). For a meta-analysis using PS, the
mean of the values of PS should be obtained by
weighting each PS by the reciprocal of its variance.
For the variance, use (l/12)[(l/m) + (1/n) + (1/mn)].
For an alternative approach, see Colditz, Miller, and
Mosteller (1988).
In the two-sample case, PK^i > X2) is a useful
measure of effect size, but this measure may not be
transitive in the k > 2 case. Group 1 may score stochastically higher than Group 2, and Group 2 may
score higher than Group 3, but it is possible in this
case that Pr(X3 > X,) > .5. For an approximate CL
approach to estimating Pr(X} > X2 and X, > X3
a n d . . . ) , see McGraw and Wong (1992). Also see
Whitney' s (1951) extension of the U test for k> 2 and
a discussion and tables of critical values for this extension in Mosteller and Bush (1954). For more discussion of Pr(X, > X2) and its estimators, see
Lehmann (1975), Laird and Mosteller (1990), Pratt
and Gibbons (1981), and Vargha and Delaney (1998).
Pratt and Gibbons (1981) discuss various methods for
dealing with tied scores.
Empirical Comparisons of PS and CL
Wolfe and Hogg (1971) asserted that the estimates
from PS and CL do not differ "too much" in practice.
To test this assertion with real and then simulated
data, we first searched two published manuals of actual data to find sets of data that compared two independent groups with respect to some psychologically
relevant dependent variable. Eleven such sets were
found, and PS and CL were used to calculate estimates of Pr(Xj > X2) for each set. In the four least
discrepant of the 11 pairs of estimates from the PS and
the CL, the larger estimate in each case was less than
3% larger than the smaller of the two estimates. However, in the three comparisons that resulted in the
greatest discrepancies, the larger estimate was 12.1%,
15.8%, or 15.9% larger than the smaller estimate.
Next, we inspected preliminary results from Monte
Carlo comparisons of the behaviors of the PS and the
CL. Results as of this writing are based on «, = n2,
normality, population VRs ranging from 1 to 8, and
25,000 samples for each simulated condition. Generally, when |x, = |X2, the means of the sampling distributions of the PS and the CL are very similar and
the correlation between the two sets of estimates is
well over .9. However, as the difference between (i,
and |x2 increases, this correlation sometimes decreases to a value as low as approximately .2, and the
CL tends to exhibit more sampling error than does the
PS. Even when assuming normality and homoscedasticity, primary researchers who are estimating Pr(X, >
X2) should consider reporting both the CL and the PS.
Multiple-Valued Measures
Measures of effect size for comparing two groups
can be misleading when there is heteroscedasticity,
different shapes of distributions, or both. For example, suppose that a treatment makes some experimental subjects perform better and some perform
worse than they would have had they been in the
comparison group. In this case, the experimental
142
GRISSOM AND KIM
group's variability will be greater or less than that of
the comparison group, depending on whether it is the
better or poorer performing subjects who are helped
or harmed by the experimental treatment, but the two
means or medians may be nearly the same. In this
case, using (Xe - Xc) or (Mdne - Mdnc) in the numerator yields an estimated effect size near zero, but
the treatment clearly has had an effect in the tails, if
not in the center, of the distribution. In another case,
the two centers and the two variabilities may differ.
For example, if the more variable group has the higher
mean (not uncommon), the proportions of its members among the high scorers and among the low scorers can be different than what is implied by the estimate of D. Consider a case in which, compared with
controls, experimental participants have a small superiority, say, D = +.3 (using the a of the combined
distribution), and a cre2 that is only 15% greater than
that of the control group. Even in this unexceptional
case, if normality is assumed, approximately 2.5 times
as many treated as control subjects would be in the
highest 5% of the combined distribution (Hedges
& Nowell, 1995). For further discussion and examples see Feingold (1992b, 1995) and P. C. O'Brien
(1988). The kinds of problems that have been discussed can arise even under homoscedasticity if there
is nonhomomerity (inequality of shapes of distributions).
Informative methods have been proposed for expressing group differences at different portions of a
distribution, but at the cost of greater complexity. For
example, Hedges and Friedman (1993), assuming normality for both populations, defined effect size in a
portion of a tail beyond a fixed value, Xa, in a distribution of scores combined from Group 1 and Group 2.
This Da is the difference between u,a, and |xa2, for
those scores that are beyond Xa, divided by cr for
those scores, cra. The value of Xa is chosen as a score
that is exceeded by some k% of the scores in the
combined distribution. Thus, if Xa is the score at the
lOOa percentile point of the combined distribution,
then the effect in the tail beyond Xa is Da = (|xal |xa2)/cra. Extensive computational details are found in
the appendix of Hedges and Friedman (1993). The
computations of the estimates of Da are repeated for
various values of k that are of interest to the researcher. Note that when the overall \L\ ¥= n-2, the
combined distribution departs from normality.
Wilcox (1995, 1996, 1997) presented references for
a variety of a graphical methods and illustrated a
method, based on the work of Doksum (1977), for
comparing two groups separately at various quantiles.
This method results in a graph of "shift functions" that
indicates whether a treatment becomes more or less
effective as we observe from the poorer performing to
the better performing of the control group's members.
Each shift function indicates how far the comparison
group has to be shifted to attain the outcome of the
treated group at each quantile of interest. Quantiles of
the control group's scores at their various qth quantiles, Xqc values, are plotted against the differences
between Xqc and Xqe, the quantiles of the experimental
group's scores, defining the shift function, AXqc =
Xqe - Xqc. It is more informative to compare distributions at various quantiles, such as deciles, than to
compare them only at their centers, such as their
means or medians (.50 quantile, 5th decile). Wilcox
(1996) provided a Mini tab macro for estimating shift
functions and another for constructing confidence intervals for the difference between the two groups'
deciles at various deciles throughout the comparison
group's distribution. Wilcox (1997) also provided SPLUS software functions for making robust inferences about shift functions and for constructing robust
simultaneous confidence intervals for them. Lunneborg (1986) provided discussion and a method for
constructing a confidence interval for the difference
between corresponding quantiles of two distributions.
A general definition of quantiles suits the purpose of
the present article. However, statistical packages use a
variety of definitions of sample quantiles (Hyndman
& Fan, 1996). For an introduction to quantiles, see
Hoaglin, Mosteller, and Tukey (1985). See Dielman,
Lowry, and Pfaffenberger (1994) for results of Monte
Carlo studies of the behavior of 10 supposedly distribution-free estimators of quantiles for a variety of
distributions. For discussion of distribution-free univariate and multivariate methods for comparing the
probability density functions of two groups, see Silverman (1986) and Izenman (1991).
Descriptive graphical methods for depicting differences between two distributions at levels in addition
to their centers are the Wilk and Gnanadesikan (1968)
percentile comparison graph, the Tukey (1984) sumdifference graph, both discussed by Cleveland (1985),
and the ordinal dominance curve (Darlington, 1973)
that is similar to the percentile comparison graph. Development of additional methods for graphic depiction of effect sizes may promote the greater use of
measures of effect size. A simple example is the depiction of two boxplots, one above the other, using the
Minitab STACK command.
REVIEW OF ASSUMPTIONS
Postscripts
There is an embarrassment of riches in the variety
of solutions to the problem of heteroscedasticity in
significance testing (Grissom, 2000; Wilcox, 1996,
1997). Therefore, future primary studies may become
a mix of reports of means, transformed means,
trimmed means, medians, various other resistant measures of location, shift-functions, CLs, and pointbiserial and other measures of correlation. If so, some
of these studies' tests of significance may be better at
controlling rates of Type I and Type II error under
heteroscedasticity by avoiding the use of means, as
encouraged by Wilcox (1996, 1997), but the effect
size synthesizing problems of meta-analysts will be
greatly exacerbated.
Workers in each area of research should learn about
the degree of heteroscedasticity typically arising from
its types of participants, measures, and treatments.
Therefore, estimation of population VRs should be
encouraged. In addition, development of a method for
constructing simultaneous confidence intervals for
population VRs that is robust to nonnormality would
be very useful. Descriptive meta-analyses of sample
VRs may be conducted (Feingold, 1992a). However,
one should not use the arithmetic means of sa2/sb2
directly because, even if there is equality of population variances, variability of sample variances will
cause the mean of the ratio sa2/sb2 to be equal to or
greater than 1. For example, suppose that aa2 = crb2
and, because of sampling variability, say, sa2 = 2sb2
and sb2 = 2.sa2 equally often across primary studies,
yielding primary VR values of 1/2 = .5 and 2/1 =
2.0, respectively, equally often. The arithmetic mean
of .5 and 2 is not 1, but 1.25. Feingold (1992a) and
Shaffer (1992) discussed the proper method of cumulation of VRs through the use of median VR or mean
ratio of logarithms of variances.
Sampling distributions for additional heteroscedastic estimators of effect size that do not assume normality are also needed so that significance of estimates can be tested and confidence intervals can be
constructed, as they can be for the Pr(Xt > X2) measure. Monte Carlo simulations show that coverage for
lower confidence bounds for Pr(X, > X2) is generally
very close to the nominal .95 level when the PS estimator is used (Mee, 1990).
References
Abelson, R. P., & Prentice, D. A. (1997). Contrast tests of
interaction hypotheses. Psychological Methods, 2, 315328.
143
American Psychological Association. (1994). Publication
manual of the American Psychological Association (4th
ed.). Washington, DC: Author.
Becker, G. (1996). Bias in the assessment of gender differences. American Psychologist, 51, 154-155.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for
equality of variances. Journal of the American Statistical
Association, 69, 364—367.
Brown, R. A., Evans, D. M., Miller, I. W., Burgess, E. S., &
Mueller, T. I. (1997). Cognitive-behavioral treatment for
depression in alcoholism. Journal of Consulting and
Clinical Psychology, 65, 715-726.
Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of
variance in experimental studies: A challenge to conventional interpretations. Psychological Bulletin, 104, 396404.
Cleveland, W. S. (1985). The elements of graphing data.
Monterey, CA: Wads worth.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press.
Colditz, G. A., Miller, J. N., & Mosteller, F. (1988). Measuring gain in the evaluation of medical technology: The
probability of a better outcome. International Journal of
Technology Assessment in Health Care, 4, 637-642.
Conti, L., & Musty, R. E. (1984). The effects of delta-9tetrahydro cannabinol injections to the nucleus accumbens on the locomotor activity of rats. In S. Agurell, W.
L. Dewey, & R. E. White (Eds.), The cannabinoids:
Chemical, pharmacologic, and therapeutic aspects. New
York: Academic Press.
Darlington, M. L. (1973). Comparing two groups by simple
graphs. Psychological Bulletin, 79, 110-116.
Dielman, T., Lowry, C., & Pfaffenberger, R. (1994). A comparison of quantile estimators. Communications in Statistics B: Simulation and Computation, 23, 355-371.
Doksum, K. A. (1977). Some graphical methods in statistics: A review and some extensions. Statistica Neerlandica, 31, 53-68.
Eagly, A. (1997). Data archiving in psychology. Science
Agenda, 10, 14.
Feingold, A. (1992a). Cumulation of variance ratios. Review
of Educational Research, 62, 433^34.
Feingold, A. (1992b). Sex differences in variability in intellectual abilities: A new look at an old controversy.
Review of Educational Research, 62, 61-84.
Feingold, A. (1995). The additive effects of differences in
central tendency and variability are important in comparisons between groups. American Psychologist, 50, 5-13.
Feske, U., & Goldstein, A. J. (1997). Eye movement desensitization and reprocessing treatment for panic disorder:
A controlled outcome and partial dismantling study.
144
GRISSOM AND KIM
Journal of Consulting and Clinical Psychology, 65,
1026-1035.
Fisher, L. D., & van Belle, G. (1993). Biostatistics: A methodology for the health sciences. New York: Wiley.
Fleiss, J. L. (1986). The design and analysis of clinical
experiments. New York: Wiley.
Fligner, M. A., & Policello II, G. E. (1981). Robust rank
procedures for the Behrens-Fisher problem. Journal of
the American Statistical Association, 76, 162-168.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Metaanalysis in social research. Thousand Oaks, CA: Sage.
Goldberg, K. M., & Iglewicz, B. (1992). Bivariate extensions of the boxplot. Technometrics, 34, 307-320.
Grissom, R. J. (1994a). Probability of the superior outcome
of one treatment over another. Journal of Applied Psychology, 79, 314-316.
Grissom, R. J. (1994b). Statistical analysis of ordinal categorical status after therapies. Journal of Consulting and
Clinical Psychology, 62, 281-284.
Grissom, R. J. (1996). The magical number .7 ± .2: Metameta-analysis of the probability of superior outcome in
comparisons involving therapy, placebo, and control.
Journal of Consulting and Clinical Psychology, 64, 973982.
Grissom, R. J. (2000). Heterogeneity of variance in clinical
data. Journal of Consulting and Clinical Psychology, 68,
155-165.
Harwell, M. (1997). An empirical study of Hedges's homogeneity test. Psychological Methods, 2, 219-231.
Hedges, L. V., & Friedman, L. (1993). Gender differences
in variability in intellectual abilities: A reanalysis of Feingold's results. Review of Educational Research, 63, 94105.
Hedges, L. V., & Nowell, A. (1995). Sex differences in
mental test scores, variability, and numbers of highscoring individuals. Science, 269, 41-45.
Hedges, L. V., & Olkin, I. (1984). Nonparametric estimators
of effect size in meta-analysis. Psychological Bulletin,
96, 573-580.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for
meta-analysis. San Diego, CA: Academic Press.
Hekmat, H. (1973). Systematic versus semantic desensitization and implosive therapy: A comparative study. Journal of Consulting and Clinical Psychology, 40, 202-209.
Hoaglin, D. C, Mosteller, F., & Tukey, J. W. (Eds.). (1985).
Exploring data tables, trends, and shapes. New York:
Wiley.
Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991).
Fundamentals of exploring analysis of variance. New
York: Wiley.
Hunter, J. E., & Schmidt, F. L. (1990) Methods of metaanalysis. Newbury Park, CA: Sage.
Huynh, C. L. (1989, March). A unified approach to the
estimation of effect size in meta-analysis. Paper presented
at the Annual Meeting of the American Educational Research Association, San Francisco. (ERIC Document Reproduction Service No. ED 306 248).
Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in
statistical packages. The American Statistician, 50, 361365.
Izenman, A. J. (1991). Recent developments in nonparametric density estimation. Journal of the American Statistical
Association, 86, 205-224.
Kendall, P. C., & Grove, W. M. (1988). Normative comparisons in therapy outcome. Behavioral Assessment, 10,
147-158.
Kendall, P. C., Marss-Garcia, A., Nath, S. R., & Sheldrick,
R. C. (1999). Normative comparisons for the evaluation
of clinical significance. Journal of Consulting and Clinical Psychology, 67, 285-299.
Keppel, G. (1991). Design and analysis: A researcher's
handbook (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Kraemer, H. C. (1983). Theory of estimation and testing of
effect sizes: Use in meta-analysis. Journal of Educational
Statistics, S, 93-101.
Kraemer, H. C., & Andrews, G. (1982). A non-parametric
technique for meta-analysis effect size calculation. Psychological Bulletin, 91, 404-412.
Laird, N. M., & Mosteller, F. (1990). Some statistical methods for combining experimental results. International
Journal of Technology Assessment in Health Care, 6,
5-30.
Lax, D. A. (1985). Robust estimators of scale: Finite sample
performance in long-tailed symmetric distributions. Journal of the American Statistical Association, 80, 736-741.
Lehmann, E. L. (1975). Nonparametrics. San Francisco:
Holden-Day.
Levene, H. (1960). Robust tests for equality of variances. In
I. Olkin (Ed.), Contributions to probability and statistics:
Essays in honor of Harold Hotelling (pp. 278-292). Stanford, CA: Stanford University Press.
Lix, L. M., Cribbie, R., & Keselman, H. J. (1996, June). The
analysis of completely randomized univariate designs.
Paper presented at the annual meeting of the Psychometric Society, Banff, Alberta, Canada.
Lunneborg, C. E. (1986). Confidence intervals for a quantile
contrast: Application of the bootstrap. Journal of Applied
Psychology, 71, 451^56.
Maxwell, S. E. (1998). Longitudinal designs in randomized
group comparisons: When will intermediate observations
REVIEW OF ASSUMPTIONS
increase statistical power? Psychological Methods, 3,
275-290.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111,
361-365.
McNemar, Q. (1969). Psychological statistics (4th ed.).
New York: Wiley.
Mee, R. W. (1990). Confidence intervals for probabilities
and tolerance regions based on a generalization of the
Mann-Whitney statistic. Journal of the American Statistical Association, 85, 793-800.
Micceri, T. (1989). The unicorn, the normal curve, and other
improbable creatures. Psychological Bulletin, 105, 156166.
Morris, S. B., & DeShon, R. P. (1997). Correcting effect
sizes computed from factorial analysis of variance for use
in meta-analysis. Psychological Methods, 2, 192-199.
Mosteller, F., & Bush, R. R. (1954). Selected quantitative
techniques. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (Vol. 1, pp. 289-334). Cambridge, MA: Addison-Wesley.
Mosteller, F., & Chalmers, T. C. (1992). Some progress and
problems in meta-analysis of clinical trials. Statistical
Science, 7, 227-236.
Murphy, B. P. (1976). Comparison of some two sample
means tests by simulation. Communications in Statistics
B: Simulation and Computation, 5, 23-32.
Norusis, M. J. (1995). SPSS 6.1 guide to data analysis.
Englewood Cliffs, NJ: Prentice Hall.
O'Brien, P. C. (1988). Comparing two samples: Extensions
of the t, rank-sum, and log-rank tests. Journal of the
American Statistical Association, 83, 52-61.
O'Brien, R. G. (1978). Robust techniques for testing heterogeneity of variance. Psychometrika, 43, 327-342.
Oliver, M. B., & Hyde, J. S. (1995). Gender differences in
attitudes toward homosexuality: A reply to Whitely and
Kite. Psychological Bulletin, 117, 155-158.
Pratt, J. W. (1964). Obustness [sic] of some procedures for
the two-sample location problem. American Statistical
Association Journal, 59, 665-680.
Pratt, J. W., & Gibbons, J. D. (1981). Concepts ofnonparametric theory. New York: Springer-Verlag.
Raudenbush, S. W., & Bryk, A. S. (1987). Examining correlates of diversity. Journal of Educational Statistics, 12,
241-269.
Ray, J. W., & Shadish, W. R. (1996). How interchangeable
are different measures of effect size? Journal of Consulting and Clinical Psychology, 64, 1316-1325.
Rosenthal, R. (1991). Meta-analytic procedures for social
research. Newbury Park, CA: Sage.
Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic
145
look at the robustness and Type II error properties of the
t test to departures from population normality. Psychological Bulletin, 111, 352-360.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error
in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199-223.
Shaffer, J. P. (1992). Caution on the use of variance ratios:
A comment. Review of Educational Research, 62, 429432.
Shoemaker, L. H. (1999). Interquantile tests for dispersion
in skewed distributions. Communications in Statistics B:
Simulation and Computation, 28, 189-205.
Silverman, B. W. (1986) Density estimation for statistics
and data analysis. New York: Chapman & Hall.
Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The
benefits of psychotherapy. Baltimore: Johns Hopkins
University Press.
Snedecor, G. W., & Cochran, W. G. (1989). Statistical
methods (8th ed.). Ames: Iowa State University.
Tomarken, A. J., & Serlin, R. C. (1986). Comparison of
ANOVA alternatives under variance heterogeneity and
specific noncentrality structures. Psychological Bulletin,
99, 90-99.
Tukey, J. W. (1984). The collected works of John W. Tukey.
Monterey, CA: Wadsworth.
Vargha, A., & Delaney, H. D. (1998). The Kruskal-Wallis
test and stochastic homogeneity. Journal of Educational
and Behavioral Statistics, 23, 170-192.
Weisz, J. R., Weiss, B., Han, S. S., Granger, D. A., &
Morton, T. (1995). Effects of psychotherapy with children and adolescents revisited: A meta-analysis of treatment outcome studies. Psychological Bulletin, 117, 450468.
Weston, T., & Hopkins, K. D. (1998). Testing for homogeneity of variance: An evaluation of current practice. Unpublished manuscript, University of Colorado at Boulder.
Whitney, D. R. (1951). A bivariate extension of the U statistic. Annals of Mathematical Statistics, 22, 274-282.
Wilcox, R. R. (1987). New designs in analysis of variance.
Annual Review of Psychology, 38, 29-60.
Wilcox, R. R. (1994). The percentage bend correlation coefficient. Psychometrika, 59, 601-616.
Wilcox, R. R. (1995). Comparing two independent groups
via multiple quantiles. The Statistician, 44, 91-99.
Wilcox, R. R. (1996). Statistics for the social sciences. San
Diego, CA: Academic Press.
Wilcox, R. R. (1997). Introduction to robust estimation and
hypothesis testing. San Diego, CA: Academic Press.
Wilcox, R. R., & Muska, J. (1999). Measuring effect size: A
non-parametric analogue of u>2. British Journal of Mathematical and Statistical Psychology, 52, 93-110.
146
GRISSOM AND KIM
Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55,
1-17.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design (3rd ed.). New
York: McGraw-Hill.
Zimmerman, D. M., & Zumbo, B. D. (1993). Rank transformations and the power of the Student t test and Welch
t' test for non-normal populations with unequal variances.
Canadian Journal of Experimental Psychology, 47, 523539.
Wolfe, D. A., & Hogg, R. V. (1971). On constructing statistics and reporting data. The American Statistician, 25,
27-30.
Received January 20, 1999
Revision received August 28, 2000
Accepted December 21, 2000 •
Members of Under-represented Groups:
Reviewers for Journal Manuscripts Wanted
If you are interested in reviewing manuscripts for APA journals, the APA Publications
and Communications Board would like to invite your participation. Manuscript reviewers are vital to the publications process. As a reviewer, you would gain valuable
experience in publishing. The P&C Board is particularly interested in encouraging
members of underrepresented groups to participate more in this process.
If you are interested in reviewing manuscripts, please write to Demarie Jackson at the
address below. Please note the following important points:
•
•
•
•
To be selected as a reviewer, you must have published articles in peer-reviewed
journals. The experience of publishing provides a reviewer with the basis for
preparing a thorough, objective review.
To be selected, it is critical to be a regular reader of the five to six empirical journals that are most central to the area or journal for which you would like to review.
Current knowledge of recently published research provides a reviewer with the
knowledge base to evaluate a new submission within the context of existing research.
To select the appropriate reviewers for each manuscript, the editor needs detailed
information. Please include with your letter your vita. In your letter, please identify which APA journal(s) you are interested in, and describe your area of expertise. Be as specific as possible. For example, "social psychology" is not sufficient—you would need to specify "social cognition" or "attitude change" as well.
Reviewing a manuscript takes time (1-4 hours per manuscript reviewed). If you
are selected to review a manuscript, be prepared to invest the necessary time to
evaluate the manuscript thoroughly.
Write to Demarie Jackson, Journals Office, American Psychological Association, 750
First Street, NE, Washington, DC 20002-4242.