Download Non-parametric methods for continuous or ordered data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
J. Paediatr. Child Health (2003) 39, 309–311
Statistics for Clinicians
8: Non-parametric methods for continuous or ordered data
JB Carlin1,3 and LW Doyle2,3,4
1Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Departments of 2Obstetrics and
Gynaecology and 3Paediatrics, University of Melbourne and 4Division of Newborn Services, Royal Women’s Hospital,
Melbourne, Victoria, Australia
In previous articles in this series, we have discussed the use of
the t-test and related confidence intervals for comparing (two)
groups on the basis of a continuous outcome measure.1,2 These
procedures rely on certain assumptions: in particular, that the
data follow a normal distribution – unless we are dealing with
large samples, where normality of the sample means can be
assumed regardless of the distribution of the individual values.
When we use the t-test methods with small samples, we are
making a parametric assumption that the outcome values can
be modelled by a normal distribution, characterized by two
parameters: the mean and standard deviation (SD). In general,
non-parametric methods avoid specific modelling assumptions
such as this.
In the present article, we discuss non-parametric methods
commonly used for comparing continuous outcomes between
two groups. These are generally used with variables that are
badly skewed. As a quick check, if the SD of a continuous
variable is similar to or greater than its mean, the distribution
must be skewed. However, there are times when the mean
exceeds the SD, but the data are still skewed (this is nearly
always obvious when plotting data, as recommended as a first
step in the second article in the series).3 The most common
non-parametric method is the Mann–Whitney U-test (or
Wilcoxon rank sum test, which is equivalent), a version of
which had appeared in 15% of all original articles in our
previous survey of statistics appearing in the Journal of Paediatrics and Child Health.4 The underlying logic of this test
involves ranking the data values, usually from lowest to
highest, and then comparing the sums of the ranks in each
group. This comparison provides an intuitive measure of the
extent to which one group tends to have higher values than
another, and can be regarded as the ‘signal’ component of the
signal : noise ratio for this procedure.
We illustrate the Wilcoxon rank sum computation using the
data in Table 1, which gives the days of assisted ventilation in
boys and girls in a modified subset of very low birthweight
(VLBW) infants from the dataset described in the second
article in the series.3 Note that boys needed longer durations of
assisted ventilation than girls. This is reflected in the mean
values for the two groups, but we also see that the SDs are high
relative to the means, indicating substantial skewness in the
distributions. To calculate the test statistic, we first need to rank
the observations across both groups; the ranks are shown in
the table adjacent to the data values. The difference between
the groups is reflected in the fact that the ranks for boys are
generally higher than the ranks for girls. This is summarized in
the sums of the ranks, although these need to be interpreted in
light of the sample number in each group. The sum of the ranks
in one of the groups is used to calculate a test statistic for which
a P-value can be computed, under the null hypothesis that the
two groups have the same distribution of values. The ‘signal’ is
the difference between the observed sum of the ranks and its
expected value if there were no difference between the groups.
For moderate to large sample sizes, a P-value is obtained by
comparing the ‘signal’ with its standard error (or ‘noise’
measure); for small samples, an exact probability is available in
tables. We omit the details of these calculations, but they are
available in standard texts5 and statistical packages. Although
we have described the Wilcoxon version of the non-parametric
test that was independently developed in a different form by
Mann and Whitney, we recommend referring to the method as
the Mann–Whitney U-test to avoid confusion with the paired
test that we describe here. Interested readers are referred
elsewhere for details of the Mann–Whitney U version of the
calculation.6
For the data in Table 1, the sum of the ranks for the boys is
128. It can be shown that with nine observations in the first
group (boys) and 12 in the second the expected sum of ranks
for boys is 99. The difference, therefore, is 29. Dividing by its
standard error, which turns out to be 14.1, gives a z-value of
2.06, with a corresponding P-value of 0.04. The conclusion
would be that we have moderately strong evidence that boys
need longer durations of assisted ventilation.
The Mann–Whitney U-test is often presented as a comparison of medians. Although with skewed data the median
generally represents a better summary of the distribution, as
already stated, the test actually compares distributions, not
medians.7 Unlike other tests described in previous articles in
the series, such as the t -test and the χ 2, we cannot check
the accuracy of someone else’s Mann–Whitney U-test using
summary statistics (such as medians) without the raw data.
As with means, the difference between medians can be
expressed along with a 95% confidence interval (CI). This
conveys both the size of the difference, reflecting its clinical
significance, and its statistical significance at the 5% level. In
the present example, the difference in medians is 20.5 days, and
the 95% CI ranges from 3 to 60 days. This should be reported
more in clinical journals, and the calculation is available in
some statistical packages, as well as in some stand-alone
programs.8
Correspondence: Associate Professor LW Doyle, Department of Obstetrics & Gynaecology, University of Melbourne, Parkville, Vic. 3010,
Australia. Fax: +61 3 9347 1761; email: lwd@unimelb.edu.au
Accepted for publication 3 February 2003.
310
JB Carlin and LW Doyle
Table 1 Data on days of assisted ventilation for very low birthweight
infants, modified for illustrative purposes
Ventilation (days)
Rank†
Boys‡
13
17
26
27
40
51
62
116
204
6
7
11
12
15
17
19
20
21
1
2
3
5
8
19
20
23
29
30
41
52
1
2
3
4
5
8
9
10
13
14
16
18
Girls§
†
Sum of ranks (boys) 128, (girls) 103; ‡ median 40, mean 61.8, SD
61.9; §median 19.5, mean 19.4, SD 16.5.
PAIRED COMPARISONS: SIGN TEST AND
WILCOXON SIGNED-RANK TEST
Just as was discussed in a previous article in relation to the
t-test for comparing means,1 when a study uses a paired design
to make comparisons between two treatments or conditions, the
analysis should reflect the pairing, as this removes a potentially
important source of variation. For example, it is natural to
expect that observations taken on the same patient before and
after administration of a drug will be more alike than observations taken on different patients.
There are two commonly used non-parametric tests for
making paired comparisons of continuous outcome data: the
sign test and the Wilcoxon signed-rank test.
The logic of the sign test is very simple: it examines the
number of pairs in which one member of the pair (say the first)
has a higher value than the other (say the second), and
compares this number of pairs with what would be expected
under the null hypothesis of no difference, which is just half the
total number of pairs. The variation or ‘noise’ surrounding this
difference is determined by the variance of the binomial
distribution with probability 0.5, and a P-value can be obtained
directly using this binomial distribution. An example of paired
data arising from our VLBW study is in twins, where we might
be interested in asking if the second twin has a need for longer
durations of assisted ventilation than the first twin. Figure 1
illustrates the data for 11 pairs of VLBW twins (note the
logarithmic plot on the vertical axis – we will explain the logic
of this plot in the next article in the series). In four sets, the first
twin required a longer duration of assisted ventilation, and in
seven sets, the second twin required more assisted ventilation.
To work out the probability of finding at least seven sets where
one twin had a longer duration of assisted ventilation, we
calculate the probability from the binomial distribution of
Fig. 1 Days of assisted ventilation for 11 sets of very low birthweight
twins. Note logarithmic scale on vertical axis. , Twin 1; , twin 2.
finding a 7:4 split, and add to it the probabilities of 8:3, 9:2,
10:1 and 11:0 splits. Fortunately, computer programs will do
this readily, and the answer happens to be a total probability
(P-value) of 0.27.
One problem with the sign test is that it ignores the size
of the difference between the pairs of twins, whereas the
Wilcoxon signed-rank test uses this information as well the
direction of the difference between pairs. In the Wilcoxon test
the differences between pairs are ranked from highest to lowest,
ignoring the direction of the effect. The sum of the ranks for
one of the pairs is then considered and compared with what
would be expected from chance alone. In the present example,
the expected sum of ranks is 33 for each group, and the
observed sum of ranks where the second twin received a
longer duration of assisted ventilation was 45. An approximate
P-value can be obtained from tables, or statistical programs can
calculate an exact probability; in the present example, this is
0.29, which is similar to the P-value for the sign test.
In the present article we have focused on non-parametric
methods for continuous or ordered variables. Commonly used
tests for dichotomous variables, such as the χ2 and Fisher’s
exact test, which we have discussed previously,9 can also be
regarded as non-parametric. There is a wide range of other nonparametric methods for which interested readers are referred
elsewhere.5,6
In the next article in the series, we will discuss alternative
approaches to handling skewed distributions based on transforming data. In particular, we will explain the central role of
the logarithmic transformation in much statistical analysis.
REFERENCES
1
2
3
4
Carlin JB, Doyle LW. Statistics for clinicians 4: Basic concepts of
statistical reasoning: Hypothesis tests and the t-test. J. Paediatr.
Child Health 2001; 37: 72–7.
Carlin JB, Doyle LW. Statistics for clinicians 6: Comparison of
means and proportions using confidence intervals. J. Paediatr.
Child Health 2001; 37: 583–6.
Carlin J, Doyle L. Statistics for clinicians 2: Describing and displaying data. J. Paediatr. Child Health 2000; 36: 270–4.
Doyle LW, Carlin JB. Statistics for clinicians 1: Introduction.
J. Paediatr. Child Health 2000; 36: 74–5.
Non-parametric methods
5
6
7
Altman DG. Practical Statistics for Medical Research. Chapman
& Hall, London, 1991.
Bland M. An Introduction to Medical Statistics. Oxford University
Press, Oxford, 1995.
Hart A. Mann–Whitney test is not just a test of medians: differences in spread can be important. BMJ 2001; 323: 391–3.
311
8
9
Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics with
Confidence. BMJ Books, London, 2000.
Carlin JB, Doyle LW. Statistics for clinicians 5: Comparing
proportions using the chi-squared test. J. Paediatr. Child Health
2001; 37: 392–4.
Related documents