Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Statistical Analysis and Data Interpretation
important
What is significant for the athlete, the statistician and team doctor?
Will Hopkins will@clear.net.nz sportsci.org/will
What is a Statistic?
Simple, effect, and inferential statistics.
Making Clinical and Non-clinical Inferences
Sampling variation; true effects; confidence limits; null-hypothesis
significance test; magnitude-based inference; individual
differences and responses.
Clinically Important Effects
For differences and changes in means; correlations; slopes or
gradients; ratios of proportions, risks, odds, hazards, counts.
Monitoring Individual Athletes
Subjective and objective assessments; error of measurement.
What is a Statistic?
Definition: a number summarizing an aspect of many numbers.
Examples: mean, correlation, confidence limit…
If the many numbers all represent different values of the same kind
of thing, we call the numbers values of a numeric variable.
• Example: 57, 73, 61, 60 kg are values of the variable body mass.
• Values of a variable all have the same units.
A nominal or grouping variable has levels or labels rather than
numeric values.
• Example: union, league, touch… are levels of the variable rugby.
Utility: a statistic usually represents the big picture or some other
important aspect of the original numbers.
The aspect is often not obvious in the original numbers.
One number is better than many.
• Most people hate numbers. The fewer, the better!
Simple statistic: an aspect of a set of values of one variable.
Sample size (n): the number of values.
Mean: the average value or center of the values.
Standard deviation (SD): the average scatter around the mean.
• Used to evaluate magnitudes of differences in means.
Standard error of the mean (SD/n): the expected variation in the
mean with resampling.
• A tricky statistical dinosaur. Avoid!
• Convert back to the SD when you see it.
Quantiles (median, tertiles, quartiles, quintiles…): values that divide
the ranked set up into 2, 3, 4, 5… equal-sized subsets.
• Used when the set is skewed by large values (e.g., salaries).
• Also used to compare subgroups. Example: systolic pressure in the
quintile of lowest physical activity vs each quintile of higher activity.
Proportion or risk: the number of "events" (e.g., injured players)
divided by the number of "trials" (total number of players).
• Often expressed as a percent (proportion×100).
Effect statistic: a relationship between a predictor or
independent variable and a dependent or outcome variable.
Difference (or change) in mean: the predictor is a grouping variable
and the dependent is numeric.
Slope (or gradient): the difference or change in the mean per
difference in a numeric predictor.
Correlation coefficient: another form of the slope.
Ratio of proportions, risks, odds or hazards: statistics for comparing
the occurrence (presence or absence) of something in two groups.
Ratio of counts: statistics for comparing counts or occurrences of
something in two groups.
Other variables can be included in the analysis as covariates.
• Moderators are interacted with the predictor to estimate how the
effect differs between subjects.
• Mediators are added to adjust for effects of subject characteristics,
which means: "for subjects of the same age…, the effect was…".
Such adjustment also deals with potential confounding (by age…).
Inferential statistic: an aspect of the "true" value of a simple or
effect statistic derived from a sample.
Confidence interval or limits: the likely range of the true value.
P value: provides evidence about the zero or null value of an effect.
Chance of benefit, risk of harm: provide evidence about the true
value for making clinical decisions.
T, F, chi-squared statistics: "test" statistics used to get the above.
• Only the statistician needs to know about these.
• They shouldn’t be shown in publications.
Making Clinical Inferences (Decisions or Conclusions)
c Every sample gives a different value for a statistic, owing to
sampling variation.
So, the value of a sample statistic is only an estimate of the true
(right, real, actual, very large sample, or population) value.
But people want to make an inference about the true value.
The best inferential statistic for this purpose is the confidence
interval: the range within which the true value is likely to fall.
"Likely" is usually 95%, so there is a 95% chance the true value is
included in the confidence interval (and a 5% chance it is not).
Confidence limits are the lower and upper ends of the interval.
The limits represent how small and how large the effect "could" be.
All effects should be shown with a confidence interval or limits.
Example: the dietary treatment produced an average weight loss of
3.2 kg (95% confidence interval 1.6 to 4.8 kg).
• The confidence interval is NOT a range of individual responses!
But confidence limits alone don't provide a clinical inference.
Statistical significance is the traditional way to make inferences.
Also known as the null-hypothesis significance test.
The inference is all about whether the effect could be zero or "null".
If the 95% confidence interval includes zero, the effect "could be
zero". The effect is "statistically non-significant (at the 5% level)":
zero or null
negative
95% confidence
interval
Researchers using p values
should show exact values.
positive
statistically non-significant (p=0.31)
statistically significant (p=0.02)
statistically significant (p=0.003)
value of effect statistic (e.g., change in weight)
If the confidence interval does not include zero, the effect "couldn't
be zero". The effect is "statistically significant (at the 5% level)".
Stats packages calculate a probability or p value for deciding
whether an effect is significant.
• p>0.05 means non-significant; p<0.05 means significant.
The exact definition of the p value is hard to understand.
• Useful interpretation: half the p value is the probability the true effect
is negative when the sample effect is positive (and vice versa).
People usually interpret non-significant as "no real effect" and
significant as "a real effect".
These interpretations apply only if the study was done with the right
sample size.
Even then they are misleading: they don't convey the uncertainty.
And you hardly ever know if the sample size is right.
Attempts to address this problem with post-hoc power calculations
are rare, generally wrong, and too hard to understand.
So the only safe interpretation is whether the effect could be zero.
But the issue for the practitioner is not whether the effect could be
zero, but whether the effect could be important.
• Important has two meanings: beneficial and harmful.
The confidence interval addresses this issue, when clinically
important values for benefit and harm are taken into account.
Clinical inferences with the confidence interval
The smallest clinically important effects define values of the effect
that are beneficial, harmful and trivial.
• Smallest effects for benefit and harm are equal and opposite.
Infer (decide) the outcome from the confidence interval, as follows:
smallest clinically
harmful effect
smallest clinically
beneficial effect
P values fail here.
Clinical
harmful trivial beneficial decision
Clear: use it.
Clear: use it.
Clear: use it. But p>0.05!
Clear: depends.
Clear: don't use it. But p<0.05!
Clear: don't use it.
Clear: don't use it.
Unclear: more data needed.
value of effect statistic (e.g., change in weight)
This approach eliminates statistical significance.
The only issue is what level to make the confidence interval.
To be careful about avoiding harm, you can make a conservative
99% confidence interval on the harm side.
And to use effects only when there is a reasonable chance of
benefit. you can make a 50% interval on the benefit side.
But that's hard to understand. Consider this equivalent approach…
Clinical inferences with probabilities of benefit and harm.
The uncertainty in an effect can be expressed as chances that the
true effect is beneficial and the risk that it is actually harmful.
You would decide to use an effect with a reasonable chance of
benefit, provided it had a sufficiently low risk of harm.
I have opted for possibly beneficial (>25% chance of benefit) and
most unlikely harmful (<0.5% chance of harm).
An effect with >25% chance of benefit and >0.5% risk of harm is
therefore unclear. You'd like to use it, but you daren't.
• Everything else is either clearly useful or clearly not worth using.
If the chance of benefit is high (e.g., 80%), you could accept a
higher risk of harm (e.g., 5%).
• This less conservative approach has been formalized using a
threshold odds ratio of 66 (odds of benefit to odds of harm).
When an effect has no obvious benefit or harm (e.g., a comparison
of males and females), the inference is only about whether the
effect could be substantially positive or negative.
• For such non-clinical inferences, use a symmetrical confidence
interval, usually 90% or 99%, to decide whether the effect is clear.
• Equivalently, one or other of the chances of being substantially
positive or negative has to be <5% for the effect to be clear ("a clear
non-clinical effect can't be substantially positive and negative").
Ways to report inferences for clear effects: possibly small benefit,
likely moderately harmful, a large difference (clear at 99% level), a
trivial-moderate increase [the lower and upper confidence limits]…
• Whatever, researchers should make a magnitude-based
inference by showing confidence limits and interpreting the
uncertainty in a (clinically) relevant way readers can understand.
A caution about making an inference…
Whatever method you use, the inference is about the one and only
mean effect in the population.
The confidence interval represents the uncertainty in the true effect,
not a range of individual differences or individual responses.
• For example, with a large-enough sample size, a treatment could be
clearly beneficial (a mean beneficial effect with a narrow confidence
interval), yet the treatment could be harmful for a substantial
proportion of the population.
Individual differences between groups and individual responses to a
treatment are best summarized with a standard deviation to go with
the mean effect.
• The mean effect and the SD both need confidence limits.
Individual differences between groups and individual responses to a
treatment may be accounted for by including subject characteristics
as modifying covariates in the analysis.
Researchers generally neglect this important issue.
Clinically Important Magnitudes of Effects
Researchers and practitioners need to know about clinically
important magnitudes to interpret research findings.
Researchers need the smallest clinically important magnitude of
an effect statistic to estimate sample size for a study.
For those who use the null-hypothesis significance test, the right
sample size has 80% power (80% chance of statistical significance,
p<0.05) if the true effect has the smallest important value.
For those who use clinical magnitude-based inference, the right
sample size gives a 0.5% risk of harm and a 25% chance of benefit
if the true effect has the smallest important beneficial value.
Practitioners need to know about clinically important magnitudes
to monitor their athletes or patients.
So the next few slides are all about values for various
magnitudes of various effect statistics.
Differences or Changes in the Mean
The most common effect statistic, for numbers
with decimals (continuous variables).
Difference when comparing
different groups, e.g., patients vs healthy.
In population-health studies, groups are often
subdivided into quartiles or quintiles (e.g., of age).
Change when tracking the same subjects.
Difference in the changes in controlled trials.
The between-subject standard deviation
provides default thresholds for important
differences and changes.
Strength
patients healthy
Data are means & SD.
Strength
You think about the effect (mean) in terms of a
pre post1 post2
fraction or multiple of the SD (mean/SD).
Trial
Data are means & SD.
The effect is said to be standardized.
The smallest important effect is ±0.20 (±0.20 of an SD).
Example: the effect of a treatment on strength
Trivial effect (0.1x SD)
post
pre
Very large effect (3.0x SD)
post
pre
strength
Interpretation of
standardized
difference or
change in means:
Complete scale:
trivial
small
moderate
large
very large
extremely large
strength
Cohen
<0.2
0.2-0.5
0.5-0.8
>0.8
?
?
Hopkins
<0.2
0.2-0.6
0.6-1.2
1.2-2.0
2.0-4.0
>4.0
trivial 0.2 small 0.6 moderate 1.2 large 2.0 very large 4.0 ext. large
Cautions with standardizing
Standardizing works only when the SD comes from a sample that
is representative of a well-defined population.
• The resulting magnitude applies only to that population.
In a controlled trial, use the baseline (pre) SD, never the SD of
change scores.
Beware of authors who show standard errors of the mean (SEM)
rather than standard deviations (SD).
• SEM = SD/(sample size), so SEMs on graphs make effects look a
lot bigger than they really are.
• Very rarely, overlap of SEM of two groups indicates that the
difference between the means is not statistically significant.
• But you won't know when that applies, and you're not using or
trusting statistical significance anymore anyway, right?
Standardization may not be best for effects on means of some
special variables: visual-analog scales, Likert scales, solo athletic
performance…
Visual-analog scales
The respondents indicate a perception on a line like this:
Rate your pain by placing a mark on this scale:
none
unbearable
Score the response as percent of the length of the line.
Magnitude thresholds: 10%, 30%, 50%, 70%, 90% for small,
moderate, large, very large, extremely large differences or changes.
Likert scales
These are used for responses to questions like this:
Over the last four weeks, how often did you train in a gym?
not at all once only 2-3 times once a week
twice or more a week
Most Likert-type questions have four to seven choices.
Code them as integers (1, 2, 3, 4, 5…) and analyze as numerics.
Magnitude thresholds are up for debate.
• If you use the thresholds of the visual-analog scale as a guide, the
threshold for a 6-pt scale would be ~0.5, 1.5, 2.5, 3.5 and 4.5.
Solo athletic performance
For fitness tests and performance indicators of team-sport athletes,
use standardization.
But for top solo athletes, an enhancement that results in one extra
medal per 10 competitions is the smallest important effect.
• The within-athlete variability that athletes show from one
competition to the next determines this effect. Here's why…
• Owing to this variability, each of the top athletes has a good chance
of winning at each competition:
Race 1
Race 2
Race 3
Your athlete needs an enhancement that overcomes this variability
to give her or him a bigger chance of a medal.
Simulations show an enhancement
of 0.3 of an athlete's typical variability
from competition to competition gives
one extra win every 10 competitions.
• Example: if the variability is an SD
(coefficient of variation) of 1%, the
smallest important enhancement is 0.3%.
• In some early publications I have mistakenly referred to 0.5 of the
variability as the smallest effect.
Small, moderate, large, very large and extremely large effects result
in an extra 1, 3, 5, 7 and 9 medals in every 10 competitions.
The corresponding enhancements as factors of the variability are:
trivial 0.3 small 0.9 moderate 1.6 large 2.5 very large 4.0 ext. large
Beware: smallest effect on athletic performance in performance
tests depends on method of measurement, because…
A percent change in an athlete's ability to output power results in
different percent changes in performance in different tests.
These differences are due to the power-duration relationship for
performance and the power-speed relationship for different modes
of exercise.
Example: a 1% change in endurance power output produces the
following changes…
• 1% in running time-trial speed or time;
• ~0.4% in road-cycling time-trial time;
• 0.3% in rowing-ergometer time-trial time;
• ~15% in time to exhaustion in a constant-power test.
• A hard-to-interpret change in any test following a fatiguing pre-load.
(But such tests can be interpreted for cycling road races: see
Bonetti and Hopkins, Sportscience 14, 63-70, 2010.)
Slope (or Gradient)
Physical activity
Used when the predictor and dependent are
both numeric and a straight line fits the trend.
The unit of the predictor is arbitrary.
Example: a 2% per year decline in activity
2 SD
seems trivial…
yet 20% per decade seems large.
Age
So it's best to express a slope as the
difference in the dependent per two SDs of predictor.
• It gives the difference in the dependent (physical activity) between
a typically low and high subject.
• The SD for standardizing the resulting effect is the standard error of
the estimate (the scatter about the line).
Correlation Coefficient
Closely related to the slope, this represents the overall linearity in
a scatterplot. Examples:
r = 0.00
r = 0.10
r = 0.30
r = 0.50
r = 0.70
r = 0.90
r = 1.00
Negative values represent negative slopes.
The value is unaffected by the scaling of the two variables or by
the sample size.
And it's much easier to calculate than a slope.
But a properly calculated slope is easier to interpret clinically.
Smallest important correlation is ±0.1. Complete scale:
trivial 0.1 low 0.3 moderate 0.5 high 0.7 very high 0.9 ext. high
Differences and Ratios of Proportions, Risks, Odds, Hazards
Example: percent of male and female players injured at all
in a season of touch rugby.
Proportion
injured (%)
Risk difference or proportion difference
100
A common measure.
a=
Example: a - b = 75% - 36% = 39%.
75%
Problem: the sense of magnitude of
b=
a given difference depends on how big
36%
0
the proportions are.
male female
Sex
• Example: for the same 10% difference,
90% vs 80% doesn't seem big, but…
11% vs 1% can be interpreted as a huge "difference" (11x the risk).
So there is no scale of magnitudes for a risk or proportion difference.
And analyses (models) don't work properly with proportions.
• We have to use odds or hazards instead of proportions. Stay tuned.
Number needed to treat (NNT) = 100/(risk difference (%)).
The number you would have to treat or sample for one subject to
have an outcome attributable to the effect.
• Example: one male in 2.6 (=1/0.39) is injured because he’s a male.
Has been promoted in some clinical journals, but not widely used.
Hard to analyze properly, and problems with its confidence limits.
Avoid!
Proportion
Risk ratio (relative risk) or proportion ratio
injured (%)
100
Another common measure.
a=
Example: a/b = 75/36 = 2.1, which means
75%
males are "2.1 times more likely" to be injured,
b=
36%
or "a 110% increase in risk" of injury for males.
0
male female
Problem: if it's a time dependent measure,
Sex
the risk ratio changes.
• If you wait long enough, everyone gets affected, so risk ratio = 1.00.
But it works for rare time-dependent risks and for time-independent
classifications (e.g., proportion playing a sport).
Hence we need values for the smallest and other important ratios for
risks and proportions.
The smallest ratio is when one event or case in every 10 is due to
the effect.
• Example: one in 10 injuries is due to being male.
• That is, for every 10 injured males, there are 9 injured females.
• If there are N males and N females (injured and uninjured), the injury
risks are 10/N and 9/N, and the risk ratio = (10/N)/(9/N) = 10/9.
For moderate, large, very large and extremely large ratios, for every
10 injured males, there are 7, 5, 3 and 1 injured females.
• Corresponding risk ratios are 10/7, 10/5, 10/3 and 10/1.
Hence this scale for proportion ratio and low-risk ratio:
trivial 1.11 small 1.43 moderate 2.0 large 3.3 very large 10 ext. large
• and the inverses for reductions in proportions: 0.9, 0.7, 0.5, 0.3, 0.1.
But there is still the problem of analyzing proportions properly.
• Two solutions: hazards instead of risks; odds instead of proportions.
Hazard ratio for time-dependent events.
Proportion
injured (%)
To understand hazards, consider the
100
increase in proportion or risk with time.
males
The hazard is the tiny proportion
females
that gets affected per a tiny interval of time.
Example:
hazard for males = a = 0.28% per day,
0
hazard for females = b = 0.11% per day.
Time (months)
So hazard ratio = a/b = 0.28/0.11 = 2.5.
• That is, males are 2.5x more likely to get injured
a
per unit time, whatever the (small) unit of time.
So you could call it the "right-now risk ratio".
b
It's also known as incidence rate ratio,
1 day
which is the ratio of the slopes.
It can also be interpreted as the ratio of the times taken for the
same proportion to get affected in two groups.
• Example: females take 2.5x as long to get injured as males.
Hazard ratios work over long periods, when a substantial proportion
of males or females is injured, and the observed risk ratio drops
below the initial hazard ratio.
100
males
• Example: at 5 weeks,
Proportion
injured (%)
a
the risk ratio = a/b = 75/36 = 2.1.
females
But the hazard ratio for those still
b
uninjured is usually assumed to stay the same,
0
even if the hazards change with time.
Time (months)
• Example: the risk of injury might increase later
in the season for both sexes, but the right-now risk ratio for new
injuries (the hazard ratio) doesn't change. A big plus!
And hazards and hazard ratios can be modeled (analyzed)!
Magnitude thresholds must be the same as for the proportion ratio,
even for frequent events, because such events start off rare.
Hence this scale for the hazard ratio:
trivial 1.11 small 1.43 moderate 2.0 large 3.3 very large 10 ext. large
• and the inverses 0.9, 0.7, 0.5, 0.3, 0.1.
Odds ratio for time-independent classifications.
Classifications refer to prevalence; risks refer to incidence.
Odds are the awkward but only way to model classifications.
Proportion
Example: proportions of boys and girls
playing (%)
playing a sport.
100
c=
• Odds of a boy playing = a/c = 75/25.
25% d =
a = 64%
• Odds of a girl playing = b/d = 36/64.
75%
• Odds ratio = (75/25)/(36/64) = 5.3.
b=
36%
Interpret the ratio as "…times more likely"
0
boys girls
only when the proportions in both groups
Sex
are small (<10%).
• The odds ratio is then approximately equal to the proportion ratio.
To assess magnitude, authors should convert the odds ratio and its
confidence limits to the proportion ratio and its confidence limits.
• Unfortunately they often just leave effects as odds ratios.
Ratio of Counts
Example: 93 vs 69 injuries per 1000 player-hours of match play in
sport A vs sport B.
The effect is expressed as a ratio: 93/69 = 1.35x more injuries.
Can also be expressed as 35% more injuries.
The scale of magnitudes is the same as for ratio of proportions:
trivial 1.11 small 1.43 moderate 2.0 large 3.3 very large 10 ext. large
and the inverses 0.9, 0.7, 0.5, 0.3, 0.1.
––––––––––
Effects of numeric linear predictors (slopes) for ratio outcomes are
expressed as risk, odds, hazard or count ratios per unit of the
predictor and evaluated as the effect per 2 SD of the predictor.
Modeling Effects
Estimates and inferential statistics for mean effects and slopes
come from various kinds of general linear model…
t tests, simple and multiple linear regression, ANOVA…
Use mixed linear models for repeated measures and clustering.
Testing for normality is pointless, but uniformity is the real issue.
• Many effects are more uniform when estimated as percents or ratios
via analysis of the log-transformed dependent variable.
Bootstrapping of confidence limits works with difficult data.
Ratios of odds, hazards and counts need various kinds of
generalized linear model…
All include log transformation to estimate ratios.
Logistic (log-odds) regression for odds, log-hazard and Cox
regression for hazards, Poisson regression for counts.
And don't forget that covariates in all these models estimate and
adjust for effects of moderators and mediators or confounders.
Monitoring Individual Athletes
It’s all about a substantial change since the last assessment.
The subjective assessments (perceptions) of the athlete, coach,
and support personnel provide important evidence.
One-off assessments often differ between individual practitioners,
but assessments of change usually have high validity.
Objective assessments of change with an instrument or test are
contaminated with error or "noise".
The noise is represented by the standard deviation of repeated
measurements, the standard (or typical) error of measurement.
Think of ± the error as the equivalent of confidence limits for the
athlete's true change.
Take into account clinically or practically important changes.
• Wow, you've made a moderate improvement!
• No real change either way. [A good instrument needed for this.]
• Uh… unclear whether you’re getting better or worse.
Summary
Inferential statistics are used to make conclusions about the true
value of a simple or effect statistic derived from a sample.
The inference from a null-hypothesis significance test is about
whether the true value of an effect statistic could be null (zero).
Magnitude-based inference addresses the issue of whether the true
value could be important (beneficial and harmful, or substantial).
Effect magnitudes have key roles in research and practice.
Effects for continuous dependents are mean differences, slopes
(expressed per 2 SD of the predictor), and correlations.
Thresholds for small, moderate, large, very large and extremely
large standardized mean differences: 0.20, 0.60, 1.2, 2.0, 4.0.
Thresholds for correlations: 0.10, 0.30, 0.50, 0.70, 0.90.
Magnitude thresholds for ratios of proportions, hazards, counts:
1.11, 1.43, 2.0, 3.3, 10 and their inverses 0.9, 0.7, 0.5, 0.3, 0.1.
Take noise and thresholds into account when monitoring athletes.