Download at this link - Sportscience

Statistical Analysis and Data Interpretation important What is significant for the athlete, the statistician and team doctor? Will Hopkins will@clear.net.nz sportsci.org/will What is a Statistic?  Simple, effect, and inferential statistics. Making Clinical and Non-clinical Inferences  Sampling variation; true effects; confidence limits; null-hypothesis significance test; magnitude-based inference; individual differences and responses. Clinically Important Effects  For differences and changes in means; correlations; slopes or gradients; ratios of proportions, risks, odds, hazards, counts. Monitoring Individual Athletes  Subjective and objective assessments; error of measurement. What is a Statistic?  Definition: a number summarizing an aspect of many numbers.  Examples: mean, correlation, confidence limit…  If the many numbers all represent different values of the same kind of thing, we call the numbers values of a numeric variable. • Example: 57, 73, 61, 60 kg are values of the variable body mass. • Values of a variable all have the same units.  A nominal or grouping variable has levels or labels rather than numeric values. • Example: union, league, touch… are levels of the variable rugby.  Utility: a statistic usually represents the big picture or some other important aspect of the original numbers.  The aspect is often not obvious in the original numbers.  One number is better than many. • Most people hate numbers. The fewer, the better!  Simple statistic: an aspect of a set of values of one variable.  Sample size (n): the number of values.  Mean: the average value or center of the values.  Standard deviation (SD): the average scatter around the mean. • Used to evaluate magnitudes of differences in means.  Standard error of the mean (SD/n): the expected variation in the mean with resampling. • A tricky statistical dinosaur. Avoid! • Convert back to the SD when you see it.  Quantiles (median, tertiles, quartiles, quintiles…): values that divide the ranked set up into 2, 3, 4, 5… equal-sized subsets. • Used when the set is skewed by large values (e.g., salaries). • Also used to compare subgroups. Example: systolic pressure in the quintile of lowest physical activity vs each quintile of higher activity.  Proportion or risk: the number of "events" (e.g., injured players) divided by the number of "trials" (total number of players). • Often expressed as a percent (proportion×100).  Effect statistic: a relationship between a predictor or independent variable and a dependent or outcome variable.  Difference (or change) in mean: the predictor is a grouping variable and the dependent is numeric.  Slope (or gradient): the difference or change in the mean per difference in a numeric predictor.  Correlation coefficient: another form of the slope.  Ratio of proportions, risks, odds or hazards: statistics for comparing the occurrence (presence or absence) of something in two groups.  Ratio of counts: statistics for comparing counts or occurrences of something in two groups.  Other variables can be included in the analysis as covariates. • Moderators are interacted with the predictor to estimate how the effect differs between subjects. • Mediators are added to adjust for effects of subject characteristics, which means: "for subjects of the same age…, the effect was…". Such adjustment also deals with potential confounding (by age…).  Inferential statistic: an aspect of the "true" value of a simple or effect statistic derived from a sample.  Confidence interval or limits: the likely range of the true value.  P value: provides evidence about the zero or null value of an effect.  Chance of benefit, risk of harm: provide evidence about the true value for making clinical decisions.  T, F, chi-squared statistics: "test" statistics used to get the above. • Only the statistician needs to know about these. • They shouldn’t be shown in publications. Making Clinical Inferences (Decisions or Conclusions) c Every sample gives a different value for a statistic, owing to sampling variation.  So, the value of a sample statistic is only an estimate of the true (right, real, actual, very large sample, or population) value.  But people want to make an inference about the true value.  The best inferential statistic for this purpose is the confidence interval: the range within which the true value is likely to fall.  "Likely" is usually 95%, so there is a 95% chance the true value is included in the confidence interval (and a 5% chance it is not).  Confidence limits are the lower and upper ends of the interval.  The limits represent how small and how large the effect "could" be.  All effects should be shown with a confidence interval or limits.  Example: the dietary treatment produced an average weight loss of 3.2 kg (95% confidence interval 1.6 to 4.8 kg). • The confidence interval is NOT a range of individual responses!  But confidence limits alone don't provide a clinical inference.  Statistical significance is the traditional way to make inferences.  Also known as the null-hypothesis significance test.  The inference is all about whether the effect could be zero or "null".  If the 95% confidence interval includes zero, the effect "could be zero". The effect is "statistically non-significant (at the 5% level)": zero or null negative 95% confidence interval Researchers using p values should show exact values. positive statistically non-significant (p=0.31) statistically significant (p=0.02) statistically significant (p=0.003) value of effect statistic (e.g., change in weight)  If the confidence interval does not include zero, the effect "couldn't be zero". The effect is "statistically significant (at the 5% level)".  Stats packages calculate a probability or p value for deciding whether an effect is significant. • p>0.05 means non-significant; p<0.05 means significant.  The exact definition of the p value is hard to understand. • Useful interpretation: half the p value is the probability the true effect is negative when the sample effect is positive (and vice versa).  People usually interpret non-significant as "no real effect" and significant as "a real effect".  These interpretations apply only if the study was done with the right sample size.  Even then they are misleading: they don't convey the uncertainty.  And you hardly ever know if the sample size is right.  Attempts to address this problem with post-hoc power calculations are rare, generally wrong, and too hard to understand.  So the only safe interpretation is whether the effect could be zero.  But the issue for the practitioner is not whether the effect could be zero, but whether the effect could be important. • Important has two meanings: beneficial and harmful.  The confidence interval addresses this issue, when clinically important values for benefit and harm are taken into account.  Clinical inferences with the confidence interval  The smallest clinically important effects define values of the effect that are beneficial, harmful and trivial. • Smallest effects for benefit and harm are equal and opposite.  Infer (decide) the outcome from the confidence interval, as follows: smallest clinically harmful effect smallest clinically beneficial effect P values fail here. Clinical harmful trivial beneficial decision Clear: use it. Clear: use it. Clear: use it. But p>0.05! Clear: depends. Clear: don't use it. But p<0.05! Clear: don't use it. Clear: don't use it. Unclear: more data needed. value of effect statistic (e.g., change in weight)  This approach eliminates statistical significance.  The only issue is what level to make the confidence interval.  To be careful about avoiding harm, you can make a conservative 99% confidence interval on the harm side.  And to use effects only when there is a reasonable chance of benefit. you can make a 50% interval on the benefit side.  But that's hard to understand. Consider this equivalent approach…  Clinical inferences with probabilities of benefit and harm.  The uncertainty in an effect can be expressed as chances that the true effect is beneficial and the risk that it is actually harmful.  You would decide to use an effect with a reasonable chance of benefit, provided it had a sufficiently low risk of harm.  I have opted for possibly beneficial (>25% chance of benefit) and most unlikely harmful (<0.5% chance of harm).  An effect with >25% chance of benefit and >0.5% risk of harm is therefore unclear. You'd like to use it, but you daren't. • Everything else is either clearly useful or clearly not worth using.  If the chance of benefit is high (e.g., 80%), you could accept a higher risk of harm (e.g., 5%). • This less conservative approach has been formalized using a threshold odds ratio of 66 (odds of benefit to odds of harm).  When an effect has no obvious benefit or harm (e.g., a comparison of males and females), the inference is only about whether the effect could be substantially positive or negative. • For such non-clinical inferences, use a symmetrical confidence interval, usually 90% or 99%, to decide whether the effect is clear. • Equivalently, one or other of the chances of being substantially positive or negative has to be <5% for the effect to be clear ("a clear non-clinical effect can't be substantially positive and negative").  Ways to report inferences for clear effects: possibly small benefit, likely moderately harmful, a large difference (clear at 99% level), a trivial-moderate increase [the lower and upper confidence limits]… • Whatever, researchers should make a magnitude-based inference by showing confidence limits and interpreting the uncertainty in a (clinically) relevant way readers can understand.  A caution about making an inference…  Whatever method you use, the inference is about the one and only mean effect in the population.  The confidence interval represents the uncertainty in the true effect, not a range of individual differences or individual responses. • For example, with a large-enough sample size, a treatment could be clearly beneficial (a mean beneficial effect with a narrow confidence interval), yet the treatment could be harmful for a substantial proportion of the population.  Individual differences between groups and individual responses to a treatment are best summarized with a standard deviation to go with the mean effect. • The mean effect and the SD both need confidence limits.  Individual differences between groups and individual responses to a treatment may be accounted for by including subject characteristics as modifying covariates in the analysis.  Researchers generally neglect this important issue. Clinically Important Magnitudes of Effects  Researchers and practitioners need to know about clinically important magnitudes to interpret research findings.  Researchers need the smallest clinically important magnitude of an effect statistic to estimate sample size for a study.  For those who use the null-hypothesis significance test, the right sample size has 80% power (80% chance of statistical significance, p<0.05) if the true effect has the smallest important value.  For those who use clinical magnitude-based inference, the right sample size gives a 0.5% risk of harm and a 25% chance of benefit if the true effect has the smallest important beneficial value.  Practitioners need to know about clinically important magnitudes to monitor their athletes or patients.  So the next few slides are all about values for various magnitudes of various effect statistics. Differences or Changes in the Mean  The most common effect statistic, for numbers with decimals (continuous variables).  Difference when comparing different groups, e.g., patients vs healthy.  In population-health studies, groups are often subdivided into quartiles or quintiles (e.g., of age).  Change when tracking the same subjects.  Difference in the changes in controlled trials.  The between-subject standard deviation provides default thresholds for important differences and changes. Strength patients healthy Data are means & SD. Strength  You think about the effect (mean) in terms of a pre post1 post2 fraction or multiple of the SD (mean/SD). Trial Data are means & SD.  The effect is said to be standardized.  The smallest important effect is ±0.20 (±0.20 of an SD).  Example: the effect of a treatment on strength Trivial effect (0.1x SD) post pre Very large effect (3.0x SD) post pre strength  Interpretation of standardized difference or change in means: Complete scale: trivial small moderate large very large extremely large strength Cohen <0.2 0.2-0.5 0.5-0.8 >0.8 ? ? Hopkins <0.2 0.2-0.6 0.6-1.2 1.2-2.0 2.0-4.0 >4.0 trivial 0.2 small 0.6 moderate 1.2 large 2.0 very large 4.0 ext. large  Cautions with standardizing  Standardizing works only when the SD comes from a sample that is representative of a well-defined population. • The resulting magnitude applies only to that population.  In a controlled trial, use the baseline (pre) SD, never the SD of change scores.  Beware of authors who show standard errors of the mean (SEM) rather than standard deviations (SD). • SEM = SD/(sample size), so SEMs on graphs make effects look a lot bigger than they really are. • Very rarely, overlap of SEM of two groups indicates that the difference between the means is not statistically significant. • But you won't know when that applies, and you're not using or trusting statistical significance anymore anyway, right?  Standardization may not be best for effects on means of some special variables: visual-analog scales, Likert scales, solo athletic performance…  Visual-analog scales  The respondents indicate a perception on a line like this: Rate your pain by placing a mark on this scale: none unbearable  Score the response as percent of the length of the line.  Magnitude thresholds: 10%, 30%, 50%, 70%, 90% for small, moderate, large, very large, extremely large differences or changes.   Likert scales  These are used for responses to questions like this: Over the last four weeks, how often did you train in a gym? not at all once only 2-3 times once a week  twice or more a week  Most Likert-type questions have four to seven choices.  Code them as integers (1, 2, 3, 4, 5…) and analyze as numerics.  Magnitude thresholds are up for debate. • If you use the thresholds of the visual-analog scale as a guide, the threshold for a 6-pt scale would be ~0.5, 1.5, 2.5, 3.5 and 4.5.  Solo athletic performance  For fitness tests and performance indicators of team-sport athletes, use standardization.  But for top solo athletes, an enhancement that results in one extra medal per 10 competitions is the smallest important effect. • The within-athlete variability that athletes show from one competition to the next determines this effect. Here's why… • Owing to this variability, each of the top athletes has a good chance of winning at each competition: Race 1 Race 2 Race 3  Your athlete needs an enhancement that overcomes this variability to give her or him a bigger chance of a medal.  Simulations show an enhancement of 0.3 of an athlete's typical variability from competition to competition gives one extra win every 10 competitions. • Example: if the variability is an SD (coefficient of variation) of 1%, the smallest important enhancement is 0.3%. • In some early publications I have mistakenly referred to 0.5 of the variability as the smallest effect.  Small, moderate, large, very large and extremely large effects result in an extra 1, 3, 5, 7 and 9 medals in every 10 competitions.  The corresponding enhancements as factors of the variability are: trivial 0.3 small 0.9 moderate 1.6 large 2.5 very large 4.0 ext. large  Beware: smallest effect on athletic performance in performance tests depends on method of measurement, because…  A percent change in an athlete's ability to output power results in different percent changes in performance in different tests.  These differences are due to the power-duration relationship for performance and the power-speed relationship for different modes of exercise.  Example: a 1% change in endurance power output produces the following changes… • 1% in running time-trial speed or time; • ~0.4% in road-cycling time-trial time; • 0.3% in rowing-ergometer time-trial time; • ~15% in time to exhaustion in a constant-power test. • A hard-to-interpret change in any test following a fatiguing pre-load. (But such tests can be interpreted for cycling road races: see Bonetti and Hopkins, Sportscience 14, 63-70, 2010.) Slope (or Gradient) Physical activity  Used when the predictor and dependent are both numeric and a straight line fits the trend.  The unit of the predictor is arbitrary.  Example: a 2% per year decline in activity 2 SD seems trivial… yet 20% per decade seems large. Age  So it's best to express a slope as the difference in the dependent per two SDs of predictor. • It gives the difference in the dependent (physical activity) between a typically low and high subject. • The SD for standardizing the resulting effect is the standard error of the estimate (the scatter about the line). Correlation Coefficient  Closely related to the slope, this represents the overall linearity in a scatterplot. Examples: r = 0.00 r = 0.10 r = 0.30 r = 0.50 r = 0.70 r = 0.90 r = 1.00  Negative values represent negative slopes.  The value is unaffected by the scaling of the two variables or by the sample size.  And it's much easier to calculate than a slope.  But a properly calculated slope is easier to interpret clinically.  Smallest important correlation is ±0.1. Complete scale: trivial 0.1 low 0.3 moderate 0.5 high 0.7 very high 0.9 ext. high Differences and Ratios of Proportions, Risks, Odds, Hazards  Example: percent of male and female players injured at all in a season of touch rugby. Proportion injured (%)  Risk difference or proportion difference 100  A common measure. a= Example: a - b = 75% - 36% = 39%. 75%  Problem: the sense of magnitude of b= a given difference depends on how big 36% 0 the proportions are. male female Sex • Example: for the same 10% difference, 90% vs 80% doesn't seem big, but… 11% vs 1% can be interpreted as a huge "difference" (11x the risk).  So there is no scale of magnitudes for a risk or proportion difference.  And analyses (models) don't work properly with proportions. • We have to use odds or hazards instead of proportions. Stay tuned.  Number needed to treat (NNT) = 100/(risk difference (%)).  The number you would have to treat or sample for one subject to have an outcome attributable to the effect. • Example: one male in 2.6 (=1/0.39) is injured because he’s a male.  Has been promoted in some clinical journals, but not widely used.  Hard to analyze properly, and problems with its confidence limits.  Avoid! Proportion  Risk ratio (relative risk) or proportion ratio injured (%) 100  Another common measure. a= Example: a/b = 75/36 = 2.1, which means 75% males are "2.1 times more likely" to be injured, b= 36% or "a 110% increase in risk" of injury for males. 0 male female  Problem: if it's a time dependent measure, Sex the risk ratio changes. • If you wait long enough, everyone gets affected, so risk ratio = 1.00.  But it works for rare time-dependent risks and for time-independent classifications (e.g., proportion playing a sport).  Hence we need values for the smallest and other important ratios for risks and proportions.  The smallest ratio is when one event or case in every 10 is due to the effect. • Example: one in 10 injuries is due to being male. • That is, for every 10 injured males, there are 9 injured females. • If there are N males and N females (injured and uninjured), the injury risks are 10/N and 9/N, and the risk ratio = (10/N)/(9/N) = 10/9.  For moderate, large, very large and extremely large ratios, for every 10 injured males, there are 7, 5, 3 and 1 injured females. • Corresponding risk ratios are 10/7, 10/5, 10/3 and 10/1.  Hence this scale for proportion ratio and low-risk ratio: trivial 1.11 small 1.43 moderate 2.0 large 3.3 very large 10 ext. large • and the inverses for reductions in proportions: 0.9, 0.7, 0.5, 0.3, 0.1.  But there is still the problem of analyzing proportions properly. • Two solutions: hazards instead of risks; odds instead of proportions.  Hazard ratio for time-dependent events. Proportion injured (%)  To understand hazards, consider the 100 increase in proportion or risk with time. males  The hazard is the tiny proportion females that gets affected per a tiny interval of time.  Example: hazard for males = a = 0.28% per day, 0 hazard for females = b = 0.11% per day. Time (months) So hazard ratio = a/b = 0.28/0.11 = 2.5. • That is, males are 2.5x more likely to get injured a per unit time, whatever the (small) unit of time.  So you could call it the "right-now risk ratio". b  It's also known as incidence rate ratio, 1 day which is the ratio of the slopes.  It can also be interpreted as the ratio of the times taken for the same proportion to get affected in two groups. • Example: females take 2.5x as long to get injured as males.  Hazard ratios work over long periods, when a substantial proportion of males or females is injured, and the observed risk ratio drops below the initial hazard ratio. 100 males • Example: at 5 weeks, Proportion injured (%) a the risk ratio = a/b = 75/36 = 2.1. females  But the hazard ratio for those still b uninjured is usually assumed to stay the same, 0 even if the hazards change with time. Time (months) • Example: the risk of injury might increase later in the season for both sexes, but the right-now risk ratio for new injuries (the hazard ratio) doesn't change. A big plus!  And hazards and hazard ratios can be modeled (analyzed)!  Magnitude thresholds must be the same as for the proportion ratio, even for frequent events, because such events start off rare.  Hence this scale for the hazard ratio: trivial 1.11 small 1.43 moderate 2.0 large 3.3 very large 10 ext. large • and the inverses 0.9, 0.7, 0.5, 0.3, 0.1.  Odds ratio for time-independent classifications.  Classifications refer to prevalence; risks refer to incidence.  Odds are the awkward but only way to model classifications. Proportion  Example: proportions of boys and girls playing (%) playing a sport. 100 c= • Odds of a boy playing = a/c = 75/25. 25% d = a = 64% • Odds of a girl playing = b/d = 36/64. 75% • Odds ratio = (75/25)/(36/64) = 5.3. b= 36%  Interpret the ratio as "…times more likely" 0 boys girls only when the proportions in both groups Sex are small (<10%). • The odds ratio is then approximately equal to the proportion ratio.  To assess magnitude, authors should convert the odds ratio and its confidence limits to the proportion ratio and its confidence limits. • Unfortunately they often just leave effects as odds ratios. Ratio of Counts  Example: 93 vs 69 injuries per 1000 player-hours of match play in sport A vs sport B.  The effect is expressed as a ratio: 93/69 = 1.35x more injuries.  Can also be expressed as 35% more injuries.  The scale of magnitudes is the same as for ratio of proportions: trivial 1.11 small 1.43 moderate 2.0 large 3.3 very large 10 ext. large  and the inverses 0.9, 0.7, 0.5, 0.3, 0.1. ––––––––––  Effects of numeric linear predictors (slopes) for ratio outcomes are expressed as risk, odds, hazard or count ratios per unit of the predictor and evaluated as the effect per 2 SD of the predictor. Modeling Effects  Estimates and inferential statistics for mean effects and slopes come from various kinds of general linear model…  t tests, simple and multiple linear regression, ANOVA…  Use mixed linear models for repeated measures and clustering.  Testing for normality is pointless, but uniformity is the real issue. • Many effects are more uniform when estimated as percents or ratios via analysis of the log-transformed dependent variable.  Bootstrapping of confidence limits works with difficult data.  Ratios of odds, hazards and counts need various kinds of generalized linear model…  All include log transformation to estimate ratios.  Logistic (log-odds) regression for odds, log-hazard and Cox regression for hazards, Poisson regression for counts.  And don't forget that covariates in all these models estimate and adjust for effects of moderators and mediators or confounders. Monitoring Individual Athletes  It’s all about a substantial change since the last assessment.  The subjective assessments (perceptions) of the athlete, coach, and support personnel provide important evidence.  One-off assessments often differ between individual practitioners, but assessments of change usually have high validity.  Objective assessments of change with an instrument or test are contaminated with error or "noise".  The noise is represented by the standard deviation of repeated measurements, the standard (or typical) error of measurement.  Think of ± the error as the equivalent of confidence limits for the athlete's true change.  Take into account clinically or practically important changes. • Wow, you've made a moderate improvement! • No real change either way. [A good instrument needed for this.] • Uh… unclear whether you’re getting better or worse. Summary  Inferential statistics are used to make conclusions about the true value of a simple or effect statistic derived from a sample.  The inference from a null-hypothesis significance test is about whether the true value of an effect statistic could be null (zero).  Magnitude-based inference addresses the issue of whether the true value could be important (beneficial and harmful, or substantial).  Effect magnitudes have key roles in research and practice.  Effects for continuous dependents are mean differences, slopes (expressed per 2 SD of the predictor), and correlations.  Thresholds for small, moderate, large, very large and extremely large standardized mean differences: 0.20, 0.60, 1.2, 2.0, 4.0.  Thresholds for correlations: 0.10, 0.30, 0.50, 0.70, 0.90.  Magnitude thresholds for ratios of proportions, hazards, counts: 1.11, 1.43, 2.0, 3.3, 10 and their inverses 0.9, 0.7, 0.5, 0.3, 0.1.  Take noise and thresholds into account when monitoring athletes.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download at this link - Sportscience