Download Lecture 2 - Tresch Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Achim Tresch
UoC / MPIPZ
Cologne
Statistics
treschgroup.de/OmicsModule1415.html
tresch@mpipz.mpg.de
1
II. Testing
II. Testing
Induction from the sample to the population
Significance
Testing:
Difference
in the sample
Estimation,
Regression:
Measure
in the sample
Difference in
the population?
Probability of a
false call?
Measure in
the population?
Variance?
Confidence
intervals?
2
II. Testing
What allows us to conclude from the sample to the
population?
The sample has to be representative
(figures about drug abuse of students cannot be
generalized to the whole population of Germany)
How is representativity achieved?
Large sample numbers
Random recruitment of samples from the population
E.g.: Dial a random phone number. Choose a random name
from the register of birth (Advantages/Disadv.?)
Randomization: Random allocation of the samples to the
different experimental groups
3
A non-sheep detector
Training:
Measure the length of all sheep that
cross your way
4
II Testing
A non-sheep detector
Training:
70
80
90
Measure the length of all sheep that
cross your way. Determine the
distribution of the quantity of interest.
100
110
120
130
140
5
Groesse [cm]
II Testing
A non-sheep detector
Testing: For any unknown animal, test the hypothesis that
it is a sheep. Measure ist length and compare it to
the learned length distribution of the sheep. If its
length is „out of bounds“, the animal will be called a
non-sheep (rejection of the hypothesis). Otherwise,
we cannot say much (non-rejection).
Not a
sheep
70
80
90
100
110
120
130
140
6
II Testing
A non-sheep detector
Advantage of the method: One does not need to
know much about sheep.
Disadvantage: It produces errors…
Positive calls
Negatives calls
False
Negatives
70
80
90
False
Positives
True
Negatives
100
110
Groesse [cm]
120
130
140
True
Positives
Decision boundary
7
Statistical Hypothesis Testing
II Testing
State a null hypothesis H0
(„nothing happens, there is no difference…“)
Choose an appropriate test statistic (the dataderived quantity that finally leads to the decision)
This implicitly determines the null distribution (the
distribution of the test statistic under the null
hypothesis).
8
Statistical Hypothesis Testing
II Testing
Stats an alternative hypothesis (e.g. „the test
statistic is higher than expected under the null
hypothesis“)
Determine a decision boundary. This is equivalent to
the chioce of a significance level α, i.e. the fraction
of false positive calls you are willing to accept.
d
Rejection region
Acceptance
region
α
9
Statistical Hypothesis Testing
II Testing
Calculate the actual value of the test statistic in the
sample, and make your decision according to the prespecified(!) decision boundary.
Keep H0 (no rejection)
d
Reject H0 (assume
the alternative
hypothesis)
α
10
Good test statistics, bad test statistics
d
Distribution of the
test statistic
under the null
hypothesis
II Testing
Good statistic
Distribution of
the test statistic
under the
alternative
hypothesis
0
Accept null
hypothesis
Reject null
hypothesis
Null hypothesis
is true
right decision
Typ I error
(False Positive)
Alternative is
true
Typ II error
(False Negative)
right decision
11
Good test statistics, bad test statistics
d
Distribution of the
test statistic
under the null
hypothesis
II Testing
Bad statistic
Distribution of
the test statistic
under the
alternative
hypothesis
0
Accept null
hypothesis
Reject null
hypothesis
Null hypothesis
is true
right decision
Typ I error
(False Positive)
Alternative is
true
Typ II error
(False Negative)
right decision
II Testing
The Offenbach Oracle
Throw the 20-sided dice
Toni, 29, Offenbach,
mechanician and
moral philosopher
Score = 20: reject the null hypothesis
Score ≠ 20: keep the null hypothesis
This is (independent of the null hypothesis) a valid
statistical test at a 5% type I error level!
13
But:
0.25
0.15
0.30
0.20
0.25
The Offenbach Oracle
The distribution of the test
statistic under null- and
alternative hypothesis is identical
This test cannot discriminate
between the two alternatives!
0.15
0.05
0.20
0.10
c(0, 0)
Distribution
under H0
0.10
0.00
c(0, 0)
II Testing
5
10
15
20
0.05
Index
0.00
Distribution
under H1
5
10
15
20
95% of the Positives (as well as the Negatives) will be missed.
Index
14
II Testing
The p-value
Given a test statistic and ist actual value t in a sample, a
p-Wert can be calculated:
Each test value t maps to a
p-value, the latter is the
probability of observing a value
of the test statistic which is at
least as extreme as the actual
value t [under the assumption of
the null hypothesis].
p = 0.08
-5
0
5
t=4.2
10
15
15
II Testing
The p-value
Given a test statistic and ist actual value t in a sample, a
p-Wert can be calculated:
Each test value t maps to a
p-value, the latter is the
probability of observing a value
of the test statistic which is at
least as extreme as the actual
value t [under the assumption of
the null hypothesis].
p = 0.42
-5
0
t=0.75
16
10
15
II Testing
Test decisions according to the p-value
Decision boundary d
significance level α
Observed test statistic t
p-value
t more extreme than d
p≥α
Keep H0 (no rejection)
p is smaller than α
p = 0.83
p < α
Reject H0 (assume
the alternative
hypothesis)
α = 0.05
p = 0.02
t
d
t
17
One- and two-sided hypotheses
II Testing
One-sided alternative
H0: The value of a quantity of interest in group A is not
higher than in group B.
H1: The value of a quantity of interest in group A is
higher than in group B.
-10
-5
0
Acceptance region
5
Blutdrucksenkung [mmHg]
][
10
15
18
Rejection region
II Testing
One- and two-sided hypotheses
Two-sided alternative
H0: The quantity of interest has the same value in group
A and group B
H1: The quantity of interest is different in group A and
group B
Generally, two-sided alternatives are more
conservative: Deviations in both directions are
detected.
-10
][-5
Rejection region
0
5
Acceptance region
Blutdrucksenkung [mmHg]
][10
15
19
Rejection region
Example “Testing”: Colon Carcinoma
II Testing
Variable:
Vaccine
Scale:
binary
What about
this fact?
32*94 ≈ 30
(62-32)*77 ≈ 23
Endpoint:
4-year
survival
Scale:
binary
20
Example “Testing”: Colon Carcinoma
II Testing
4-year survival
Vaccine
Ja
Nein
yes (n=32)
30 (94%)
2 (6%)
no (n=30)
23 (77%)
7 (23%)
Interesting questions:
Das the vaccine yield any effect?
Is this effect „significant“ ?
21
Example “Testing”: Colon Carcinoma
II Testing
Null hypothesis H0:
Vaccination has not (either positive or negative) impact on the
patients. The survival rates in the vaccine and non-vaccine
group in the whole population are the same.
Alternative hypothesis H1:
For the whole population, the survival rates in the vaccine and
non vaccine group are different.
Choose the significance level α
(usually: α = 1%; 0.1%; 5%)
Interpretation of the significane level α :
If there is no difference between the groups, one obtains a
false positive result with a probability of α.
22
Example “Testing”: Colon Carcinoma
II Testing
Choice of test statistic: „Fisher‘s Exact Test“
Sir Ronald Aylmer Fisher, 1890-1962
Theoretical Biology, Evolution Theory,
Statistics
23
Example “Testing”: Colon Carcinoma
II Testing
Value of the test statistic t after the experiment has
been carried out. This value can be converted into a p-value:
p = 0.0766  7.7%
Since we have chosen a significane level α = 5%, and p > α,
we cannot reject the null hypothesis, thus we keep it.
Formulation of the result: At a 5% significance level (and
using Fisher‘s Exact Test), no significant effect of
vaccination on survival could be detected.
Consequence: We are not (yet) sufficiently convinced of
the utility of this therapy.
But this does not mean that there is no difference at all!
24
Non-significance ≠ equivalence
Statistics can never prove a hypothesis,
it can only provide evidence
Egon Pearson
(1895-1980)
II Testing
Jerzy Neyman
(1894-1981)
“No test based upon the theory of probability can by
itself provide any valuable evidence of the truth or
falsehood of a hypothesis.“
Neyman J, Pearson E (1933) Phil Trans R Soc A 25
I. Description
Confidence intervals
95%-Confidence interval: An estimated interval which
contains the „true value“ of a quantity with a probability of
95%.
(
Interval estimate

)
____________________________________
X
20.5
24,3
29,5
Point estimate
(e.g. % votes for the SPD in the EU elections)
( 1 – α ) – Conficence interval: An estimated interval
which contains the „true value“ of a quantity with a
probability of (1 – α).
1 – α = confidence level , α = error probability
Use confidence intervals with caution!
26
Specific statistical tests
Comparison of two group means
Gene A
…
Gene B
group 1
group 2
Which gene is
expressed at a higher level?
gene expression
measurements
27
Two group comparison
Data:
Expression of gene g
in different samples
2
mean( Punkte)
Test statistic, e.g.
Difference of group means
d
1
Hypothesis:
The expression of gene g in
group 1 is lower than in group 2.
d  1  2
mean ( Punkte)
group 1
group 2
Decision
for “lower expression“, if
d  d0
28
Two group comparison
Bad idea: Difference of group means d   1   2
Problem: d is not scaling
invariant
2
1
d
1
d
2
group 1
group 2
Solution:
Divide d by an estimate of the
standard deviation s(d) in the two
groups
d
t 
s(d )
This is the t-statistic giving rise
to the (unpaired) t-test.
29
Wilcoxon (rank sum) test (equiv. to Mann-Whitney-test)
Question: Given independent samples in group 1 and group 2,
Are the values in group 1 smaller than in group 2 ?
measurements
group 1
18
3
6
9
5
group 2
15
10
8
7
12
Raw scale
Rank scale
3
5 6 7 8 9 10
12
15
1 2 3 4 5 6 7 8 9 10
Rank sum group 1:
1+2+3+6+10 = 22
Rank sum group 2:
4+5+7+8+9 = 33
18
Wilcoxon (rank sum) test (equiv. to Mann-Whitney-test)
Choose the rank sum of group 1 as test statistic W
The p-value corresponding to
W can be computed exactly for
small sample numbers. For
large numbers, there exist
good approximations.
P(W≤22, given the groups do not
differ in their location)
= 0.15
15
20
22
25
30
35
40
Wilcoxon W
Rank sum distribution for group 1,
|group 1| = 5, |group 2| = 5
Summary: Two-group comparison of a continuous variable
Question:
Do the two measurements in the two groups differ in their
location?
Gaussian data?
yes
no
Paired
Samples?
Paired
samples?
yes
Unpaired
two sample
t-test
no
Paired
two sample
t-Test
yes
Wilcoxon
signed rank
test
no
Wilcoxon
rank sum
test
32
Comparison of two binary variables
Unpaired data: Fisher‘s exact test
Question:
Do the distributions in group 1 and group 2 differ?
Example: Clinical trial, unpaired design (each test
person receives only one treatment)
Effect
Medication
effect
no effect
Verum
65
7
Placebo
44
13
Odds und Odds Ratio
heads
tails
Fair coin
54
46
Bent coin
82
18
Odds (= Chances):
Odds (fair coin) = 54 : 46
= 1.17
Odds (bent coin) = 82 : 18
= 4.56
Odds Ratio
54 / 46 1.17
OR 

 0.26
82 / 18 4.56
Comparison of two categorial variables
Unpaired data:
5yr survival
Chi-square test (χ2-test)
Tumor
size
No
Yes
1
10
8
2
20
23
3
19
10
4
32
18
Null hypothesis: 5yr survival is independent of tumor size.
In this example, p < 0.001.
Vergleich zweier kategorialer Merkmale
Unpaired data Chi-square test (χ2-test)
Requirements
Sample number sufficiently large (n ≥ 60)
Expected number of is not too small ( ≥ 5) for all
possible observations
Note that for binary data and large n, chisquare test
and Fisher test are equivalent.
Summary: Comparison of two categorial variables
Question: Do there exist differences in the distribution
of one variable if grouped by the second variable?
Binary
data?
yes
no
Paired
data?
Paired
data?
yes
McNemar
test
no
Fisher‘s
exact test
yes
(Bowker
Symmetrytest)
no
Chisquare
(χ2) -Test
37
Summary: Description und Testing
Design
Deskription
numerisch
Deskription
graphisch
Test
two sample
Medians,
quartiles
2 Boxplots
Wilcoxon
rank sum
test, t-test*
paired
Medians,
quartiles óf
differences
Boxplot of
differences
Wilcoxon
signed rank
test, paired
t-test*
Continuous
binary
two sample
Cross table,
odds ratio
Barplot
Fisher‘s
exact test
binary
paired
Cross table
Barplot
McNemartest
categorial
two sample
Cross table
3D Barplot
χ2-test
Merkmal
Continuous
* If differences follow a normal distribution
38
Remarks on Testing
Data description is the mandatory first step of every
statistical analysis / test.
Test results should report the outcome
(singificant/not significant) together with the p-value
that has been obtained.
Never report a p-value of exactly 0! (why?)
Statistical significance ≠ relevance
For large sample numbers,
even tiny differences may
produce significant
findings.
For small sample numbers,
an observed relevant
difference can be
statistically insignificant.
40
Multiple Testing
Examples of multiple tests:
Testing of several endpoints
(systolic and diastolic blood
pressure, pulse, …)
Comparison of several groups
(e.g., 4 groups require 6
pairwise two-group
comparisons)
Let us set a significance level of 5%,
and suppose the null hypothesis holds in all cases.
→ If we perform 6 tests, the probability of reporting
at least one false positive finding can increase to 30%!
41
Multiple Testing, Bonferroni Correction
Remedy:
Bonferroni correction
For m tests and a target significance level, perform each
individual test at a significance level of α/m (local
significance level).
The probability of producing a false positive finding in at
least one of the m tests is then at most α (multiple /
global significance level)