Download Decision errors, significance levels, sample size

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Unit 3: Foundations for inference
Lecture 3: Decision errors, significance levels,
sample size, power, and bootstrapping
Statistics 101
Thomas Leininger
June 3, 2013
Decision errors
Type 1 and Type 2 errors
Decision errors
Hypothesis tests are not flawless.
In the court system innocent people are sometimes wrongly
convicted and the guilty sometimes walk free.
Similarly, we can make a wrong decision in statistical hypothesis
tests as well.
The difference is that we have the tools necessary to quantify
how often we make errors in statistics.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
2 / 23
Decision errors
Type 1 and Type 2 errors
Decision errors (cont.)
There are two competing hypotheses: the null and the alternative. In
a hypothesis test, we make a decision about which might be true, but
our choice might be incorrect.
H0 true
Truth
HA true
Decision
fail to reject H0
reject H0
X
Type 1 Error
Type 2 Error
X
A Type 1 Error is rejecting the null hypothesis when H0 is true.
A Type 2 Error is failing to reject the null hypothesis when HA is
true.
We (almost) never know if H0 or HA is true, but we need to
consider all possibilities.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
3 / 23
Decision errors
Type 1 and Type 2 errors
Hypothesis Test as a trial
If we again think of a hypothesis test as a criminal trial then it makes
sense to frame the verdict in terms of the null and alternative
hypotheses:
H0 : Defendant is innocent
HA : Defendant is guilty
Which type of error is being committed in the following cirumstances?
Declaring the defendant innocent when they are actually guilty
Declaring the defendant guilty when they are actually innocent
Which error do you think is the worse error to make?
“better that ten guilty persons escape than that one innocent suffer”
– William Blackstone
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
4 / 23
Decision errors
Error rates & power
Type 1 error rate
As a general rule we reject H0 when the p-value is less than
0.05, i.e. we use a significance level of 0.05, α = 0.05.
This means that, for those cases where H0 is actually true, we do
not want to incorrectly reject it more than 5% of those times.
In other words, when using a 5% significance level there is about
5% chance of making a Type 1 error.
P (Type 1 error) = α
This is why we prefer to small values of α – increasing α
increases the Type 1 error rate.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
5 / 23
Decision errors
Error rates & power
Filling in the table...
Truth
H0 true
Decision
fail to reject H0
reject H0
1−α
Type 1 Error, α
HA true
Type 2 Error, β
Power, 1 − β
Type 1 error is rejecting H0 when you shouldn’t have, and the
probability of doing so is α (significance level)
Type 2 error is failing to reject H0 when you should have, and the
probability of doing so is β (a little more complicated to calculate)
Power of a test is the probability of correctly rejecting H0 , and the
probability of doing so is 1 − β
In hypothesis testing, we want to keep α and β low, but there are
inherent trade-offs.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
6 / 23
Decision errors
Error rates & power
A quick example
In a cancer screening, what happens if we conclude a patient
has cancer and they do in fact have cancer?
What if they didn’t have cancer (but we concluded that they did)?
What if we conclude the patient has cancer but we conclude that
they do not have cancer?
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
7 / 23
Decision errors
Error rates & power
Type 2 error rate
If the alternative hypothesis is actually true, what is the chance that
we make a Type 2 Error, i.e. we fail to reject the null hypothesis even
when we should reject it?
The answer is not obvious.
If the true population average is very close to the null hypothesis
value, it will be difficult to detect a difference (and reject H0 ).
If the true population average is very different from the null
hypothesis value, it will be easier to detect a difference.
Clearly, β depends on the effect size (δ)
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
8 / 23
Decision errors
Power
Example - Blood Pressure
Blood pressure oscillates with the beating of the heart, and the systolic pressure is
defined as the peak pressure when a person is at rest. The average systolic blood
pressure for people in the U.S. is about 130 mmHg with a standard deviation of about
25 mmHg.
We are interested in finding out if the average blood pressure of employees at a
certain company is greater than the national average, so we collect a random sample
of 100 employees and measure their systolic blood pressure. What are the
hypotheses?
We’ll start with a very specific question – “What is the power of this hypothesis test to
correctly detect an increase of 2 mmHg in average blood pressure?”
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
9 / 23
Decision errors
Power
Problem 1
Which values of x̄ represent sufficient evidence to reject H0 ?
(Remember H0 : µ = 130, HA : µ > 130)
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
10 / 23
Decision errors
Power
Problem 2
What is the probability that we would reject H0 if x̄ did come from
N (mean = 132, SE = 2.5).
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
11 / 23
Decision errors
Power
Putting it all together
Null
distribution
120
125
130
135
140
Systolic blood pressure
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
12 / 23
Decision errors
Power
Putting it all together
Null
distribution
120
Power
distribution
125
130
135
140
Systolic blood pressure
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
12 / 23
Decision errors
Power
Putting it all together
Null
distribution
Power
distribution
0.05
120
125
130
135
140
Systolic blood pressure
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
12 / 23
Decision errors
Power
Putting it all together
Null
distribution
Power
distribution
134.125
120
125
130
0.05
135
140
Systolic blood pressure
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
12 / 23
Decision errors
Power
Putting it all together
Null
distribution
Power
distribution
Power
134.125
120
125
130
0.05
135
140
Systolic blood pressure
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
12 / 23
Decision errors
Power
Achieving desired power
There are several ways to increase power (and hence decrease type
2 error rate):
1
Increase the sample size.
2
Decrease the standard deviation of the sample, which essentially
has the same effect as increasing the sample size (it will
decrease the standard error).
3
4
Increase α, which will make it more likely to reject H0 (but note
that this has the side effect of increasing the Type 1 error rate).
Consider a larger effect size δ.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
13 / 23
Decision errors
Power
Choosing sample size for a particular margin of error
If I want to predict the proportion of US voters who approve of President Obama and I want to have a margin of error of 2% or less, how
many people do I need to sample?
1
Given desired error level m, we need m ≥ ME = z ? √σn .
2
To get m ≥ z ? √σn , I need
n≥
Note: This requires an estimate of σ.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
14 / 23
Bootstrapping
Rent in Durham
A random sample of 10 housing units were chosen on http:// raleigh.
craigslist.org after subsetting posts with the keyword “durham”. The
dot plot below shows the distribution of the rents of these apartments.
Can we apply the methods we have learned so far to construct a confidence interval using these data. Why or why not?
●
600
● ●
800
●
●
1000
● ● ●
1200
●
1400
●
1600
1800
rent
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
15 / 23
Bootstrapping
Bootstrapping
An alternative approach to constructing confidence intervals is
bootstrapping.
This term comes from the phrase “pulling oneself up by one’s
bootstraps”, which is a metaphor for accomplishing an
impossible task without any outside help.
In this case the impossible task is estimating a population
parameter, and we’ll accomplish it using data from only the given
sample.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
16 / 23
Bootstrapping
Bootstrapping
Bootstrapping works as follows:
(1) take a bootstrap sample - a random sample taken with
replacement from the original sample, of the same size as the
original sample
(2) calculate the bootstrap statistic - a statistic such as mean,
median, proportion, etc. computed on the bootstrap samples
(3) repeat steps (1) and (2) many times to create a bootstrap
distribution - a distribution of bootstrap statistics
The 95% bootstrap confidence interval is estimated by the cutoff
values for the middle 95% of the bootstrap distribution.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
17 / 23
Bootstrapping
Rent in Durham - bootstrap interval
The dot plot below shows the distribution of means of 100 bootstrap
samples from the original sample. Estimate the 90% bootstrap confidence interval based on this bootstrap distribution.
●
●
●
●
●
●
●
●
● ● ● ●
●
● ● ● ●●●●●● ●
●●●●●●●●● ●●●●●● ●
●●●
●●●●●●●●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●●●
●
900
1000
1100
1200
1300
●
● ●● ●
1400
bootstrap means
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
18 / 23
Randomization testing
Randomization testing for a mean
We can also use a simulation method to conduct the same test.
This is very similar to bootstrapping, i.e. we randomly sample
with replacement from the sample, but this time we shift the
bootstrap distribution to be centered at the null value.
The p-value is then defined as the proportion of simulations that
yield a sample mean at least as favorable to the alternative
hypothesis as the observed sample mean.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
19 / 23
Randomization testing
Rent in Durham - randomization testing
According to rentjungle.com the average rent for an apartment in
Durham is $854. Your random sample had a mean of $1143.2. Does
this sample provide convincing evidence that the article’s estimate is
an underestimate?
p-value: proportion of simulations where the
H0 : µ = $854
simulated sample mean is at least as extreme as
HA : µ > $854
the one observed. → 3 / 100 = 0.03
●
600
●
●
●
●
●
●
●
● ● ● ●
●
● ● ● ●●●●●● ●
●●●●●●●●● ●●●●●● ●
●●●
●●●●●●●●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●●●
●
700
800
900
1000
●
● ●● ●
1100
1200
randomization means
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
20 / 23
Randomization testing
Extra Notes - Calculating Power
Begin by picking a meaningful effect size δ and a significance
level α
Calculate the range of values for the point estimate beyond
which you would reject H0 at the chosen α level.
Calculate the probability of observing a value from preceding
step if the sample was derived from a population where
x̄ ∼ N (µH0 + δ, SE )
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
21 / 23
Randomization testing
Example - Using power to determine sample size
Going back to the blood pressure example, how large a sample would you need if
you wanted 90% power to detect a 4 mmHg increase in average blood pressure for
the hypothesis that the population average is greater than 130 mmHg at α = 0.05?
Given: H0 : µ = 130, HA : µ > 130, α = 0.05, β = 0.10, σ = 25, δ = 4
Step 1: Determine the cutoff – in order to reject H0 at α = 0.05, we need a sample
mean that will yield a Z score of at least 1.65.
25
x̄ > 130 + 1.65 √
n
Step 2: Set the probability of obtaining the above x̄ if the true population is centered
at 130 + 4 = 134 to the desired power, and solve for n.
!
25
P x̄ > 130 + 1.65 √ = 0.9
n


√ !

130 + 1.65 √25n − 134 
 = P Z > 1.65 − 4 n = 0.9

P Z >

25

√
25
n
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
22 / 23
Randomization testing
Example - Using power to determine sample size (cont.)
0.6
0.4
0.2
power
0.8
1.0
You can either directly solve for n, or use computation to calculate
power for various n and determine the sample size that yields the
desired power:
0
200
400
600
800
1000
n
For n = 336, power = 0.9002, therefore we need 336 subjects in our
sample to achieve the desired level of power for the given
circumstance.
Statistics 101 (Thomas Leininger)
U3 - L4: Decision errors, significance levels, sample size, and power June 3, 2013
23 / 23