Download Confidence intervals and hypothesis tests

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Announcements
Announcements
Unit 3: Foundations for inference
Lecture 2: Hypothesis testing and confidence intervals
Review the Learning Objectives. I promised you Quizzes!
Midterm: Monday July 20. Practice midterm with solutions will be
posted soon.
Statistics 101
Gary Larson
I am out of town Friday July 17 through Sunday 19th – start
studying now! Bring questions next week to Office Hours.
July 10, 2015
Sta 101 (Gary Larson)
Announcements
U3 - L2: HT and CI
July 10, 2015
2 / 37
Announcements
Today
Exercise 2.18(d)
“Hypothesis testing” ≡ “testing” ≡ “tests”
how the solution makes sense with what we’ve learned so far
Hypothesis testing using theoretical methods. (Simulation-based
was 7/2 lecture)
how the ambiguity in the way the question is asked (”overweight”
to the book meant overweight or obese column)
p-values
students may get some credit back if they did it correct for only
the overweight column.
Single-sided and two-sided hypothesis tests
Using confidence intervals for testing
but note that we’re interested in VARIABLES being related – so
that may have given you a clue to think about all the categories
of weight rather than just the overweight column.
Close relationship between confidence intervals and hypothesis
tests
Error rates in hypothesis testing
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
3 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
4 / 37
Hypothesis testing
Hypothesis testing framework
Hypothesis testing
Remember when...
Hypothesis testing framework
Result of hypothesis testing by simulation / randomization
Speed skating lane advantage:
Lane
Inner
Outer
Total
Top 10
6
4
10
Rank
Not Top 10
14
16
30
Total
20
20
40
p̂I = 0.3, p̂O = 0.2 ⇒ p̂diff = 0.1
Possible explanations: A null hypothesis and alternative hypothesis.
H0 : Rank / lane are independent. No lane advantage. Observed
difference in proportions is due to chance.
We failed to reject the null hypothesis, since values equal to or more
extreme than our observed data p̂diff = 0.1 weren’t extremely unlikely
under the null distribution (a.k.a. the randomization distribution).
HA : Rank and lane are dependent, there is lane advantage,
observed difference in proportions is not due to chance.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
Hypothesis testing
July 10, 2015
5 / 37
Hypothesis testing framework
July 10, 2015
6 / 37
Hypothesis testing framework
Number of college applications
We state a null hypothesis (H0 ) that represents the status quo, or the
absence of an effect, or the hypothesis that “nothing is going on.”
A survey asked how many colleges Duke students applied to, and 206 students responded. This sample yielded an average of x̄ = 9.7 college applications with a standard deviation of s = 7. College Board says counselors
recommend students apply to 8 colleges. Does our survey provide convincing evidence that the average number of colleges all Duke students apply to
is higher than recommended?
We also state an alternative hypothesis (HA ) that represents our
research question, i.e. what we’re testing for.
Is there enough evidence in the data to reject H0 ? To find out, we
conduct a hypothesis test under the assumption that the null
hypothesis is true, by one of two methods:
2
U3 - L2: HT and CI
Hypothesis testing
Recap: hypothesis testing framework (from 7/2 slides)
1
Sta 101 (Gary Larson)
simulation of additional data collection (today), e.g. using
randomization
theoretical methods (later in the course).
Let’s introduce the method of hypothesis testing (theory version) using
an example which tests a claim about a population mean.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
http:// www.collegeboard.com/ student/ apply/ the-application/ 151680.html
7 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
8 / 37
Hypothesis testing
Hypothesis testing framework
Hypothesis testing
Setting the hypotheses
Setting the hypotheses
The parameter of interest µ is the mean number of schools
applied to by all Duke students.
We start with the assumption the average number of colleges
Duke students apply to is 8 (as recommended)
The sample statistic, x̄ = 9.7, is the average number of schools
applied to by Duke students in our sample .
H0 : µ = 8
There are two possible explanations why our sample mean is
higher than the recommended 8 schools.
We test the claim that the average number of colleges Duke
students apply to is greater than 8
The true population mean is different. Duke students on average
truly apply to more than 8 schools.
The true population mean is 8. Duke students on average truly
apply to exactly 8 schools. Our sample mean is higher than 8
simply due to natural sampling variability.
Sta 101 (Gary Larson)
Hypothesis testing framework
U3 - L2: HT and CI
Hypothesis testing
July 10, 2015
HA : µ > 8
9 / 37
Sta 101 (Gary Larson)
Formal testing using p-values
U3 - L2: HT and CI
Hypothesis testing
With hypotheses in place, assume H0 is true.
July 10, 2015
10 / 37
Formal testing using p-values
Central limit theorem
So, we pretend that µ = 8. Then how unusual is a sample
statistic like x̄ = 9.7?
Central limit theorem
Under certain conditions,
To answer that, we need the probability that if we took a random
sample of n = 246 students (with H0 : µ = 8 true), we would
obtain a sample mean x̄ ≥ 9.7.
x̄ ∼ N mean = µ, SE = √
n
Great! We did this yesterday. To calculate that probability, start
with: what is the sampling distribution of x̄ if H0 is true? (Use the
CLT! (if conditions are met))
Make sure to check conditions for CLT to hold: (1) independence, and
(2) sample size/skew.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
s
11 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
!
July 10, 2015
12 / 37
Hypothesis testing
Formal testing using p-values
Hypothesis testing
Number of college applications - conditions
Formal testing using p-values
Applying the CLT
Participation question
Which of the following is not a condition that needs to be met to proceed with this hypothesis test?
n = 246 students, x̄ = 9.7, s = 7, and we’re assuming
H0 : µ = 8
Here the conditions for inference for the CLT are met. So:
(a) Students in the sample should be independent of each other with
respect to how many colleges they applied to.
7
!
x̄ ∼ N µ = 8, σx̄ = SE = √
246
(b) Sampling should have been done randomly.
(c) The sample size should be less than 10% of the population of all
Duke students.
x̄ ∼ N (µ = 8, σx̄ ≈ 0.5)
How unusual is x̄ = 9.7 in this distribution?
(d) There should be at least 10 successes and 10 failures in the
sample.
If it’s very unusual, that’s probably not the right distribution! (i.e. µ
isn’t really 8)
(e) The distribution of the number of colleges students apply to
should not be extremely skewed.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
Hypothesis testing
July 10, 2015
13 / 37
Sta 101 (Gary Larson)
Formal testing using p-values
U3 - L2: HT and CI
Hypothesis testing
We’re getting closer
July 10, 2015
14 / 37
Formal testing using p-values
Number of college applications - p-value
To determine if our observed sample mean is unusual if H0 is true, we
determine how many standard errors it is from the null value (µ = 8).
i.e. we calculate the Z-score.
µ=8
x = 9.7
7
!
With the Z-score calculated, we use it to calculate the p-value.
p-value: probability under H0 (µ = 8) of observing data at least as
extreme as what we observed (a sample mean greater than 9.7)
The sample mean is 3.4 standard errors away from the hypothesized value. Is this considered unusually (significantly)
high?
µ=8
x = 9.7
x̄ ∼ N µ = 8, SE = √
≈ 0.5
206
Z=
P (x̄ > 9.7 | µ = 8) = P (Z > 3.4) = 0.0003
9.7 − 8
= 3.4
0.5
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
15 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
16 / 37
Hypothesis testing
Formal testing using p-values
Hypothesis testing
p-values
Formal testing using p-values
Number of college applications - Making a decision
p-value = 0.0003
If the true average of the number of colleges Duke students
applied to is 8, there is only 0.03% chance of observing a random
sample of 206 Duke students who on average apply to 9.7 or
more schools.
This is a pretty low probability for us to think that a sample mean
of 9.7 or more schools is likely to happen simply by chance.
We then use this test statistic to calculate the p-value
If the p-value is low (lower than the significance level, α, which is
usually 5%) we say that it would be very unlikely to observe the
data if the null hypothesis were true, and hence reject H0 .
If the p-value is high (higher than α) we say that it is likely to
observe the data even if the null hypothesis were true, and hence
do not reject H0 .
Since p-value is low (lower than 5%) we reject H0 .
The data provide convincing evidence that Duke students on
average apply to more than 8 schools.
The difference between the null value of 8 schools and observed
sample mean of 9.7 schools is not due to chance or sampling
variability.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
Hypothesis testing
July 10, 2015
17 / 37
Sta 101 (Gary Larson)
Formal testing using p-values
U3 - L2: HT and CI
Hypothesis testing
July 10, 2015
18 / 37
Formal testing using p-values
Recap: Hypothesis testing for a population mean
1
2
Set the hypotheses
H0 : µ = null value
HA : µ < or > or , null value
Check assumptions and conditions
Independence: random sample/assignment, 10% condition when
sampling without replacement
Normality: nearly normal population or n ≥ 30, no extreme skew
the next slide is provided as a brief summary of hypothesis testing...
3
Calculate a test statistic and a p-value (draw a picture!)
Z=
4
x̄ − µ
s
, where SE = √
SE
n
Make a decision, and interpret it in context of the research
question
If p-value < α, reject H0 , data provide evidence for HA
If p-value > α, do not reject H0 , data do not provide evidence for
HA
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
19 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
20 / 37
Hypothesis testing
Formal testing using p-values
Hypothesis testing
Formal testing using p-values
Hypothesis testing for µ, from beginning to end
You should (and can!) understand every aspect of this process,
before the midterm. If not, come to Office Hours or email me:)
You want to make a statement about a population parameter. So,
state your H0 and HA and the significance level of this test. Then,
observe a point estimate of the population parameter you’re
interested in (i.e. collect data for testing the hypothesis). After
verifying the CLT’s conditions are met, use the CLT to state what the
sampling distribution for the sample mean would be if H0 were true.
Calculate the Z-score of your point estimate, and use the probability
tables to find the associated p-value. If the p-value is lower than the
significance level, reject H0 in favor of HA . Otherwise fail to reject H0 .
Never “accept” H0 .
Come see me, even (especially) if you’re thinking “I really don’t
understand most of this or what’s happening”.
the next slide is also provided as a brief summary of hypothesis
testing...
Sta 101 (Gary Larson)
U3 - L2: HT and CI
Hypothesis testing
July 10, 2015
21 / 37
Sta 101 (Gary Larson)
Formal testing using p-values
U3 - L2: HT and CI
Hypothesis testing
Example
July 10, 2015
22 / 37
Formal testing using p-values
Participation question
The p-value for this hypothesis test is 0.0485. Which of the following is
correct?
A poll by the National Sleep Foundation found that college students
average about 7 hours of sleep per night. A sample of 169 Duke students yielded an average of 6.88 hours, with a standard deviation of
0.94 hours. Assuming that this is a random sample representative of
all Duke students (bit of a leap of faith?) , conduct a hypothesis test to evaluate if Duke students on average sleep less than 7 hours per night.
(a) Fail to reject H0 , the data provide convincing evidence that Duke
students sleep less than 7 hours on average.
Edit by Gary: This seems totally false... can we do a survey now?
(d) Fail to reject H0 , the data do not provide convincing evidence that
Duke students sleep less than 7 hours on average.
(b) Reject H0 , the data provide convincing evidence that Duke
students sleep less than 7 hours on average.
(c) Reject H0 , the data prove that Duke students sleep more than 7
hours on average.
(e) Reject H0 , the data provide convincing evidence that Duke
students in this sample sleep less than 7 hours on average.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
23 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
24 / 37
Hypothesis testing
Two-sided hypothesis testing with p-values
Confidence Intervals and Hypothesis Tests
Two-sided hypothesis testing with p-values
Confidence Intervals
Construct a 95% confidence interval for the number of hours Duke
students sleep on average.
If the research question was “Do the data provide convincing
evidence that the average amount of sleep Duke students get
per night is different from the national average?”, the alternative
hypothesis would be different.
x̄ = 6.88, s = 0.94, n = 169, SE ≈ 0.07
H0 : µ = 7
Confidence interval, a general formula
HA : µ , 7
point estimate ± z ? × SE = point estimate ± ME
Hence the p-value would change as well:
For a 95% confidence interval, z ? = 1.96.
p-value
= 0.0485 × 2
= 0.097
Do we reject now?
6.88
7.00
Sta 101 (Gary Larson)
6.88 ± 1.96 × 0.07 = (6.74, 7.02)
We are 95% confidence that the true average number of hours Duke
students sleep is between (6.74, 7.02) hours.
7.12
U3 - L2: HT and CI
July 10, 2015
25 / 37
Confidence Intervals and Hypothesis Tests
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
26 / 37
Confidence Intervals and Hypothesis Tests
Connect HT and CI
Agreement of CI and HT
Confidence intervals and hypothesis tests (almost) always agree,
as long as the two methods use equivalent levels of significance /
confidence.
We are 95% confidence that the average number of hours Duke
students sleep is between (6.738, 7.022) hours.
A two sided hypothesis with threshold of α is equivalent to a
confidence interval with CL = 1 − α.
A one sided hypothesis with threshold of α is equivalent to a
confidence interval with CL = 1 − (2 × α).
6.88 ± 1.96 × 0.07 = (6.74, 7.02)
Is the null value in this interval? (ie is µ = 7 plausible?) Yes!
Did we fail to reject the null? (ie did we decide we did not have
enough evidence to claim that µ is something other than 7?) Yes!
If H0 is rejected, an “agreeing” confidence interval does not
include the null value; the null value wasn’t plausible.
If H0 is not rejected, an “agreeing” confidence interval does
include the null value; the null value was plausible.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
27 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
28 / 37
Confidence Intervals and Hypothesis Tests
Confidence Intervals and Hypothesis Tests
Confidence Intervals
Significance level vs. confidence level
What confidence level do we need to create a confidence interval
which will agree with our one-sided hypothesis test?
Two sided
One sided
H0 : µ = 7
HA : µ > 7
0.95
A one sided hypothesis with threshold of α is equivalent to a
confidence interval with CL = 1 − (2 × α).
U3 - L2: HT and CI
July 10, 2015
29 / 37
Sta 101 (Gary Larson)
0.05
1.96
0.05
0
1.65
One sided HT with α = 0.05
is equivalent to
90% confidence interval.
U3 - L2: HT and CI
July 10, 2015
30 / 37
Confidence Intervals and Hypothesis Tests
Construct a 90% confidence interval for the number of hours Duke
students sleep on average.
Participation question
A 95% confidence interval for the average waiting time at an emergency room is (128 minutes, 147 minutes). Which of the following is
false?
x̄ = 6.88, SE ≈ 0.07
For a 90% confidence interval, z ? = 1.64.
(a) A hypothesis test of HA : µ , 120 min at α = 0.05 is equivalent to
this CI.
6.88 ± 1.64 × 0.07 = (6.76, 7.00)
(b) A hypothesis test of HA : µ > 120 min at α = 0.025 is equivalent
to this CI.
We are 90% confident that the average number of hours Duke
students sleep is between (6.76, 7.00) hours.
(c) This interval does not support the claim that the average wait time
is 120 minutes.
We rejected the null hypothesis with a p-value of 0.0485.
(d) The claim that the average wait time is 120 minutes would not be
rejected using a 90% confidence interval.
What is the connection between the p-value and the interval?
Note the importance of rounding properly!
U3 - L2: HT and CI
0
Two sided HT with α = 0.05
is equivalent to
95% confidence interval.
Confidence Intervals and Hypothesis Tests
Sta 101 (Gary Larson)
0.025
-1.96
CL = 1 − 2(0.05) = .90 ⇒ 90%CI
Sta 101 (Gary Larson)
0.9
0.025
July 10, 2015
31 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
32 / 37
Type 1 and Type 2 errors
Type 1 and Type 2 errors
Decision errors
Decision errors (cont.)
When conducting a hypothesis test, there are two ways we could be
right, and two ways we could be wrong!
Hypothesis tests are not flawless.
In the court system innocent people are sometimes wrongly
convicted and the guilty sometimes walk free.
H0 true
Truth
Similarly, we can make a wrong decision in statistical hypothesis
tests as well.
The difference is that we have the tools necessary to quantify
how often we make errors in statistics.
HA true
Decision
fail to reject H0
reject H0
X
Type 1 Error
Type 2 Error
X
A Type 1 Error is rejecting the null hypothesis when H0 is true.
A Type 2 Error is failing to reject the null hypothesis when HA is
true.
We (almost) never know if H0 or HA is true, but we can still look
at these error rates!
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
33 / 37
Sta 101 (Gary Larson)
Type 1 and Type 2 errors
U3 - L2: HT and CI
Type 1 and Type 2 errors
Hypothesis test as a trial
July 10, 2015
34 / 37
Error rates & power
Type 1 error rate
We usually use a significance level of 0.05, α = 0.05, i.e. reject
H0 when p < 0.05
H0 : Defendant is innocent
HA : Defendant is guilty
One way to think about the significance level: when using a 5%
significance level there is about 5% chance of making a Type 1
error (incorrectly rejecting a true H0 . Why? Because we defined
rare data to be the rarest 5% of data!
Which type of error is being committed in the following cirumstances?
Declaring the defendant innocent when they are actually guilty
Declaring the defendant guilty when they are actually innocent
P (Type 1 error) = α
Which error do you think is the worse error to make?
This is why we like small values of α - increasing α increases the
Type 1 error rate.
“better that ten guilty persons escape than that one innocent suffer”
– William Blackstone
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
35 / 37
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
36 / 37
Type 1 and Type 2 errors
Error rates & power
Filling in the table...
Truth
H0 true
Decision
fail to reject H0
reject H0
1−α
Type 1 Error, α
HA true
Type 2 Error, β
Power, 1 − β
Type 1 error is rejecting H0 when you shouldn’t have, and the
probability of doing so is α (significance level)
Type 2 error is failing to reject H0 when you should have, and
the probability of doing so is β (a little more complicated to
calculate)
Power of a test is the probability of correctly rejecting H0 , and the
probability of doing so is 1 − β.
In hypothesis testing, we want to keep α and β low, but there is
an inherent trade-off.
Sta 101 (Gary Larson)
U3 - L2: HT and CI
July 10, 2015
37 / 37