Download Module III

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Module III
Lecture 3
Two-Sample Problems
Suppose that you are a manager in a plant that makes micro-chips. One of
the key materials is quite expensive so that you want to minimize the inevitable
wastage of this material. A vendor claims that his manufacturing equipment will
result in less wasted material then your present process. Should you replace your
existing equipment?
The above question is of course a complicated trade-off of costs of lost
material as compared to the cost of new equipment. However, the question is worth
your time unless the claim that the result will be less wasted material. Further, if
there is a reduction in waste, an estimate is needed of how much material will be
saved so that it can be factored into the cost trade-offs.
Suppose you get the vendor to perform 25 runs of his new equipment and to
measure the wastage. You then take a random sample of 25 runs using your present
equipment with the results shown below:
Wasted Material
Present Process
Wasted Material
New Process
196.00
202.32
220.89
198.16
188.51
210.12
204.64
191.68
186.46
200.18
201.49
207.66
187.93
218.81
192.34
214.71
206.06
195.77
209.65
211.53
221.05
196.04
196.17
216.05
201.24
178.48
201.42
231.03
174.99
169.49
183.02
170.24
195.64
173.80
191.37
195.32
194.18
177.13
223.99
198.45
188.50
189.08
197.06
186.54
189.55
204.53
174.46
181.79
197.14
194.95
How can we determine if a reduction in wastage has occurred?
It turns out that more information is needed to perform the analysis since
there are two distinct ways the data could have been collected.
The first way would have been to take a random sample of 25 runs from the
present process and then an independent sample of 25 runs from the new process.
This is called the two independent sample case.
Since the process is under the control of an operator, the second way the data
could have been generated is if let us say operator 1 ran under the old process and
this same operator then ran under the new process. Then this process would be
repeated for 19 other operators. This is called the paired sample case. That is
because for each operator we have a run under the old system and under the new
system.
Which of these methods is better? In general, given a choice, the paired
sample case is better in that it has a higher probability of detecting a difference if
there is one. To see this assume that there is variability in the way an operator runs
the process. In the two independent sample case any variability between runs is a
mixture of the potential difference in the process results and the variability from
operator to operator. In the paired sample case, since the same operator runs both
the new and old process, these two measurements would differ primarily on the
basis of the potential difference in the process only. Sometimes this is called
“controlling for the operator”.
The key point is that you cannot tell by looking at the data whether you have
two independent samples or paired samples. You must ask how the data was
collected. The distinction is important since it affects how the analysis is performed.
The Paired Two Sample Case
In this case, we view the data as n pairs of observations
( x 1 ,i , x 2 ,i ) arrayed
as follows:
Pair
Group 1
Group2
1
x11
x21
2
x12
x22
.
.
.
.
.
.
n
x1n
x2n
The null hypothesis is that the two groups have the same mean as opposed to
having different means. Formally,
H0 : 1  2
H A : 1  2
Note that this can also be re-written as either:
H0 : 1  2  0
H A : 1  2  0
or
H0 : 2  1  0
H A : 2  1  0
Now since the data is paired we can define for each pair i,
d i  x 1 ,i  x 2 ,i
or
d i  x 2 , i  x 1 ,i
the choice depending on whether subtracting the values in Group 1 from Group 2 is
easier to interpret than subtracting the values in Group 2 from Group1. Note that
whichever way is chosen, it must be the same for all n values.
For example, in our case, since we are expecting the New Process (Group 2)
to have less wastage than the Old Process (Group 1), it makes sense to subtract the
values of Group 2 from the values in Group 1, giving the decrease in wastage
directly. This is shown in the data below:
Wasted Material
Wasted Material
Old Process New Process
Operator
1
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
196.00
202.32
220.89
198.16
188.51
210.12
204.64
191.68
186.46
200.18
201.49
207.66
187.93
218.81
192.34
214.71
206.06
195.77
209.65
211.53
221.05
196.04
196.17
216.05
201.24
178.48
201.42
231.03
174.99
169.49
183.02
170.24
195.64
173.80
191.37
195.32
194.18
177.13
223.99
198.45
188.50
189.08
197.06
186.54
189.55
204.53
174.46
181.79
197.14
194.95
Difference
17.52
0.91
-10.14
23.17
19.02
27.10
34.39
-3.96
12.65
8.81
6.17
13.48
10.80
-5.18
-6.11
26.22
16.97
-1.28
23.11
21.97
16.52
21.58
14.38
18.91
6.28
By this process, we now have one sample of differences d1, d2, . . . , dn .
Now one can show theoretically that
 d  E ( d i )  E ( x 1 , i  x 2 ,i )  E ( x 1 , i )  E ( x 2 , i )   1   2
This means that:
d  0  1  2  0
and
d  0  1  2  0
Because of this equivalency, we can test the hypothesis that the two groups
have the same mean by testing the hypothesis:
H0 : d  0
H A : d  0
The confidence interval is given by:
d  t / 2
sd
n
  d  d  t / 2
sd
n
where the t-distribution with n – 1 degrees of freedom is used. This is exactly the
same form as the confidence interval done for the mean in the previous lecture
except now we use the mean and standard deviation of the d’s as our basis. That is:
n
d   di / n
i 1
n
sd 
(d
i 1
i
 d )2
n1
If 0 is inside the confidence interval, then we accept the null hypothesis that
the two groups have the same mean. If 0 is not inside the interval, then we reject the
null hypothesis and the confidence interval gives us an estimate of how much the
means of the two groups differ.
In our case we can use the EXCEL functions “=average(range of data)” and
“=stdev(range of data )” to compute d and s d as shown below:
Wasted Material Wasted Material
Old Process
New Process
Operator
1
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
196.00
202.32
220.89
198.16
188.51
210.12
204.64
191.68
186.46
200.18
201.49
207.66
187.93
218.81
192.34
214.71
206.06
195.77
209.65
211.53
221.05
196.04
196.17
216.05
201.24
178.48
201.42
231.03
174.99
169.49
183.02
170.24
195.64
173.80
191.37
195.32
194.18
177.13
223.99
198.45
188.50
189.08
197.06
186.54
189.55
204.53
174.46
181.79
197.14
194.95
mean=
stdev=
Difference
17.52
0.91
-10.14
23.17
19.02
27.10
34.39
-3.96
12.65
8.81
6.17
13.48
10.80
-5.18
-6.11
26.22
16.97
-1.28
23.11
21.97
16.52
21.58
14.38
18.91
6.28
12.53
11.71
If we choose to do our test at the .05 level, we can use our automatic template
from the EXCEL file “onesam.xls” to obtain the confidence interval as shown
below:
Template for Confidence Interval
Enter ===>
Sample
Mean
Sample
SD
Sample
n
alpha
12.53
11.71
25
0.05
7.696351
to
17.36365
Confidence Interval is
Since zero is not in the confidence interval, we reject the null hypothesis that
the groups have the same mean in favor of the alternative that the means are
different. How much would wastage be reduced? The confidence interval indicates
that on average it would be between a reduction of 7.69 to as much as 17.36.
Incidentally, if you had looked at d i  x 2 ,1  x 1 ,i (the reverse of what we did)
then the only difference would be that we would then have d  -12.53 and the
confidence interval would be given by:
Template for Confidence Interval
Enter ===>
Confidence Interval is
Sample
Mean
Sample
SD
Sample
n
alpha
-12.53
11.71
25
0.05
-17.3636
to
-7.69635
Since zero is not in the confidence interval we would also reject the null hypothesis.
EXCEL will compute the appropriate test in the paired sample case directly.
To do so click on “Tools”, “Data Analysis”, and then “t-test: Paired two-sample for
means” to get a screen that looks like that below:
The data for Process 1 is in rows c7 to c25. The data for Process 2 is in rows
e7 to e25. The hypothesized mean difference is 0 and the alpha level is .05.
By hitting “OK” you get the following result:
t-Test: Paired Two Sample for Means
Variable 1 Variable 2
Mean
203.018
190.486
Variance
107.958
225.864
Observations
25
25
Pearson Correlation
0.6298674
Hypothesized Mean Difference
0
df
24
t Stat
5.3512423
P(T<=t) one-tail
8.561E-06
t Critical one-tail
1.7108823
P(T<=t) two-tail
1.712E-05
t Critical two-tail
2.0638981
The two-sided p-value is .0000171 which is much less that .05, so we would
reject the null hypothesis.
If you now wished to compute the confidence interval directly, the above
table gives you the appropriate value of t / 2  2.063898 . The confidence interval
would then be:
12.53  2.063898 *
11.71
25
  d  12.53  2.063898 *
Yielding,
7.69635   d  17.36365
just as before.
11.71
25
The Two Independent Sample Case
In this situation, we have two totally independent samples taken from two
different populations. The basic hypothesis, as before is:
H0 : 1  2
H A : 1  2
Which, as before can be written as:
H0 : 1  2  0
H A : 1  2  0
or
H0 : 2  1  0
H A : 2  1  0
The structure of the data is:
Group 1
Group2
x11
x21
x12
x22
.
.
.
.
x1,n1
.
.
x2,n2
Notice that there is no pairing of the observations, indeed the groups do not
even have to have the same number of observations.
Since we can formulate the hypothesis as the difference in population means,
the natural statistic to use is the difference in the sample means. It makes no
difference if we look at the difference of the sample mean of the first group minus
the sample mean of the second group, or the reverse. Accordingly we need to know
the sampling distribution of:
x1  x 2
where
n1
x 1   x 1 i / n1
i 1
and,
n2
x 2   x 2 i / n2
i 1
Theoretically, one can show that the standard deviation of the sampling
distribution of the difference in the means is:
 12
SE ( x 1  x 2 ) 

n1
 22
n2
Since we do not know the population variances, the natural estimate of the standard
error of the difference in the sample means is:
s 12 s 22

n1 n 2
where,
n1
s1 
( x
i 1
1i
 x1 )2
n1  1
and
n2
s2 
( x
i 1
2i
 x 2 )2
n2  1
Theoretical statisticians then studied the sampling distribution of the
statistic:
t obs 
x1  x 2
s 12 s 22

n1 n 2
Although no exact solution has been found, it has been established that it can
be closely approximated by a t-distribution with degrees of freedom = df given by a
rather complicated formula.
The exact procedure for finding df is a two stage procedure:
First compute;
s 12
n1
f 
s 12 s 22

n1 n 2
Finally compute df using the formula;
df 
1
f
2
( n1  1 )

( 1  f )2
( n2  1 )
One can show mathematically, that the following inequality will always hold:
min( n1  1, n2  1 )  df  n1  n2  2
If the standard deviations in the two groups are very different, then df will
tend to the lower end of the inequality. If the standard deviations in the two groups
are approximately the same, then df will tend to the upper end of the inequality.
Also, if both n1 and n2 are both over 30 (which is often the case in business
situations), then df will be greater than 30 so that one can just use the normal
distribution. However, if either sample size (or both) are less than 30, then the
computation must be made.
For a given alpha level, the confidence interval is then given by:
( x 1  x 2 )  t / 2
s 12 s 22
s2 s2

  1   2  ( x 1  x 2 )  t / 2 1  2
n1 n 2
n1 n 2
where the t/2 value is based on the df computed above. If 0 is in the interval one
accepts the null hypothesis. If 0 is not in the interval, reject the null hypothesis and
then the confidence interval provides bounds on the magnitude of the difference in
the mean values.
Fortunately, these computations have been automated. In the EXCEL file
“twosamp.xls”, I have included a template which will compute the confidence
interval for you. In order to use it you need the means and standard deviations
from the two samples. These are obtained using the EXCEL functions “average”
and “stdev” as before. The results are shown below:
Wasted Material
Old Process
1
Wasted Material
New Process
2
196.00
202.32
220.89
198.16
188.51
210.12
204.64
191.68
186.46
200.18
201.49
207.66
187.93
218.81
192.34
214.71
206.06
195.77
209.65
211.53
221.05
196.04
196.17
216.05
201.24
178.48
201.42
231.03
174.99
169.49
183.02
170.24
195.64
173.80
191.37
195.32
194.18
177.13
223.99
198.45
188.50
189.08
197.06
186.54
189.55
204.53
174.46
181.79
197.14
194.95
203.02
10.39
190.49
15.03
mean=
stdev=
Both samples contain 25 observations, and we will work at alpha level .05.
Substituting these values into the template yields:
Template for Confidence Interval on Positive Difference in Means
Group
1
Group
2
input ==> mean
input ==> sd
input ==> n
203.0200
10.3900
25
190.4900
15.0300
25
Approx Degrees of Freedom =
43
Alpha
0.05
Confidence Interval on Positive Difference in Means
5.1603
to
19.8997
Since the confidence interval does not contain zero, we would reject the null
hypothesis and conclude that the new process would reduce wastage somewhere
between 5.15 to 19.90.
Note that this confidence interval is wider than the one that would be
obtained if the data were paired.
Also note that since the standard deviations of the two samples are relatively
different, we obtain a value of 43 degrees of freedom which is below the value of 48
which would be the value if the standard deviations were approximately equal.
If the roles of groups 1 and 2 were reversed the confidence interval would
have been:
 19.8997   2   1  5.1603
Since zero is not in the interval, we would again reject the null hypothesis.
EXCEL has a built in function to compute the p-value of this test from raw
data. It can be accessed by clicking on “Tools”, then “Data Analysis” and then “ttest: Two Sample Assuming Unequal Variances”. The screen would look like:
The data for Group 1 is in rows C7:C31. The data for Group 2 is in rows
E7:E31, the hypothesized mean difference is 0 and alpha is set at .05.
By hitting “OK” the following output is produced:
t-Test: Two-Sample Assuming Unequal Variances
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
203.018
107.958
25
0
43
3.429510
0.000673
1.681071
0.001345
2.016691
Variable 2
190.486
225.864
25
The two sided p-value is highlighted in yellow. With a two sided p-value of
.001345 we would reject the null hypothesis.
The appropriate value of t/2 is 2.016691(highlighted in red), so that one could
construct the confidence interval as:
( 203.02  190.49 )  2.016691 *
( 10.39 ) 2 ( 15.03 ) 2

25
25
which yields
5.1603   1   2  19.8997
exactly as before.
Comparing Proportions From Two Independent Samples
Suppose the Human Resources Department of a company does a study on
men and women’s salaries. They find that on average women's’ salaries are
substantially lower than those of men. When brought to the attention of senior
management, one vice-president points out that he thinks that on average more men
have graduate degrees (MBAs) and thus would tend to have higher salaries then
women.
The director of Human Resources then conducts two random samples. One
of male employees and one of female employees with the following results:
Males
Females
Number with
MBA
20
8
Total Sample
100
75
Proportion
0.2
0.106667
In the samples 20% of the men have an MBA while only a little less than
11% of the women have an MBA. Does this indicate that the proportion differs in
the populations of all male and female employees?
The general situation is as follows.
Random Sample from
Population 1
Random Sample from
Population 2
Number of Successes
x1
x2
Sample Size
n1
n2
p̂1  x1 / n1
p̂2  x 2 / n2
Sample Proportion
Population Proportion
p1
p2
The basic hypothesis is:
H 0 : p1  p 2
H A : p1  p 2
which, as in the case of means, is equivalent to:
H 0 : p1  p 2  0
H A : p1  p 2  0
or
H 0 : p 2  p1  0
H A : p 2  p1  0
The natural test statistic to use is:
p̂1  p̂2
It can be shown theoretically that the standard deviation of the sampling
distribution of the difference of two independent proportions is given by the
equation:
SE ( p̂1  p̂ 2 ) 
p1 ( 1  p1 ) p 2 ( 1  p 2 )

n1
n2
Finally, it can be shown that if n1p1>5 and n2p2 >5, then the statistic
z obs 
p̂ 1  p̂ 2
p̂ 1 ( 1  p̂ 1 ) p̂ 2 ( 1  p̂ 2 )

n1
n2
approximately follows the normal distribution.
This lead directly to a formula for a (1-)100% confidence interval as:
( p̂1  p̂ 2 )  z / 2
p̂1 ( 1  p̂1 ) p̂ 2 ( 1  p̂ 2 )

n1
n2
If 0 is inside this confidence interval, then one would accept the null
hypothesis that the two population proportions are equal. If 0 is not inside this
interval, then one would reject the null hypothesis and the confidence interval would
provide an estimate of the range of the difference between the two proportions.
EXCEL does not compute the confidence interval directly. However, the file
“twosamp.xls” contains a template as shown below.
Template for Approximate Confidence Interval
Input ==>
Input ==>
Number of Successes
Sample Size
P-hat
Approximate Confidence Interval on Positive Difference
Sample 1 Sample 2
20
8
100
75
0.2000
0.1067
-0.0117
to
Working with alpha at the .05 level, one enters the number of successes in the
first and second samples as well as the sample sizes of the two samples. The
computer then computes the appropriate confidence interval using the appropriate
z value.
Since this confidence interval includes the value of zero, the data is
insufficient to reject the hypothesis that male and females, at this company, have
MBA’s in the same proportion. Even though the male proportion of 20% is almost
twice as large as the female proportion of almost 11%!
Alpha
0.05
0.1983
Two Sample Structural Hypotheses
There is another way of approaching the previous problem of comparing two
proportions which leads to a general method for dealing with two sample structural
hypotheses.
Using the same data, array the information in the following 2 x 2 table:
MBA
No MBA
Sample 1 Sample 2
Male
Female
20
8
80
67
100
75
28
147
175
Call this the observed table. Now if the null hypothesis is true and both
males and females have the same probability of having an MBA, I could pool the
male and female results. Then of the total of 175 employees, since 28 have MBA’s, I
would estimate
P(has MBA) = 28 / 175 = .16.
In other words I would estimate that 16% of my employees have an MBA.
Of 100 males I would expect 100*(.16) = 16 to have an MBA, and therefore
100 – 16 = 84 to not have an MBA. Further, I would expect 75*(.16) = 12 of the
females to have an MBA and therefore 75 – 12 = 63 not to have an MBA.
Accordingly, if the two groups had the same probability of having an MBA, I could
construct the following expected table:
Expected
MBA
No MBA
Sample 1 Samle 2
Male
Female
16.00
12.00
84.00
63.00
100
75
28
147
The question now becomes “Does the observed table agree enough with the
expected table constructed based on the assumption that both groups have the same
probability of having an MBA?”
This comparison is exactly the same kind of comparison we made when we
studied structural hypotheses in the one sample case where we introduced the chisquare statistic.
The same logic works here. In order to test the hypotheses that males and
females have the same probability distribution of having an MBA or not having an
MBA, we compute the expected table. Then we compute the chi-square statistic as:

2
obs

i,j
( OBS ij  EXPij ) 2
EXPij
where i indexes the rows of the tables and j indexes the columns. The degrees of
freedom in this case is given by (#rows –1 ) x (#cols – 1). One then uses the EXCEL
function “=chidist”, just as we did earlier, to get the p-value for this problem.
In our case, we get:
2
 obs

( 20  16 ) 2 ( 8  12 ) 2 ( 80  84 ) 2 ( 67  63 ) 2



 2.78
16
12
84
63
The degrees of freedom is (2 –1) x (2 – 1) = 1. Therefore the p-value is :
p-value = chidist(2.78 , 1) = .095581.
Since .095581 is greater than our alpha level of .05, we accept the null
hypothesis that the two proportions do not disagree, or that males and females have
the same probability of having an MBA.
The following table shows how the expected values were computed:
Expected
Male
Female
100*28/175
75*28/175
28
No MBA 100*147/175 75*147/175
147
MBA
100
75
175
Notice that the expected value in a cell is nothing more than the total for that
row multiplied by the total for that column, divided by the grand total.
This leads to the result that the expected value for the entry in Row i and
Column j is given by the formula:
(Row i Total) x (Column j Total) / Grand Total.
This simplification will be useful for the next problem.
When we were studying one sample structural hypotheses, we looked at the
case where your company had a group of regular customers, your competitor had a
group of regular customers and the remaining people bought opportunistically
between you and your competitor. Suppose we did a market survey at a one year
interval and observed the following results:
Observed
Time 1
Time 2
Total
Your Company
Opportunity
Competitor
3,946
879
1,653
4,119
600
1,913
8,065
1,479
3,566
Total
6,478
6,632
13,110
Is there any indication that the market structure has changed?
This is equivalent to asking whether or not the market is following the same
probability distribution at Time 1 and Time 2.
Let us test the hypothesis that the probability distribution of customers is the
same at Times 1 and 2 at the .01 level.
The first step is to construct the expected table using the algorithm described
above, this results in the expected table:
Expected
Time 1
Time 2
Total
Your Company
Opportunity
Competitor
3,985.13
730.81
1,762.06
4,079.87
748.19
1,803.94
8,065.00
1,479.00
3,566.00
Total
6,478.00
6,632.00
13,110.00
One then needs to compute ((OBS – EXP)^2)/EXP for each cell, which would
result in the table below:
Contributions to Chi-Square
Your Company
Opportunity
Competitor
Time 1
Time 2
0.38
30.05
6.75
0.38
29.35
6.59
chi-square obs =
73.50
Based on the chi-squared obs statistic of 73.50, we compute the p-value with
(3 –1) x (2 –1) = 2 degrees of freedom as:
p-value = chidist( 73.5, 2) = 1.0958 E-16.
This means the observed data has about a 1 in quadrillion chance of
occurring by chance if the probability distributions have remained unchanged.
Since this is much less that our alpha level of .01, we reject the null hypothesis that
the probability distribution has not changed in favor of it having altered over time.
What has changed? To answer this question, first look only at the cells with
contributions over 3.5. These are highlighted in yellow below:
Contributions to Chi-Square
Your Company
Opportunity
Competitor
Time 1
Time 2
0.38
30.05
6.75
0.38
29.35
6.59
chi-square obs =
73.50
Now compare the observed values and the expected values to see how they
have changed from Time 1 to Time 2. This results in the table below:
Oberved compared to Expected
Your Company
Opportunity
Competitor
Time 1
Time 2
No Change
Higher
Lower
No Change
Lower
Higher
The highlighted values are where significant differences have occurred
between Times 1 and 2. The Opportunity group has gone from over expectation to
below expectation. Your competitor has gone from below expectation to above
expectation. Your company has remained the same. This indicates that the
opportunity buyer group has decreased as a proportion of the population and that
your competitor seems to be getting a disproportionate number of them!!!
This can be seen directly by looking at the empirical probability distributions
at Times 1 and 2 as shown below:
Market Share
Your Company
Opportunity
Competitor
Time 1
Time 2
60.91%
13.57%
25.52%
62.11%
9.05%
28.84%
100.00%
100.00%
Of the drop of (13.57% - 9.05%) = 4.52% of Opportunity buyers, you got
(62.11% - 60.91%) = 1.20% while your competitor increased his market share by
(28.84% - 25.52%) = 3.32%.