Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Module III Lecture 3 Two-Sample Problems Suppose that you are a manager in a plant that makes micro-chips. One of the key materials is quite expensive so that you want to minimize the inevitable wastage of this material. A vendor claims that his manufacturing equipment will result in less wasted material then your present process. Should you replace your existing equipment? The above question is of course a complicated trade-off of costs of lost material as compared to the cost of new equipment. However, the question is worth your time unless the claim that the result will be less wasted material. Further, if there is a reduction in waste, an estimate is needed of how much material will be saved so that it can be factored into the cost trade-offs. Suppose you get the vendor to perform 25 runs of his new equipment and to measure the wastage. You then take a random sample of 25 runs using your present equipment with the results shown below: Wasted Material Present Process Wasted Material New Process 196.00 202.32 220.89 198.16 188.51 210.12 204.64 191.68 186.46 200.18 201.49 207.66 187.93 218.81 192.34 214.71 206.06 195.77 209.65 211.53 221.05 196.04 196.17 216.05 201.24 178.48 201.42 231.03 174.99 169.49 183.02 170.24 195.64 173.80 191.37 195.32 194.18 177.13 223.99 198.45 188.50 189.08 197.06 186.54 189.55 204.53 174.46 181.79 197.14 194.95 How can we determine if a reduction in wastage has occurred? It turns out that more information is needed to perform the analysis since there are two distinct ways the data could have been collected. The first way would have been to take a random sample of 25 runs from the present process and then an independent sample of 25 runs from the new process. This is called the two independent sample case. Since the process is under the control of an operator, the second way the data could have been generated is if let us say operator 1 ran under the old process and this same operator then ran under the new process. Then this process would be repeated for 19 other operators. This is called the paired sample case. That is because for each operator we have a run under the old system and under the new system. Which of these methods is better? In general, given a choice, the paired sample case is better in that it has a higher probability of detecting a difference if there is one. To see this assume that there is variability in the way an operator runs the process. In the two independent sample case any variability between runs is a mixture of the potential difference in the process results and the variability from operator to operator. In the paired sample case, since the same operator runs both the new and old process, these two measurements would differ primarily on the basis of the potential difference in the process only. Sometimes this is called “controlling for the operator”. The key point is that you cannot tell by looking at the data whether you have two independent samples or paired samples. You must ask how the data was collected. The distinction is important since it affects how the analysis is performed. The Paired Two Sample Case In this case, we view the data as n pairs of observations ( x 1 ,i , x 2 ,i ) arrayed as follows: Pair Group 1 Group2 1 x11 x21 2 x12 x22 . . . . . . n x1n x2n The null hypothesis is that the two groups have the same mean as opposed to having different means. Formally, H0 : 1 2 H A : 1 2 Note that this can also be re-written as either: H0 : 1 2 0 H A : 1 2 0 or H0 : 2 1 0 H A : 2 1 0 Now since the data is paired we can define for each pair i, d i x 1 ,i x 2 ,i or d i x 2 , i x 1 ,i the choice depending on whether subtracting the values in Group 1 from Group 2 is easier to interpret than subtracting the values in Group 2 from Group1. Note that whichever way is chosen, it must be the same for all n values. For example, in our case, since we are expecting the New Process (Group 2) to have less wastage than the Old Process (Group 1), it makes sense to subtract the values of Group 2 from the values in Group 1, giving the decrease in wastage directly. This is shown in the data below: Wasted Material Wasted Material Old Process New Process Operator 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 196.00 202.32 220.89 198.16 188.51 210.12 204.64 191.68 186.46 200.18 201.49 207.66 187.93 218.81 192.34 214.71 206.06 195.77 209.65 211.53 221.05 196.04 196.17 216.05 201.24 178.48 201.42 231.03 174.99 169.49 183.02 170.24 195.64 173.80 191.37 195.32 194.18 177.13 223.99 198.45 188.50 189.08 197.06 186.54 189.55 204.53 174.46 181.79 197.14 194.95 Difference 17.52 0.91 -10.14 23.17 19.02 27.10 34.39 -3.96 12.65 8.81 6.17 13.48 10.80 -5.18 -6.11 26.22 16.97 -1.28 23.11 21.97 16.52 21.58 14.38 18.91 6.28 By this process, we now have one sample of differences d1, d2, . . . , dn . Now one can show theoretically that d E ( d i ) E ( x 1 , i x 2 ,i ) E ( x 1 , i ) E ( x 2 , i ) 1 2 This means that: d 0 1 2 0 and d 0 1 2 0 Because of this equivalency, we can test the hypothesis that the two groups have the same mean by testing the hypothesis: H0 : d 0 H A : d 0 The confidence interval is given by: d t / 2 sd n d d t / 2 sd n where the t-distribution with n – 1 degrees of freedom is used. This is exactly the same form as the confidence interval done for the mean in the previous lecture except now we use the mean and standard deviation of the d’s as our basis. That is: n d di / n i 1 n sd (d i 1 i d )2 n1 If 0 is inside the confidence interval, then we accept the null hypothesis that the two groups have the same mean. If 0 is not inside the interval, then we reject the null hypothesis and the confidence interval gives us an estimate of how much the means of the two groups differ. In our case we can use the EXCEL functions “=average(range of data)” and “=stdev(range of data )” to compute d and s d as shown below: Wasted Material Wasted Material Old Process New Process Operator 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 196.00 202.32 220.89 198.16 188.51 210.12 204.64 191.68 186.46 200.18 201.49 207.66 187.93 218.81 192.34 214.71 206.06 195.77 209.65 211.53 221.05 196.04 196.17 216.05 201.24 178.48 201.42 231.03 174.99 169.49 183.02 170.24 195.64 173.80 191.37 195.32 194.18 177.13 223.99 198.45 188.50 189.08 197.06 186.54 189.55 204.53 174.46 181.79 197.14 194.95 mean= stdev= Difference 17.52 0.91 -10.14 23.17 19.02 27.10 34.39 -3.96 12.65 8.81 6.17 13.48 10.80 -5.18 -6.11 26.22 16.97 -1.28 23.11 21.97 16.52 21.58 14.38 18.91 6.28 12.53 11.71 If we choose to do our test at the .05 level, we can use our automatic template from the EXCEL file “onesam.xls” to obtain the confidence interval as shown below: Template for Confidence Interval Enter ===> Sample Mean Sample SD Sample n alpha 12.53 11.71 25 0.05 7.696351 to 17.36365 Confidence Interval is Since zero is not in the confidence interval, we reject the null hypothesis that the groups have the same mean in favor of the alternative that the means are different. How much would wastage be reduced? The confidence interval indicates that on average it would be between a reduction of 7.69 to as much as 17.36. Incidentally, if you had looked at d i x 2 ,1 x 1 ,i (the reverse of what we did) then the only difference would be that we would then have d -12.53 and the confidence interval would be given by: Template for Confidence Interval Enter ===> Confidence Interval is Sample Mean Sample SD Sample n alpha -12.53 11.71 25 0.05 -17.3636 to -7.69635 Since zero is not in the confidence interval we would also reject the null hypothesis. EXCEL will compute the appropriate test in the paired sample case directly. To do so click on “Tools”, “Data Analysis”, and then “t-test: Paired two-sample for means” to get a screen that looks like that below: The data for Process 1 is in rows c7 to c25. The data for Process 2 is in rows e7 to e25. The hypothesized mean difference is 0 and the alpha level is .05. By hitting “OK” you get the following result: t-Test: Paired Two Sample for Means Variable 1 Variable 2 Mean 203.018 190.486 Variance 107.958 225.864 Observations 25 25 Pearson Correlation 0.6298674 Hypothesized Mean Difference 0 df 24 t Stat 5.3512423 P(T<=t) one-tail 8.561E-06 t Critical one-tail 1.7108823 P(T<=t) two-tail 1.712E-05 t Critical two-tail 2.0638981 The two-sided p-value is .0000171 which is much less that .05, so we would reject the null hypothesis. If you now wished to compute the confidence interval directly, the above table gives you the appropriate value of t / 2 2.063898 . The confidence interval would then be: 12.53 2.063898 * 11.71 25 d 12.53 2.063898 * Yielding, 7.69635 d 17.36365 just as before. 11.71 25 The Two Independent Sample Case In this situation, we have two totally independent samples taken from two different populations. The basic hypothesis, as before is: H0 : 1 2 H A : 1 2 Which, as before can be written as: H0 : 1 2 0 H A : 1 2 0 or H0 : 2 1 0 H A : 2 1 0 The structure of the data is: Group 1 Group2 x11 x21 x12 x22 . . . . x1,n1 . . x2,n2 Notice that there is no pairing of the observations, indeed the groups do not even have to have the same number of observations. Since we can formulate the hypothesis as the difference in population means, the natural statistic to use is the difference in the sample means. It makes no difference if we look at the difference of the sample mean of the first group minus the sample mean of the second group, or the reverse. Accordingly we need to know the sampling distribution of: x1 x 2 where n1 x 1 x 1 i / n1 i 1 and, n2 x 2 x 2 i / n2 i 1 Theoretically, one can show that the standard deviation of the sampling distribution of the difference in the means is: 12 SE ( x 1 x 2 ) n1 22 n2 Since we do not know the population variances, the natural estimate of the standard error of the difference in the sample means is: s 12 s 22 n1 n 2 where, n1 s1 ( x i 1 1i x1 )2 n1 1 and n2 s2 ( x i 1 2i x 2 )2 n2 1 Theoretical statisticians then studied the sampling distribution of the statistic: t obs x1 x 2 s 12 s 22 n1 n 2 Although no exact solution has been found, it has been established that it can be closely approximated by a t-distribution with degrees of freedom = df given by a rather complicated formula. The exact procedure for finding df is a two stage procedure: First compute; s 12 n1 f s 12 s 22 n1 n 2 Finally compute df using the formula; df 1 f 2 ( n1 1 ) ( 1 f )2 ( n2 1 ) One can show mathematically, that the following inequality will always hold: min( n1 1, n2 1 ) df n1 n2 2 If the standard deviations in the two groups are very different, then df will tend to the lower end of the inequality. If the standard deviations in the two groups are approximately the same, then df will tend to the upper end of the inequality. Also, if both n1 and n2 are both over 30 (which is often the case in business situations), then df will be greater than 30 so that one can just use the normal distribution. However, if either sample size (or both) are less than 30, then the computation must be made. For a given alpha level, the confidence interval is then given by: ( x 1 x 2 ) t / 2 s 12 s 22 s2 s2 1 2 ( x 1 x 2 ) t / 2 1 2 n1 n 2 n1 n 2 where the t/2 value is based on the df computed above. If 0 is in the interval one accepts the null hypothesis. If 0 is not in the interval, reject the null hypothesis and then the confidence interval provides bounds on the magnitude of the difference in the mean values. Fortunately, these computations have been automated. In the EXCEL file “twosamp.xls”, I have included a template which will compute the confidence interval for you. In order to use it you need the means and standard deviations from the two samples. These are obtained using the EXCEL functions “average” and “stdev” as before. The results are shown below: Wasted Material Old Process 1 Wasted Material New Process 2 196.00 202.32 220.89 198.16 188.51 210.12 204.64 191.68 186.46 200.18 201.49 207.66 187.93 218.81 192.34 214.71 206.06 195.77 209.65 211.53 221.05 196.04 196.17 216.05 201.24 178.48 201.42 231.03 174.99 169.49 183.02 170.24 195.64 173.80 191.37 195.32 194.18 177.13 223.99 198.45 188.50 189.08 197.06 186.54 189.55 204.53 174.46 181.79 197.14 194.95 203.02 10.39 190.49 15.03 mean= stdev= Both samples contain 25 observations, and we will work at alpha level .05. Substituting these values into the template yields: Template for Confidence Interval on Positive Difference in Means Group 1 Group 2 input ==> mean input ==> sd input ==> n 203.0200 10.3900 25 190.4900 15.0300 25 Approx Degrees of Freedom = 43 Alpha 0.05 Confidence Interval on Positive Difference in Means 5.1603 to 19.8997 Since the confidence interval does not contain zero, we would reject the null hypothesis and conclude that the new process would reduce wastage somewhere between 5.15 to 19.90. Note that this confidence interval is wider than the one that would be obtained if the data were paired. Also note that since the standard deviations of the two samples are relatively different, we obtain a value of 43 degrees of freedom which is below the value of 48 which would be the value if the standard deviations were approximately equal. If the roles of groups 1 and 2 were reversed the confidence interval would have been: 19.8997 2 1 5.1603 Since zero is not in the interval, we would again reject the null hypothesis. EXCEL has a built in function to compute the p-value of this test from raw data. It can be accessed by clicking on “Tools”, then “Data Analysis” and then “ttest: Two Sample Assuming Unequal Variances”. The screen would look like: The data for Group 1 is in rows C7:C31. The data for Group 2 is in rows E7:E31, the hypothesized mean difference is 0 and alpha is set at .05. By hitting “OK” the following output is produced: t-Test: Two-Sample Assuming Unequal Variances Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Variable 1 203.018 107.958 25 0 43 3.429510 0.000673 1.681071 0.001345 2.016691 Variable 2 190.486 225.864 25 The two sided p-value is highlighted in yellow. With a two sided p-value of .001345 we would reject the null hypothesis. The appropriate value of t/2 is 2.016691(highlighted in red), so that one could construct the confidence interval as: ( 203.02 190.49 ) 2.016691 * ( 10.39 ) 2 ( 15.03 ) 2 25 25 which yields 5.1603 1 2 19.8997 exactly as before. Comparing Proportions From Two Independent Samples Suppose the Human Resources Department of a company does a study on men and women’s salaries. They find that on average women's’ salaries are substantially lower than those of men. When brought to the attention of senior management, one vice-president points out that he thinks that on average more men have graduate degrees (MBAs) and thus would tend to have higher salaries then women. The director of Human Resources then conducts two random samples. One of male employees and one of female employees with the following results: Males Females Number with MBA 20 8 Total Sample 100 75 Proportion 0.2 0.106667 In the samples 20% of the men have an MBA while only a little less than 11% of the women have an MBA. Does this indicate that the proportion differs in the populations of all male and female employees? The general situation is as follows. Random Sample from Population 1 Random Sample from Population 2 Number of Successes x1 x2 Sample Size n1 n2 p̂1 x1 / n1 p̂2 x 2 / n2 Sample Proportion Population Proportion p1 p2 The basic hypothesis is: H 0 : p1 p 2 H A : p1 p 2 which, as in the case of means, is equivalent to: H 0 : p1 p 2 0 H A : p1 p 2 0 or H 0 : p 2 p1 0 H A : p 2 p1 0 The natural test statistic to use is: p̂1 p̂2 It can be shown theoretically that the standard deviation of the sampling distribution of the difference of two independent proportions is given by the equation: SE ( p̂1 p̂ 2 ) p1 ( 1 p1 ) p 2 ( 1 p 2 ) n1 n2 Finally, it can be shown that if n1p1>5 and n2p2 >5, then the statistic z obs p̂ 1 p̂ 2 p̂ 1 ( 1 p̂ 1 ) p̂ 2 ( 1 p̂ 2 ) n1 n2 approximately follows the normal distribution. This lead directly to a formula for a (1-)100% confidence interval as: ( p̂1 p̂ 2 ) z / 2 p̂1 ( 1 p̂1 ) p̂ 2 ( 1 p̂ 2 ) n1 n2 If 0 is inside this confidence interval, then one would accept the null hypothesis that the two population proportions are equal. If 0 is not inside this interval, then one would reject the null hypothesis and the confidence interval would provide an estimate of the range of the difference between the two proportions. EXCEL does not compute the confidence interval directly. However, the file “twosamp.xls” contains a template as shown below. Template for Approximate Confidence Interval Input ==> Input ==> Number of Successes Sample Size P-hat Approximate Confidence Interval on Positive Difference Sample 1 Sample 2 20 8 100 75 0.2000 0.1067 -0.0117 to Working with alpha at the .05 level, one enters the number of successes in the first and second samples as well as the sample sizes of the two samples. The computer then computes the appropriate confidence interval using the appropriate z value. Since this confidence interval includes the value of zero, the data is insufficient to reject the hypothesis that male and females, at this company, have MBA’s in the same proportion. Even though the male proportion of 20% is almost twice as large as the female proportion of almost 11%! Alpha 0.05 0.1983 Two Sample Structural Hypotheses There is another way of approaching the previous problem of comparing two proportions which leads to a general method for dealing with two sample structural hypotheses. Using the same data, array the information in the following 2 x 2 table: MBA No MBA Sample 1 Sample 2 Male Female 20 8 80 67 100 75 28 147 175 Call this the observed table. Now if the null hypothesis is true and both males and females have the same probability of having an MBA, I could pool the male and female results. Then of the total of 175 employees, since 28 have MBA’s, I would estimate P(has MBA) = 28 / 175 = .16. In other words I would estimate that 16% of my employees have an MBA. Of 100 males I would expect 100*(.16) = 16 to have an MBA, and therefore 100 – 16 = 84 to not have an MBA. Further, I would expect 75*(.16) = 12 of the females to have an MBA and therefore 75 – 12 = 63 not to have an MBA. Accordingly, if the two groups had the same probability of having an MBA, I could construct the following expected table: Expected MBA No MBA Sample 1 Samle 2 Male Female 16.00 12.00 84.00 63.00 100 75 28 147 The question now becomes “Does the observed table agree enough with the expected table constructed based on the assumption that both groups have the same probability of having an MBA?” This comparison is exactly the same kind of comparison we made when we studied structural hypotheses in the one sample case where we introduced the chisquare statistic. The same logic works here. In order to test the hypotheses that males and females have the same probability distribution of having an MBA or not having an MBA, we compute the expected table. Then we compute the chi-square statistic as: 2 obs i,j ( OBS ij EXPij ) 2 EXPij where i indexes the rows of the tables and j indexes the columns. The degrees of freedom in this case is given by (#rows –1 ) x (#cols – 1). One then uses the EXCEL function “=chidist”, just as we did earlier, to get the p-value for this problem. In our case, we get: 2 obs ( 20 16 ) 2 ( 8 12 ) 2 ( 80 84 ) 2 ( 67 63 ) 2 2.78 16 12 84 63 The degrees of freedom is (2 –1) x (2 – 1) = 1. Therefore the p-value is : p-value = chidist(2.78 , 1) = .095581. Since .095581 is greater than our alpha level of .05, we accept the null hypothesis that the two proportions do not disagree, or that males and females have the same probability of having an MBA. The following table shows how the expected values were computed: Expected Male Female 100*28/175 75*28/175 28 No MBA 100*147/175 75*147/175 147 MBA 100 75 175 Notice that the expected value in a cell is nothing more than the total for that row multiplied by the total for that column, divided by the grand total. This leads to the result that the expected value for the entry in Row i and Column j is given by the formula: (Row i Total) x (Column j Total) / Grand Total. This simplification will be useful for the next problem. When we were studying one sample structural hypotheses, we looked at the case where your company had a group of regular customers, your competitor had a group of regular customers and the remaining people bought opportunistically between you and your competitor. Suppose we did a market survey at a one year interval and observed the following results: Observed Time 1 Time 2 Total Your Company Opportunity Competitor 3,946 879 1,653 4,119 600 1,913 8,065 1,479 3,566 Total 6,478 6,632 13,110 Is there any indication that the market structure has changed? This is equivalent to asking whether or not the market is following the same probability distribution at Time 1 and Time 2. Let us test the hypothesis that the probability distribution of customers is the same at Times 1 and 2 at the .01 level. The first step is to construct the expected table using the algorithm described above, this results in the expected table: Expected Time 1 Time 2 Total Your Company Opportunity Competitor 3,985.13 730.81 1,762.06 4,079.87 748.19 1,803.94 8,065.00 1,479.00 3,566.00 Total 6,478.00 6,632.00 13,110.00 One then needs to compute ((OBS – EXP)^2)/EXP for each cell, which would result in the table below: Contributions to Chi-Square Your Company Opportunity Competitor Time 1 Time 2 0.38 30.05 6.75 0.38 29.35 6.59 chi-square obs = 73.50 Based on the chi-squared obs statistic of 73.50, we compute the p-value with (3 –1) x (2 –1) = 2 degrees of freedom as: p-value = chidist( 73.5, 2) = 1.0958 E-16. This means the observed data has about a 1 in quadrillion chance of occurring by chance if the probability distributions have remained unchanged. Since this is much less that our alpha level of .01, we reject the null hypothesis that the probability distribution has not changed in favor of it having altered over time. What has changed? To answer this question, first look only at the cells with contributions over 3.5. These are highlighted in yellow below: Contributions to Chi-Square Your Company Opportunity Competitor Time 1 Time 2 0.38 30.05 6.75 0.38 29.35 6.59 chi-square obs = 73.50 Now compare the observed values and the expected values to see how they have changed from Time 1 to Time 2. This results in the table below: Oberved compared to Expected Your Company Opportunity Competitor Time 1 Time 2 No Change Higher Lower No Change Lower Higher The highlighted values are where significant differences have occurred between Times 1 and 2. The Opportunity group has gone from over expectation to below expectation. Your competitor has gone from below expectation to above expectation. Your company has remained the same. This indicates that the opportunity buyer group has decreased as a proportion of the population and that your competitor seems to be getting a disproportionate number of them!!! This can be seen directly by looking at the empirical probability distributions at Times 1 and 2 as shown below: Market Share Your Company Opportunity Competitor Time 1 Time 2 60.91% 13.57% 25.52% 62.11% 9.05% 28.84% 100.00% 100.00% Of the drop of (13.57% - 9.05%) = 4.52% of Opportunity buyers, you got (62.11% - 60.91%) = 1.20% while your competitor increased his market share by (28.84% - 25.52%) = 3.32%.