Download Module III Lecture 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Module III
Lecture 2
One Sample Situations
Testing Hypotheses About the Mean
In the last lecture, we studied intensively the situation where we tested
hypotheses of the form:
H0   0
H A  0
in the circumstance where we took a random sample of size n and computed the
sample mean. However, we assumed that we knew  ! This assumption is usually
only tenable in the quality control case where we have a great deal of information on
the underlying process. But suppose we don’t know , which is the usual case.
For example, suppose you worked for Bull’s Eye, a corporation which owns
mid-scale department stores. Suppose you took a random sample of 25 customers
and determined that on average purchase they made was $35.00 worth of goods with
a standard deviation of $30.00 using the formulae from Module 1 of:
n
x   x i / n  35.00
i 1
s
( x
i
 x )2
i
n1
 30.00
Since you don’t know  you would be tempted to substitute the sample
standard deviation s in the formulae of the proceeding lecture. Would this make a
difference?
The answer is maybe yes and maybe no. Modifying Method 2 from the
proceeding lecture, we could compute the statistic:
t obs  n ( x   0 ) / s
which differs from zobs only by substituting s for .
In the 1920’s, William Gossett (publishing under the pseudonym “Student”),
showed that the tobs followed a distribution called the “t” distribution which was
symmetric and bell shaped and indexed by a term called the degrees of freedom.
Degrees of freedom is a mathematical term which has to do with the
dimension of certain spaces which enter into theoretical derivations. It will vary
from problem to problem.
How does the t distribution with df degrees of freedom compare with the
standard normal distribution?
The answer depends on how large df is. For example, below a graph showing
the standard normal distribution (mean = 0 and standard deviation =1) and the t
distribution with 1 degree of freedom. (This example can be found in the file
tdist.xls).
0.5
0.4
0.3
0.2
0.1
0
3
0
0.
75
1.
5
2.
25
Normal
t distribution
-3
-2
.2
5
-1
.5
-0
.7
5
pdf
Comparison of Normal and t with 1 df
t or z-score
Notice that the t distribution is flatter in the middle but has more probability in the
tails.
Now let us increase the degrees of freedom to 5. The resulting comparison is
shown below:
0.5
0.4
0.3
0.2
0.1
0
3
0
0.
75
1.
5
2.
25
Normal
t distribution
-3
-2
.2
5
-1
.5
-0
.7
5
pdf
Comparison of Normal and t with 5 df
t or z-score
Notice that although the t distribution is still lower in the middle and has heavier
tails, the two distributions are much closer.
Now let us look at the situation with 20 degrees of freedom.
0.5
0.4
0.3
0.2
0.1
0
3
0
0.
75
1.
5
2.
25
Normal
t distribution
-3
-2
.2
5
-1
.5
-0
.7
5
pdf
Comparison of Normal & t with 20 df
t or z-score
The two distributions are even closer although close inspection will still show that
the t distribution is a little lower in the middle and has slightly heavier tails.
Finally, let us look at the comparison when the degrees of freedom are 30.
0.5
0.4
0.3
0.2
0.1
0
3
0
0.
75
1.
5
2.
25
Normal
t distribution
-3
-2
.2
5
-1
.5
-0
.7
5
pdf
Comparison of Normal & t with 30 df
t or z-score
As can be seen the two distributions are almost indistinguishable.
We will then use the rule, use the t distribution when estimating the standard
deviation from the sample if the degrees of freedom are fewer than 30, and
otherwise use the normal values as before.
A 100*(1 -  ) % confidence interval based on sample of size n would then be
given by:
x  t / 2
s
n
   x  t / 2
s
n
The appropriate t value can be found from the EXCEL function “tinv”. The form of
“tinv” is:
=tinv(two sided  , df).
Since in this case df = n – 1, one gets:
t / 2  tinv(  , n  1 )
In our sample case, if we want a 95% confidence interval so that  =.05, we
have for df = 25 –1, that:
t .025  tinv(.05 ,24 )  2.0639
Therefore our 95% confidence interval is given by:
35  2.0639 *
30
25
   35  2..0639 *
30
25
or,
$22.62    $47.38
Any hypothesized value between $22.62 and $47.38 would be accepted at the
5% level of significance.
In the EXCEL file "onesam.xls" I have included a section that will
automatically compute the confidence interval if you enter the sample mean, the
sample standard deviation, the sample size and the alpha level. For our example the
result is shown below:
Template for Confidence Interval
Enter ===>
Sample
Mean
Sample
SD
Sample
n
alpha
35
30
25
0.05
22.61661
to
47.38339
Confidence Interval is
If you wanted a 99% confidence interval, one need only change alpha to .01
to obtain:
Template for Confidence Interval
Enter ===>
Confidence Interval is
Sample
Mean
Sample
SD
Sample
n
alpha
35
30
25
0.01
18.21829
to
51.78171
As you can see the 95% confidence interval is very wide (+/- $12.38). Can it
be made smaller? From the formula for the confidence interval, the only thing
under the control of the manager is the significance level and the sample size.
Suppose we wanted to know the average purchase to within $5.00, how big a sample
should we take?
Let W be the desired width (W = 5.00 in our case), then mathematically, we
would want to a confidence interval of the form:
 W
But our confidence interval is:
  t / 2
s
n
.
We can insure that we achieve the width W if:
t / 2
s
n
W .
Rearranging and solving for n, one obtains:
t2 / 2 s 2
n
.
W2
Since this sample size is usually over 30, one usually uses z/2 instead of t/2,
and  instead of s, so that the formula is usually given as:
n
z2 / 2 2
.
W2
A practical problem in using this formula is that it requires knowledge of .
However it is easy to get around this problem by taking a pre-sample and using the
standard deviation of this pre-sample to estimate.
In our case we want to know the average purchase within $5.00. We have
already taken a sample of size 25 and obtained s = $30.00. For a 95% confidence
interval we take z/2 = 1.96. Plugging into the formula we obtain:
n
( 1.96 ) 2 ( 30 ) 2
 138.2976  138
( 5 )2
Now since we already have 25 observations, we would take a further 138 – 25 = 113
random observations. We would then have a total of 138 and we would re-compute
the sample mean and sample standard deviation to obtain the confidence interval.
In our case suppose that the mean of all 138 observations is $35.25 and the
standard deviation is $29.85. Plugging into the formula given earlier yields:
Template for Confidence Interval
Enter ===>
Sample
Mean
Sample
SD
Sample
n
alpha
35.25
29.85
138
0.05
30.22534
to
40.27466
Confidence Interval is
The confidence interval is $35.25 +/- $5.02 which is very close to our desired
accuracy.
Technical Note: (Not Required)
The formulae above all assume that the ratio of sample size to population size,
n'/N, is small (say less than 5%). If that is not the case, then the correct formula for a
100(1-α)% confidence interval is:
x  t / 2
s
n'
1
n'
N
If N, the population size is small, it is possible to use the formula for
determining the size of a sample to get a confidence interval of size +/- W, and get a
value of n which is greater than N. That is the sample size is bigger than the size of the
population. If that should happen, the formula below gives the correct sample size n'
which should be used in the confidence interval formula given immediately above.
Let n be given by the formula:
z2 / 2 2
n
W2
as before, then n' is given by the formula:
n' 
n
1
n
N
Although I feel that confidence intervals are by far the most practical method
of inference (when they can be computed), it is possible to apply Methods 2 and 3 if
one wishes to test a specific hypothesis.
Let us return to the case that we started with:
x  35.00
s  30.00
n  25
Suppose we wish to test the hypothesis:
H 0 :   45
H A :   45
(Which we know will be accepted since 45 is inside the confidence interval).
Using method 2 we would compute:
t obs  n ( x   0 ) / s  25 ( 35  45 ) / 30  1.66666
Since this value is within the range +/- 2.0639 we would accept the hypothesis.
To use method three and determine the p-value, one need only use the
formula:
two sided p-value = tdist (abs(tobs), n-1, 2).
The absolute value sign is necessary due to the programming assumptions made in
EXCEL. The second entry is the degrees of freedom. The final entry is 2 for a twosided p-value and 1 for a one-sided p-value.
In our case we get
two sided p-value = tdist(abs(-1.66666), 24, 2) = .10858.
Since .10858 > .05, we again would accept the null hypothesis that  = 45.
Testing Hypotheses About Proportions
Another common one sample problem deals with proportions. Suppose we
are concerned about the gender distribution of our middle managers. Assuming
that this position now requires an MBA at the entry level, and also assuming that
the proportion of men and women who are obtaining an MBA is approximately
50:50, does our work force reflect this gender distribution?
One might formulate this as a hypothesis in the following way. Let p
represent the probability of a middle manager being female. If we take a random
sample of n of our middle managers and determine the sample proportion of
females, say p̂ , does this value provide evidence that our proportion of female
employees differs from .5?
Let x be the number of females in the sample of size n and define:
p̂  x / n .
Then as we have shown before, the sampling distribution of p̂ is approximately
normal (this requires np>5 and n(1 – p) >5) with
E ( p̂ )  p
SD( p̂ ) 
p( 1  p )
n
Based on this information we can formally test the hypothesis:
H 0 : p  p0
H A : p  p0
where p0 = .5 in this specific case.
All four of our previous methods can be applied to this problem. For
purposes of illustration we will use the example where n = 25, x = 10, p0 = .5, and 
=. 05.
Method 1, the quality control method would yield the following 100(1 - )%
quality control limits:
P ( p0  z  / 2
p0 ( 1  p0 )
 p̂  p0  z / 2
n
p0 ( 1  p 0 )
)  1 
n
by an argument similar to what we developed in the case of the sample mean. This
leads to the rule:
Accept H0 if p̂ is in the range p0  z / 2
p0 ( 1  p 0 )
n
,
Reject otherwise.
In our case the limits become:
.5  1.96
or
.5 * .5
 .5  .196
25
.304 to .696.
Since p̂  10 / 25  .4 falls inside the interval, we accept the null hypothesis that p =
.5.
Method 2 also directly applies to this situation. We would first compute zobs
using the formula:
z obs  n ( p̂  p0 ) /
p0 ( 1  p0 )
We would accept the null hypothesis if:
 z / 2  z obs  z / 2
otherwise we would reject the null hypothesis.
In our specific example, we have:
z obs  25 * (.4  .5 ) / .5 * .5  1.00 .
Since this falls within the limits of +/- 1.96 we accept the hypothesis.
Method 3, the p-value method can also be applied. As in the case of the
sample mean, we compute the two-sided p value as:
two-sided p value = 2*(1-normsdist(abs(zobs))).
In this case we get
two-sided p value =2*(1-normsdist(abs(-1))) = .317311
Since this value is greater than .05, we accept the null hypothesis.
Finally, we come to my preferred method of the confidence interval. It turns
out, theoretically, that the formula for the exact confidence interval is somewhat
complex, however, a very good approximation to the exact result is given by the
formula:
p̂  z / 2
p̂( 1  p̂ )
 p  p̂  z / 2
n
p̂( 1  p̂ )
n
This is equivalent to interchanging the roles of p̂ and p in the quality control
formula.
In our case the confidence interval becomes:
.4  1.96
.4 * .6
 p  .4 
25
.4 * .6
25
which gives the interval:
.208  p  .592
Since .5 is in the confidence interval we would accept the null hypothesis.
In the EXCEL file "onesam.xls", I have also included a template for the
computation of the approximate confidence interval for the population proportion.
By entering x, n, and , the confidence interval is computed. In our case the result
looks like:
Template for Confidence Interval
Sample Sample
Successes
Size
x
n
Enter ==>
Sample Proportion =
Confidence Interval is
10
Alpha
25
0.05
to
0.5920
0.4
0.2080
A 99% confidence interval could be obtained by changing .05 to .01 with the
result:
Template for Confidence Interval
Sample Sample
Successes
Size
x
n
Enter ==>
Sample Proportion =
Confidence Interval is
10
Alpha
25
0.01
to
0.6524
0.4
0.1476
You may have noticed that these confidence intervals are quite wide. Just as
in the case of the mean, there is little the manager can do to make the confidence
intervals narrower then increase the sample size n. We can determine the
proportion to any desired precision.
Suppose we want to determine p to within +/- W. That is:
p W .
The confidence interval is given by:
p̂  z / 2
p̂( 1  p̂ )
n
therefore we will achieve the goal if:
z / 2
p̂( 1  p̂ )
W
n
This leads to the equation:
n
z2 / 2 p( 1  p )
W2
As in the case of the mean, this result is problematic since it requires
knowledge of p to determine how large a sample we will need to determine p. This is
a classic case of circular reasoning.
However, we can take a pre-sample as we did in the case of the mean. Let us
assume that we wish to determine p to within +/- .025 (i.e. W = .025). Let us suppose
that we wish to construct a 95% confidence interval so that z/2 = 1.96. Now we
already have a random sample of size 25 with an estimate of p as .4. Therefore, we
would estimate the total sample size necessary as:
n
( 1.96 ) 2 (.4 )(.6 )
 1 ,475.17  1 ,475
(.025 ) 2
Since we already have 25, we would need to sample 1,450 more persons. Now once
we have all 1,475 let us suppose that 578 are female, so that
p̂  578 / 1475  .3919
Then our 95% confidence interval would be:
.3919  1.96
.3919 * .6081
 .3919  .0249
1475
This gives a confidence interval of .367 to .417 which is very close to our desired
accuracy.
Actually in the case of estimating the sample size for a proportion, we are in
a slightly better position than in estimating the mean since we can get a worst-case
estimate.
Notice that the numerator of the formula for n has the term p(1-p).
n
z2 / 2 p( 1  p )
W2
Since p is always between 0 and 1, one can plot the function p(1-p) as shown below:
Plot of p * (1 - p)
0.3
p*(1-p)
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
p
Notice that this reaches a maximum value of .25 = ¼ when p = ½.
0.8
0.9
1
This means that the following inequality always holds:
z2 / 2 z2 / 2 p( 1  p )
1
 p( 1  p ) 

4
4W 2
W2
Therefore if we choose n so that
n
z2 / 2
4W 2
the value of n may be larger than necessary for any value of p, but it cannot be
smaller!
In our particular case, the equation becomes:
( 1.96 ) 2
n
 1 ,536.64  1 ,537
4(.025 ) 2
This worst-case estimate does not require knowledge of p. In our case it would
require taking 1,537 – 1,475 = 62 more sample values than using the pre-sample
method. If the cost of an individual sample is not large, the worst-case analysis is
often used.
Technical Point (Not Required):
As in the case of the sample mean, it could happen that the sample size chosen, n,
based on the above formulas could be greater than the population size N. In that case,
compute n' using the formula:
n' 
n
1
n
N
and use the following formula for the approximate confidence interval on p:
p̂  z / 2
p̂( 1  p̂ )
n'
1
n'
N
Testing Hypotheses in Regression
Consider the following data:
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Height
3
4
5
6
6
6
6
7
7
8
8
8
9
9
9
10
10
10
11
13
Weight
118
128
124
144
155
138
130
138
163
162
133
142
172
154
178
150
185
175
171
200
This data represents a random sample of 20 high school boys, where their height is
measures in inches above 5 feet tall ("3" = 5 foot 3 inches) and their weight is given
in pounds.
The plot of the raw data is shown below:
Raw Data Plot
250
200
Y
150
100
50
0
0
2
4
8
6
X
Clearly a linear relationship seems to exist.
10
12
14
Running our regression program, as we did in Module I, yields the following
results:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.851856311
R Square
0.725659175
Adjusted R Square
0.710418018
Standard Error
12.02911235
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Height
1
18
19
SS
MS
F
Significance F
6889.408207 6889.408 47.61182
1.88207E-06
2604.591793 144.6995
9494
Coefficients Standard Error t Stat
P-value
93.20950324
9.073001258 10.27328 5.89E-09
7.714902808
1.118080521 6.900132 1.88E-06
This indicates that there is a correlation of r = .8519 between x and y. As you know,
we square the value to interpret it giving r2 = .7257. Therefore approximately
72.57% of the variability in weight can be "explained" by using height as a
predictor.
You may have wondered at what value of r2 is enough? I cannot answer that
question. However there is a test for the hypothesis that , the population
correlation between x and y, is equal to zero.
The formal statement of the hypothesis to be tested is:
H0 :   0
HA :   0
If you reject the null hypothesis and conclude that   0 then the correlation is said
to be statistically significant. If you accept the null hypothesis that   0 then the
correlations is said to be not statistically significant.
Although it is possible to construct a confidence interval for  , the process is
complex and approximate. Traditionally only methods 2 and 3 are used.
If you have all of the raw data as in the case above, then one can test the
hypothesis by simply running the regression and using method3, the p-value
method. The following output shows (in yellow) the required p-value on the
regression output.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.851856311
R Square
0.725659175
Adjusted R Square
0.710418018
Standard Error
12.02911235
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Height
1
18
19
SS
MS
F
Significance F
6889.408207 6889.408 47.61182
1.88207E-06
2604.591793 144.6995
9494
Coefficients Standard Error t Stat
P-value
93.20950324
9.073001258 10.27328 5.89E-09
7.714902808
1.118080521 6.900132 1.88E-06
The observed correlation coefficient is r = .851856, and the two-sided p-value is
.00000188. Since this is much lower than  =.05 or even  =.01, we would say that
there is a statistically significant correlation between height and weight.
If you do not have the raw data, but only have the actual value of the
correlation coefficient r, then one can use method 2, the t test method. Let us work
with  =.01. The test statistic that will be used is given by the formula:
t obs 
r n2
1 r2
which follows the t distribution with degrees of freedom = n – 2.
In our case n = 20, so the t-distribution has df = 20 – 2 = 18 degrees of
freedom.
The appropriate cut off point, using the EXCEL function tinv, is:
t/2 = tinv( .01, 18) = 2.878442
Then compute:
t obs 
(.851856 ) 20  2
1  (.851856 ) 2
 6.900132
Since this value falls outside the range  2.878442 we would reject the null
hypothesis and conclude that there is a statistically significant correlation between
height and weight.
The choice of the English word "significant" is unfortunate since the
impression one has is that if something is significant it is important. Consider a
situation where a random sample of size 400 is taken and the correlation computed
between two variables based on this sample is .1. Suppose we work with  =.05 so
that our cut-off points are  1.96 . Then the test statistic would be:
t obs 
(.1 ) 400  2
1  (.1 )
2
 2.005
Since this is outside the +/- bounds we would reject the null hypothesis and say that
there is a statistically significant correlation between the two variables. However,
all this means is that the correlation is probably not zero. It does not mean that the
relationship is useful for forecasting!!!!!
In order to determine the practical use of any relationships we still need to
square r. In this case r2 = (.1)2 = .01. This indicates that only about 1% of the
variability in the y variable can be explained by using x as a predictor leaving
almost 99% unexplained.
Whenever you hear someone claim that there is a statistically significant
correlation between two variables, remember that only means the correlation is
probably not zero. Ask what the value of r is and then square it to determine if
there is something of practical value in the relationship.
We can also test hypotheses about regression coefficients using the theory we
have developed. Consider the regression model:
y i  b0  b1 x 1 i  b2 x 21  .....  b p x pi  e i
as we studied in Module I. EXCEL provides the p-values to test the hypothesis:
H 0 : bi  0
H A : bi  0
In our current example, these values are highlighted below:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.851856311
R Square
0.725659175
Adjusted R Square
0.710418018
Standard Error
12.02911235
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Height
1
18
19
SS
MS
F
Significance F
6889.408207 6889.408 47.61182
1.88207E-06
2604.591793 144.6995
9494
Coefficients Standard Error t Stat
P-value
93.20950324
9.073001258 10.27328 5.89E-09
7.714902808
1.118080521 6.900132 1.88E-06
In this case both b0 and b1 are significantly different than zero even with alpha as
low as .01.
In fact comparing the p-value to .05 formed the basis of the Backwards
Elimination procedure that we developed in the first module.
Structural Hypotheses
Assume your firm is in an area with only one major competitor. The
marketing department has determined that buyers fall into one of three groups.
They are either loyal buyers of your product, loyal to your competitor's product, or
opportunity buyers who will purchase from either of you depending on their whim.
Last year you held 60% of the market, your competitor held 25%, and 15%
of the market were opportunistic buyers. A recent survey gave the following
results:
Sample
Your Company
Opportunity
Competitor
3,946
879
1,653
60.9%
13.6%
25.5%
6,478
100.0%
Has there been a change?
Notice that in this situation there really is no "statistic" like the mean or the
proportion to formulate a hypothesis for. Rather, the question is structural, does
the data conform to a fixed pattern, in this case the market share distribution of last
year.
If we can specify the structural pattern as a probability distribution, then a
very useful statistic called the Chi-Squared Distribution can often be used.
Formally we need the following set-up:
Category
Probability
1
1
x1
2
2
x2
.
.
.
.
.
.
K
K
__________
1.00
Observed Number
xK
_____________
n
Define
EXPi = n i .
Now if EXPi >3.5 for each of the K categories, then the Chi-Squared Distribution with
K – 1 degrees of freedom can be used to test whether the observed data conforms to
the structure of the probability distribution.
Formally, the hypothesis being tested is:
H0 : Data Conforms To The Specified Probability Distribution
HA : Data Does Not Conform To The Specified Probability Distribution
Again notice that no specific value is specified in the hypothesis. This means that we
cannot approach this problem using the confidence interval approach since there is
nothing to put a confidence interval on.
The actual test statistic is:
K
( x i  EXPi ) 2
( OBS i  EXPi ) 2

EXPi
EXPi
i 1
i 1
K
2
 obs

where OBSi is the observed value in category i, i.e. xi.
Notice that if the expected value in a category differs from the observed value
in a category in either a positive or negative direction, when the chi-square statistic
is computed, a positive deviation will be generated since the difference between the
observed and expected value is squared.
Accordingly, we shall use a one-sided p-value when testing these kinds of
structural hypotheses.
The Chi-Square distribution is a right-skewed distribution. There are two
functions in EXCEL associated with its use.
The first is
=chidist(value, degrees of freedom)
For the given value and degrees of freedom, this function will give us the one-sided
p-value of being greater than or equal to the observed value.
The second is
=chiinv(p,degrees of freedom).
This gives the value which has a probability p of being exceed for a chi-square
distribution with the given degrees of freedom.
Unfortunately, EXCEL does not perform the Chi-Square test directly.
However it is very easy to set up as shown below:
Your Company
Opportunity
Competitor
Last Year
x
EXP
x-EXP
60%
15%
25%
3,946
879
1,653
3,886.8
971.7
1,619.5
59.2
-92.7
33.5
100%
6,478
6,478.0
0.0
(x-EXP)^2 ((x-EXP)^2)/EXP
3504.64
8593.29
1122.25
0.90
8.84
0.69
10.44
The "EXP" column is obtained by simply multiplying the probability for
2
 10.44 is shown in
each of the categories last year by 6,478. The value of  obs
yellow.
I can get the p-value by using the EXCEL function Chidist as follows with
(3 –1) = 2 degrees of freedom:
one sided p-value = chidist( 10.44, 2) = .005412
Using either =.05 or =.01 we would reject the hypothesis that the data
conform to last year's pattern thus concluding that the pattern has changed.
If one rejects a structural hypothesis, the next question is where does the
structure differ from what was hypothesized? An empirical procedure suggests that
one look for cells where:
( OBSi  EXPi ) 2
 3.5
EXPi
Examining the table below, it indicates that the major source of change is in
the Opportunity category with value 8.84 (shown in green below).
Your Company
Opportunity
Competitor
Last Year
x
EXP
x-EXP
60%
15%
25%
3,946
879
1,653
3,886.8
971.7
1,619.5
59.2
-92.7
33.5
100%
6,478
6,478.0
0.0
(x-EXP)^2 ((x-EXP)^2)/EXP
3504.64
8593.29
1122.25
0.90
8.84
0.69
10.44
Examining the cells highlighted in yellow, the data suggest that the Opportunity
group is shrinking and that people are becoming more loyal customers of either
your company or your competitor.
As another example, consider the pseudo-random numbers that we have
been using throughout this course. How could I test if they really are close to
random? One way is to see if they behave like random numbers and have an equal
probability of taking on any value between 0 and 1.
Below I have generated 100 random numbers:
0.103639
0.260748
0.785453
0.859472
0.069325
0.0674
0.91183
0.345949
0.142841
0.665597
0.8929
0.057035
0.68365
0.747171
0.788218
0.756522
0.378581
0.759218
0.729493
0.511406
0.814838
0.015725
0.149028
0.090306
0.050667
0.148827
0.295621
0.77713
0.054504
0.042992
0.381002
0.347441
0.33664
0.182379
0.305068
0.014612
0.255795
0.490767
0.561867
0.726386
0.739218
0.643481
0.810773
0.102478
0.814651
0.836788
0.502936
0.403509
0.49807
0.432307
0.02988
0.427461
0.355017
0.350722
0.279455
0.712044
0.375799
0.035562
0.302697
0.050712
0.073917
0.842951
0.817831
0.648994
0.896142
0.309024
0.545441
0.075328
0.758163
0.898943
0.665451
0.307257
0.83563
0.653133
0.734595
0.197073
0.935106
0.466816
0.996782
0.254434
0.148243
0.836328
0.127853
0.984712
0.219108
0.140383
0.988508
0.010277
0.126001
0.729527
0.601552
0.783166
0.934277
0.633395
0.673409
0.666446
0.888458
0.811172
0.603228
0.008411
A histogram of the 100 numbers distributed in the ranges
0 - .10, .10 - .20, ……….., .80-.90, .90 – 1.00 resulted in the histogram below:
Histogram
18
16
Frequency
14
12
10
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
More
Bin
Does this data conform to approximately 10% of the data in each bin?
The table below shows the computation as before:
Bins
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
to
to
to
to
to
to
to
to
to
to
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Obs
Exp
16
11
6
12
6
4
11
14
14
6
10
10
10
10
10
10
10
10
10
10
100
100
Chi Square=
p-value
Contrib
3.6
0.1
1.6
0.4
1.6
3.6
0.1
1.6
1.6
1.6
15.8
0.071177
Conclusion Not Significant
As can be seen the p-value is .071177 which is the value given by
chidist(15.8, 9). The result is not significant at the .05 level so there is no reason to
doubt that the pseudo random numbers are behaving like usual random numbers.
Finally, note that in two of the cells, the contribution to chi-square exceed
3.5. If the result had been significant, we would have focused on these cells.
However, we only look at the individual contributions if the overall result is
significant!
In other words, we only look for the deviations in individual categories if the
overall pattern does not seem to conform to the hypothesized structure.