Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STP231 Brief Class Notes, Instructor: Ela Jackiewicz Chapter9 Categorical Data: One-Sample Distributions Estimating population proportion: In the first part of this chapter we will consider a dichotomous categorical variable (2 classes: A, not A) in a large population. We will discuss a sampling distribution of an estimate of a population proportion p=P(A) in our population. Suppose we take a random sample of size n, and denote y=# of subjects with characteristics A in our sample, then we can estimate p by using two different sample statistics: ̂ y , or ordinary sample proportion (p-hat): p= n y2  Wilson-Adjusted Proportion (p-tilde): p= which gives CI-s more reliable than those based on n4 p-hat. We will only use p-tilde in our computations. Sampling Distribution of p̃ Ex1 Suppose certain population has 39% of mutants, so p=population proportion of mutants=0.39. If we take a random sample of 6 individuals from that population, obtain Sampling Distribution of p̃ Let Y=# of mutants in our sample, Y has binomial distribution with p=0.39 All possible values of of Y are 0-6, probabilities of taking each value can be computes using binomial model. Values are displayed in the table below: Y probability p̃ 0 1 2 3 4 5 6 (0+2)/6+4)=0.2 (1+2)/(6+4)=0.3 (2+2)/(6+4)=0.4 (3+2)/(6+4)=0.5 (4+2)/(6+4)=0.6 (5+2)/(6+4)=0.7 (6+2)/(6+4)=0.8 Binompdf(6, 0.39, 0)=0.0515 binompdf(6, 0.39, 1)=0.1976 binompdf(6, 0.39, 2)=0.3159 binompdf(6, 0.39, 3)=0.2693 binompdf(6, 0.39, 4)=0.1291 binompdf(6, 0.39, 5)=0.0330 binompdf(6, 0.39, 6)=0.0035 We can use the above probability distribution to assess following probabilities: a) Probability that our p̃ will estimate p within 4%, P(.39−.04≤ p̃ ≤.39+.04)=P(.35≤ ̃p≤.43)=P ( p̃ =.4)=.3159 b) Probability that p̃ will overestimate p by more than 5%= P( ̃p ≥.39+.05)=P( p̃ ≥.44)=.2693+.1291+.0330+.0035=.4393 c) What is the % of samples for which p̃ will overestimate p by more than 5%? The answer is 43.93%, the same as in part b) As n increases, the sampling distribution of p̃ becomes more compresses around the value of p=0.39, so the probability that p̃ is within ±4 percentage points of p will be greater and overestimating p by 5 or more percentage points , using p̃ , will become less likely. STP231 Brief Class Notes, Instructor: Ela Jackiewicz For large n sampling distribution of p̃ is approximately normal, with mean p and standard deviation p( 1− p) we will use that fact in constructing CI for p. The approximation gets better with n+4 increasing n. √ 95% Confidence Interval for p=unknown population proportion. Standard Error of p ̃ : SEp =  p  1−p   SEp , , 95% CI for p: p±1.96 n4 We can use this CI if sample size is at least 5. __________________________________________________________________________________ Optional: y.5  z 2 /2   1−p  p  for other confidence levels: p= , SEp = 2 nz 2/ 2 nz  /2  and 1−∗100 % CI for p: p±z  / 2 SE p  __________________________________________________________________________________  Sample size considerations for desired Standard Error: Selecting sample size: n= Guessed p 1−Guessed p  −4 rounded to the next integer  Desired SE2 If no suitable guess available for ̃p , use 50% (0.5). Ex2 . Gene mutations have been found in patients with MD. In one study of patients with MD, 23 out of 180 patients had a certain defect in the gene coding. a. Construct and interpret 95% CI for true proportion of all patients with MD with that defect. .1359(1−.1359) p =25/184= .1359 .1359±1.96 (.0253) gives Answer: SE ̃p= =.0253 184 .1359±.0496 CI: (.0863, .1855) √ We have 95% confidence that true proportion of MD patients with that gene mutation is in the above interval. b. Compute sample size is needed for a standard error to be cut in half, assume a reasonable guess for p is 0.14. Answer: SE=.0253 .5(SE)=.01265 n= .14 (1−.14 ) −4=748.39 , n=749 .01265 2 STP231 Brief Class Notes, Instructor: Ela Jackiewicz Inference for proportions: Goodness-of-Fit Test: In this part of the chapter we will consider one categorical variable with k categories, not necessarily dichotomous. Distribution of that variable in a random sample is compared to specified fixed distribution, and null hypothesis is: H 0 : The variable has specified distribution (probability pi in each i category is specified) H a : The variable does not have specified distribution (probability pi in some or all categories is not as specified) O=observed counts are counts of sampled observations in each category Ei=np i E= Expected counts are: We assume that all E are 1 or greater, and all E are > 5 (O−E)2 has Chi-square distribution with k-1 degrees of freedom (under E null hypothesis), where k=number of categories Test statistics : χ 2s =∑ P-Value: To obtain a P-value (P) of a hypothesis test, we compute, assuming the null hypothesis is true, the probability of observing a value of the test statistic as extreme or more extreme than that observed. By extreme we mean far from what we would expect to observe if the null hypothesis were true. P-value =area right of the observed test statistics under Chi-square curve with df=k-1 Note: Our alternative is nondirectional, but in case of dichotomous variable we can also have a directional hypothesis. We can test hypothesis, specifying that probability in one category is smaller/larger than in the other. We have to check if the directionality is correct first. In that case p-value as computed above is divided by 2. Check example #2 Ex1. The offspring produced by a cross between two given types of plants can be any one of three genotypes A, B or C. A simple inheritance model suggests that the offspring of types A,B and C should be in a ratio 1:2:1 respectively. An experiment was conducted in which 100 plants were bred by crossing the two parent types. The genetic classification of offspring are recorded below. Do these data support the hypothesis that the offspring follow the predicted ratio? Test using =.05 Genotype: O=observed frequency: A 18 B 55 C 27 This is GOF test, variable=genotype of offspring, 3 classes. Notice that if predicted ratio is a:b:c, then p1= a b c , p 2= and p3= a+b+c a+b+c a+b+c STP231 Brief Class Notes, Instructor: Ela Jackiewicz H 0 : p1=1/4, p2=1/2, p3=1/4 ( data follows predicted ratio) H a : not all probabilities are as stated in the null hypothesis (data does not follow predicted ratio) (18−25)2 ( 55−50)2 (27−25)2 E= 25, 50 and 25 , χ 2s = + + =2.62 df=2 25 50 25 p-value = χ 2 cdf (2.62,10 6 ,2) =.27>.05 Do not reject null, data support hypothesis that offspring follow predicted ratio. Ex2 People who harvest wild mushrooms sometimes accidentally eat the toxic ones. In reviewing 205 European cases of mushroom poisoning from 1971 through 1980 researchers found that 45 of the victims had died. Does this present the evidence that mortality has decreased since 1970, when it was recorded to be 30%. Use Chi- square test with appropriate directional hypothesis and =.05 . GOF test again, we have 1 variable: Status after eating toxic mushrooms: Dead or Alive , and we compare distribution of it past 1970 to the fixed distribution P(dead)=0.3, P(alive)=.07 as recorded in 1970 Let p= % of dead since 1970, our hypothesis is then: H 0 : p=0.3 vs H a : p<0.3 . We can have directional hypothesis here, since there are only 2 classes and 45/205=.22, so we have a correct directionality. We can, but we do not have to specify both probabilities. We have: O: Dead 45 Alive 160 (45−61.5)2 (160−143.5)2 χ= + =6.33 61.5 143.5 p=(1/2)* χ 2 cdf (6.33, 106 ,1) =.5(.012)=.006<.05 E: .3(205)=61.5 .7(205)=143.5 2 s Reject H0 , evidence that mortality decreased in since 1970 Ex3. In a study of spatial orientation of certain fish 50 individuals were caught in various locations and later tested in artificial pool to see which direction they would choose when released. Use the following data and Chi-square test to test the null hypothesis that directional choice of these fish is random. Use =.05 . Directional choice: #of fish=O Toward shore Away from shore Along shore (right) Along shore (left) GOF test, 18 12 13 7 H 0 : p 1= p 2= p3= p 4=.25 ie. directions are randomly selected (all equally likely) H a : not all pi are as stated in null hypothesis (selections not random, some directions are preferred over others)) E= np=.25(50) = 12.5 for each category χ 2s =4.88 , p= χ 2 cdf (4.88,106 , 3) =0.180, do not reject H0 , no evidence that choices are not random. STP231 Brief Class Notes, Instructor: Ela Jackiewicz Ex4. Day % In 2000, workplace accidents were distributed on workdays as follows: Monday 25 Tuesday 15 Wednesday Thursday 15 15 Friday 30 In 2005, a random sample of 120 workplace accidents yielded the following data: Day Monday Tuesday Wednesday Thursday Number of accidents=O 33 20 12 17 E=Expected number of .25(120)=30 .15(120)=18 .15(120)=18 .15(120)=18 accidents under H0 Friday 38 .3(120)=36 Do the data present sufficient evidence to indicate that the distribution of workplace accidents in 2005 differs from the 2000 distribution? Test the appropriate hypotheses by means of a Chi-square test and =.05 H 0 : p 1=.25, p 2=p 3= p4 =.15, p 5=.30 i.e. distributions are the same both years H a : not all pi are as stated in null hypothesis , distributions are different This is again GOF test. are different 2 χ s =2.69 , df=4, p=.611, so do not reject H0 , no evidence that distributions Using Calculator (TI 83, 84) 1 Proportion Z interval use STAT menu then TESTS option A is 1-PropZInterval It will use p-hat method, just input x and n If we want 95% CI using p-tilde, we can input x=x+2 and n=n+4, for other confidence levels it will not work Chi-square GOF Test: only newer calculators; 1. Place observed and expected frequencies on 2 different lists , (STAT EDIT option) 2. Use  2 GOF−Test , make sure to set appropriate degrees of freedom. P-value computed by the test is for nondirectional alternative. Alternatively, if you have older TI: STAT EDIT, input O on L1, E on L2, then compute test statistics as follows: (L1-L2)^2 /L2 , STO L3, 1-Var Stats L3, Test statistics = ∑ x