Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CLASS NOTES on SAMPLING DISTRIBUTION and Central Limit Theorem (CLT) Why “Sample” the Population? Why not study the whole population? • The physical impossibility of checking all items in the population. • The cost of studying all the items in a population. • The sample results are usually adequate. • Contacting the whole population would often be time-consuming. • The destructive nature of certain tests (e.g., study of light bulb life). Statisticians advocate Probability Sampling (not judgment sampling) • A probability sample is a sample selected in such a way that each item or person in the population being studied has a known likelihood of being included in the sample. If we use judgment sampling we will have no idea about the accuracy of our estimates since we have no idea about the quality of judgments. Probability sampling enables us to construct probabilistic error bounds. (to be studied in a second course in Statistics). The aim of sampling is to get a sample, which is representative of the population. Methods of Probability Sampling • Simple Random Sample (SRS): A sample formulated so that each item or person and each subset in the population has the same chance of being included. (e.g., from N items, prob. that any one is selected=1/N.) A simple way to implement this is to use a lottery or computer program. For example we can mark N cards and write names of items on these cards, shuffle the cards and select n cards. This will yield a simple random sample of size n. • Systematic Random Sampling (SysRS): The items or individuals of the population are arranged in some order. A random starting point is selected (by lottery) and then every kth member of the population is selected. If there are N=1000 stores along Fifth avenue and we want to select n=100 stores in the sample, k=N/n or 10 We shuffle only the first k, and select one, say #4 Now on we systematically select stores by adding k, 2k, 3k, 4k etc to 4 So a systematic sample will have store #4, 14, 24, 34, 44, 54 etc. • Stratified Random Sampling (StrRS)): A population is first divided into subgroups, called strata, and a sample is selected from each stratum. (e.g., 70% males, 30% females) If a sample of 10 is selected, (n=10) 70% of n =7, so select 7 males and 3 females. In general, N=population size, N1=stratum 1(female), N2= stratum 2 (males), n=sample size desired. Sample should have = (N1/N)*n from stratum 1 and so on. Thus Females in sample = (N1/N)*n, Males in the sample=(N2/N)*n Population has 25 students of whom 15 are white and 10 black. A stratified sample of size 10 should have how many whites / blacks? Answer: Let N=population size, N1=blacks=10, N2=whites =15, n=sample size=10. Note that N1 /N =(10/25)*10 or 4 blacks and How many whites in the sample? (N2/N)*n= (15/25)*10 or 6 Verify that 6+4=10. We have a representative sample • Cluster Sampling: A population is first divided into clusters and a sample of the clusters is selected. (used in marketing). It works if clusters are as heterogeneous as the population. For a large country like the US it is convenient to use cluster sampleing and choose some geographical locations (Oshkosh Wisconsin). A sampling error is the difference between a sample statistic and its corresponding parameter. We can make probabilistic statements about this sampling error only if we have a probability sample (not judgment sample). In general, sampling distribution is for any sample statistic (mean, median, mode, standard deviation, etc) defined over a sample space consisting of all possible samples of size n from the available population of size N. Let us first study the sampling distribution of sample mean as an example. Sampling Distribution of the Sample Mean • The sampling distribution of the sample means is a probability distribution consisting of all possible sample means based on specified sample sizes selected from the population. The sampling distribution yields the probability of occurrence associated with each sample mean over the set of all possible sample mean numbers. EXAMPLE 1 • The law firm of Hoya and Associates has five partners (A,B,C,D,E). At their weekly partners meeting each reported the number of hours they charged clients for their services last week. A 22, B 26, C 30, D 26, E 22. (eg, Mr. E charged 22 hrs) • If n=2, two partners are selected randomly, how many different samples are possible? This is the combination of 5 objects taken 2 at a time. That is, 5C2= 5!/(2!3!)=10. There are 10 possible samples. Ten sample means are given below: (e.g. if the sample has A and B, sample mean is 24) A=22, B=26 means average Av(AB)=(22+26)/2 or 24. Similarly, Av(AB)=24, Av(AC)=26, Av(AD)=24, Av(AE)=22, Av(BC)=Av(28, Av(BD)=26, Av(BE)=24, Av(CD)=28, Av(CE)=26, Av(EF)= 24 Exercise: draw a picture with freq on vertical axis for sampling distribution of means. Note above that mean of A and C is 26, B and D is 26 and mean of C and E is also 26, which means the x =26 repeats itself three times (has frequency 3). We find following list of frequencies: x =22 with freq= 1, x =24 with freq= 4, x =26 with freq= 3, x =28 with freq= 2. This is almost the sampling distribution of means • • • • Total frequency =10. If we divide individual frequencies by total frequency we get “relative frequency” or probability. These probabilities add up to one, so we have a prob. distribution. The above information says that the probability that sample mean is 22 is 2 out of 10 or 0.2. The sampling distribution is simply this probability distribution defined over all possible samples of size n from the population of size N. In the real world problems N will be large (e.g. 200 million US population) and n will be also be large (e.g., 1000 people surveyed) and (N C n) will be astronomical number. Then the sampling distribution can only be imagined. We have chosen a simple example of N=5, n=2 so that the entire sampling distribution can be explicitly computed and visualized. This is a sampling distribution of all possible sample means. Now the random variable is x , it is no longer just X. What are the properties of the sampling distribution of sample means x ? Properties include the mean and variance of x • Compute the mean of the sample means and compare it with the population mean: For our simple example we can explicitly calculate the mean of means or Expected value of means or E( x )= · The mean of the sample means is obtained by weighting each sample mean by its frequency= [(22)(1) + (24)(4) + (26)(3) + (28)(2)]/10=25.2 [Read page 214 of your text] · Since we know the value of every observation in the population in this (impractical) simple example, we have the directly calculated population mean = (22+26+30+26+22)/5 = (25.2). Note that in the real world we usually cannot find · , we can only make inferences about it from sample mean x Observe that the grand mean of all 10 sample means (25.2) is equal to the population mean (25.2). · Since E( x )= , we say that Sample mean x is an UNBIASED estimator of population mean We verified this property above for the simple example of Lawyer hours. In general, such verification is difficult and one needs to use advanced theory. Now we turn to the variance of x . It is possible to verify intuitively that larger the sample size, smaller the variance. For example if X is height (known to be a Normal random variable) we want to estimate the average height of all Fordham students from a small sample of only 10 students. When we consider all possible samples we cannot rule out the sample of very tall folks (e.g., all 10 from the Fordham basketball team who are, say, 7 ft tall). Now the average height over seven feet is large and upper limit of the range of averages will be seven feet. Similarly the average for the shortest 10 students will be smaller than five feet (say). Thus the range of variability from the smallest to the largest average heights based on n=10 will be spread over a wide range. Recall that wide range means large variance. By contrast, if we choose n=100, the average height for the tallest 100 will not be seven feet, but smaller. Similarly the average height of shortest 100 will be higher than for shortest 10 and the range for n=100 will not be as large a range for n=10. Thus the range spread of the sampling distribution decreases as n increases. In fact the variance can be proved to be inversely proportional to n as we see below. Standard Error (SE) of the Sample Means (Sq. root of sampling variance or standard deviation. It is customary to distinguish between usual standard deviation (SD) and that of a sampling distribution (SE) • The standard error of the sample means is the standard deviation of the sampling distribution of the sample means. • n is the size of the sample. • is the standard deviation of the population (assumed known). • It is computed by: xbar = ( /n ) as a first approximation if N is not known or N is large (almost infinity). • xbar is the symbol for the standard error of the sample means. • If is not known and n 30, the standard deviation of the sample, denoted by s is used to approximate the population standard deviation. Then the formula for the standard error becomes: SE( x ) = s sub x Always, think of SE( =s / n x ) as the standard deviation of the Random Variable x . What is the shape of the probability distribution of ( x ) ? The following theorem says that it is Normal and hence the theorem enables us to solve all kinds of practical problems. Central Limit Theorem (CLT) See page 391 of Hawkes textbook. [Central means it is of central importance to Statistics. Limit theorem because it studies the behavior as n becomes large, namely as n tends to infinity, in practice for n30.] This is a powerful result by a mathematician named Polya in 1920's showing that EVEN IF x is NOT NORMAL, if n30 the process of averaging (is so helpful) that it yields normality of the sampling distribution of ( x ) with the variance given below. • For a population with a mean and variance 2, the sampling distribution of all possible means of all possible samples of size n generated from that population will be approximately normally distributed – • x N { , (2 /n) [(Nn)/(N1)] } assuming sufficiently large n. (n 30). If N is large the finite population correction term [(Nn)/(N1)] is close to 1 and can be ignored. Then, this formula simplifies to x N { , (2 /n) } Even if we start with a bimodal, exponential decay or uniform distributions, which are decidedly not normal to begin with the process of averaging gives us a normal distribution for the sample mean provided the sample size is at least 30. We may know that human intelligence or human height are normally distributed, but we have no reason to think that Lawyer’s hours are normally distributed. The central limit theorem says that as long as you are averaging over 30 lawyers, normality can be assumed. This is very useful since we do not have to verify the underlying shape of the distribution. A good practice example which highlights the difference between ordinary distribution of X and sampling distribution of Xbar with separate word problems follows: IQ=X ~ N(110, 102), Find P(IQ<80) Intelligence Quotient (IQ) is normally distributed with mean 110 and standard deviation of 10. A moron is a person with IQ less than 80. Find the probability that a randomly chosen person is a moron. (Hint this random variable is for a single person X) Let idiot be defined as one with an IQ less than 90. Find the probability that a randomly chosen person is an idiot. (Hint this random variable is for a single person X) If a sample of 25 students is available, what is the probability that the average IQ exceeds 105? (Hint this random variable is for an average over 25 persons or Xbar) What is the probability that the average IQ exceeds 115 (Hint this random variable is for an average over 25 persons or Xbar) Answers are given after many blank lines X=IQ ~ N(110, (10)2 ) mu= =110 standard deviation=sd= = 10 4 times sd= 4 =40 Plausible range of X has the lower limit= -4 =110 – 40 or 70 upper limit is +4 =110 + 40 = 150 This corresponds with the plausible range of standard normal z (-4 to 4) EXERCISE 1: Given that X=IQ ~ N(110, (10)2 ). If a dumb moron’s IQ is 80 or less, find the probability that a randomly chosen person is a dumb moron. ANSWER 1: This is just normal distribution word problem. In symbols, we want to find: P( x<80). Recall that probability is some area under the Normal bell shaped curve. We want to evaluate a shaded area between - to 80 This shaded area has the lower limit of - and upper limit of 80 The mapping of - to the z scale is obviously -4 for all practical purposes Hence we need not bother with the lower limit of desired shaded area. We still need to map the upper limit 80 to the z scale by using the z transform any z = (x-) / = (80-110)/10 = For our upper limit x=80=IQ or moron, =110 and =10 z= (80-110)/8 =-3 when z=3 area between 0 and 3 is 0.4987 from the table A of your text Tail area is 0.5-0.4987 hence the answer is 0.0013 In R software we compute pnorm(-3) to get 0.0013 for the left tail EXERCISE 2: X=IQ ~ N(110, (10)2 ) is given. If a dumb idiot’s IQ is 90 or less, find the probability that a randomly chosen person is a dumb idiot. In symbols, find: P( x<90). ANSWER 2: For our upper limit x=90=IQ or idiot, =110 and =10 Mapping 90 to the z scale is (90-110)/10 = -2 Tail area to the left of z=-2 is 0.5-0.4772 =0.0228 In R software we compute pnorm(-2) to get 0.0228 for the left tail EXERCSE 3: X=IQ ~ N(110, (10)2 ) is given. Find probability that the average IQ of 25 students exceeds 105 ANSWER 3: Since the sample size n=25 is given, this is not a run-of-the-mill normal distribution word problem. The random variable under consideration here is the average. Hence, a sampling distribution is relevant when we consider average IQ as the variable of interest, not he IQ of an individual student, but the average over 25 students. standard deviation of the sampling distribution = Standard Error = SE = /n n=25 n = 25 =5 SE = /n = 10/5 = 2 4SE = 8 Plausible range is 110-8 to 110+8 or 102 to 118 for xbar =average IQ Area to the right of 105 is to be found Must map 105 to the z scale Mapping now is z=(xbar - )/SE = (105-110)/2 = -2.5 Area between 0 to 2.5 is 0.4938 Total area 0.5+ 0.4938 = 0.9938 = probability that the average IQ exceeds 105 In R software we compute pnorm(-2.5,lower.tail=FALSE) to get 0.9938 Now find probability that the average IQ exceeds 115 This is the tail area to the right of z = (115-110)/2 = 0.5-0.4938 = 0.0062 pnorm(0.01,lower.tail=FALSE) EXAMPLE 4 Library usually has 13% of its books checked out Find the probability that in a sample of 588 books greater than 14% are checked out. ANSWER 4: We have percentages here, so it is not the simple normal distribution word problem. It uses the fact that p^ ~ N(p, [pq/n] ) which says that the Sampling distribution of the proportion p^ is Normal with mean p and variance p(1-p)/n E(p^)=0.13, n=588 Var(p^)=2 (p^) = (0.13)(1-0.13)/n or 0.00019235 We need the square root of this variance for use in our z transform. SE(p^)= sqrt(0.00019235) = 0.01386903 = 0.0139 (here we round to 4 places) Plausible range is 0.13 4* 0.0139 4*SE is 0.0556 [0.0744 to 0.1856] is the plausible range. Find the probability that in a sample of 588 books greater than 14% are checked out. Hence the desired point is to the right of the center at 0.13 In symbols, we want to compute P(p^ >0.14). Now let us apply z transform to both sides of the inequality. P(p^ >0.14)= P(z > (0.14 - 0.13)/SE ) or we have to compute: P(z> 0.7194) = P(z> 0.72). We must round to 2 places to the right of the decimal since z tables are that way. We want tail area, but we can look up only the area from 0 to 0.72 for z 0.5 MINUS 0.2642 or ANS= 0.2358 > pnorm(.72,lower.tail=FALSE) [1] 0.2357625 Copyright: Hrishikesh D. Vinod Last updated 4/29/17 6:38 PM