Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Descriptive Statistics: Suppose we have two data sets x and y. x = {x1 , x2 , x3 , x4 , x5 } = {3, 6, 7, 4, 1} y = {y1 , y2 , y3 , y4 , y5 , y6 } = {8, 2, 1, 5, 4, 7} The sample mean, x, is the arithmetic average of the observed values. It is a measure of central location. Add up all the observed values and divide by the number of observations. n x= So, for the data set x, where n = 5, we have 1X x n i=1 i 5 x= 21 1X 1 = 4.2. xi = (3 + 6 + 7 + 4 + 1) = 5 i=1 5 5 For the data set y, where n = 6, we have 6 y= 9 1X 1 y = (8 + 2 + 1 + 5 + 4 + 7) = = 4.5. 6 i=1 i 6 2 The sample variance, s2 , is the sum of the squared deviations from the mean divided by the number of observations minus 1. It is a measure of variability of the data about the mean. µ ¶X n 1 (x − x)2 s2x = n − 1 i=1 i The sample variance will always be zero or positive. It will be zero if and only if every observed value equals the mean. For the data set x, we have µ ¶X 5 ´ 1 1³ (3 − 4.2)2 + (6 − 4.2)2 + (7 − 4.2)2 + (4 − 4.2)2 + (1 − 4.2)2 (xi − 4.2)2 = s2x = 5 − 1 i=1 4 ´ ³ 1 (−1.2)2 + (1.8)2 + (2.8)2 + (−0.2)2 + (−3.2)2 = 4 1 (1.44 + 3.24 + 7.84 + 0.04 + 10.24) = 4 = 5.7 The sample standard deviation is the square root of the sample variance. It also measures the variability of the data around the mean, but in terms of the original units. If the variable were measured in inches, the variance would be in inches squared but the standard deviation would be in inches. v uµ ¶X n p u 1 2 t 2 sx = sx = (x − x) n − 1 i=1 i For the data set x, we have sx = p √ s2x = 5.7 ' 2.38747. Some of the descriptive statistics rely on having the data ordered from smallest to largest observations. For our data sets x and y, we get n o x = x(1) , x(2) , x(3) , x(4) , x(5) = {1, 3, 4, 6, 7} o n y = y(1) , y(2) , y(3) , y(4) , y(5) , y(6) = {1, 2, 4, 5, 7, 8} 1 The minimum of a set of data is the smallest observed value (the first observation when the observations are arranged in ascending order). For the data set x, the smallest observed value is 1. For the data set y, the smallest observed value is 1. The maximum of a set of data is the largest observed value (the last observation when the observations are arranged in ascending order). For the data set x, the largest observed value is 7. For the data set y, the largest observed value is 8. The range of a set of data is the difference between the maximum and minimum values. For the data set x, the range is 7 − 1 = 6. For the data set y, the range is 8 − 1 = 7. The median of a set of data is a value such that half of the observations are smaller and half of the observations are larger. Like the mean, the median is a measure of central location. If the number of observations, n, is odd, the median is the middle value in the ordered data. This will be the ordered observation number (n + 1) /2. For the data set x, there are 5 observations, so the median is the middle observation (which is ordered observation (5 + 1) /2 = 3). That is, the median is x(3) = 4. If the number of observations, n, is even, the median will be the average of the two middle observations. This n n will be the average ³of the ordered ´ observations numbered 2 and 2 + 1. For the data set y, there are 6 observations, so the median is 1 2 y(3) + y(4) = 1 2 (4 + 5) = 4.5. The first quartile (Q1 , lower quartile, LQ) of a set of data is a value such that one-fourth of the observations are smaller and three-fourths of the observations are larger. The first quartile is the median of the ordered data created from the original data by deleting any observations greater than or equal to the original median. For the data set x this would leave {1, 3} , which has median 12 (1 + 3) = 2. That is, for the data set x, Q1 is 2. For the data set y this would leave {1, 2, 4} which has median 2. For the data set y, Q1 is 2. The third quartile (Q3 , upper quartile, UQ) of a set of data is a value such that one-fourth of the observations are larger and three-fourths of the observations are smaller. The third quartile is the median of the ordered data created from the original data by deleting any observations less than or equal to the original median. For the data set x this would leave {6, 7} , which has median 12 (6 + 7) = 6.5. That is, for the data set x, Q3 is 6.5. For the data set y this would leave {5, 7, 8} which has median 7. For the data set y, Q3 is 7. The second quartile (Q2 ) is the median The 5-number summary of a set of data is the collection of the minimum, Q1 , Q2 , Q3 , and the maximum of the observed values. The interquartile range (IQR) is given by IQR = Q3 − Q1 . For the data set x, the IQR is 6.5 − 2 = 4.5. For the data set y, the IQR is 7 − 2 = 5. An observation is considered an outlier if its value departs significantly from the rest of the data. For this class consider an observation to be an outlier if: (i) it is greater than Q3 + 1.5IQR, (ii) it is less that Q1 − 1.5IQR. 2 2 More data and graphs Suppose our data set x is as follows: {26, 33, 28, 25, 30, 28, 25, 29, 28, 30, 31, 32, 35, 31, 28, 35, 27, 33, 35, 29, 28, 31, 34, 32, 33, 30, 28, 37, 37, 41, 31, 29, 33, 33, 23, 28, 27, 29, 42, 28}. And sorted in ascending order we have: {23, 25, 25, 26, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 33, 33, 33, 33, 33, 34, 35, 35, 35, 37, 37, 41, 42} It can be verified that x = 30.8 s2x = 16.78 p sx = s2x = 4.10 min = 23 Q1 = 28 Q2 (median) = 30 Q3 = 33 max = 42 range = 42 − 23 = 19 IQR = Q3 − Q1 = 33 − 28 = 5 So, any values less than Q1 − 1.5IQR = 28 − (1.5) (5) = 20.5 will be considered outliers. Any values greater than Q3 + 1.5IQR = 33 + (1.5) (5) = 40.5 will be considered outliers. There are then 2 outliers in the data: x(39) = 41 and x(40) = 42. A histogram is a chart that shows the counts of observations that have been aggregated into intervals (bins, classes,...). The following is a histogram of the data with bins 22.5 ≤ x < 25, 25 ≤ x < 27.5, ..., 40 ≤ x < 42.5. 12 10 8 6 4 2 25 30 35 40 There are a number of things to watch out for when interpreting a histogram. First, be aware of how the endpoints of intervals (bins,classes,...) are treated. For this histogram, values falling exactly on the left endpoint 3 of an interval are included but not those on the right endpoint. The first bar is of height one as it counts the value of 23 but not 25. Unless n = 100 or you are given a relative frequency histogram, heights of the bars are not = 18 percentages. Here, n = 40, so the percent of observations strictly less than thirty is 1+5+12 40 40 = 45%. Hence 55% are greater than or equal to thirty. This histogram is right-skewed (right-tailed). For a skewed distribution, the median is often a better measure of central location than the mean. A single extreme value can greatly affect the mean but have little affect on the median and related statistics. A boxplot (box-and-whisker plot) puts the 5-number summary into graphical form. 42 - - maximum 40 35 33 - - UQ 30 30 - - median 28 - - LQ 25 23 - - minimum 3 Normal The normal curve is symmetric about the mean. Hence the mean and median are the same. Two parameters define any normal variable, the mean µ and variance σ2 (or the standard deviation σ). The proportion of values to the right of the mean (median) is one-half as is the proportion to the left of the mean. Typically we are interested in finding out proportions from one of three types of regions. If X is normal with mean µ and standard deviation σ, and a and b are any real numbers, we can ask P (X < a) P (X > b) P (a < X < b) 4 We only have (only need) tables for the standard normal random variable. A standard normal variable has mean zero and variance one. For X normal with mean µ and standard deviation σ, we have Z= X −µ σ is a standard normal variable. Then, µ ¶ µ ¶ a−µ X −µ a−µ < =P Z< σ σ σ µ ¶ µ ¶ µ ¶ b−µ X −µ b−µ b−µ > P (X > b) = P =P Z> = 1−P Z < σ σ σ σ µ ¶ µ ¶ b−µ a−µ P (a < X < b) = P (X < b) − P (X < a) = P Z < −P Z < σ σ P (X < a) = P The probabilities (proportions) can then be found in standard normal tables. As an example, suppose heights are normal with mean 65 inches and standard deviation 3 inches. (i) What proportion are less than 62 inches? µ ¶ µ ¶ 62 − µ X −µ 62 − 65 < P (X < 62) = P =P Z< = P (Z < −1) = 0.1587 σ σ 3 where the proportion to the left of Z = −1 can be found in a table or using the graph below. (ii) What proportion are taller than 59 inches? ¶ µ ¶ µ 59 − µ 59 − 65 X −µ > =P Z> = P (Z > −2) = 1 − 0.0228 = 0.9772. P (X > 59) = P σ σ 3 Some standard thing to know for X normal with mean µ and standard deviation σ are: (i) about 68.26% of observations are within one standard deviation of the mean, 5 (ii) about 95.44% of observations are within two standard deviations of the mean, (iii) about 99.74% of observations are within three standard devations of the mean. The tables can also be used in reverse — given a proportion less than (greater than, between) an unknown value x, find if X is normal with mean µ and variance σ2 . Z= X −µ ⇒ X = µ + Zσ. σ given a proportion, find the corresponding z and plug into the formula x = X = µ + zσ. Example: If heights are normal with mean 65 inches and standard deviation 3 inches, 95% are less than what height? From the normal table, 95% are smaller than z when z = 1.64. Hence x = 65 + (1.64) (3) = 69.92. 4 Confidence intervals and Margin of Error The margin of error for an approximate 95% confidence interval (CI) for a population proportion p is given by r pb (1 − pb) , M oE = 2 n where n is the sample size and pb is the sample proportion and estimates the parameter p. A 95% confidence interval for p is then à ! r r pb (1 − pb) pb (1 − pb) pb − 2 , pb + 2 . n n Example: In a survey of 200 people, 40% were for Proposition Q. What is an approximate 95% CI for the true proportion supporting Proposition Q? pb = 0.40 n = 200 r r pb (1 − pb) 0.4 (0.6) M oE = 2 =2 = 0.0693. n 200 Hence, an approximate 95% CI for the true proportion supporting Proposition Q is given by à ! r r pb (1 − pb) pb (1 − pb) pb − 2 , pb + 2 = (0.40 − 0.0693, 0.40 + 0.0693) n n = (0.3307, 0.4693) . 6