* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lecture 3, january 12, 2004
Survey
Document related concepts
Transcript
A point from previous lecture Last Friday, I told you that when you create histogram, if the classes are like 10-20, 20-30 and so on and the value “20” should be tallied in the class “10-20”. It is not right. You should tally 20 in the class “20-30”. If you have any question please ask me. Statistical Measures Although the frequency distribution arranges the raw data into a meaningful pattern, that summary cannot by itself answer many important statistical questions. For example, an industrial engineer wishing to select the faster of two production methods might obtain sample completion times from pilot runs and then try to reach a decision comparing the two resulting sample frequency distributions. Statistical Measures The faster procedure ought to be more clearly indicated by the “average “ completion times under the two production methods. Averages are one class of statistical measures. These quantities (statistical measures) express various properties of the statistical data. Statistical Measures First kind of measures in this discussion is Measures of Location There are two types of location measures. One group expresses central tendency The other group measures variability or dispersion. Statistics and Parameter Summary data measures fall into two major groupings, depending on whether the observations they describe are a population or a sample. Population Parameter When the data constitute a population, each summary measure is referred to as a population parameter. But, ordinarily not all possible population observations are made. Sample Statistic A measure that summarizes sample data is called a sample statistic. It is the statistic that is computed from those observations actually made. Important population parameters have counterpart sample statistics that measure the same characteristic. NUMERICAL DESCRIPTIVE MEASURES Numerical descriptive measures are numbers computed from data set to help us create a mental image of its relative frequency histogram. Measures of Central Location Mean, median, mode Relative Standing Percentile, box plots Measures of Variability Range, variance, standard deviation, Measures of Association Covariance, coefficient of correlation MEASURES OF CENTRAL LOCATION MEAN The arithmetic mean is the most commonly used and best understood measure of central tendency. Mean is defined as follows: Sum of the measurements Mean = Number of measurements In the following, sample mean and population means are discussed separately. Note the difference of notation - sample mean is denote by and the population mean is denoted by . The number of values in a sample is denoted by n and the number of values in the population is denoted by N. x MEASURES OF CENTRAL LOCATION MEAN Mean of Data Set Data Set is Sample Data Set is Population Sample Mean Population Mean MEASURES OF CENTRAL LOCATION SAMPLE MEAN The sample mean is the sum of all the sample values divided by the number of sample values: n x x i 1 i n where x stands for the sample mean n is the total number of values in the sample xi is the value of the i- th observation. represents a summation MEASURES OF CENTRAL LOCATION SAMPLE MEAN A sample of five executives received the following amounts of bonus last year: $14,000, $15,000, $17,000, $16,000, and $15,000. Find the average bonus for these five executives. Since these values represent a sample size of 5, the sample mean is (14,000 + 15,000 +17,000 + 16,000 +15,000)/5 = $15,400. MEASURES OF CENTRAL LOCATION POPULATION MEAN The population mean is the sum of all the population values divided by the number of population values: n x i 1 i N Where stands for the population mean N is the total number of values in the population xi is the value of the i-th observation. represents a summation MEASURES OF CENTRAL LOCATION POPULATION MEAN The Keller family owns four cars. The following is the mileage attained by each car: 56,000, 23,000, 42,000, and 73,000. Find the average miles covered by each car. The mean is (56,000 + 23,000 + 42,000 + 73,000)/4 = 48,500 MEASURES OF CENTRAL LOCATION PROPERTIES OF MEAN Data possessing an interval scale or a ratio scale, usually have a mean. All the values are included in computing the mean. A set of data has a unique mean. The arithmetic mean is the only measure of central tendency where the sum of the deviations of each value from the mean is zero. MEASURES OF CENTRAL LOCATION PROPERTIES OF MEAN Consider the set of values: 3, 8, and 4. The mean is 5. Illustrating the last property, (3-5) + (8-5) + (4-5) = -2 +3 -1 = 0. In other words, n (x i 1 i x) 0 MEASURES OF CENTRAL LOCATION MEDIAN Median: The midpoint of the values after they have been ordered from the smallest to the largest, or the largest to the smallest. There are as many values above the median as below it in the data array. For an even set of numbers, the median will be the arithmetic average of the two middle numbers. The median is the most appropriate measure of central location to use when the data under consideration are ranked data, rather than quantitative data. For example, if 13 universities are ranked according to the reputation, university 7 is the one of median reputation. MEASURES OF CENTRAL LOCATION MEDIAN Compute the median for the following data. The age of a sample of five college students is: 21, 25, 19, 20, and 22. Arranging the data in ascending order gives: 19, 20, 21, 22, 25. Thus the median is 21. The height of four basketball players, in inches, is 76, 73, 80, and 75. Arranging the data in ascending order gives: 73, 75, 76, 80. Thus the median is 75.5 MEASURES OF CENTRAL LOCATION MODE The mode is the value of the observation that appears most frequently. The mode is most useful when an important aspect of describing the data involves determining the number of times each value occurs. If the data are qualitative (e.g., number of graduate in various disciplines accounting,finance, etc.) then, mode is useful (e.g., a modal class is accounting). EXAMPLE 6: The exam scores for ten students are: 81, 93, 84, 75, 68, 87, 81, 75, 81, 87. Since the score of 81 occurs the most, the modal score is 81. MEASURES OF CENTRAL LOCATION MEAN, MEDIAN, MODE Mean: affected by unusually large/small data, may be used if the data are quantitative (ratio or interval scale). Median: most appropriate if the data are ranked (ordinal scale) Mode: most appropriate if the data are qualitative (nominal scale) Appropriate measures if the data is quantitative: mean, median, mode ranked: median, mode qualitative: mode MEASURES OF CENTRAL LOCATION RELATIVE VALUES OF MEAN, MEDIAN, MODE Mode<Median<Mean Mode=Median=Mean Mean<Median<Mode If distribution is If distribution is if distribution is positively skewed symmetric negatively skewed RELATIVE STANDING PERCENTILES Percentiles divide the distribution into 100 groups. The p-th percentile is defined to be that numerical value such that at most p% of the values are smaller than that value and at most (100 – p)% are larger than that value in an ordered data set. For example, if the 78th percentile of GMAT scores is 600, then at most 78% scores are below 600 and at most 22% scores are above 600 (actually, this is also true that at least 22% are 600 or above). Percentile gives you an idea about your relative standing in a group. Two questions: Find percentile of a given value Find value of a given percentile RELATIVE STANDING: PERCENTILES FIND PERCENTILE OF A GIVEN VALUE The percentile corresponding to a given value (X) is computed by using the formula: number of values below X + 0.5 100% Percentile total number of values RELATIVE STANDING: PERCENTILES FIND PERCENTILE OF A GIVEN VALUE A teacher gives a 20-point test to 10 students. Scores are as follows: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10. Find the percentile rank of the score of 12. Ordered set of scores: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. There are 6 values below 12: 2, 3, 5, 6, 8, 10 Percentile = [(6 + 0.5)/10](100%) = 65th percentile. Student did better than 65% of the class. RELATIVE STANDING: PERCENTILES FIND VALUE OF A GIVEN PERCENTILE Procedure: Let p be the percentile and n the sample size. Step 1: Arrange the data in the ascending order. Step 2: Compute c = (np)/100. Step 3: If c is not a whole number, round up to the next whole number. If c is a whole number, use the value halfway between c and c+1. Step 4: The c-th value of the required percentile. RELATIVE STANDING: PERCENTILES FIND VALUE OF A GIVEN PERCENTILE Example: Consider data set 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. Note: the data set is already ordered. Find the value of the 25th percentile n = 10, p = 25, so c = (1025)/100 = 2.5. Hence round up to c = 3. Thus, the value of the 25th percentile is the 3rd value X = 5. Find the value of the 80th percentile n = 10, p = 80, so c = (1080)/100 = 8. Thus the value of the 80th percentile is the average of the 8th and 9th values. Thus, the 80th percentile for the data set is (15 + 18)/2 = 16.5. RELATIVE STANDING: PERCENTILES DECILES AND QUARTILES Deciles divide the data set into 10 groups. Deciles are denoted by D1, D2, …, D9 with the corresponding percentiles being P10, P20, …, P90 Quartiles divide the data set into 4 groups. Quartiles are denoted by Q1, Q2, and Q3 with the corresponding percentiles being P25, P50, and P75. The median is the same as P50 or Q2. RELATIVE STANDING: PERCENTILES INTERQUARTILE RANGE AND OUTLIERS An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. The Interquartile Range, IQR = Q3 – Q1. To determine whether a data value can be considered as an outlier: Step 1: Compute Q1 and Q3. Step 2: Find the IQR = Q3 – Q1. Step 3: Compute (1.5)(IQR). Step 4: Compute Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR). RELATIVE STANDING: PERCENTILES INTERQUARTILE RANGE AND OUTLIERS To determine whether a data value can be considered as an outlier: Step 5: Compare the data value (say X) with Q1– (1.5)(IQR) and Q3 + (1.5)(IQR). If X < Q1 – (1.5)(IQR) or if X > Q3 + (1.5)(IQR), then X is considered an outlier. RELATIVE STANDING: PERCENTILES INTERQUARTILE RANGE AND OUTLIERS Given the data set 5, 6, 12, 13, 15, 18, 22, 50, can the value of 50 be considered as an outlier? Q1 = 9, Q3 = 20, IQR = 11. Verify. (1.5)(IQR) = (1.5)(11) = 16.5. 9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5. The value of 50 is outside the range – 7.5 to 36.5, hence 50 is an outlier. RELATIVE STANDING BOX PLOTS When the data set contains a small number of values, a box plot is used to graphically represent the data set. These plots involve five values: the minimum value (S) the lower quartile (Q1) the median (Q2) the upper quartile (Q3) and the maximum value (L) RELATIVE STANDING: BOX PLOTS EXAMPLE Example: Construct a box plot with the following data which shows the assets of the 15 largest North American banks, rounded off to the nearest hundred million dollars: 111, 135, 217, 108, 51, 98, 65, 85, 75, 75, 93, 64, 57, 56, 98 RELATIVE STANDING: BOX PLOTS RANKING AND SUMMARIZING Data 217 135 111 108 98 98 93 85 75 75 65 64 57 56 51 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Smallest = 51 Q1 = 64 Median = 85 Q3 = 108 Largest = 217 IQR = 44 Outliers = (217, ) Box Plot 0 50 100 150 200 Assets (in 100 million dollars) 250 RELATIVE STANDING: BOX PLOTS INTERPRETATION If the median is near the center of the box, the distribution is approximately symmetric. If the median falls to the left of the center of the box, the distribution is positively skewed. If the median falls to the right of the center of the box, the distribution is negatively skewed. If the lines are about the same length, the distribution is approximately symmetric. If the line segment to the right of the box is larger than the one to the left, the distribution is positively skewed. If the line segment to the left of the box is larger than the one to the right, the distribution is negatively skewed. SYMMETRIC BOX PLOT 0 50 100 150 200 Number of units sold 250 300 POSITIVELY SKEWED BOX PLOT 0 50 100 150 200 Number of units sold 250 300