Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Data: Summary Measures Identifying the Scale of Measurement VARIABLE AGREE NO OPINION DISAGREE • Before you analyze the data, identify the measurement scale for each variable (continuous, nominal, or ordinal). Nominal Variables • Variable: Type of Beverage 1 2 3 • or 1 2 3 Ordinal Variables Variable: Size of Beverage Small Medium Large Continuous Variables Variable: Volume of Beverage Variable: Temperature of Beverage 4.0 3.0 2.0 1.0 0 Ratio Level Interval Level Central Tendencies • Defined as the tendency of the data to cluster around or center about certain numerical values, such as the: – Mean (arithmetic mean), – Median, and – Mode. Central Tendency – Mean, Median, and Mode 1 1 1 2 3 10 n Mean=3 Median=1.5 Mode=1 the sum of all the values in the data set divided by the number of values x i 1 i n the middle value (also known as the 50th percentile) the most common or frequent data value Mean (or Arithmetic Mean) • Sum of the values of all of the observations in a data set divided by the number of observations: – The Sample Mean is X – The Population Mean is: – The formula for calculating the sample mean x is: n i Median • Defined as the middle point of the set of data, i.e. exactly half of the data points are above the median and exactly half are below: – If the number of data points is odd, it is the middle point of the ordered set of data. – If the number of data points is even, it is the average (mean) of the two middle points of the ordered set of data. Mode • Defined as the measurement(s) which occurs with the greatest frequency in the sample, i.e. the most common point(s): – A unimodal data set contains only one mode. – A bimodal data set contains two modes. – And so on…. Examples and Calculations Sales Sales Company Revenue Company Revenue 1 3.1 19 6.2 2 7.4 20 8.4 3 2.2 21 1.9 4 10.9 22 5.8 5 4.5 23 4.9 6 8.6 24 6.4 7 3.7 25 3.6 8 6.3 26 7.9 9 7.6 27 3.2 10 5.4 28 8.5 11 2.3 29 6.2 12 5.8 30 9.7 13 4.2 31 7.1 14 6.1 32 5.9 15 9.1 33 5.7 16 5.5 34 4.4 17 4.8 35 2.9 18 8.9 Stem Leaf 19 2239 31267 424589 5457889 612234 71469 84569 917 10 9 Key: Leaf units are tenths. Examples and Calculations • Mean: x X (3.1 7.4 2.2 10.9 .... 2.9) / 35 n i X (205.1/ 35) 5.86 • Median: – From stem and leaf: median is 5.8 • Mode: – From stem and leaf: mode is 5.8 and 6.2 Skewing (Mean and Median Comparisons) • If the median is less than the mean, the data set is skewed right (extreme data in right tail which increases the mean). • If the median is greater than the mean, the data set is skewed left (extreme data in the left tail which decreases the mean). • If median equals the mean, the data set is said to be symmetrical. Other Items • Notes: – The mean is sensitive to outliers (extreme data) while the median and mode are not. – Outliers can have significant implications on the ability draw inferences from the data set. For example, consider: • Average home sales price. • Average exam score. • Average jury award amount. – Mean or median may not be feasible. Picturing Distributions: Histogram PERCENT – Each bar in the histogram represents a group of values (a bin). – The height of the bar represents the frequency or percent of values in the bin. Bins Data Distributions Compared to Normal A Normal Distribution Skewness Measures of Data Variability • Knowing central tendencies (mean, median, mode) isn’t enough. Also need a method for determining how close the data is clustered around its center point(s). • Three typical measures of data variability: – Range, – Variance, and – Standard Deviation. The Spread of a Distribution: Dispersion Measure Definition Range the difference between the maximum and minimum data values Interquartile Range the difference between the 25th and 75th percentiles Variance a measure of dispersion of the data around the mean Standard Deviation a measure of dispersion expressed in the same units of measurement as your data (the square root of the variance) Range • Simplest measure of variability. • Calculated by subtracting the smallest measurement from the largest measurement. • GPA Examples: – – – – Data Set (4.0, 2.7, 3.3, 3.2, 2.1, 3.7, 3.5, 1.9). Range equals 4.0 minus 1.9 which is 2.1. Data Set (2.7, 3.2, 3.4, 2.9, 1.8, 2.2, 3.0, 2.1). Range equals 3.4 minus 1.8 which is 1.6. Is Range A Sufficient Measure of Variability? • No - Consider the following two stem and leaf diagrams where the range equals 9.0. Stem Leaf 19 23 316 4235567 5445678899 61223458 714689 818 97 10 9 Key: Leaf units are tenths. Stem Leaf 19 2239 31267 424589 5457889 612234 71469 84569 917 10 9 Key: Leaf units are tenths. Another Method for Measuring Data Variability (or Spread) • A more sensitive measurement of variation uses the difference between the sample mean and each of the measurements of the sample, also known as the deviation from the mean. • Each deviation between the sample member and the mean is first calculated and then squared. These results are then summed. Variance • Equal to the sum of the squared distances from the mean divided by (n-1) for a sample: – Sample Variance - s2 – Population Variance - 2 • Deviations are squared to remove effects of negative differences. Standard Deviation • While variance does not provide a useful metric (i.e. “units squared”), taking the positive square root of the variance provides a metric which is the same as the data itself (i.e. “units”): – Sample Standard Deviation - s – Population Standard Deviation - Variance Formulas s 2 x x 2 i n 1 2 x i n 2 Shortcut Variance Formulas s 2 2 xi 2 x i 2 n 1 xi 2 n n x i 2 n Standard Deviation Formulas s s x x x 2 2 i 2 n 1 i n 2 Shortcut Standard Deviation Formulas s s2 2 xi 2 x i 2 n 1 xi 2 n n x i 2 n Notes on Variability • The sample variance (standard deviation) is divided by one less than the sample size (n-1) rather than by sample size itself (n). • The sample variance (standard deviation) is used to estimate the population variance (standard deviation). We divide by (n-1) rather than (n) so that this estimator is unbiased. Notes on Variability (continued) • A data set with larger spread about its mean will have a larger standard deviation. • A data set with smaller spread about its mean will have a smaller standard deviation. • It is the calculation of the standard deviation that allows for the comparison of the spread (variability) of the two data sets. Examples • GPA Data Set One: • (4.0, 2.7, 3.3, 3.2, 2.1, 3.7, 3.5, 1.9). – Range equals 4.0 minus 1.9 which is 2.1. – Mean equals: • GPA Data Set Two: (2.7, 3.2, 3.4, 2.9, 1.8, 2.2, 3.0, 2.1). – Range equals 3.4 minus 1.8 which is 1.6. – Mean equals: X (40 . 27 . 33 . ...19 . )/8 X (27 . 32 . 34 . ...21 . )/ 8 X (244 . / 8) 305 . X (213/8 . ) 266 . Variance and Standard Deviation Calculations • GPA Data Set One: Member 1 2 3 4 5 6 7 8 X 4.0 2.7 3.3 3.2 2.1 3.7 3.5 1.9 X Minus Result Mean Squared 0.95 0.9025 -0.35 0.1225 0.25 0.0625 0.15 0.0225 -0.95 0.9025 0.65 0.4225 0.45 0.2025 -1.15 1.3225 Sum 3.9600 • GPA Data Set Two: Member 1 2 3 4 5 6 7 8 X 2.7 3.2 3.4 2.9 1.8 2.2 3.0 2.1 X Minus Result Mean Squared 0.04 0.0016 0.54 0.2916 0.74 0.5476 0.24 0.0576 -0.86 0.7396 -0.46 0.2116 0.34 0.1156 -0.56 0.3136 Sum 2.2788 – Variance: – Variance: 2 s (39600 . ) / (8 1) 0566 . s2 (2.2788) / (8 1) 0.326 – Standard Deviation: – Standard Deviation: s s2 (3.9600) / (8 1) 0.752 s s 2 (2.2788) /(8 1) 0.571 Variance/Standard Deviation Calculations using Shortcut • GPA Data Set One: Member 1 2 3 4 5 6 7 8 Sum X 4.0 2.7 3.3 3.2 2.1 3.7 3.5 1.9 24.4 X Squared 16.00 7.29 10.89 10.24 4.41 13.69 12.25 3.61 78.38 – Variance: (24.4) 2 78.38 ( ) 8 s2 0.566 (8 1) • GPA Data Set Two: Member 1 2 3 4 5 6 7 8 Sum X 2.7 3.2 3.4 2.9 1.8 2.2 3.0 2.1 21.3 X Squared 7.29 10.24 11.56 8.41 3.24 4.84 9.00 4.41 58.99 – Variance: (213 . )2 58.99 ( ) 8 s2 0.326 (8 1) Standard Deviation • Useful for comparing the variability of two data sets. • The data set with the larger standard deviation is the data set with more variability. • From GPA Example: – Data Set One: Mean=3.05, St. Dev.=0.752. – Data Set Two: Mean=2.66, St Dev.=0.571. Relative vs. Absolute Comparison • Deviation, or error, has been standardized. Thus, for a single data set, variability can be discussed in terms of how many members of the data set fall within one, two, three, or more standard deviations of the mean. • A theorem and a rule describe this behavior: – Chebyshev’s Theorem; – Empirical Rule. Chebyshev’s Theorem • In general, at least (1-1/k2) of the sample members will fall within k standard deviations of the mean (for k >1). • So… – For k=1, it is possible for no members to fall within 1 standard deviation of the mean. – For k=2, 75% (3/4) or more of the members will fall within 2 standard deviations of the mean. Chebyshev’s Theorem (cont’d) • From Table 2.10, Part 1 (cont’d): – For k=3, 88.9% (8/9) or more of the members will fall within 3 st. deviations of the mean. – For k=4, 93.75 (15/16) or more of the members will fall within 4 st. deviations of the mean. – And so on…. • This theorem holds true regardless of the frequency distribution of the data set, i.e. no matter what the histogram looks like. Empirical Rule • Based on empirical evidence for mound or bell shaped frequency distributions: – Approximately 68% (0.6826) of the sample members will fall within 1 standard deviation of the mean. – Approximately 95% (0.9544) of the sample members will fall within 2 standard deviations of the mean. – Almost all (0.9974) of the sample members will fall within 3 standard deviations of the mean. Normal Distributions Useful Probabilities for Normal Distributions 68% 95% 99% Chebyshev’s Theorem vs. Empirical Rule • Percent of sample members using Chebyshev’s Theorem – – – – – – For k=1, 0% (+) For k=2, 75% (+) For k=3, 88.9% (+) For k=4, 93.75% (+) For k=5, 96% (+) And so on…. • Percent of sample members using Empirical Rule – – – – – – For k=1, 68.26% For k=2, 95.44% For k=3, 99.74% For k=4, > 99.8% For k=5, > 99.9% And so on…. Example using Toyota 4Runner Data MPG Data for 1995 Toyota 4Runner 14.6 14.9 15.2 14.6 14.9 15.2 14.7 15.0 15.2 14.8 15.1 15.3 14.8 15.1 15.3 (ordered) 15.4 15.4 15.5 15.5 15.5 Mean Mean Chebyshev's Thm k Minus ks Plus ks Estimate Actual 1 14.9048 15.7752 30*0% = 0 18 2 14.4696 16.2104 30*75% = 22.5 30 3 14.0344 16.6456 30*88.9% = 26.7 30 Mean = 15.34 Standard Deviation = 0.4352 15.6 15.6 15.7 15.7 15.7 15.8 15.9 16.0 16.1 16.1 Empirical Rule Estimate Actual 30*68.26% = 20.5 18 30*95.44% = 28.6 30 30*99.74=29.9 30 Measures of Relative Standing Percentile Ranking • Describes how a member of the data set compares to the rest of the data. • Percentile ranking - pth percentile is a number x such that p% of the measurements fall below the pth percentile. • Example: – SAT Scores: 90th percentile means that 90 percent of the scores are below. Percentiles 98 95 92 90 85 81 79 70 63 55 47 42 75th Percentile=91 50th Percentile=80 third quartile Quartiles divide your data into quarters. 25th Percentile=59 first quartile Box Plots 1.5* IQR outliers > 1.5 IQR from the box largest point <= 1.5 IQR from the box the 75th percentile the 50th percentile (median) the 25th percentile smallest point <= 1.5 IQR from the box The mean is denoted by a ◊. Excel and StatPro Add-in Demonstration • Pivot tables • Summary measures – Excel – Add-ins • Covariance and correlation • Boxplots • Applications