Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike 3.0 Unported Creative Commons License, the full details of which may be found online here: http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use, edit, or redistribute the content provided that the original source is cited, it is for noncommercial purposes, and provided it is distributed under a similar license. CC BY-NC-SA Nordyke 2010 Why do we need descriptive statistics • We use the label univariate descriptive statistics to refer to a variety of measures of center and variation that are useful for understanding the nature and distribution of a single variable. • They can allow us to quickly understand a large amount of information about a single variable. • They make data meaningful! CC BY-NC-SA Nordyke 2010 Making Data Meaningful Age of Volunteer 15 19 22 17 39 17 26 A relatively small sample of the ages of volunteers at a local nonprofit agency in the community. What does this list tell us about the age of volunteers in the agency? CC BY-NC-SA Nordyke 2010 Making Data Meaningful Age of Volunteer 15 17 17 19 22 26 39 Sorting the list can provide a starting place. What do we know now? CC BY-NC-SA Nordyke 2010 Making Data Meaningful 39 16 43 38 39 36 16 31 24 22 17 35 32 28 47 49 25 31 27 43 27 41 30 16 41 47 49 34 33 31 15 16 22 50 42 40 35 25 40 26 42 44 33 20 18 19 39 19 40 46 43 22 28 38 21 49 49 20 44 26 24 16 49 23 37 30 17 19 26 25 16 24 44 31 27 29 45 26 33 34 15 15 16 15 28 36 48 44 24 24 43 44 50 26 29 37 30 25 33 24 41 38 48 39 18 24 49 17 21 16 40 18 16 20 26 19 43 38 46 15 28 27 16 42 39 45 20 15 What if the sample is larger? CC BY-NC-SA Nordyke 2010 25 17 31 26 47 18 30 21 22 34 23 43 40 22 18 19 28 22 30 40 22 45 31 24 38 33 25 29 21 21 37 41 The Menu of Basic Descriptive Statistics • Measures of central tendency – Mean, Median, Mode, Midrange • Measures of distribution – Range, Min, Max, Percentiles • Measures of Variation – Standard Deviation, Variance, Coefficient of Variation CC BY-NC-SA Nordyke 2010 Some initial notation indicates the addition of a set of values y is the variable used to represent the individual data values n represents the number of values in a sample N represents the number of values in a population CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Mean The sample mean is the mathematical average of the data and is the measure of central tendency we use most often. CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Mean Observation Age of # Volunteer 1 2 3 4 5 6 7 15 17 17 19 22 26 39 155 Sample Mean: 𝑦= 𝑦= 𝑛 𝑖=1 𝑦𝑖 𝑛 155 7 𝑦 = 22.14 The sum of all of the observations n = the number of observations CC BY-NC-SA Nordyke 2010 Measures of Central Tendency Median The sample median is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If there isn’t one value in the middle we take the average of the two middle values. The median is not affected by extreme values. CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Median Median: ỹ= (𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑑𝑑𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑟𝑖𝑒𝑠) 2 Median is often denoted by ỹ which is pronounced “y-tilde” CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Median Sample ages are arranged in ascending order 15 17 17 19 22 26 The middle value is the median. ỹ = 19 CC BY-NC-SA Nordyke 2010 39 Measures of Central Tendency - Median If there are two values in the middle, we take the average of the two. 15 17 17 19 22 26 34 39 ỹ= (𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑑𝑑𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑟𝑖𝑒𝑠) 2 ỹ= (19+22) 2 ỹ= 20.5 CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Median Note that the presence of an extreme value, doesn’t change the median. 15 17 17 19 22 26 34 99 ỹ= (𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑑𝑑𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑟𝑖𝑒𝑠) 2 ỹ= (19+22) 2 ỹ= 20.5 CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Mode The mode is the value that occurs most frequently. – Not every sample has a distinct mode. Sometimes it is bimodal (two modes) or multimodal (three or more modes) or sometimes there is no mode at all. – The mode is the only measure of central tendency we can use for nominal data. CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Mode 15 17 17 19 22 26 39 M = 17 17 is the only value that occurs more than once, so it is the value that occurs most frequently and the mode. Mode is often denoted with the symbol M CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Mode Blue Green Green Purple Purple Red Red Red Red Yellow Yellow Yellow M = Red 20 29 33 33 34 41 41 42 43 45 45 Multi modal CC BY-NC-SA Nordyke 2010 1.1 2.3 4.1 5.3 4.3 6.7 8.2 8.3 8.7 8.9 10.3 No Mode Measures of Central Tendency - Midrange The midrange, or middle of the range is the average of the highest and lowest values. There is no distinct symbol for the Midrange. Midrange= (𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒+𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒) 2 CC BY-NC-SA Nordyke 2010 Measures of Central Tendency - Midrange 15 17 Midrange= Midrange= 17 19 22 26 39 (𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒+𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒) 2 (15+39) 2 Midrange= 27 CC BY-NC-SA Nordyke 2010 Comparing Measures of Central Tendency 15 17 17 19 Mean = 22.14 Median = 19 Mode = 17 Midrange = 27 CC BY-NC-SA Nordyke 2010 22 26 39 Comparing Measures of Center Measure of Center (Listed from most used to least used) Does it always exist? Does it take into account every value? Is it affected by extreme values? Mean Always Yes Yes Median Always No No Mode Might not exist, may have more than one No No Midrange Always No Yes CC BY-NC-SA Nordyke 2010 The Range • The range of a sample is the difference between the highest value and the lowest value. 15 17 17 19 22 26 39 In our example the Range = 39 – 15 or 24; there are 24 years between our youngest and oldest volunteers in the sample. CC BY-NC-SA Nordyke 2010 Measures of Variance • Where measures of central tendency try to give us an idea of where the middle of the data lies, measures of variance (or variation) tell us about how the data is distributed around that center. • Our three primary measures of variance are: – Standard Deviation, – Variance and – Coefficient of Variation CC BY-NC-SA Nordyke 2010 Measures of Variance – Standard Deviation The Standard Deviation is a measure of the variation of values around the mean. Sample Standard Deviation: 𝑠 = 2 𝑛 𝑖=1(𝑌−𝑌) Population Standard Deviation: 𝜎 = CC BY-NC-SA Nordyke 2010 𝑛−1 2 𝑛 (𝑌−𝑌) 𝑖=1 𝑁 Some Key Points for Understanding Standard Deviation • The standard deviation is always positive. • The standard deviation of a sample will always be in the same units as the observations in the sample. • Extreme values or outliers can change the value of the standard deviation substantially. • The size of the sample will affect the size of the standard deviation; as the sample size increases, the size of the standard deviation decreases. CC BY-NC-SA Nordyke 2010 Measures of Variance - Variance • The variance of a sample is just the standard deviation of the sample squared. Sample Variance: 𝑠 2 = 𝟐 𝑛 (𝑌−𝑌) 𝑖=1 𝑛−1 Population Variance: 𝜎 2 = CC BY-NC-SA Nordyke 2010 2 𝑛 (𝑌−𝑌) 𝑖=1 𝑁 Standard Deviation and Variance Notation Sample s = standard deviation s2 = variance Population = standard deviation 2 = variance CC BY-NC-SA Nordyke 2010 Seeing Standard Deviations • Once I figure out how to draw the curves, this well be a slide that shows the difference between a distribution with a small standard deviation (tall and narrow) and a large one (broad and flat). CC BY-NC-SA Nordyke 2010 Back to our example • In our sample of volunteer ages, the mean was 22.14 years. 15 17 17 19 22 26 39 • We can calculate the standard deviation to better understand how the values or distributed around that mean. CC BY-NC-SA Nordyke 2010 Back to our example Sample Standard Deviation: 𝑠 = y 15 17 17 19 22 26 39 𝑦 22.14 22.14 22.14 22.14 22.14 22.14 22.14 (y-𝑦) -7.14 -5.14 -5.14 -3.14 -0.14 3.86 16.86 CC BY-NC-SA Nordyke 2010 2 𝑛 𝑖=1(𝑌−𝑌) 𝑛−1 (y-𝑦)2 50.9796 26.4196 26.4196 9.8596 0.0196 14.8996 284.2596 412.8572 Back to our example Sample Standard Deviation: 𝑠 = 𝑠= 412.86 〖 〗 7−1 𝑠 = 8.3 CC BY-NC-SA Nordyke 2010 2 𝑛 𝑖=1(𝑌−𝑌) 𝑛−1 How are standard deviations helpful? The Empirical Rule When data sets have distributions that are approximately bell shaped, the following is true: • About 68% of all values fall within 1 standard deviation of the mean • About 95% of all values fall within 2 standard deviations of the mean • About 99.7% of all values fall within 3 standard deviations of the mean Copyright © 2004 Pearson Education, Inc. The Empirical Rule 68% of values fall within 1 standard deviation of the mean 34% 34% CC BY-NC-SA Nordyke 2010 The Empirical Rule 95% of values fall within 2 standard deviations of the mean 68% of values fall within 1 standard deviation of the mean 34% 34% 13.5% 13.5% CC BY-NC-SA Nordyke 2010 The Empirical Rule 99.7% of values fall within 3 standard deviations of the mean 95% of values fall within 2 standard deviations of the mean 68% of values fall within 1 standard deviation of the mean 34% 34% 2.4% 2.4% 13.5% CC BY-NC-SA Nordyke 2010 13.5% Measures of Center – Coefficient of Variation • The Coefficient of Variation (CV) is a measure of the standard deviation of a sample relative to its mean. • CV’s can be useful when you are comparing the standard deviations of variables that are in two different units. CC BY-NC-SA Nordyke 2010 Measures of Center – Coefficient of Variation An example: You are comparing the heights and weights of fourth graders. Height 𝑦 = 52” S = 4” Weight 𝑦 = 80 lbs. S = 10 lbs. Which variable has greater variance? How can we compare 4” to 10 lbs? CC BY-NC-SA Nordyke 2010 Measures of Center – Coefficient of Variation Weight Height 𝑦 = 80 lbs. 𝑦 = 52” 𝑠 CV = * 100% S = 4” S = 10 lbs. 𝑦 10 4 CV = * 100% CV = 80 * 100% 52 CV = 8% CV = 12.5% The standard deviation of height is 8% of the mean of height, where as the standard deviation of weight is 12.5% of the mean of weight, so there is greater variation in the weight of the fourth graders than in the height. CC BY-NC-SA Nordyke 2010