Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Data Numerically EC 233.01/02 Describing Data Numerically Extra Lecture Notes 3 Central Tendency Describing Data Numerically Variation Arithmetic Mean Range Median Variance M d Mode Standard Deviation Coefficient of Variation Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 1 2 Numerical Descriptive Measures Sample statistics versus population parameters The summary measures we will learn about are relevant for a population as well as a sample. Yet, there is a distinction: Measure Summary measures describing a population, called parameters, are denoted with Greek letters. Population parameters are unique, and in many instances unknown. Mean Variance Summary measures describing observations in a sample, are called sample statistics. With each new sample, you obtain new sample statistics. Standard Deviation By using sample statistics, we try to make inferences about population parameters. Please think about what this means! 3-3 EC 233 Lecture Notes 3 3-4 Population Parameter Sample Statistic X 2 S2 S Measures of Central Tendency Arithmetic Mean Central Tendency The arithmetic mean (mean) is the most common measure of central tendency Mean Median N Mode μ n x For a population of N values: x i1 x i1 i N x1 x 2 x N N Population size i n Arithmetic average Midpoint of ranked values Most frequently observed value Population values For a sample of size n: n x x i1 n i x1 x 2 x n n Sample size 5 Arithmetic Mean Observed values 6 Median (continued) The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 1 2 3 4 5 15 3 5 5 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 Mean = 4 7 EC 233 Lecture Notes 3 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 10 20 4 5 5 In an ordered list, the median is the “middle” number (50% above above, 50% below) Not affected by extreme values 8 Mode Finding the Median The location of the median: n 1 Median position position in the ordered data 2 If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers n 1 is not the value of the median, only the 2 position of the median in the ranked data A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may be no mode There may be several modes Note that 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode 9 10 Review Example: Summary Statistics Review Example Five houses on a hill by the beach House Prices: $2,000 K $2,000,000 500,000 300,000 100,000 100,000 House Prices: $2,000,000 500,000 300,000 100,000 , 100,000 $500 K $300 K Sum 3,000,000 , , $100 K M Mean: ($3,000,000/5) ($3 000 000/5) = $600,000 Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 $100 K 11 EC 233 Lecture Notes 3 12 Which measure of location is the “best”? Shape of a Distribution Mean is g generally y used, unless extreme values (outliers) exist If outliers exist, then median is often used, since the median is not sensitive to extreme values. Describes how data are distributed Measures of shape Example: Median home prices may be reported for a region – less sensitive to outliers Symmetric or skewed Left-Skewed Symmetric Right-Skewed Mean < Median Mean = Median Median < Mean 13 14 Measures of Variability Range Variation Range Variance Standard Deviation Coefficient of Variation Simplest measure of variation Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest Measures of variation give information on the spread or variability of the data values. Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 Range = 14 - 1 = 13 Same center, different variation 15 EC 233 Lecture Notes 3 13 14 16 Disadvantages of the Range Ignores the way in which data are distributed 7 8 9 10 11 12 7 Range = 12 - 7 = 5 Quartiles 8 9 10 11 Quartiles split the ranked data into 4 segments with an equal number of values per segment 12 25% Range = 12 - 7 = 5 Q1 Sensitive to outliers Range = 5 - 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 25% Q2 25% Q3 The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile 1111111111122222222333345 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 25% Range = 120 - 1 = 119 17 18 Population Variance Sample Variance Average g of squared q deviations of values from the mean Average g ((approximately) pp y) of squared q deviations of values from the mean (why n-1?) n N Population variance: σ2 Where (x i μ) 2 s2 i 1 N μ = population mean Where (x x) i 1 n = sample size xi = Xi = ith value of the variable X ith value of the variable x 2 i n -1 X = arithmetic mean N = population size 19 EC 233 Lecture Notes 3 Sample variance: 20 Population Standard Deviation Sample Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data Population standard deviation: Most commonly used measure of variation Shows variation about the mean Has the same units as the original data N σ n Sample standard deviation: S (x i μ) 2 (x x) 2 i i1 n -1 i 1 N 21 22 Calculation Example: Sample Standard Deviation Measuring variation Sample Data (xi) : Small standard deviation 12 14 n=8 s Large standard deviation 23 EC 233 Lecture Notes 3 10 15 17 18 18 24 Mean = x = 16 (10 X) (12 x) (14 x)2 (24 x)2 n 1 2 2 (10 16)2 (12 16)2 (14 16)2 (24 16)2 8 1 126 7 4.2426 A measure of the “average” scatter around the mean 24 Advantages of Variance and Standard Deviation Comparing Standard Deviations Data A 11 12 13 14 15 16 17 18 19 20 21 M Mean = 15.5 1 s = 3.338 14 15 16 17 18 19 20 21 Mean = 15.5 s = 0.926 Data B 11 12 13 12 13 Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared) Data C 11 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.570 25 26 Measures of Variation: Summary Characteristics The more the data are spread out, the greater the range, variance, and standard deviation. The more the data are concentrated, the smaller the range, variance, and standard deviation. If the values are all the same (no variation), all these measures will be zero. None of these measures are ever negative. 3-27 EC 233 Lecture Notes 3 Coefficient of Variation Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of data measured in different units s CV 100% x 28 Comparing Coefficient of Variation Stock A: Average price last year = $50 Standard deviation = $5 s $5 CVA 100% 100% 10% x $50 Using Microsoft Excel Stock B: Average price last year = $100 Standard deviation = $5 s $5 CVB 100% 100% 5% $100 x Descriptive Statistics can be obtained from Microsoft® Excel Both stocks have the same standard deviation but deviation, stock B is less variable relative to its price Use menu choice: tools / data analysis / descriptive statistics E t details Enter d t il iin di dialog l b box 29 30 Using Excel Using Excel (continued) Use menu choice: tools / data analysis / descriptive statistics 31 EC 233 Lecture Notes 3 Enter dialog box details Check box for summary statistics Click OK 32 Excel output Microsoft Excel descriptive statistics output output, using the house price data: House Prices: $2,000,000 500,000 300,000 100,000 100,000 33 EC 233 Lecture Notes 3