Download Computer Hardware - Computer Science@IUPUI

CSCI N207 Data Analysis Using Spreadsheet 10b. Univariate Analysis Part 2 Lingma Acheson linglu@iupui.edu Department of Computer and Information Science, IUPUI The Range • Difference between minimum and maximum values in a data set • Larger range usually (but not always) indicates a large spread or deviation in the values of the data set. (73, 66, 69, 67, 49, 60, 81, 71, 78, 62, 53, 87, 74, 65, 74, 50, 85, 45, 63, 100) Range : 100 – 45 = 55 • Some extreme low or high value might throw off the range, e.g. (20, 76, 77, 80, 82, 82, 84, 88, 90, 93, 99, 100) Range: 100 – 20 = 80 Variance • One measure of dispersion (deviation from the mean) of a data set. How far away is each data from the mean? • Variance – average distance to the mean • The larger the variance, the greater is the average deviation of each datum from the mean (more numbers are away from the mean). 1 N 2 2 2 2 2 ( m  m )  (( m  m )  ( m  m )  ( m  m )  ...  ( m  m ) )/ N 1 2 3 N Variance = N  i i 1 m  Average value of the data set • E.g. 73, 67, 70, 67, 49, 60, 81, 71, 78, 62, 53, 87, 72, 65, 74, 50, 84, 45, 62,100 Variance = ((73-68.5)2+(67-68.5)2 +(70-68.5)2 + … +(100-68.5)2)/20 Excel Functions: VARP() – variance for the whole population (data set is complete) VAR() – variance from a sample population (data set is a sample) Standard Deviation • Square root of the variance, as the variance gets the square of the distance. • The magnitude of the number is more in line with the values in the data set. • Can be thought of as the average deviation from the mean of a data set. Standard Deviation = 1 N 2 ( mi  m )  N i 1 Excel Functions: STDEVP() – use this when the data set is complete STDEV() – use this when the data set is a sample Frequency Tables • Use frequency table to observe the distribution • E.g. Consider the following data set: {45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81, 85, 87, 100} • Need to determine how to group data into different bins. Category Labels Frequency 0-50 3 51-60 2 61-70 6 71-80 5 81-90 3 >90 1 Histogram • A histogram is simply a column chart of the frequency table. Category Labels Frequency 0-50 3 51-60 2 61-70 6 71-80 5 81-90 3 >90 1 7 Frequency 6 5 4 3 2 1 0 0-50 51-60 61-70 71-80 81-90 Scores Page 6 >90 Data Distribution Normal Distribution 6 Category Labels Frequency 0-50 1 51-60 3 61-70 5 71-80 5 81-90 3 >90 1 5 4 3 2 1 0 0-50 51-60 61-70 71-80 81-90 >90 Normal Distributions • The Bell curve – Symmetrical – Mean ≈ Median Skewed Distributions • Most of the times the distributions are skewed. • Positively skewed distribution: mean > median • Negatively skewed distribution: mean < median Data Distribution {45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81, 85, 87, 100} Average (68.6) Mode (74) and Median (68) 7 55.14 -1SD 6 82.06 +1SD Frequency 5 4 3 2 1 0 0-50 51-60 61-70 71-80 Scores 81-90 >90 Standard Deviation With a normal distribution: mean + 1*SD covers 68% of data mean + 2*SD covers 95% of data mean + 3*SD covers 99.7% of data Page 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Computer Hardware - Computer Science@IUPUI