Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measure of Variability (Dispersion, Spread) 1. 2. 3. 4. Range Inter-Quartile Range Variance, standard deviation Pseudo-standard deviation Measure of Central Location 1. Mean 2. Median 1. Range R = Range = max - min 2. Inter-Quartile Range (IQR) Inter-Quartile Range = IQR = Q3 - Q1 Example The data Verbal IQ on n = 23 students arranged in increasing order is: 80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119 min = 80 Q1 = 89 Q2 = 96 Q3 = 105 max = 119 Range and IQR Range = max – min = 119 – 80 = 39 Inter-Quartile Range = IQR = Q3 - Q1 = 105 – 89 = 16 3. Sample Variance Let x1, x2, x3, … xn denote a set of n numbers. Recall the mean of the n numbers is defined as: n x xi i 1 n x1 x2 x3 xn 1 xn n The numbers d1 x1 x d2 x2 x d3 x3 x d n xn x are called deviations from the the mean The sum n d i 1 n 2 i xi x 2 i 1 is called the sum of squares of deviations from the the mean. Writing it out in full: d d d d 2 1 or 2 2 2 3 x1 x x2 x 2 2 2 n xn x 2 The Sample Variance Is defined as the quantity: n d i 1 n 2 i n 1 x x i 1 2 i n 1 and is denoted by the symbol s 2 The Sample Standard Deviation s Definition: The Sample Standard Deviation is defined by: n s d i 1 n 2 i n 1 x x i 1 2 i n 1 Hence the Sample Standard Deviation, s, is the square root of the sample variance. Example Let x1, x2, x3, x4, x5 denote a set of 5 denote the set of numbers in the following table. i 1 2 3 4 5 xi 10 15 21 7 13 Then 5 xi i 1 and x = x 1 + x2 + x3 + x4 + x5 = 10 + 15 + 21 + 7 + 13 = 66 n xi i 1 n x1 x2 x3 xn 1 xn n 66 13.2 5 The deviations from the mean d1, d2, d3, d4, d5 are given in the following table. i xi d i xi x 2 d i2 xi x 1 10 -3.2 2 15 1.8 3 21 7.8 4 7 -6.2 5 13 -0.2 10.24 3.24 60.84 38.44 0.04 The sum n d i 1 n 2 i xi x 2 i 1 3.2 1.8 7.8 6.2 0.2 2 2 2 2 10.24 3.24 60.84 38.44 0.04 112.80 n and 2 xi x 112.8 2 i 1 s 28.2 n 1 4 2 Also the standard deviation is: n s s 2 x x i 1 2 i n 1 112.8 28.2 5.31 4 Interpretations of s • In Normal distributions – Approximately 2/3 of the observations will lie within one standard deviation of the mean – Approximately 95% of the observations lie within two standard deviations of the mean – In a histogram of the Normal distribution, the standard deviation is approximately the distance from the mode to the inflection point Mode 0.14 0.12 Inflection point 0.1 0.08 0.06 0.04 s 0.02 0 0 5 10 15 20 25 2/3 s s 2s Example A researcher collected data on 1500 males aged 60-65. The variable measured was cholesterol and blood pressure. – The mean blood pressure was 155 with a standard deviation of 12. – The mean cholesterol level was 230 with a standard deviation of 15 – In both cases the data was normally distributed Interpretation of these numbers • Blood pressure levels vary about the value 155 in males aged 60-65. • Cholesterol levels vary about the value 230 in males aged 60-65. • 2/3 of males aged 60-65 have blood pressure within 12 of 155. i.e. between 155-12 =143 and 155+12 = 167. • 2/3 of males aged 60-65 have Cholesterol within 15 of 230. i.e. between 230-15 =215 and 230+15 = 245. • 95% of males aged 60-65 have blood pressure within 2(12) = 24 of 155. Ii.e. between 155-24 =131 and 155+24 = 179. • 95% of males aged 60-65 have Cholesterol within 2(15) = 30 of 230. i.e. between 23030 =200 and 230+30 = 260. A Computing formula for: Sum of squares of deviations from the the mean : n x x i 1 2 i The difficulty with this formula is that x will have many decimals. The result will be that each term in the above sum will also have many decimals. The sum of squares of deviations from the the mean can also be computed using the following identity: x i n 2 i 1 xi n i 1 n n x x i 1 2 i 2 To use this identity we need to compute: n x i 1 x1 x2 xn and i n x i 1 2 i x x x 2 1 2 2 2 n Then: n x x i 1 x i n 2 i 1 xi n i 1 n 2 i 2 x i n 2 i 1 xi n i 1 n 1 n n and s 2 x x i 1 2 i n 1 2 and x i n 2 i 1 xi n i 1 n 1 n n s x x i 1 2 i n 1 2 Example The data Verbal IQ on n = 23 students arranged in increasing order is: 80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119 n x i i 1 n x i 1 2 i = 80 + 82 + 84 + 86 + 86 + 89 + 90 + 94 + 94 + 95 + 95 + 96 + 99 + 99 + 102 + 102 + 104 + 105 + 105 + 109 + 111 + 118 + 119 = 2244 = 802 + 822 + 842 + 862 + 862 + 892 + 902 + 942 + 942 + 952 + 952 + 962 + 992 + 992 + 1022 + 1022 + 1042 + 1052 + 1052 + 1092 + 1112 + 1182 + 1192 = 221494 Then: n x x i 1 x i n 2 i 1 xi n i 1 n 2 i 2244 221494 2 2 23 2557.652 You will obtain exactly the same answer if you use the left hand side of the equation x i n 2 i 1 xi n i 1 n 1 n n and s 2 x x 2 i i 1 n 1 2244 221494 2 2 23 22 2557.652 116.26 22 x i n 2 i 1 xi n i 1 n 1 n n Also s x x i 1 2 i n 1 2244 221494 2 2 10.782 23 22 2557.652 116.26 22 A quick (rough) calculation of s Range s 4 The reason for this is that approximately all (95%) of the observations are between x 2s and x 2s. Thus max x 2s and min x 2s. and Range max min x 2s x 2s . 4s Range Hence s 4 Example Verbal IQ on n = 23 students min = 80 and max = 119 119 - 80 39 s 9.75 4 4 This compares with the exact value of s which is 10.782. The rough method is useful for checking your calculation of s. The Pseudo Standard Deviation (PSD) Definition: The Pseudo Standard Deviation (PSD) is defined by: IQR InterQuart ile Range PSD 1.35 1.35 Properties • For Normal distributions the magnitude of the pseudo standard deviation (PSD) and the standard deviation (s) will be approximately the same value • For leptokurtic distributions the standard deviation (s) will be larger than the pseudo standard deviation (PSD) • For platykurtic distributions the standard deviation (s) will be smaller than the pseudo standard deviation (PSD) Example Verbal IQ on n = 23 students Inter-Quartile Range = IQR = Q3 - Q1 = 105 – 89 = 16 Pseudo standard deviation IQR 16 PSD 11.85 1.35 1.35 This compares with the standard deviation s 10.782 • An outlier is a “wild” observation in the data • Outliers occur because – of errors (typographical and computational) – Extreme cases in the population • We will now consider the drawing of boxplots where outliers are identified Box-whisker Plots showing outliers • An outlier is a “wild” observation in the data • Outliers occur because – of errors (typographical and computational) – Extreme cases in the population • We will now consider the drawing of boxplots where outliers are identified To Draw a Box Plot we need to: • Compute the Hinge (Median, Q2) and the Mid-hinges (first & third quartiles – Q1 and Q3 ) • To identify outliers we will compute the inner and outer fences The fences are like the fences at a prison. We expect the entire population to be within both sets of fences. If a member of the population is between the inner and outer fences it is a mild outlier. If a member of the population is outside of the outer fences it is an extreme outlier. Lower outer fence F1 = Q1 - (3)IQR Upper outer fence F2 = Q3 + (3)IQR Lower inner fence f1 = Q1 - (1.5)IQR Upper inner fence f2 = Q3 + (1.5)IQR • Observations that are between the lower and upper fences are considered to be nonoutliers. • Observations that are outside the inner fences but not outside the outer fences are considered to be mild outliers. • Observations that are outside outer fences are considered to be extreme outliers. • mild outliers are plotted individually in a box-plot using the symbol • extreme outliers are plotted individually in a box-plot using the symbol • non-outliers are represented with the box and whiskers with – Max = largest observation within the fences – Min = smallest observation within the fences Box-Whisker plot representing the data that are not outliers Extreme outlier Mild outliers Inner fences Outer fence Example Data collected on n = 109 countries in 1995. Data collected on k = 25 variables. The variables 1. Population Size (in 1000s) 2. Density = Number of people/Sq kilometer 3. Urban = percentage of population living in cities 4. Religion 5. lifeexpf = Average female life expectancy 6. lifeexpm = Average male life expectancy 7. literacy = % of population who read 8. pop_inc = % increase in popn size (1995) 9. babymort = Infant motality (deaths per 1000) 10. gdp_cap = Gross domestic product/capita 11. Region = Region or economic group 12. calories = Daily calorie intake. 13. aids = Number of aids cases 14. birth_rt = Birth rate per 1000 people 15. death_rt = death rate per 1000 people 16. aids_rt = Number of aids cases/100000 people 17. log_gdp = log10(gdp_cap) 18. log_aidsr = log10(aids_rt) 19. b_to_d =birth to death ratio 20. fertility = average number of children in family 21. log_pop = log10(population) 22. cropgrow = ?? 23. lit_male = % of males who can read 24. lit_fema = % of females who can read 25. Climate = predominant climate The data file as it appears in SPSS Consider the data on infant mortality Stem-Leaf diagram stem = 10s, leaf = unit digit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4455555666666666777778888899 0122223467799 0001123555577788 45567999 135679 011222347 03678 4556679 5 4 1569 0022378 46 7 8 Summary Statistics median = Q2 = 27 Quartiles Lower quartile = Q1 = the median of lower half Upper quartile = Q3 = the median of upper half 12 12 66 67 Q1 12, Q3 66.5 2 2 Interquartile range (IQR) IQR = Q1 - Q3 = 66.5 – 12 = 54.5 The Outer Fences lower = Q1 - 3(IQR) = 12 – 3(54.5) = - 151.5 upper = Q3 = 3(IQR) = 66.5 – 3(54.5) = 230.0 No observations are outside of the outer fences The Inner Fences lower = Q1 – 1.5(IQR) = 12 – 1.5(54.5) = - 69.75 upper = Q3 = 1.5(IQR) = 66.5 – 1.5(54.5) = 148.25 Only one observation (168 – Afghanistan) is outside of the inner fences – (mild outlier) Box-Whisker Plot of Infant Mortality 0 0 50 100 150 Infant Mortality 200 Example 2 In this example we are looking at the weight gains (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork). – Ten test animals for each diet Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork) High Protein Level Low protein Source Beef Cereal Pork Beef Cereal Pork Diet 1 73 102 118 104 81 107 100 87 117 111 103.0 100.0 24.0 17.78 229.11 15.14 2 98 74 56 111 95 88 82 77 86 92 87.0 85.9 18.0 13.33 225.66 15.02 3 94 79 96 98 102 102 108 91 120 105 100.0 99.5 11.0 8.15 119.17 10.92 4 90 76 90 64 86 51 72 90 95 78 82.0 79.2 18.0 13.33 192.84 13.89 5 107 95 97 80 98 74 74 67 89 58 84.5 83.9 23.0 17.04 246.77 15.71 6 49 82 73 86 81 97 106 70 61 82 81.5 78.7 16.0 11.05 273.79 16.55 Median Mean IQR PSD Variance Std. Dev. Box Plots: Weight Gains for Six Diets 130 High Protein 120 Low Protein 110 Weight Gain 100 90 80 70 60 50 Beef Cereal Pork Beef 2 3 4 Cereal Pork 40 1 Diet 5 6 Non-Outlier Max Non-Outlier Min Median; 75% 25% Conclusions • Weight gain is higher for the high protein meat diets • Increasing the level of protein - increases weight gain but only if source of protein is a meat source Measures of Shape Measures of Shape • Skewness Symmetric Positively skewed 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.14 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Negatively skewed 0 5 10 15 20 25 0 5 10 15 20 25 • Kurtosis Normal (mesokurtic) Platykurtic Leptokurtic 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 -3 -2 -1 0 1 2 3 0 0 5 10 15 20 25 -3 -2 -1 0 1 2 3 • Measure of Skewness – based on the sum of cubes n x x 3 i i 1 • Measure of Kurtosis – based on the sum of 4th powers n x x i 1 4 i The Measure of Skewness n 3 n xi x i 1 g1 3 n 2 2 xi x i 1 The Measure of Kurtosis n g2 x x i 1 n 4 i n xi x 3 2 i 1 The 3 is subtracted so that g2 is zero for the normal distribution Interpretations of Measures of Shape • Skewness 0.14 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.12 g1 > 0 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 g1 = 0 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 0 5 10 15 20 25 g1 < 0 0 5 10 15 20 25 • Kurtosis 0.14 g2 < 0 0.12 g2 = 0 0.1 0.08 0.06 g2 > 0 0.04 0.02 0 0 -3 -2 -1 0 1 2 3 0 0 5 10 15 20 25 -3 -2 -1 0 1 2 3 Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many variables) Graphical Techniques • The scatter plot • The two dimensional Histogram The Scatter Plot For two variables X and Y we will have a measurements for each variable on each case: xi, yi xi = the value of X for case i and yi = the value of Y for case i. To Construct a scatter plot we plot the points: (xi, yi) for each case on the X-Y plane. (xi, yi) yi xi Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program Student Verbal IQ Math IQ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 86 104 86 105 118 96 90 95 105 84 94 119 82 80 109 111 89 99 94 99 95 102 102 94 103 92 100 115 102 87 100 96 80 87 116 91 93 124 119 94 117 93 110 97 104 93 Initial Reading Acheivement 1.1 1.5 1.5 2.0 1.9 1.4 1.5 1.4 1.7 1.6 1.6 1.7 1.2 1.0 1.8 1.4 1.6 1.6 1.4 1.4 1.5 1.7 1.6 Final Reading Acheivement 1.7 1.7 1.9 2.0 3.5 2.4 1.8 2.0 1.7 1.7 1.7 3.1 1.8 1.7 2.5 3.0 1.8 2.6 1.4 2.0 1.3 3.1 1.9 Scatter Plot 140 120 Math IQ 100 80 60 40 20 0 0 20 40 60 80 Verbal IQ 100 120 140 Scatter Plot 140 120 Math IQ 100 80 60 (84,80) 40 20 0 0 20 40 60 80 Verbal IQ 100 120 140 Scatter Plot 130 120 Math IQ 110 100 90 80 70 60 60 70 80 90 100 Verbal IQ 110 120 130 Some Scatter Patterns 250 200 150 100 50 0 40 -50 -100 60 80 100 120 140 250 200 150 100 50 0 40 -50 -100 60 80 100 120 140 • Circular • No relationship between X and Y • Unable to predict Y from X 160 140 120 100 80 60 40 20 0 40 60 80 100 120 140 160 140 120 100 80 60 40 20 0 40 60 80 100 120 140 • Ellipsoidal • Positive relationship between X and Y • Increases in X correspond to increases in Y (but not always) • Major axis of the ellipse has positive slope 160 140 120 100 80 60 40 20 0 40 60 80 100 120 140 Example Verbal IQ, MathIQ Scatter Plot 130 120 Math IQ 110 100 90 80 70 60 60 70 80 90 100 Verbal IQ 110 120 130 Some More Patterns 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 • Ellipsoidal (thinner ellipse) • Stronger positive relationship between X and Y • Increases in X correspond to increases in Y (more freqequently) • Major axis of the ellipse has positive slope • Minor axis of the ellipse much smaller 140 120 100 80 60 40 20 0 40 60 80 100 120 140 • Increased strength in the positive relationship between X and Y • Increases in X correspond to increases in Y (almost always) • Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse. 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 • Perfect positive relationship between X and Y • Y perfectly predictable from X • Data falls exactly along a straight line with positive slope 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 • Ellipsoidal • Negative relationship between X and Y • Increases in X correspond to decreases in Y (but not always) • Major axis of the ellipse has negative slope slope 140 120 100 80 60 40 20 0 40 60 80 100 120 140 • The strength of the relationship can increase until changes in Y can be perfectly predicted from X 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 140 120 100 80 60 40 20 0 40 60 80 100 120 140 Some Non-Linear Patterns 1200 1000 800 600 400 200 0 -20 -10 0 10 20 30 40 50 1200 1000 800 600 400 200 0 -20 -10 0 10 20 30 40 50 • In a Linear pattern Y increase with respect to X at a constant rate • In a Non-linear pattern the rate that Y increases with respect to X is variable Growth Patterns 120 100 80 60 40 20 0 0 -20 10 20 30 40 50 120 100 150 80 100 50 60 0 0 10 20 30 40 50 40 -50 -100 20 -150 0 0 -20 10 20 30 40 50 • Growth patterns frequently follow a sigmoid curve 120 100 80 60 40 20 0 0 10 20 30 40 50 • Growth at the start is slow • It then speeds up • Slows down again as it reaches it limiting size Measures of strength of a relationship (Correlation) • Pearson’s correlation coefficient (r) • Spearman’s rank correlation coefficient (rho, r) Assume that we have collected data on two variables X and Y. Let (x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population) From this data we can compute summary statistics for each variable. n The means x x i 1 i n and n y y i 1 n i The standard deviations n sx x x 2 i i 1 n 1 and n sy y i 1 y 2 i n 1 These statistics: x sx y sy • give information for each variable separately but • give no information about the relationship between the two variables Consider the statistics: n S xx xi x i 1 n 2 S yy yi y 2 i 1 n S xy xi x yi y i 1 The first two statistics: n S xx xi x i 1 2 and S yy n y i 1 i • are used to measure variability in each variable • they are used to compute the sample standard deviations S xx sx n 1 sy S yy n 1 y 2 The third statistic: n S xy xi x yi y i 1 • is used to measure correlation • If two variables are positively related the sign of xi x will agree with the sign of yi y •When xi x is positive yi y will be positive. •When xi is above its mean, yi will be above its mean •When xi x is negative yi y will be negative. •When xi is below its mean, yi will be below its mean The product xi x yi y will be positive for most cases. This implies that the statistic n S xy xi x yi y i 1 • will be positive • Most of the terms in this sum will be positive On the other hand • If two variables are negatively related the sign of yi y will be opposite in sign to xi x •When xi x is positive yi y will be negative. •When xi is above its mean, yi will be below its mean •When xi x is negative yi y will be positive. •When xi is below its mean, yi will be above its mean The product xi x yi y will be negative for most cases. Again implies that the statistic n S xy xi x yi y i 1 • will be negative • Most of the terms in this sum will be negative Pearsons correlation coefficient is defined as below: n r S xy S xx S yy x x y i 1 n i i y n x x y y i 1 2 i i 1 2 i The denominator: n n x x y y i 1 2 i is always positive i 1 2 i The numerator: n x x y i 1 i i y • is positive if there is a positive relationship between X ad Y and • negative if there is a negative relationship between X ad Y. • This property carries over to Pearson’s correlation coefficient r Properties of Pearson’s correlation coefficient r 1. The value of r is always between –1 and +1. 2. If the relationship between X and Y is positive, then r will be positive. 3. If the relationship between X and Y is negative, then r will be negative. 4. If there is no relationship between X and Y, then r will be zero. 5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope. 6. The value of r will be -1 if the points, (xi, yi) lie on a straight line with negative slope. 140 120 100 80 r =1 60 40 20 0 40 60 80 100 120 140 140 120 100 80 r = 0.95 60 40 20 0 40 60 80 100 120 140 140 120 100 80 r = 0.7 60 40 20 0 40 60 80 100 120 140 160 140 120 100 r = 0.4 80 60 40 20 0 40 60 80 100 120 140 250 200 150 100 r=0 50 0 40 -50 -100 60 80 100 120 140 140 120 100 80 r = -0.4 60 40 20 0 40 60 80 100 120 140 140 120 100 80 r = -0.7 60 40 20 0 40 60 80 100 120 140 140 120 100 80 r = -0.8 60 40 20 0 40 60 80 100 120 140 140 120 100 80 r = -0.95 60 40 20 0 40 60 80 100 120 140 140 120 100 80 r = -1 60 40 20 0 40 60 80 100 120 140 Computing formulae for the statistics: n S xx xi x i 1 n 2 S yy yi y 2 i 1 n S xy xi x yi y i 1 xi n i 1 2 xi n i 1 n n S xx xi x 2 i 1 yi n i 1 2 yi n i 1 n n S yy yi y i 1 2 n S xy xi x yi y i 1 2 n n xi yi n i 1 i 1 xi yi n i 1 2 To compute S xy S yy S xx first compute n n n n n i 1 i 1 i 1 i 1 i 1 A xi B yi C xi2 D yi2 E xi yi Then 2 A S xx C n B2 S yy D n A B S xy E n Example Verbal IQ, MathIQ Data Set #3 The following table gives data on Verbal IQ, Math IQ, Initial Reading Acheivement Score, and Final Reading Acheivement Score for 23 students who have recently completed a reading improvement program Student Verbal IQ Math IQ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 86 104 86 105 118 96 90 95 105 84 94 119 82 80 109 111 89 99 94 99 95 102 102 94 103 92 100 115 102 87 100 96 80 87 116 91 93 124 119 94 117 93 110 97 104 93 Initial Reading Acheivement 1.1 1.5 1.5 2.0 1.9 1.4 1.5 1.4 1.7 1.6 1.6 1.7 1.2 1.0 1.8 1.4 1.6 1.6 1.4 1.4 1.5 1.7 1.6 Final Reading Acheivement 1.7 1.7 1.9 2.0 3.5 2.4 1.8 2.0 1.7 1.7 1.7 3.1 1.8 1.7 2.5 3.0 1.8 2.6 1.4 2.0 1.3 3.1 1.9 Scatter Plot 130 120 Math IQ 110 100 90 80 70 60 60 70 80 90 100 Verbal IQ 110 120 130 n Now x i 1 n i 2 x i 221494 i 1 n 2244 n y i 1 i 2307 2 y i 234363 i 1 n x y i 1 i i 227199 2 2244 Hence S xx 221494 2557.652 23 2307 2 S yy 234363 2960.87 23 2244 2307 S xy 227199 2116.043 23 Thus Pearsons correlation coefficient is: r S xy S xx S yy 2116.043 0.769 2557.652 2960.87