Download Lecture 4

Document related concepts
no text concepts found
Transcript
Measure of Variability
(Dispersion, Spread)
1.
2.
3.
4.
Range
Inter-Quartile Range
Variance, standard deviation
Pseudo-standard deviation
Measure of Central Location
1. Mean
2. Median
1. Range
R = Range = max - min
2. Inter-Quartile Range (IQR)
Inter-Quartile Range = IQR = Q3 - Q1
Example
The data Verbal IQ on n = 23 students
arranged in increasing order is:
80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119
min = 80
Q1 = 89
Q2 = 96
Q3 = 105
max = 119
Range and IQR
Range = max – min = 119 – 80 = 39
Inter-Quartile Range
= IQR = Q3 - Q1 = 105 – 89 = 16
3. Sample Variance
Let x1, x2, x3, … xn denote a set of n numbers.
Recall the mean of the n numbers is defined
as:
n
x
 xi
i 1
n
x1  x2  x3    xn 1  xn

n
The numbers
d1  x1  x
d2  x2  x
d3  x3  x

d n  xn  x
are called deviations from the the mean
The sum
n
d
i 1
n
2
i
   xi  x 
2
i 1
is called the sum of squares of deviations from
the the mean.
Writing it out in full:
d  d  d  d
2
1
or
2
2
2
3
x1  x   x2  x 
2
2
2
n
   xn  x 
2
The Sample Variance
Is defined as the quantity:
n
d
i 1
n
2
i
n 1

 x  x 
i 1
2
i
n 1
and is denoted by the symbol
s
2
The Sample Standard Deviation s
Definition: The Sample Standard Deviation is
defined by:
n
s
d
i 1
n
2
i
n 1

 x  x 
i 1
2
i
n 1
Hence the Sample Standard Deviation, s, is the
square root of the sample variance.
Example
Let x1, x2, x3, x4, x5 denote a set of 5 denote the
set of numbers in the following table.
i
1
2
3
4
5
xi
10
15
21
7
13
Then 5
 xi
i 1
and
x
= x 1 + x2 + x3 + x4 + x5
= 10 + 15 + 21 + 7 + 13
= 66
n
 xi
i 1
n
x1  x2  x3    xn 1  xn

n
66

 13.2
5
The deviations from the mean d1, d2, d3, d4, d5
are given in the following table.
i
xi
d i  xi  x
2
d i2   xi  x 
1
10
-3.2
2
15
1.8
3
21
7.8
4
7
-6.2
5
13
-0.2
10.24
3.24
60.84
38.44
0.04
The sum
n
d
i 1
n
2
i
   xi  x 
2
i 1
  3.2  1.8  7.8   6.2   0.2
2
2
2
2
 10.24  3.24  60.84  38.44  0.04
 112.80
n
and
2
xi  x 

112.8
2
i 1
s 

 28.2
n 1
4
2
Also the standard deviation is:
n
s s 
2
 x  x 
i 1
2
i
n 1
112.8

 28.2  5.31
4
Interpretations of s
• In Normal distributions
– Approximately 2/3 of the observations will lie
within one standard deviation of the mean
– Approximately 95% of the observations lie
within two standard deviations of the mean
– In a histogram of the Normal distribution, the
standard deviation is approximately the
distance from the mode to the inflection point
Mode
0.14
0.12
Inflection point
0.1
0.08
0.06
0.04
s
0.02
0
0
5
10
15
20
25
2/3
s
s
2s
Example
A researcher collected data on 1500 males
aged 60-65.
The variable measured was cholesterol and
blood pressure.
– The mean blood pressure was 155 with a
standard deviation of 12.
– The mean cholesterol level was 230 with a
standard deviation of 15
– In both cases the data was normally distributed
Interpretation of these numbers
• Blood pressure levels vary about the value
155 in males aged 60-65.
• Cholesterol levels vary about the value 230
in males aged 60-65.
• 2/3 of males aged 60-65 have blood pressure
within 12 of 155. i.e. between 155-12 =143
and 155+12 = 167.
• 2/3 of males aged 60-65 have Cholesterol
within 15 of 230. i.e. between 230-15 =215
and 230+15 = 245.
• 95% of males aged 60-65 have blood
pressure within 2(12) = 24 of 155. Ii.e.
between 155-24 =131 and 155+24 = 179.
• 95% of males aged 60-65 have Cholesterol
within 2(15) = 30 of 230. i.e. between 23030 =200 and 230+30 = 260.
A Computing formula for:
Sum of squares of deviations from the the
mean :
n
 x  x 
i 1
2
i
The difficulty with this formula is that x will
have many decimals.
The result will be that each term in the above
sum will also have many decimals.
The sum of squares of deviations from the the
mean can also be computed using the
following identity:


x



i
n
2
i 1


  xi 
n
i 1
n
n
 x  x 
i 1
2
i
2
To use this identity we need to compute:
n
x
i 1
 x1  x2    xn and
i
n
x
i 1
2
i
 x  x  x
2
1
2
2
2
n
Then:
n
 x  x 
i 1


x



i
n
2
i 1


  xi 
n
i 1
n
2
i
2


x


i
n
2
i 1


xi 

n
i 1

n 1
n
n
and s 
2
 x  x 
i 1
2
i
n 1
2
and


x



i
n
2
i 1


xi 

n
i 1

n 1
n
n
s
 x  x 
i 1
2
i
n 1
2
Example
The data Verbal IQ on n = 23 students
arranged in increasing order is:
80 82 84 86 86 89 90 94
94 95 95 96 99 99 102 102
104 105 105 109 111 118 119
n
x
i
i 1
n
x
i 1
2
i
= 80 + 82 + 84 + 86 + 86 + 89
+ 90 + 94 + 94 + 95 + 95 + 96
+ 99 + 99 + 102 + 102 + 104
+ 105 + 105 + 109 + 111 + 118
+ 119 = 2244
= 802 + 822 + 842 + 862 + 862 + 892
+ 902 + 942 + 942 + 952 + 952 + 962
+ 992 + 992 + 1022 + 1022 + 1042
+ 1052 + 1052 + 1092 + 1112
+ 1182 + 1192 = 221494
Then:
n
 x  x 
i 1


x



i
n
2
i 1


  xi 
n
i 1
n
2
i

2244
 221494 
2
2
23
 2557.652
You will obtain exactly the same answer if
you use the left hand side of the equation


x



i
n
2
i 1


xi 

n
i 1

n 1
n
n
and s 
2
 x  x 
2
i
i 1
n 1

2244
221494 
2
2

23
22
2557.652

 116.26
22


x


i
n
2
i 1


xi 

n
i 1

n 1
n
n
Also s 
 x  x 
i 1
2
i
n 1

2244
221494 
2
2

 10.782
23
22
2557.652

 116.26
22
A quick (rough) calculation of s
Range
s
4
The reason for this is that approximately all
(95%) of the observations are between x  2s
and x  2s.
Thus max  x  2s and min  x  2s.
and Range  max  min  x  2s   x  2s .
 4s
Range
Hence s 
4
Example
Verbal IQ on n = 23 students
min = 80 and max = 119
119 - 80 39
s

 9.75
4
4
This compares with the exact value of s
which is 10.782.
The rough method is useful for checking
your calculation of s.
The Pseudo Standard Deviation (PSD)
Definition: The Pseudo Standard Deviation
(PSD) is defined by:
IQR InterQuart ile Range
PSD 

1.35
1.35
Properties
• For Normal distributions the magnitude of the
pseudo standard deviation (PSD) and the standard
deviation (s) will be approximately the same value
• For leptokurtic distributions the standard deviation
(s) will be larger than the pseudo standard
deviation (PSD)
• For platykurtic distributions the standard deviation
(s) will be smaller than the pseudo standard
deviation (PSD)
Example
Verbal IQ on n = 23 students
Inter-Quartile Range
= IQR = Q3 - Q1 = 105 – 89 = 16
Pseudo standard deviation
IQR 16
 PSD 

 11.85
1.35 1.35
This compares with the standard deviation
s  10.782
• An outlier is a “wild” observation in the
data
• Outliers occur because
– of errors (typographical and computational)
– Extreme cases in the population
• We will now consider the drawing of boxplots where outliers are identified
Box-whisker Plots showing
outliers
• An outlier is a “wild” observation in the
data
• Outliers occur because
– of errors (typographical and computational)
– Extreme cases in the population
• We will now consider the drawing of boxplots where outliers are identified
To Draw a Box Plot we need to:
• Compute the Hinge (Median, Q2) and the
Mid-hinges (first & third quartiles – Q1
and Q3 )
• To identify outliers we will compute the
inner and outer fences
The fences are like the fences at a prison. We
expect the entire population to be within both
sets of fences.
If a member of the population is between the
inner and outer fences it is a mild outlier.
If a member of the population is outside of the
outer fences it is an extreme outlier.
Lower outer fence
F1 = Q1 - (3)IQR
Upper outer fence
F2 = Q3 + (3)IQR
Lower inner fence
f1 = Q1 - (1.5)IQR
Upper inner fence
f2 = Q3 + (1.5)IQR
• Observations that are between the lower and
upper fences are considered to be nonoutliers.
• Observations that are outside the inner
fences but not outside the outer fences are
considered to be mild outliers.
• Observations that are outside outer fences
are considered to be extreme outliers.
• mild outliers are plotted individually in a
box-plot using the symbol
• extreme outliers are plotted individually in
a box-plot using the symbol
• non-outliers are represented with the box
and whiskers with
– Max = largest observation within the fences
– Min = smallest observation within the fences
Box-Whisker plot
representing the data
that are not outliers
Extreme outlier
Mild outliers
Inner fences
Outer fence
Example
Data collected on n = 109 countries in 1995.
Data collected on k = 25 variables.
The variables
1. Population Size (in 1000s)
2. Density = Number of people/Sq kilometer
3. Urban = percentage of population living in
cities
4. Religion
5. lifeexpf = Average female life expectancy
6. lifeexpm = Average male life expectancy
7. literacy = % of population who read
8. pop_inc = % increase in popn size (1995)
9. babymort = Infant motality (deaths per
1000)
10. gdp_cap = Gross domestic product/capita
11. Region = Region or economic group
12. calories = Daily calorie intake.
13. aids = Number of aids cases
14. birth_rt = Birth rate per 1000 people
15. death_rt = death rate per 1000 people
16. aids_rt = Number of aids cases/100000
people
17. log_gdp = log10(gdp_cap)
18. log_aidsr = log10(aids_rt)
19. b_to_d =birth to death ratio
20. fertility = average number of children in
family
21. log_pop = log10(population)
22. cropgrow = ??
23. lit_male = % of males who can read
24. lit_fema = % of females who can read
25. Climate = predominant climate
The data file as it appears in SPSS
Consider the data on infant mortality
Stem-Leaf diagram stem = 10s, leaf = unit digit
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
4455555666666666777778888899
0122223467799
0001123555577788
45567999
135679
011222347
03678
4556679
5
4
1569
0022378
46
7
8
Summary Statistics
median = Q2 = 27
Quartiles
Lower quartile = Q1 = the median of lower half
Upper quartile = Q3 = the median of upper half
12  12
66  67
Q1 
 12, Q3 
 66.5
2
2
Interquartile range (IQR)
IQR = Q1 - Q3 = 66.5 – 12 = 54.5
The Outer Fences
lower = Q1 - 3(IQR) = 12 – 3(54.5) = - 151.5
upper = Q3 = 3(IQR) = 66.5 – 3(54.5) = 230.0
No observations are outside of the outer fences
The Inner Fences
lower = Q1 – 1.5(IQR) = 12 – 1.5(54.5) = - 69.75
upper = Q3 = 1.5(IQR) = 66.5 – 1.5(54.5) = 148.25
Only one observation (168 – Afghanistan) is
outside of the inner fences – (mild outlier)
Box-Whisker Plot of Infant Mortality
0
0
50
100
150
Infant Mortality
200
Example 2
In this example we are looking at the weight
gains (grams) for rats under six diets differing
in level of protein (High or Low) and source
of protein (Beef, Cereal, or Pork).
– Ten test animals for each diet
Table
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
High Protein
Level
Low protein
Source
Beef
Cereal
Pork
Beef
Cereal
Pork
Diet
1
73
102
118
104
81
107
100
87
117
111
103.0
100.0
24.0
17.78
229.11
15.14
2
98
74
56
111
95
88
82
77
86
92
87.0
85.9
18.0
13.33
225.66
15.02
3
94
79
96
98
102
102
108
91
120
105
100.0
99.5
11.0
8.15
119.17
10.92
4
90
76
90
64
86
51
72
90
95
78
82.0
79.2
18.0
13.33
192.84
13.89
5
107
95
97
80
98
74
74
67
89
58
84.5
83.9
23.0
17.04
246.77
15.71
6
49
82
73
86
81
97
106
70
61
82
81.5
78.7
16.0
11.05
273.79
16.55
Median
Mean
IQR
PSD
Variance
Std. Dev.
Box Plots: Weight Gains for Six Diets
130
High Protein
120
Low Protein
110
Weight Gain
100
90
80
70
60
50
Beef
Cereal
Pork
Beef
2
3
4
Cereal
Pork
40
1
Diet
5
6
Non-Outlier Max
Non-Outlier Min
Median; 75%
25%
Conclusions
• Weight gain is higher for the high protein
meat diets
• Increasing the level of protein - increases
weight gain but only if source of protein is a
meat source
Measures of Shape
Measures of Shape
• Skewness
Symmetric
Positively
skewed
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.14
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
Negatively
skewed
0
5
10
15
20
25
0
5
10
15
20
25
• Kurtosis
Normal
(mesokurtic)
Platykurtic
Leptokurtic
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
-3
-2
-1
0
1
2
3
0
0
5
10
15
20
25
-3
-2
-1
0
1
2
3
• Measure of Skewness – based on the sum
of cubes
n
 x  x 
3
i
i 1
• Measure of Kurtosis – based on the sum
of 4th powers
n
 x  x 
i 1
4
i
The Measure of Skewness
 n
3
n    xi  x  
i 1


g1 
3
n

2 2
   xi  x  
 i 1

The Measure of Kurtosis
n
g2 
 x  x 
i 1
n
4
i
n  xi  x 
3
2
i 1
The 3 is subtracted so that g2 is zero for the
normal distribution
Interpretations of Measures of Shape
• Skewness
0.14
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.12
g1 > 0
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
g1 = 0
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
0
5
10
15
20
25
g1 < 0
0
5
10
15
20
25
• Kurtosis
0.14
g2 < 0
0.12
g2 = 0
0.1
0.08
0.06
g2 > 0
0.04
0.02
0
0
-3
-2
-1
0
1
2
3
0
0
5
10
15
20
25
-3
-2
-1
0
1
2
3
Descriptive techniques for
Multivariate data
In most research situations data is collected
on more than one variable (usually many
variables)
Graphical Techniques
• The scatter plot
• The two dimensional Histogram
The Scatter Plot
For two variables X and Y we will have a
measurements for each variable on each case:
xi, yi
xi = the value of X for case i
and
yi = the value of Y for case i.
To Construct a scatter plot we plot the points:
(xi, yi)
for each case on the X-Y plane.
(xi, yi)
yi
xi
Data Set #3
The following table gives data on Verbal IQ, Math IQ,
Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Student
Verbal
IQ
Math
IQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
86
104
86
105
118
96
90
95
105
84
94
119
82
80
109
111
89
99
94
99
95
102
102
94
103
92
100
115
102
87
100
96
80
87
116
91
93
124
119
94
117
93
110
97
104
93
Initial
Reading
Acheivement
1.1
1.5
1.5
2.0
1.9
1.4
1.5
1.4
1.7
1.6
1.6
1.7
1.2
1.0
1.8
1.4
1.6
1.6
1.4
1.4
1.5
1.7
1.6
Final
Reading
Acheivement
1.7
1.7
1.9
2.0
3.5
2.4
1.8
2.0
1.7
1.7
1.7
3.1
1.8
1.7
2.5
3.0
1.8
2.6
1.4
2.0
1.3
3.1
1.9
Scatter Plot
140
120
Math IQ
100
80
60
40
20
0
0
20
40
60
80
Verbal IQ
100
120
140
Scatter Plot
140
120
Math IQ
100
80
60
(84,80)
40
20
0
0
20
40
60
80
Verbal IQ
100
120
140
Scatter Plot
130
120
Math IQ
110
100
90
80
70
60
60
70
80
90
100
Verbal IQ
110
120
130
Some Scatter Patterns
250
200
150
100
50
0
40
-50
-100
60
80
100
120
140
250
200
150
100
50
0
40
-50
-100
60
80
100
120
140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
160
140
120
100
80
60
40
20
0
40
60
80
100
120
140
160
140
120
100
80
60
40
20
0
40
60
80
100
120
140
• Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y
(but not always)
• Major axis of the ellipse has positive slope
160
140
120
100
80
60
40
20
0
40
60
80
100
120
140
Example
Verbal IQ, MathIQ
Scatter Plot
130
120
Math IQ
110
100
90
80
70
60
60
70
80
90
100
Verbal IQ
110
120
130
Some More Patterns
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
• Ellipsoidal (thinner ellipse)
• Stronger positive relationship between X
and Y
• Increases in X correspond to increases in Y
(more freqequently)
• Major axis of the ellipse has positive slope
• Minor axis of the ellipse much smaller
140
120
100
80
60
40
20
0
40
60
80
100
120
140
• Increased strength in the positive
relationship between X and Y
• Increases in X correspond to increases in Y
(almost always)
• Minor axis of the ellipse extremely small in
relationship to the Major axis of the ellipse.
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
• Perfect positive relationship between X and
Y
• Y perfectly predictable from X
• Data falls exactly along a straight line with
positive slope
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
• Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y
(but not always)
• Major axis of the ellipse has negative slope
slope
140
120
100
80
60
40
20
0
40
60
80
100
120
140
• The strength of the relationship can increase
until changes in Y can be perfectly
predicted from X
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
140
120
100
80
60
40
20
0
40
60
80
100
120
140
Some Non-Linear Patterns
1200
1000
800
600
400
200
0
-20
-10
0
10
20
30
40
50
1200
1000
800
600
400
200
0
-20
-10
0
10
20
30
40
50
• In a Linear pattern Y increase with respect
to X at a constant rate
• In a Non-linear pattern the rate that Y
increases with respect to X is variable
Growth Patterns
120
100
80
60
40
20
0
0
-20
10
20
30
40
50
120
100
150
80
100
50
60
0
0
10
20
30
40
50
40
-50
-100
20
-150
0
0
-20
10
20
30
40
50
• Growth patterns frequently follow a
sigmoid curve
120
100
80
60
40
20
0
0
10
20
30
40
50
• Growth at the start is slow
• It then speeds up
• Slows down again as it reaches it limiting
size
Measures of strength of a
relationship (Correlation)
• Pearson’s correlation coefficient (r)
• Spearman’s rank correlation
coefficient (rho, r)
Assume that we have collected data on two
variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on
two variables X and Y for n cases in a sample
(or population)
From this data we can compute summary
statistics for each variable.
n
The means
x
x
i 1
i
n
and
n
y
y
i 1
n
i
The standard deviations
n
sx 
 x  x 
2
i
i 1
n 1
and
n
sy 
 y
i 1
 y
2
i
n 1
These statistics:
x
sx
y
sy
• give information for each variable separately
but
• give no information about the relationship
between the two variables
Consider the statistics:
n
S xx    xi  x 
i 1
n
2
S yy    yi  y 
2
i 1
n
S xy   xi  x  yi  y 
i 1
The first two statistics:
n
S xx    xi  x 
i 1
2
and S yy 
n
 y
i 1
i
• are used to measure variability in each
variable
• they are used to compute the sample
standard deviations
S xx
sx 
n 1
sy 
S yy
n 1
 y
2
The third statistic:
n
S xy   xi  x  yi  y 
i 1
• is used to measure correlation
• If two variables are positively related the
sign of
xi  x 
will agree with the sign of
 yi  y 
•When xi  x  is positive  yi  y  will be positive.
•When xi is above its mean, yi will be above its
mean
•When xi  x  is negative  yi  y  will be negative.
•When xi is below its mean, yi will be below its
mean
The product xi  x  yi  y  will be positive for
most cases.
This implies that the statistic
n
S xy   xi  x  yi  y 
i 1
• will be positive
• Most of the terms in this sum will be
positive
On the other hand
• If two variables are negatively related the
sign of
 yi  y 
will be opposite in sign to
xi  x 
•When xi  x  is positive  yi  y  will be negative.
•When xi is above its mean, yi will be below its
mean
•When xi  x  is negative  yi  y  will be positive.
•When xi is below its mean, yi will be above its
mean
The product xi  x  yi  y  will be negative for
most cases.
Again implies that the statistic
n
S xy   xi  x  yi  y 
i 1
• will be negative
• Most of the terms in this sum will be
negative
Pearsons correlation coefficient is defined as
below:
n
r
S xy
S xx S yy

 x  x  y
i 1
n
i
i
 y
n
 x  x    y  y 
i 1
2
i
i 1
2
i
The denominator:
n
n
 x  x    y  y 
i 1
2
i
is always positive
i 1
2
i
The numerator:
n
 x  x  y
i 1
i
i
 y
• is positive if there is a positive relationship
between X ad Y and
• negative if there is a negative relationship
between X ad Y.
• This property carries over to Pearson’s
correlation coefficient r
Properties of Pearson’s correlation
coefficient r
1. The value of r is always between –1 and +1.
2. If the relationship between X and Y is positive, then
r will be positive.
3. If the relationship between X and Y is negative,
then r will be negative.
4. If there is no relationship between X and Y, then r
will be zero.
5. The value of r will be +1 if the points, (xi, yi) lie on
a straight line with positive slope.
6. The value of r will be -1 if the points, (xi, yi) lie on
a straight line with negative slope.
140
120
100
80
r =1
60
40
20
0
40
60
80
100
120
140
140
120
100
80
r = 0.95
60
40
20
0
40
60
80
100
120
140
140
120
100
80
r = 0.7
60
40
20
0
40
60
80
100
120
140
160
140
120
100
r = 0.4
80
60
40
20
0
40
60
80
100
120
140
250
200
150
100
r=0
50
0
40
-50
-100
60
80
100
120
140
140
120
100
80
r = -0.4
60
40
20
0
40
60
80
100
120
140
140
120
100
80
r = -0.7
60
40
20
0
40
60
80
100
120
140
140
120
100
80
r = -0.8
60
40
20
0
40
60
80
100
120
140
140
120
100
80
r = -0.95
60
40
20
0
40
60
80
100
120
140
140
120
100
80
r = -1
60
40
20
0
40
60
80
100
120
140
Computing formulae for the statistics:
n
S xx    xi  x 
i 1
n
2
S yy    yi  y 
2
i 1
n
S xy   xi  x  yi  y 
i 1


  xi 
n
i 1


2
  xi 
n
i 1
n
n
S xx    xi  x 
2
i 1


  yi 
n
i 1


2
  yi 
n
i 1
n
n
S yy    yi  y 
i 1
2
n
S xy   xi  x  yi  y 
i 1
2
 n  n 
  xi   yi 
n
i 1
i 1




  xi yi 
n
i 1
2
To compute
S xy
S yy
S xx
first compute
n
n
n
n
n
i 1
i 1
i 1
i 1
i 1
A   xi B   yi C   xi2 D   yi2 E   xi yi
Then
2
A
S xx  C 
n
B2
S yy  D 
n
A B
S xy  E 
n
Example
Verbal IQ, MathIQ
Data Set #3
The following table gives data on Verbal IQ, Math IQ,
Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Student
Verbal
IQ
Math
IQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
86
104
86
105
118
96
90
95
105
84
94
119
82
80
109
111
89
99
94
99
95
102
102
94
103
92
100
115
102
87
100
96
80
87
116
91
93
124
119
94
117
93
110
97
104
93
Initial
Reading
Acheivement
1.1
1.5
1.5
2.0
1.9
1.4
1.5
1.4
1.7
1.6
1.6
1.7
1.2
1.0
1.8
1.4
1.6
1.6
1.4
1.4
1.5
1.7
1.6
Final
Reading
Acheivement
1.7
1.7
1.9
2.0
3.5
2.4
1.8
2.0
1.7
1.7
1.7
3.1
1.8
1.7
2.5
3.0
1.8
2.6
1.4
2.0
1.3
3.1
1.9
Scatter Plot
130
120
Math IQ
110
100
90
80
70
60
60
70
80
90
100
Verbal IQ
110
120
130
n
Now
x
i 1
n
i
2
x
 i  221494
i 1
n
 2244
n
y
i 1
i
 2307
2
y
 i  234363
i 1
n
x y
i 1
i
i
 227199
2
2244
Hence S xx  221494 
 2557.652
23
2307 2
S yy  234363 
 2960.87
23

2244 2307 
S xy  227199 
 2116.043
23
Thus Pearsons correlation coefficient is:
r
S xy
S xx S yy
2116.043

 0.769
2557.652 2960.87
Related documents