Download pptx file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Describing Data:
Summary Measures
Identifying the Scale of Measurement
VARIABLE
AGREE
NO OPINION
DISAGREE
• Before you analyze the data, identify the
measurement scale for each variable
(continuous, nominal, or ordinal).
Nominal Variables
• Variable: Type of Beverage
1
2
3
• or
1
2
3
Ordinal Variables
Variable: Size of Beverage
Small
Medium
Large
Continuous Variables
Variable:
Volume of Beverage
Variable:
Temperature of Beverage
4.0
3.0
2.0
1.0
0
Ratio Level
Interval Level
Central Tendencies
• Defined as the tendency of the data to
cluster around or center about certain
numerical values, such as the:
– Mean (arithmetic mean),
– Median, and
– Mode.
Central Tendency – Mean, Median, and Mode
1
1
1
2
3
10
n
Mean=3
Median=1.5
Mode=1
the sum of all the values in the data
set divided by the number of values
x
i 1
i
n
the middle value (also known as the 50th percentile)
the most common or frequent data value
Mean (or Arithmetic Mean)
• Sum of the values of all of the
observations in a data set divided by the
number of observations:
– The Sample Mean is X
– The Population Mean is: 
– The formula for calculating the sample mean
x
is:
n
i
Median
• Defined as the middle point of the set of
data, i.e. exactly half of the data points are
above the median and exactly half are
below:
– If the number of data points is odd, it is the
middle point of the ordered set of data.
– If the number of data points is even, it is the
average (mean) of the two middle points of
the ordered set of data.
Mode
• Defined as the measurement(s) which
occurs with the greatest frequency in the
sample, i.e. the most common point(s):
– A unimodal data set contains only one mode.
– A bimodal data set contains two modes.
– And so on….
Examples and Calculations
Sales
Sales
Company Revenue Company Revenue
1
3.1
19
6.2
2
7.4
20
8.4
3
2.2
21
1.9
4
10.9
22
5.8
5
4.5
23
4.9
6
8.6
24
6.4
7
3.7
25
3.6
8
6.3
26
7.9
9
7.6
27
3.2
10
5.4
28
8.5
11
2.3
29
6.2
12
5.8
30
9.7
13
4.2
31
7.1
14
6.1
32
5.9
15
9.1
33
5.7
16
5.5
34
4.4
17
4.8
35
2.9
18
8.9
Stem
Leaf
19
2239
31267
424589
5457889
612234
71469
84569
917
10 9
Key: Leaf units are tenths.
Examples and Calculations
• Mean:
x
X    (3.1 7.4  2.2 10.9 .... 2.9) / 35
n
i
X  (205.1/ 35)  5.86
• Median:
– From stem and leaf: median is 5.8
• Mode:
– From stem and leaf: mode is 5.8 and 6.2
Skewing (Mean and Median
Comparisons)
• If the median is less than the mean, the data set
is skewed right (extreme data in right tail which
increases the mean).
• If the median is greater than the mean, the data
set is skewed left (extreme data in the left tail
which decreases the mean).
• If median equals the mean, the data set is said
to be symmetrical.
Other Items
• Notes:
– The mean is sensitive to outliers (extreme
data) while the median and mode are not.
– Outliers can have significant implications
on the ability draw inferences from the data
set. For example, consider:
• Average home sales price.
• Average exam score.
• Average jury award amount.
– Mean or median may not be feasible.
Picturing Distributions: Histogram
PERCENT
– Each bar in the
histogram represents
a group of values
(a bin).
– The height of the
bar represents the
frequency or percent
of values in the bin.
Bins
Data Distributions Compared to Normal
A Normal Distribution
Skewness
Measures of Data Variability
• Knowing central tendencies (mean, median,
mode) isn’t enough. Also need a method for
determining how close the data is clustered
around its center point(s).
• Three typical measures of data variability:
– Range,
– Variance, and
– Standard Deviation.
The Spread of a Distribution: Dispersion
Measure
Definition
Range
the difference between the maximum
and minimum data values
Interquartile
Range
the difference between the 25th and
75th percentiles
Variance
a measure of dispersion of the data
around the mean
Standard
Deviation
a measure of dispersion expressed
in the same units of measurement
as your data (the square root of the
variance)
Range
• Simplest measure of variability.
• Calculated by subtracting the smallest
measurement from the largest measurement.
• GPA Examples:
–
–
–
–
Data Set (4.0, 2.7, 3.3, 3.2, 2.1, 3.7, 3.5, 1.9).
Range equals 4.0 minus 1.9 which is 2.1.
Data Set (2.7, 3.2, 3.4, 2.9, 1.8, 2.2, 3.0, 2.1).
Range equals 3.4 minus 1.8 which is 1.6.
Is Range A Sufficient Measure
of Variability?
• No - Consider the following two stem and leaf diagrams
where the range equals 9.0.
Stem
Leaf
19
23
316
4235567
5445678899
61223458
714689
818
97
10 9
Key: Leaf units are tenths.
Stem
Leaf
19
2239
31267
424589
5457889
612234
71469
84569
917
10 9
Key: Leaf units are tenths.
Another Method for Measuring
Data Variability (or Spread)
• A more sensitive measurement of variation uses
the difference between the sample mean and
each of the measurements of the sample, also
known as the deviation from the mean.
• Each deviation between the sample member
and the mean is first calculated and then
squared. These results are then summed.
Variance
• Equal to the sum of the squared distances
from the mean divided by (n-1) for a
sample:
– Sample Variance - s2
– Population Variance -  2
• Deviations are squared to remove effects
of negative differences.
Standard Deviation
• While variance does not provide a useful
metric (i.e. “units squared”), taking the
positive square root of the variance
provides a metric which is the same as the
data itself (i.e. “units”):
– Sample Standard Deviation - s
– Population Standard Deviation - 
Variance Formulas
s
2

x x


2
i
n 1

2
x  


i
n
2
Shortcut Variance Formulas
s

2
2




xi 2 
 x i 2
n 1
xi 2 
n
n
 x i 2
n
Standard Deviation Formulas
s s 
 x x
  
 x  
2
2
i
2
n 1
i
n
2
Shortcut Standard Deviation
Formulas
s 
 
s2 

2


xi 2 
 x i 2
n 1

xi 2 
n
n
 x i 2
n
Notes on Variability
• The sample variance (standard deviation) is
divided by one less than the sample size (n-1)
rather than by sample size itself (n).
• The sample variance (standard deviation) is
used to estimate the population variance
(standard deviation). We divide by (n-1) rather
than (n) so that this estimator is unbiased.
Notes on Variability
(continued)
• A data set with larger spread about its mean will
have a larger standard deviation.
• A data set with smaller spread about its mean
will have a smaller standard deviation.
• It is the calculation of the standard deviation that
allows for the comparison of the spread
(variability) of the two data sets.
Examples
• GPA Data Set One:
• (4.0, 2.7, 3.3, 3.2,
2.1, 3.7, 3.5, 1.9).
– Range equals 4.0
minus 1.9 which is 2.1.
– Mean equals:
• GPA Data Set Two:
(2.7, 3.2, 3.4, 2.9, 1.8,
2.2, 3.0, 2.1).
– Range equals 3.4
minus 1.8 which is 1.6.
– Mean equals:
X  (40
.  27
.  33
. ...19
. )/8
X  (27
.  32
.  34
. ...21
. )/ 8
X  (244
. / 8)  305
.
X  (213/8
. )  266
.
Variance and Standard
Deviation Calculations
• GPA Data Set One:
Member
1
2
3
4
5
6
7
8
X
4.0
2.7
3.3
3.2
2.1
3.7
3.5
1.9
X Minus
Result
Mean
Squared
0.95
0.9025
-0.35
0.1225
0.25
0.0625
0.15
0.0225
-0.95
0.9025
0.65
0.4225
0.45
0.2025
-1.15
1.3225
Sum
3.9600
• GPA Data Set Two:
Member
1
2
3
4
5
6
7
8
X
2.7
3.2
3.4
2.9
1.8
2.2
3.0
2.1
X Minus
Result
Mean
Squared
0.04
0.0016
0.54
0.2916
0.74
0.5476
0.24
0.0576
-0.86
0.7396
-0.46
0.2116
0.34
0.1156
-0.56
0.3136
Sum
2.2788
– Variance:
– Variance:
2
s  (39600
.
) / (8  1)  0566
.
s2  (2.2788) / (8  1)  0.326
– Standard Deviation:
– Standard Deviation:
s  s2  (3.9600) / (8  1)  0.752
s  s 2  (2.2788) /(8  1)  0.571
Variance/Standard Deviation
Calculations using Shortcut
• GPA Data Set One:
Member
1
2
3
4
5
6
7
8
Sum
X
4.0
2.7
3.3
3.2
2.1
3.7
3.5
1.9
24.4
X
Squared
16.00
7.29
10.89
10.24
4.41
13.69
12.25
3.61
78.38
– Variance:
(24.4) 2
78.38  (
)
8
s2 
 0.566
(8  1)
• GPA Data Set Two:
Member
1
2
3
4
5
6
7
8
Sum
X
2.7
3.2
3.4
2.9
1.8
2.2
3.0
2.1
21.3
X
Squared
7.29
10.24
11.56
8.41
3.24
4.84
9.00
4.41
58.99
– Variance:
(213
. )2
58.99  (
)
8
s2 
 0.326
(8  1)
Standard Deviation
• Useful for comparing the variability of two data
sets.
• The data set with the larger standard deviation is
the data set with more variability.
• From GPA Example:
– Data Set One: Mean=3.05, St. Dev.=0.752.
– Data Set Two: Mean=2.66, St Dev.=0.571.
Relative vs. Absolute
Comparison
• Deviation, or error, has been standardized.
Thus, for a single data set, variability can be
discussed in terms of how many members of the
data set fall within one, two, three, or more
standard deviations of the mean.
• A theorem and a rule describe this behavior:
– Chebyshev’s Theorem;
– Empirical Rule.
Chebyshev’s Theorem
• In general, at least (1-1/k2) of the sample
members will fall within k standard deviations of
the mean (for k >1).
• So…
– For k=1, it is possible for no members to fall within 1
standard deviation of the mean.
– For k=2, 75% (3/4) or more of the members will fall
within 2 standard deviations of the mean.
Chebyshev’s Theorem (cont’d)
• From Table 2.10, Part 1 (cont’d):
– For k=3, 88.9% (8/9) or more of the members will fall
within 3 st. deviations of the mean.
– For k=4, 93.75 (15/16) or more of the members will
fall within 4 st. deviations of the mean.
– And so on….
• This theorem holds true regardless of the
frequency distribution of the data set, i.e. no
matter what the histogram looks like.
Empirical Rule
• Based on empirical evidence for mound or
bell shaped frequency distributions:
– Approximately 68% (0.6826) of the sample
members will fall within 1 standard deviation of the
mean.
– Approximately 95% (0.9544) of the sample
members will fall within 2 standard deviations of
the mean.
– Almost all (0.9974) of the sample members will fall
within 3 standard deviations of the mean.
Normal Distributions
Useful Probabilities for Normal Distributions
68%
95%
99%







Chebyshev’s Theorem vs.
Empirical Rule
• Percent of sample
members using
Chebyshev’s
Theorem
–
–
–
–
–
–
For k=1, 0% (+)
For k=2, 75% (+)
For k=3, 88.9% (+)
For k=4, 93.75% (+)
For k=5, 96% (+)
And so on….
• Percent of sample
members using
Empirical Rule
–
–
–
–
–
–
For k=1, 68.26%
For k=2, 95.44%
For k=3, 99.74%
For k=4, > 99.8%
For k=5, > 99.9%
And so on….
Example using Toyota 4Runner
Data
MPG Data for 1995 Toyota 4Runner
14.6
14.9
15.2
14.6
14.9
15.2
14.7
15.0
15.2
14.8
15.1
15.3
14.8
15.1
15.3
(ordered)
15.4
15.4
15.5
15.5
15.5
Mean
Mean
Chebyshev's Thm
k
Minus ks Plus ks
Estimate
Actual
1
14.9048 15.7752
30*0% = 0
18
2
14.4696 16.2104 30*75% = 22.5
30
3
14.0344 16.6456 30*88.9% = 26.7
30
Mean =
15.34
Standard Deviation =
0.4352
15.6
15.6
15.7
15.7
15.7
15.8
15.9
16.0
16.1
16.1
Empirical Rule
Estimate
Actual
30*68.26% = 20.5
18
30*95.44% = 28.6
30
30*99.74=29.9
30
Measures of Relative Standing
Percentile Ranking
• Describes how a member of the data set
compares to the rest of the data.
• Percentile ranking - pth percentile is a number x
such that p% of the measurements fall below the
pth percentile.
• Example:
– SAT Scores: 90th percentile means that 90 percent of
the scores are below.
Percentiles
98
95
92
90
85
81
79
70
63
55
47
42
75th
Percentile=91
50th Percentile=80
third quartile
Quartiles divide your data
into quarters.
25th Percentile=59
first quartile
Box Plots
1.5* IQR
outliers > 1.5 IQR from the box
largest point <= 1.5 IQR from the box
the 75th percentile
the 50th percentile (median)
the 25th percentile
smallest point <= 1.5 IQR from the box
The mean is denoted by a ◊.
Excel and StatPro Add-in
Demonstration
• Pivot tables
• Summary measures
– Excel
– Add-ins
• Covariance and correlation
• Boxplots
• Applications
Related documents