Download Univariate Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Univariate Descriptive Statistics
Dr. Shane Nordyke
University of South Dakota
This material is distributed under an Attribution-NonCommercial-ShareAlike 3.0
Unported Creative Commons License, the full details of which may be found online
here: http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use, edit, or
redistribute the content provided that the original source is cited, it is for noncommercial purposes, and provided it is distributed under a similar license.
CC BY-NC-SA Nordyke 2010
Why do we need descriptive statistics
• We use the label univariate descriptive
statistics to refer to a variety of measures of
center and variation that are useful for
understanding the nature and distribution of a
single variable.
• They can allow us to quickly understand a
large amount of information about a single
variable.
• They make data meaningful!
CC BY-NC-SA Nordyke 2010
Making Data Meaningful
Age of Volunteer
15
19
22
17
39
17
26
A relatively small sample of the
ages of volunteers at a local nonprofit agency in the community.
What does this list tell us about the age of
volunteers in the agency?
CC BY-NC-SA Nordyke 2010
Making Data Meaningful
Age of Volunteer
15
17
17
19
22
26
39
Sorting the list can provide a
starting place.
What do we know now?
CC BY-NC-SA Nordyke 2010
Making Data Meaningful
39
16
43
38
39
36
16
31
24
22
17
35
32
28
47
49
25
31
27
43
27
41
30
16
41
47
49
34
33
31
15
16
22
50
42
40
35
25
40
26
42
44
33
20
18
19
39
19
40
46
43
22
28
38
21
49
49
20
44
26
24
16
49
23
37
30
17
19
26
25
16
24
44
31
27
29
45
26
33
34
15
15
16
15
28
36
48
44
24
24
43
44
50
26
29
37
30
25
33
24
41
38
48
39
18
24
49
17
21
16
40
18
16
20
26
19
43
38
46
15
28
27
16
42
39
45
20
15
What if the sample is larger?
CC BY-NC-SA Nordyke 2010
25
17
31
26
47
18
30
21
22
34
23
43
40
22
18
19
28
22
30
40
22
45
31
24
38
33
25
29
21
21
37
41
The Menu of Basic Descriptive
Statistics
• Measures of central tendency
– Mean, Median, Mode, Midrange
• Measures of distribution
– Range, Min, Max, Percentiles
• Measures of Variation
– Standard Deviation, Variance, Coefficient of
Variation
CC BY-NC-SA Nordyke 2010
Some initial notation
 indicates the addition of a set of values
y
is the variable used to represent the
individual data values
n represents the number of values in a
sample
N represents the number of values in a
population
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mean
The sample mean is the mathematical average
of the data and is the measure of central
tendency we use most often.
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mean
Observation Age of
#
Volunteer
1
2
3
4
5
6
7
15
17
17
19
22
26
39
155
 Sample Mean:
 𝑦=
𝑦=
𝑛
𝑖=1 𝑦𝑖
𝑛
155
7
 𝑦 = 22.14
The sum of all of the observations
n = the number of observations
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency Median
The sample median is the middle value when
the original data values are arranged in order
of increasing (or decreasing) magnitude. If
there isn’t one value in the middle we take the
average of the two middle values.
The median is not affected by extreme values.
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
Median:
ỹ=
(𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑑𝑑𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑟𝑖𝑒𝑠)
2
Median is often denoted by ỹ which is pronounced “y-tilde”
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
Sample ages are arranged in ascending order
15
17
17
19
22
26
The middle value is the median.
ỹ = 19
CC BY-NC-SA Nordyke 2010
39
Measures of Central Tendency - Median
If there are two values in the middle, we take
the average of the two.
15
17
17
19
22
26
34
39
ỹ=
(𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑑𝑑𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑟𝑖𝑒𝑠)
2
ỹ=
(19+22)
2
ỹ= 20.5
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
Note that the presence of an extreme value,
doesn’t change the median.
15
17
17
19
22
26
34
99
ỹ=
(𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑑𝑑𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑟𝑖𝑒𝑠)
2
ỹ=
(19+22)
2
ỹ= 20.5
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mode
The mode is the value that occurs most
frequently.
– Not every sample has a distinct mode. Sometimes it is
bimodal (two modes) or multimodal (three or more
modes) or sometimes there is no mode at all.
– The mode is the only measure of central tendency we
can use for nominal data.
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mode
15
17
17
19
22
26
39
M = 17
17 is the only value that occurs more than
once, so it is the value that occurs most
frequently and the mode.
Mode is often denoted with the symbol M
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mode
Blue
Green
Green
Purple
Purple
Red
Red
Red
Red
Yellow
Yellow
Yellow
M = Red
20
29
33
33
34
41
41
42
43
45
45
Multi modal
CC BY-NC-SA Nordyke 2010
1.1
2.3
4.1
5.3
4.3
6.7
8.2
8.3
8.7
8.9
10.3
No
Mode
Measures of Central Tendency - Midrange
The midrange, or middle of the range is the
average of the highest and lowest values.
There is no distinct symbol for the Midrange.
Midrange=
(𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒+𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒)
2
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Midrange
15
17
Midrange=
Midrange=
17
19
22
26
39
(𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒+𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒)
2
(15+39)
2
Midrange= 27
CC BY-NC-SA Nordyke 2010
Comparing Measures of Central Tendency
15
17
17
19
Mean = 22.14
Median = 19
Mode = 17
Midrange = 27
CC BY-NC-SA Nordyke 2010
22
26
39
Comparing Measures of Center
Measure of Center
(Listed from most
used to least used)
Does it always
exist?
Does it take into
account every
value?
Is it affected by
extreme values?
Mean
Always
Yes
Yes
Median
Always
No
No
Mode
Might not exist,
may have more
than one
No
No
Midrange
Always
No
Yes
CC BY-NC-SA Nordyke 2010
The Range
• The range of a sample is the difference
between the highest value and the lowest
value.
15
17
17
19
22
26
39
In our example the Range = 39 – 15 or 24;
there are 24 years between our youngest
and oldest volunteers in the sample.
CC BY-NC-SA Nordyke 2010
Measures of Variance
• Where measures of central tendency try to
give us an idea of where the middle of the
data lies, measures of variance (or variation)
tell us about how the data is distributed
around that center.
• Our three primary measures of variance are:
– Standard Deviation,
– Variance and
– Coefficient of Variation
CC BY-NC-SA Nordyke 2010
Measures of Variance – Standard Deviation
The Standard Deviation is a measure of the
variation of values around the mean.
Sample Standard Deviation: 𝑠 =
2
𝑛
𝑖=1(𝑌−𝑌)
Population Standard Deviation: 𝜎 =
CC BY-NC-SA Nordyke 2010
𝑛−1
2
𝑛
(𝑌−𝑌)
𝑖=1
𝑁
Some Key Points for Understanding
Standard Deviation
• The standard deviation is always positive.
• The standard deviation of a sample will always be
in the same units as the observations in the
sample.
• Extreme values or outliers can change the value
of the standard deviation substantially.
• The size of the sample will affect the size of the
standard deviation; as the sample size increases,
the size of the standard deviation decreases.
CC BY-NC-SA Nordyke 2010
Measures of Variance - Variance
• The variance of a sample is just the standard
deviation of the sample squared.
Sample Variance: 𝑠 2 =
𝟐
𝑛
(𝑌−𝑌)
𝑖=1
𝑛−1
Population Variance: 𝜎 2 =
CC BY-NC-SA Nordyke 2010
2
𝑛
(𝑌−𝑌)
𝑖=1
𝑁
Standard Deviation and Variance Notation
Sample
s = standard deviation
s2 = variance
Population
 = standard deviation
2 = variance
CC BY-NC-SA Nordyke 2010
Seeing Standard Deviations
• Once I figure out how to draw the curves, this
well be a slide that shows the difference
between a distribution with a small standard
deviation (tall and narrow) and a large one
(broad and flat).
CC BY-NC-SA Nordyke 2010
Back to our example
• In our sample of volunteer ages, the mean
was 22.14 years.
15
17
17
19
22
26
39
• We can calculate the standard deviation to
better understand how the values or
distributed around that mean.
CC BY-NC-SA Nordyke 2010
Back to our example
Sample Standard Deviation: 𝑠 =
y
15
17
17
19
22
26
39
𝑦
22.14
22.14
22.14
22.14
22.14
22.14
22.14
(y-𝑦)
-7.14
-5.14
-5.14
-3.14
-0.14
3.86
16.86
CC BY-NC-SA Nordyke 2010
2
𝑛
𝑖=1(𝑌−𝑌)
𝑛−1
(y-𝑦)2
50.9796
26.4196
26.4196
9.8596
0.0196
14.8996
284.2596
412.8572
Back to our example
Sample Standard Deviation: 𝑠 =
𝑠=
412.86
〖
〗
7−1
𝑠 = 8.3
CC BY-NC-SA Nordyke 2010
2
𝑛
𝑖=1(𝑌−𝑌)
𝑛−1
How are standard deviations helpful?
The Empirical Rule
When data sets have distributions that are
approximately bell shaped, the following is true:
•
About 68% of all values fall within 1 standard
deviation of the mean
• About 95% of all values fall within 2 standard
deviations of the mean
• About 99.7% of all values fall within 3 standard
deviations of the mean
Copyright © 2004 Pearson Education,
Inc.
The Empirical Rule
68% of values fall
within 1 standard
deviation of the
mean
34%
34%
CC BY-NC-SA Nordyke 2010
The Empirical Rule
95% of values fall within 2 standard
deviations of the mean
68% of values fall
within 1 standard
deviation of the
mean
34%
34%
13.5%
13.5%
CC BY-NC-SA Nordyke 2010
The Empirical Rule
99.7% of values fall within 3 standard deviations of the mean
95% of values fall within 2 standard
deviations of the mean
68% of values fall
within 1 standard
deviation of the
mean
34%
34%
2.4%
2.4%
13.5%
CC BY-NC-SA Nordyke 2010
13.5%
Measures of Center – Coefficient of Variation
• The Coefficient of Variation (CV) is a measure
of the standard deviation of a sample relative
to its mean.
• CV’s can be useful when you are comparing
the standard deviations of variables that are in
two different units.
CC BY-NC-SA Nordyke 2010
Measures of Center – Coefficient of Variation
An example: You are comparing the heights and
weights of fourth graders.
Height
𝑦 = 52”
S = 4”
Weight
𝑦 = 80 lbs.
S = 10 lbs.
Which variable has greater variance? How can
we compare 4” to 10 lbs?
CC BY-NC-SA Nordyke 2010
Measures of Center – Coefficient of Variation
Weight
Height
𝑦 = 80 lbs.
𝑦
=
52”
𝑠
CV = * 100% S = 4”
S = 10 lbs.
𝑦
10
4
CV = * 100% CV = 80 * 100%
52
CV = 8%
CV = 12.5%
The standard deviation of height is 8% of the mean of height,
where as the standard deviation of weight is 12.5% of the mean of
weight, so there is greater variation in the weight of the fourth
graders than in the height.
CC BY-NC-SA Nordyke 2010