Download Lecture 2: Measures of Location and Variability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Lecture 2: Measures of Location and Variability
Devore: Section 1.3-1.4
Oct, 2011
Page 1
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Sample mean
• The sample mean of the n numbers x1 , . . . , xn is
x1 + x2 + . . . + xn
x̄ =
n
• Alternative notation:
n
P
x̄ =
xi
i=1
n
• The following dataset shows the crack length in µm for smooth
bar tensile specimens subjected to stress corrosion tests. The
data is x1
= 16.1, x2 = 9.6, x3 = 24.9, ...,x21 = 28.5.
Oct, 2011
Page 2
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
• The mean is clearly
444.8
x̄ =
= 21.18.
21
• Physical interpretation of the mean: the balance point for a
system of weights.
• R commands: with(xmp01.12, mean(crackLength)) and
with(xmp01.12, sum(crackLength))
• Lack of robustness is the main shortcoming of the mean as a
measure of center. In the example before, there is a value
x14 = 45.0 which is an outlier. Without it, we have
399.8
= 19.99.
x̄ =
20
Oct, 2011
Page 3
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Sample median
• The sample median x̃ is the middle value in the set of data that
has been arranged in ascending order. For an even number of
data points the median is the average of the middle two.
• More precisely, suppose the number of observations n is odd.
.
Then, the median x̃ is the observation number n+1
2
• In the same way, if n is even, the median is defined as the
n
n
average of 2 th and 2 + 1 th observations
• Median is a robust measure of the data center, unlike mean.
• The mean and the median are generally not the same
Oct, 2011
Page 4
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Median calculation example
• The following data give the concentration for a specific receptor
for a sample of women with evidence of iron-deficiency anemia
When ordered, they are 7.6 8.3 9.3 9.4 9.4 9.7 10.4 11.5 11.9
15.2 16.2 20.4 .
• As n = 12 is even, the median is
9.7+10.4
2
= 10.05.
• An R command: with(xmp01.13, median(concentration))
• What would happen if the largest observation 20.4 was not
there?
Oct, 2011
Page 5
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Population mean and median
• The population mean is defined as the sum of the N
population values divided by N
• The sample mean is commonly used as a point estimate of the
population mean
• The population median µ̃ is defined as the ”middle value” (in
the same way as before) for the entire population. Again,
sample median is commonly used as a point estimate of the
population median.
Oct, 2011
Page 6
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Trimmed mean
• Often the median and the mean are just two extremes...How do
we claim the middle ground? Consider the trimmed mean.
• The mean does not discard any observations while the median
discards almost everything. We can discard a predetermined
number or percentage of observations as an alternative.
• The following dataset gives the copper percentages in a sample
of 26 Bidri artifacts
2.0
2.4
2.5
2.6
2.6
2.7
2.7
2.8
3.0
3.1
3.2
3.3
3.3
3.4
3.4
3.6
3.6
3.6
3.6
3.7
4.4
4.6
4.7
4.8
5.3
10.1
• The regular mean is x̄ = 3.65 and the median is x̃ = 3.35.
The difference is due to a large observation 10.1%
Oct, 2011
Page 7
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Trimmed mean
• 7.7% trimmed mean is the result of removing 2 smallest and 2
largest observations - x̄tr(7.7) = 3.42.
• The 10% trimmed mean is an appropriate weighted average of
7.7% trimmed mean (trimming two values at each end) and
11.5% trimmed mean (trimming three values at each end)
• R command: with(xmp01.14, mean(copper, trim = 0.1))
Oct, 2011
Page 8
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Measures of spread
• Variance measures the spread of the data
• The sample variance is
Pn
2
(x
−
x̄)
i
.
s2 = i=1
n−1
√
• The sample standard deviation is s = s2 and n − 1 is
referred to as the number of degrees of freedom
Oct, 2011
Page 9
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
• Consider the prefabricated plate example; there are n = 11
plate elements that have been subjected to a severe stress test.
If x is the length of resulting cracks,
x̄ =
Pn
i=1
xi = 18.349 and
18.349
11
= 1.6681. Thus,
Pn
2
(x
−
x̄)
11.9359
i
2
i=1
=
= 1.19359.
s =
11
10
• R command: with(xmp01.15, var(Strength))
Oct, 2011
Page 10
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
• An alternative computing formula for the s2 is based on the fact
that
Sxx =
X
(xi − x̄)2 =
X
P 2
( xi )
2
xi −
n
• Thus, we can write
P
s2 =
P
( xi ) 2
n
x2i −
n−1
• This formula needs to be used with the largest decimal
accuracy possible
Oct, 2011
Page 11
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Some properties of standard deviation
• Let x1 , . . . , xn be a sample of n observations and c any
non-zero constant. Denote s2x the sample variance of x’s.
• 1. If y1 = x1 + c, . . . , yn = xn + c, then s2y = s2x
2. If y1
= cx1 , . . . , yn = cxn , then s2y = c2 s2x , sy = |c|sx .
Oct, 2011
Page 12
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
The fourth spread
• To define an alternative, again order n observations in a data
set from smallest to largest. Then, the lower (upper) fourth is
the median of the smallest (largest) half of the data; where the
median is included in both halves if n is odd.
• Then, the fourth spread is defined as
fs
= upper fourth − lower fourth
• Any observation farther than 1.5fs from the closest fourth is an
outlier. An outlier is extreme if it is more than 3 fs from the
nearest fourth, and it is mild otherwise.
Oct, 2011
Page 13
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Boxplots
• The simplest boxplot is a five-number summary : 1) minimum
2) lower fourth 3)median 4) upper fourth 5) maximum
• The ”whiskers” mark the location of the smallest and the largest
observations
• Corrosion data on the thickness of the floor plate of the crude oil
storage tank; each observation is the largest pit depth in milli-in
• R command: with(xmp01.17, boxplot(depth, horizontal =
TRUE))
Oct, 2011
Page 14
Statistics 511: Statistical Methods
Dr. Levine
Purdue University
Fall 2011
Oct, 2011
Page 15
Related documents