Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Lecture 2: Measures of Location and Variability Devore: Section 1.3-1.4 Oct, 2011 Page 1 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Sample mean • The sample mean of the n numbers x1 , . . . , xn is x1 + x2 + . . . + xn x̄ = n • Alternative notation: n P x̄ = xi i=1 n • The following dataset shows the crack length in µm for smooth bar tensile specimens subjected to stress corrosion tests. The data is x1 = 16.1, x2 = 9.6, x3 = 24.9, ...,x21 = 28.5. Oct, 2011 Page 2 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 • The mean is clearly 444.8 x̄ = = 21.18. 21 • Physical interpretation of the mean: the balance point for a system of weights. • R commands: with(xmp01.12, mean(crackLength)) and with(xmp01.12, sum(crackLength)) • Lack of robustness is the main shortcoming of the mean as a measure of center. In the example before, there is a value x14 = 45.0 which is an outlier. Without it, we have 399.8 = 19.99. x̄ = 20 Oct, 2011 Page 3 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Sample median • The sample median x̃ is the middle value in the set of data that has been arranged in ascending order. For an even number of data points the median is the average of the middle two. • More precisely, suppose the number of observations n is odd. . Then, the median x̃ is the observation number n+1 2 • In the same way, if n is even, the median is defined as the n n average of 2 th and 2 + 1 th observations • Median is a robust measure of the data center, unlike mean. • The mean and the median are generally not the same Oct, 2011 Page 4 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Median calculation example • The following data give the concentration for a specific receptor for a sample of women with evidence of iron-deficiency anemia When ordered, they are 7.6 8.3 9.3 9.4 9.4 9.7 10.4 11.5 11.9 15.2 16.2 20.4 . • As n = 12 is even, the median is 9.7+10.4 2 = 10.05. • An R command: with(xmp01.13, median(concentration)) • What would happen if the largest observation 20.4 was not there? Oct, 2011 Page 5 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Population mean and median • The population mean is defined as the sum of the N population values divided by N • The sample mean is commonly used as a point estimate of the population mean • The population median µ̃ is defined as the ”middle value” (in the same way as before) for the entire population. Again, sample median is commonly used as a point estimate of the population median. Oct, 2011 Page 6 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Trimmed mean • Often the median and the mean are just two extremes...How do we claim the middle ground? Consider the trimmed mean. • The mean does not discard any observations while the median discards almost everything. We can discard a predetermined number or percentage of observations as an alternative. • The following dataset gives the copper percentages in a sample of 26 Bidri artifacts 2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3 3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1 • The regular mean is x̄ = 3.65 and the median is x̃ = 3.35. The difference is due to a large observation 10.1% Oct, 2011 Page 7 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Trimmed mean • 7.7% trimmed mean is the result of removing 2 smallest and 2 largest observations - x̄tr(7.7) = 3.42. • The 10% trimmed mean is an appropriate weighted average of 7.7% trimmed mean (trimming two values at each end) and 11.5% trimmed mean (trimming three values at each end) • R command: with(xmp01.14, mean(copper, trim = 0.1)) Oct, 2011 Page 8 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Measures of spread • Variance measures the spread of the data • The sample variance is Pn 2 (x − x̄) i . s2 = i=1 n−1 √ • The sample standard deviation is s = s2 and n − 1 is referred to as the number of degrees of freedom Oct, 2011 Page 9 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 • Consider the prefabricated plate example; there are n = 11 plate elements that have been subjected to a severe stress test. If x is the length of resulting cracks, x̄ = Pn i=1 xi = 18.349 and 18.349 11 = 1.6681. Thus, Pn 2 (x − x̄) 11.9359 i 2 i=1 = = 1.19359. s = 11 10 • R command: with(xmp01.15, var(Strength)) Oct, 2011 Page 10 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 • An alternative computing formula for the s2 is based on the fact that Sxx = X (xi − x̄)2 = X P 2 ( xi ) 2 xi − n • Thus, we can write P s2 = P ( xi ) 2 n x2i − n−1 • This formula needs to be used with the largest decimal accuracy possible Oct, 2011 Page 11 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Some properties of standard deviation • Let x1 , . . . , xn be a sample of n observations and c any non-zero constant. Denote s2x the sample variance of x’s. • 1. If y1 = x1 + c, . . . , yn = xn + c, then s2y = s2x 2. If y1 = cx1 , . . . , yn = cxn , then s2y = c2 s2x , sy = |c|sx . Oct, 2011 Page 12 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 The fourth spread • To define an alternative, again order n observations in a data set from smallest to largest. Then, the lower (upper) fourth is the median of the smallest (largest) half of the data; where the median is included in both halves if n is odd. • Then, the fourth spread is defined as fs = upper fourth − lower fourth • Any observation farther than 1.5fs from the closest fourth is an outlier. An outlier is extreme if it is more than 3 fs from the nearest fourth, and it is mild otherwise. Oct, 2011 Page 13 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Boxplots • The simplest boxplot is a five-number summary : 1) minimum 2) lower fourth 3)median 4) upper fourth 5) maximum • The ”whiskers” mark the location of the smallest and the largest observations • Corrosion data on the thickness of the floor plate of the crude oil storage tank; each observation is the largest pit depth in milli-in • R command: with(xmp01.17, boxplot(depth, horizontal = TRUE)) Oct, 2011 Page 14 Statistics 511: Statistical Methods Dr. Levine Purdue University Fall 2011 Oct, 2011 Page 15