Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 20812: PRACTICAL STATISTICS I SEMESTER 2 INTRODUCTION TO STATISTICS 1.1 A Definition of Statistics Statistics is the analysis of numerical data for the purpose of reaching a decision or communicating information in the face of uncertainty (variability). 1.2 Role of Statistics 1. Description: summary calculations and graphical displays (e.g. minimum, maximum, average, histograms) necessary for statistical evaluations. 2. Inference/Induction: generalisation of the whole by a thorough examination of the part (e.g. political polls). 3. Deduction: ascribe properties of specific cases from general situation (e.g. we can establish 0.04 as the probability that a randomly chosen student will be a mechanical engineer if we know that 4% of all students on campus are taking that major). 1.3 Elements of Statistical Data 1. Observation: the basic statistical element or single data point (e.g. starting salary of a graduating engineer, status of a machine as “defective”/“nondefective”). 2. Population: collection of all possible observations of a specified characteristic of interest (e.g. starting salaries of all graduating engineers). 3. Sample: collection of observations representing only a portion of the population (e.g. starting salaries of all graduating engineers from Plymouth). 1.4 Data Types 1. 2. quantitative data: numerical values qualitative data: attributes (e.g. gender, occupation) 1 1.5 Data Displays 1.5.1 Histogram (Frequency Distribution) Place the observations into a series of contiguous blocks called class intervals. Determine the number of observations falling into each interval. The respective count provides the class frequency. Draw a bar chart having frequency as the vertical axis and the observed values as the horizontal axis. Cost per pound of raw materials used in processing batches of a chemical feedstock 12.01 14.9 11.24 14.4 12.87 11.3 16.98 15.23 12.38 12.9 12.29 11.51 12.06 15.06 11.16 10.12 14.82 17.02 12.56 13.95 11.13 12.14 13.41 15.67 12.17 12.15 12.57 14.21 13.21 16.5 13.81 15.58 11.1 12.03 11.2 12.27 17.05 12.57 13.11 13.31 13.84 11.62 12.33 16.31 12.74 14.25 18.63 13.34 13.43 14.78 Class intervals (Cost per pound) 10.0-under 11.0 11.0-under 12.0 12.0-under 13.0 13.0-under 14.0 14.0-under 15.0 15.0-under 16.0 16.0-under 17.0 17.0-under 18.0 18.0-under 19.0 Number of raw materials Class frequency Cumulative frequency 1 1 8 1+ 8 = 9 16 9 + 16 = 25 9 25 + 9 = 34 6 34 + 6 = 40 4 40 + 4 = 44 3 44 + 3 = 47 2 47 + 2 = 49 1 49 + 1 = 50 Total 50 Relative frequency 1/50 8/50 16/50 9/50 6/50 4/50 3/50 2/50 1/50 Total 1 10 5 0 Number of Raw Materials 15 Histogram 10 11 12 13 14 15 Cost (£) 2 16 17 18 19 1.5.2 Frequency Polygon Plot a dot (‘’) above the midpoint of each interval at a height matching the class frequency. Connect the dots with outside line segments touching the horizontal axis one-half of an interval width below and above the lowest and highest intervals. Frequency Polygon 10 • • 5 • • • • • 0 Number of Raw Materials 15 • • • • 10 11 12 13 14 15 Cost (£) 3 16 17 18 19 1.5.3 Ogive (Cumulative Frequency Distribution) Plot a dot (‘’) above the upper class limit at a height equal to the cumulative frequency for that interval. Connect the dots by line segments, with the lowest line touching the horizontal axis at the lower limit of the smallest class. 50 • • • 18 19 40 • • 30 • 10 20 • 0 Cumulative Sum of Number of Raw Materials Cumulative Frequency Distribution • • • 10 11 12 13 14 15 Cost (£) 4 16 17 1.5.4 Relative Frequency Distribution: same as the frequency polygon with the ordinates divided by the total number of observations. Relative Frequency Distribution 0.20 • • 0.10 • • • • 0.0 Number of Raw Materials/50 0.30 • • • • • 10 11 12 13 14 15 Cost (£) 5 16 17 18 19 1.5.5 Stem-and-Leaf Plot: arrange the data tabularly by separating the value of each observation into a stem digit and a leaf digit. 2.9 3.0 1.8 4.2 1.2 4.5 4.0 1.6 3.6 3.7 3.2 1.5 5.4 3.6 2.3 3.6 3.6 2.7 Precipitation levels during the month of April 3.2 4.0 3.9 2.1 2.9 2.9 1.0 2.2 5.4 3.5 3.6 4.0 4.0 4.0 0.3 2.2 3.3 3.8 4.8 3.3 2.7 1.8 4.4 2.6 3.9 0.8 3.1 3.1 3.7 0.3 1.5 3.4 3.4 3.3 1.2 5.9 5.0 3.4 2.6 3.3 5.8 0.6 0.7 2.9 3.1 2.9 2.0 3.2 3.4 2.9 0.5 2.4 LEAF (Number to the right of decimal point) STEM (Number to the left of decimal point) 0 : 334 0 : 56778 1 : 01222 1 : 556888 2 : 001222344 2 : 66677899999999 3 : 011122233334444 3 : 56666667778899 4 : 00000124 4 : 58 5 : 044 5 : 89 6 1.1 0.7 2.6 2.9 3.7 3.6 2.9 1.2 0.4 2.8 2.2 2.0 4.1 3.8 2.4 1.8 1.5.6 Bar Chart: when each category or attribute occurs with some frequency, the summary in a bar chart. Number of professional women employed (in thousands) in 1986 Category Frequency Engineering/Computer Science 347 Health care 1937 Education 2833 Social/Legal 698 Arts/Athletics/Entertainment 901 All others 355 2000 1000 500 0 Number of Professional Women Employed Bar Chart Eng Health Edu Social 7 Arts Other 1.5.7 Pie Chart: size the piece of pie according to the category’s relative frequency, with the angle of the slice corresponding to its proportion of 360 degrees. Number of professional women employed (in thousands) in 1986 Category Frequency Angle (degrees) Engineering/Computer Science 347 (347/7071)*360 = 18 Health care 1937 (1937/7071)*360 = 99 Education 2833 (2833/7071)*360 = 144 Social/Legal 698 (698/7071)*360 = 36 Arts/Athletics/Entertainment 901 (901/7071)*360 = 46 All others 355 (355/7071)*360 = 18 Hea lth Pie Chart Edu Social 8 Ar Eng Other ts 1.5.8 Scatter Diagram: Plot of two quantitative variables (one versus the other). Year Average Distance (1,000 miles) 9,450 9,390 9,980 9,630 9,760 10,050 9,480 9,140 9,000 9,530 9,650 9,790 9,830 1960 1965 1970 1975 1976 1978 1979 1980 1981 1982 1983 1984 1985 Average Fuel Consumption (gallons) 661 667 735 712 711 715 664 603 579 587 578 553 549 Scatter Diagram 700 • 600 650 • • • • • • • • • 550 Average Fuel Consumption (Gallons) • • • 9000 9200 9400 9600 Average Distance (Miles) 9 9800 10000 1.5.9 Box Plot: Shows variability of the data (will revisit later). 10 4 Number of busy teleports in a computer network 15 17 6 12 9 13 Number of Busy Teleports Box Plot 17 16 15 14 13 12 11 10 9 8 7 6 5 4 Maximum 3/4 way 1/2 way 1/4 way Minimum 10 15 5 1.6 Summary Data Measures 1. Parameter: summary of data when data constitute a population (e.g. average salary of all graduating engineers). 2. Statistic: summary of sample data (e.g. average salary of graduating engineers from Plymouth). 1.7 Summary Measures of Location 1.7.1 The Mean: The mean is the most commonly used measure of central tendency, indicating the central point around which observations tend to cluster. For individual data such as: 755 613 584 693 622, mean = (sum of observations)/(number of observations) = (755 + 613 + 584 + 693 + 622)/5 = 653.4. For grouped data such as: Class interval Class interval Frequency (Age) midpoint 20-under 25 22.5 5 25-under 30 27.5 13 30-under 35 32.5 12 35-under 40 37.5 8 40-under 80 60 12, mean = (sum of (frequency*midpoint))/(number of observations) = (5*22.5 + 13*27.5 + 12*32.5 + 8*37.5 + 12*60)/50 = 37.6. 1.7.2 The Median: Another measure of central tendency, especially useful when data has a skewed frequency distribution. For individual data such as: 755 613 584 693 622, the median is located as follows: order the data by increasing magnitude to get: 584 613 622 693 755; the median is the value above and below which an equal number of observations lie (basically the value at middle) median = 622; when there is an even number of observations the median is the average of the middle two. 11 For grouped data such as: Class interval Frequency Cumulative (Age) Frequency 20-under 25 5 5 25-under 30 13 18 30-under 35 12 30 35-under 40 8 38 40-under 80 12 50 use the following scheme: calculate (number of observations + 1)/2 = (50 + 1)/2 = 25.5; find the greatest cumulative class frequency less than or equal to 25.5 ( 18) and note down the upper limit of the corresponding interval ( 30); find the cumulative frequency and the upper class limit of the next higher interval ( 30 and 35 respectively); then, median = 30 + (25.5 - 18)*(35 - 30)/(30 - 18) = 33.1. 1.7.3 The Mode: most frequently occurring value. For individual data such as: 0 0 3 7 1 0 2, mode = 0. For grouped data it is the midpoint of the class interval with the highest frequency (for the age data, mode = 27.5). 1.7.4 Percentile: is a point below which a stated percentage of the observations lie, e.g. the median is the 50th percentile. To find the 100pth percentile, where p is any value between 0 and 1, use the following scheme: let n represent the total number of observations and let k be the largest integer less than or equal to (n + 1)*p; arrange the data values by increasing magnitude; find the kth and the (k + 1)st largest values. Denote them by a and b respectively; then, 100pth percentile = a + ((n + 1)*p - k)*(b - a). For grouped data a similar scheme applies: let k be the greatest cumulative frequency less than or equal to (n + 1)*p and let a be the upper limit of the corresponding interval; let h and b denote the cumulative frequency and upper limit of the next higher interval; then, 100pth percentile = a + ((n + 1)*p - k)*((b - a)/(h - k)). 12 1.8 Summary Measures of Variability 1.8.1 The Range: the largest observation minus the lowest observation. For individal data such as: 755 613 584 693 622, the range = 755 - 584 = 171. For grouped data it is the difference between the lowest and highest class limits (for the age data, the range = 80 - 20 = 60). In a box plot the range is the overall length of the plot. 1.8.2 Interquartile Range: difference between the 75th percentile and the 25th percentile - representing the scatter in the middle 50% of the observations. In a box plot it is the length of the box. 1.8.3 Variance: most useful measure of variability, based on deviations of individual observations about the central value of mean. For individual data such as: 755 613 584 693 622, we calculate it as follows: compute the mean, we know it to be 653.4 from above; take the deviation of each observation from the mean: 755 - 653.4 = 101.5 613 - 653.4 = -40.4 584 - 653.4 = -69.4 693 - 653.44 = 39.6 622 - 653.4 = -31.4; then, variance= (sum of the (deviation) 2 )/(number of observations)) = ((101.6) 2 + (-40.4) 2 + (-69.4) 2 + (39.6) 2 + (-31.4) 2 )/5 = 3,865.04; if the observations are from a sample, then variance = (sum of the (deviation) 2 )/(number of observations - 1). For grouped data such as: Class interval Class interval Frequency (Age) midpoint 20-under 25 22.5 5 25-under 30 27.5 13 30-under 35 32.5 12 35-under 40 37.5 8 40-under 80 60 12, use the following scheme: compute the mean, we know it to be 37.6; take the deviation of each class interval midpoint from the mean: 22.5 - 37.6 = -15.1 27.5 -37.6 = -10.1 32.5 - 37.6 = -5.1 37.5 - 37.6 = -0.1 60 - 37.6 = 56.4; 13 then, variance= (sum of (class frequency*(deviation) 2 ))/(number of observations) = (5*(-15.1) 2 +13*(-10.1) 2 +12*(-5.1) 2 +8*(-0.1) 2 +12*(56.4) 2 )/50 = 818.998; Replace (number of observations) by (number of observations - 1) if observations are from a sample. 1.8.4 Standard Deviation: positive square root of the variance, more convenient summary than variance as it is in the same units as the observations themselves. It is useful in describing frequency distributions of many populations (especially when the normal curve fits the frequency distribution): at least 68% of all population values lies within mean standard deviation; at least 95% within mean 2*standard deviation; and at least 99% within mean 3*standard deviation. 1.9 Composite Measures: Coefficient of variation = (standard deviation/mean) measures the variability relative to mean. Coefficient of skewness = 3*(mean median)/(standard deviation) expresses the direction of and the degree to which a frequency distribution is skewed. 1.10 Formulae for Sample Mean and Variance: mathematical representations of the forms introduced in sections 1.7.1 and 1.8.3. First, some notation: Xi ith observation when the data are individual and the midpoint of the ith class interval when the data are grouped; fi frequency of the ith class interval when the data are grouped; n number of observations; m number of class intervals when the data are grouped; sample mean; X 2 sample variance. s We have the following: n X X i i 1 for individual data; n m X f i i 1 n Xi m (where n f i ) n s2 ( Xi X )2 i 1 n 1 m s2 for grouped data; i 1 n X i 1 n 1 for individual data; n 1 m fi ( Xi X )2 i 1 nX 2 2 i f i X i2 nX 2 i 1 for grouped data. n 1 14 15