Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Example of empirical statistics of a set of data BA_FSM 2014/2015 Data of body heights of n = 46 girls – students of the University of Finance nad Administration Tab. 1. č. body height, cm 89 151 34 157 51 158 94 158 32 160 41 161 83 162 31 163 81 163 4 164 33 164 37 164 87 164 88 164 7 165 3 165 39 165 84 165 96 165 49 166 44 167 91 167 48 167 č. body height, cm 90 167 1 168 45 168 40 168 82 168 92 168 95 170 2 170 85 170 35 170 80 170 50 171 36 172 6 173 46 173 47 173 38 175 43 176 93 176 86 176 42 177 5 180 97 185 Range of the set R = xmax – xmin = 185 - 151 = 34 cm Kvantils specifically median and quartils x25 164 cm x50 167 cm x75 171,5 cm Arithmetical average (mean value) k x 151 157 2 158 160 161 162 2 163 5 164 5 165 166 4 167 n 46 5 168 5 170 171 172 3 173 175 3 176 177 180 185 + 167,59 167,6 46 x i 1 i This calculation and the following procedures can be simplified by division our data into smaller number of groups (intervals). Sturges´ rule according to which we calculate aproximative number of intervals k has the form k = 1 + 3,3 log n 1 in our case k = 1 + 3,3 log 46 = 6,48 6 . We build 6 intervals. Using the range of set R we estimate the width of one interval 34/6 = 5,67 5 cm ,(better than 6 because of the distant values on both ends which shift the mean value to higher values). 1. interval till 157 cm 2. interval 158 – 162 3. interval 163 – 167 4. interval 168 – 172 5. interval 173 – 177 6. interval 178 cm and more As representative value of body height in respective intervals we choose the center of the intervals. For the outer intervals we take the same distance. Tab.2. Interval center of interval, cm do 157 155 158-162 160 163-167 165 168-172 170 173-177 175 xi 1 2 3 4 5 ni 2 5 17 12 8 ni / n 0,043 0,109 0,370 0,261 0,174 Σ ni / n 0,043 0,152 0,522 0,783 0,957 178 a více 6 2 0,043 1,000 46 1 180 ∑ column marked xi column marked ni column marked ni / n – scale elements – absolute frequencies of scale elements – relative frequencies of scale elements column marked Σ (ni / n) – cumulative frequencies Polygon relativních četností ni/n 0,40 0,35 0,30 0,25 0,20 0,15 0,10 0,05 0,00 1 2 3 4 5 6 xi Fig. 1. Polygon of the relative frequencies 2 Polygon kumulativních relativních četností ∑ni/n 1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 1 2 3 4 5 6 xi Fig. 2. Polygon of the cumulative relative frequencies Arithmetical average (mean value) (using the representative values of respective intervals) 6 x n .x i i 1 i 46 2 155 5 160 17 165 12 170 8 175 2 180 167,72 167,7 cm 46 or alternatively in the values of the new scale 6 x n .x i 1 i 46 i 1 2 2 5 3 17 4 12 5 8 6 2 3,543 46 Both mean values in cm differs slightly as we used the interval centers instead the mean values of intervals. Variance S x2 and Standard deviation S x (using interval centers) k S n . x i i x 2 2 155 167,7 5 (160 167,7) 2 17 (165 167,7) 2 12 (170 167,7) 2 2 n 8 (175 167,7) 2 2 (180 167,7) 2 33,38 cm 46 alternatively 2 x i 1 46 2 213,543 5(23,543)2 17(33,543)2 12(43,543)2 8(53,543)2 2(63,543)2 2 Sx 1, 337 46 Standard deviation Sx Sx2 5,777 cm, alternatively Sx = 1,156. 3 Standard deviation shows which weight the value of arithmetical average has: If the value of the standard deviation is high, the weight of arithmetical average is low and vice versa. Moments We calculate moments of our set of data (experimental moments) and we compare them with corresponding moments of the standard normal (Gaussian) distribution. To calculate this moments we use scale elements and frequencies of scale elements as are shown in Tab. 1. We distiguish: General moments, i.e. parameter of position (location of ditribution), Central moments, i.e. parameter of variance (width of ditribution), Standardized moments, i.e. parameters of skewness and of kurtosis. In order to calculate the general moments we form the Tab. 2 containing the following columns: 1. column contains xi 2. column contains ni 3. column contains the products xi ni 4. column contains the products ni x 2i 5. column contains the products ni x 3i 6. column contains the products ni x 4i Tab.3. xi ni ni x 2i ni xi ni x 3i ni x 4i 1 2 2 2 2 2 5 10 20 40 2 80 3 17 51 153 459 1377 4 12 48 192 768 3072 5 8 40 200 1000 5000 6 2 12 72 432 2592 ∑ 46 163 639 2701 12123 General moments 6 O1 n x i i i 1 n 163 3,543 46 6 O2 n x i 1 i n 2 i 639 13,89 46 4 6 O3 n x i i 1 3 i n 2701 58,72 46 12123 263,5 46 6 O4 n x 4 i i i 1 n The general moment of the first order O1 = 3,543 is actually arithmetical average x expressed in elements of scale 1 to 6. It is simple to transform this value to cm: (a) 3,544 belongs to the 3rd interval with central value 165 cm, (b) to 165 cm we add 0,544 part of the width of interval i.e. 0,544 5 cm, (c) addition of (a) and (b) leads to average value 167,7cm. The general moment of the first order O1 is parameter of position. Body heights are around the mean value 167,7 cm. The central moments can be simply calculated using the general moments. Central moments 6 C1 n x x i 1 1 i i O1 O1 0 n 6 C2 n x x i 1 i n x x i 1 i 3 i O3 3O2O1 2O13 0,03268 n 6 C4 O2 O12 1,337 n 6 C3 2 i n x x i 1 i i n 4 O4 4O3O1 6O2O12 3O14 4,753 Central moments are calculated with regard to central value of x (aritmetical average). Sumation of deviations from x to higher values (positive) and sumation of deviations from x to lower values (negative) have the same absolute value but opposite sings, thus central moment of the first order, C1 , is always equal to zero.Central moment of the second order, C2 , is the variance S x2 and it is the parametr of width. S x C2 is called standard deviation. In our case it holds C2 = 1,1554. This value is again expressed in elements of scale 1 to 6. 5 To transform this value to cm we calculate Sx cm 1,1554 5 cm = 5,777 cm . This corresponds with above obtained value 5,78 cm. Central moments of the third nad fourth order we use for calculation further empirical parameters. Standardized moments Parameter of skewness is calculated using standardized moment of the third order N 3 and is called coefficient of skewness. C3 N3 0,215 C2 C2 If the coeficient of skewness is positive, the elements of scale on the left hand side from the average have higher frequency (posively skewed distribution of frequencies – higher contration of smaller elements of scale) and vice versa for N 3 negative. Present distribution is slightly positively skewed, it means that there are more girls smaller than average body height 167,7 cm – see the input data. Parameter of pointedness is usually determined using standardized moment of the fourth order N 4 nad is called coefficient of kurtosis. N4 C4 2,717 C22 The higher N 4 is the more poited is the distribution of frequencies at given variance. The distribution of frequencies is usually compared with standardized normal distribution. Standardized moments are showing in which features our data differ from the normalized normal (Gauss) distribution. It is also used quantity „excess“ which is defined by relation Ex N 4 3 . If E x is positive, the studied empirical distribution is more pointed than standardized normal distribution, if E x is negative, the studied empirical distribution is more flat than standardized normal distribution. It holds in our case: Ex N 4 3 0,283 It means that our distribution of frequencies is more flat than standardized normal distribution. The sense of standard deviation for theoretical normal distribution interval x Sx ; x Sx contains 68% of all values x interval x 2Sx ; x 2Sx contains 95% of all values x interval x 3Sx ; x 3S x contains 99% of all values x It holds in our case: 167,7 5,8 cm. Body heights of 33 girls belong to this interval, i.e. 71,7% of girls. 167,7 11,6 cm. Body heights of 43 girls belong to this interval, i.e. 93,5% of girls.167,7 17,4 cm. Body heights of all 46 girls belong to this interval, i.e. 100% of girls. Conclusion: The empirical statistics of our data leads to the conclusion that our data are close to the theoretical normal distribution. 6