Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged What is biostatisics? Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. Biostatistics or biometry is the application of statistics to a wide range of topics in biology. It has particular applications to medicine and to agriculture. Krisztina Boda INTERREG 2 Application of biostatistics Krisztina Boda Research Design and analysis of clinical trials in medicine Public health, including epidemiology, INTERREG 3 Krisztina Boda INTERREG 4 Biostatistical methods Descriptive statistics Hypothesis tests (statistical tests) They depend on: the type of data the nature of the problem the statistical model Krisztina Boda INTERREG 5 The data set A data set contains information on a number of individuals. Individuals are objects described by a set of data, they may be people, animals or things. For each individual, the data give values for one or more variables. A variable describes some characteristic of an individual, such as person's age, height, gender or salary. Krisztina Boda INTERREG 6 The data-table Data of one experimental unit (“individual”) must be in one record (row) Data of the answers to the same question (variables) must be in the same field of the record (column) Number SEX AGE .... 1 1 20 .... 2 2 17 .... . . . ... Krisztina Boda INTERREG 7 Variables Categorical (discrete) A discrete random variable X has finite number of possible values Krisztina Boda Continuous A continuous random variable X has takes all values in an interval of numbers. Concentration Temperature … Gender Blood group Number of children … INTERREG 8 Types of data from two aspects Based on the number of values they can have discrete (categorical) Continuous Based on the property they represent Qualitative data nominal data (they can be distinguished by their names ) ordinal data (there are categories of classification it may be possible to order) Sex, blood-group, number of children Age, temperature, concentration Quantitative (or numerical) data Krisztina Boda Example Example Qualitative data Quantitative (or numerical) data INTERREG Sex, blood-group very good-good – acceptable- wrong - very wrong-very wrong, low - normal - high, Age, number of children 9 Distribution of variables Continuous: the distribution of a continuous variable describes what values it takes and how often these values fall into an interval. Discrete: the distribution of a categorical variable describes what values it takes and how often it takes these values. Histogram 10 SEX 14 8 12 10 6 8 4 6 Frequency Frequency 4 2 0 male female 0 5.0 SEX Krisztina Boda 2 15.0 25.0 35.0 45.0 55.0 65.0 age in years INTERREG 10 The distribution of a continuous variable, example 20.00 17.00 22.00 28.00 9.00 5.00 26.00 60.00 35.00 51.00 17.00 50.00 9.00 10.00 19.00 22.00 25.00 29.00 27.00 19.00 Krisztina Boda Categories: 0-10 11-205 21-30 31-40 41-50 51-60 Frequencies 4 7 1 1 2 8 7 6 Frequency Values: 5 4 3 2 1 0 0-10 11-20 21-30 31-40 41-50 51-60 Age INTERREG 11 The length of the intervals (or the number of intervals) affect a histogram 8 10 7 9 8 6 7 count count 5 4 5 4 3 3 2 2 1 1 0 0 0-10 11-20 21-30 31-40 41-50 51-60 age Krisztina Boda 6 0-20 21-40 41-60 age INTERREG 12 The overall pattern of a distribution The center, spread and shape describe the overall pattern of a distribution. Some distributions have simple shape, such as symmetric and skewed. Not all distributions have a simple overall shape, especially when there are few observations. A distribution is skewed to the right if the right side of the histogram extends much farther out then the left side. Krisztina Boda INTERREG 13 Histogram of body weights (kg) Hisztogram Jelenlegi testsúlyok 300 200 100 Std. D ev = 8.74 M ean = 57.0 N = 1090.00 0 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 Jelenlegi testsúlya /kg/ Krisztina Boda INTERREG 14 Outliers Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them (real data, typing mistake or other). 10 8 6 4 2 Std. Dev = 13.79 Mean = 62.1 N = 4 3.00 0 40.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 90.0 85.0 100.0 95.0 110.0 105.0 Jelenlegi testsúlya Krisztina Boda INTERREG 15 Describing distributions with numbers Measures of central tendency: the mean, the mode and the median are three commonly used measures of the center. Measures of variability : the range, the quartiles, the variance, the standard deviation are the most commonly used measures of variability . Measures of an individual: rank, z score Krisztina Boda INTERREG 16 Measures of the center n Krisztina Boda Mean: x x1 x 2 ... x n n x i 1 i n Mode: is the most frequent number Median: is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order INTERREG Example: 1,2,4,1 Mean Mode Median 17 Measures of the center n Mean: x Krisztina Boda x1 x 2 ... x n n x i 1 n Mode: is the most frequent number Median: is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order i INTERREG Example: 1,2,4,1 Mean=8/4=2 Mode=1 Median First sort data 1124 Then find the element(s) in the middle If the sample size is odd, the unique middle element is the median If the sample size is even, the median is the average of the two central elements 1124 Median=1.5 18 Example The grades of a test written by 11 students were the following: 100 100 100 63 62 60 12 12 6 2 0. A student indicated that the class average was 47, which he felt was rather low. The professor stated that nevertheless there were more 100s than any other grade. The department head said that the middle grade was 60, which was not unusual. The mean is 517/11=47, the mode is 100, the median is 60. Krisztina Boda INTERREG 19 Relationships among the mean(m), the median(M) and the mode(Mo) A symmetric curve m=M=Mo A curve skewed to the right Mo<M< m A curve skewed to the left M < M < Mo Krisztina Boda INTERREG 20 Measures of variability (dispersion) The range is the difference between the largest number (maximum) and the smallest number (minimum). Percentiles (5%-95%): 5% percentile is the value below which 5% of the cases fall. Quartiles: 25%, 50%, 75% percentiles n The variance= SD 2 (x i i 1 x) 2 n 1 n Krisztina Boda The standard deviation: SD INTERREG ( x x) i 1 i n 1 2 var iance 21 Example Data: 1 2 4 1, in ascending order: 1 1 2 4 Range: max-min=4-1=3 Percentiles Quartiles: Standard deviation: Weighted Average(Definition 1) Tukey's Hinges xi xi x Percentiles 25 50 1.0000 1.5000 1.0000 1.5000 75 3.5000 3.0000 ( xi x) 2 n 1 1 2 4 Total Krisztina Boda 1-2=-1 1-2=-1 2-2=0 4-2=2 0 1 1 0 4 6 SD INTERREG ( x x) i 1 i n 1 2 6 2 1.414 3 22 The meaning of the standard deviation Krisztina Boda A measure of dispersion around the mean. In a normal distribution, 68% of cases fall within one standard deviation of the mean and 95% of cases fall within two standard deviations. For example, if the mean age is 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution. INTERREG 23 The use of sample characteristics in summary tables Center Dispersion Publish Mean Standard deviation, Standard error Median Min, max 5%, 95%s percentile 25 % , 75% (quartiles) Mean (SD) Mean SD Mean SE Mean SEM Med (min, max) Med(25%, 75%) Krisztina Boda INTERREG 24 Displaying data Categorical data Kördiagram Apja iskolai végzettsége Oszlopdiagram 40 8 ált.-nal kevesebb nincs válasz 30 bar chart pie chart 8 ált. felsőfokú végzettség 20 gimnáziumi érettségi Percent 10 szakmunkásképző szakközépiskolai ére 0 8 ált.-nal kevesebb 8 ált. szakmunkásképző gimnáziumi érettségi nincs válasz szakközépiskolai ére fels őfokú végzettség Apja legmagasabb is kolai végzettsége Histogram (kerd97.STA 20v*43c) 12 10 8 Box Plot (kerd97 20v*43c) 100 4 90 80 2 70 0 35 40 45 50 55 60 65 70 75 80 85 90 95 NEM: fiú SULY 35 40 45 60 50 55 60 65 70 75 80 85 90 95 SULY NEM: lány 50 40 Median 25%-75% Min-Max Extremes Mean Plot (kerd97 20v*43c) 30 fiú 85 80 lány NEM 75 70 65 SULY histogram box-whisker plot mean-standard deviation plot scatter plot 6 60 55 Szóródási diagram 50 120 45 fiú Mean Mean±SD 100 lány NEM 80 Jelenlegi testsúlya /kg/ Continuous data No of obs 60 40 20 0 40 60 Kivánatosnak tartott testsúlya /kg/ Krisztina Boda INTERREG 80 100 25 Distribution of body weights The distribution is skewed in case of girls Histogram (kerd97.STA 20v*43c) 12 10 8 6 No of obs 4 2 0 35 40 45 50 55 60 65 70 75 80 85 90 95 35 40 45 50 55 60 65 70 75 80 85 90 95 NEM: fiú NEM: lány boys 1. Leíró statisztika Krisztina Boda SULY INTERREG girls 26 Histogram (kerd97.STA 20v*43c) 12 10 8 6 No of obs 4 2 0 35 40 45 50 55 60 65 70 75 80 85 90 95 35 40 45 50 55 60 65 70 75 80 85 90 95 NEM = 2.00 NEM: f iú NEM: lány SULY NEM = 1.00 SULY SULY 40 65 70 75 80 80 Jelenlegi testsúlya Jelenlegi testsúlya Krisztina Boda 60 85 INTERREG 27 Mean-dispersion diagrams Mean Plot (kerd97 20v*43c) 85 80 70 65 SULY Mean + SD Mean + SE Mean + 95% CI 75 60 55 50 45 fiú lány Mean Mean±SE NEM Mean SE Mean Plot (kerd97 20v*43c) 85 Mean Plot (kerd97 20v*43c) 85 80 80 75 75 70 70 65 SULY SULY 65 60 60 55 55 50 50 45 45 fiú lány fiú Mean Mean±0.95 Conf. Interval lány Mean Mean±SD NEM NEM Mean 95% CI Krisztina Boda Mean SD INTERREG 28 Box diagram Box Plot (kerd97 20v*43c) Box Plot (kerd97 20v*43c) 100 100 90 90 80 80 70 70 SULY SULY 60 60 50 50 40 40 30 fiú lány Median 25%-75% Non-Outlier Range Extremes 30 fiú lány Median 25%-75% Min-Max Extremes NEM NEM A box plot, sometimes called a box-and-whisker plot displays the median, quartiles, and minimum and maximum observations . Krisztina Boda INTERREG 29 Transformations of data values Addition, subtraction Adding (or subtracting) the same number to each data value in a variable shifts each measures of center by the amount added (subtracted). Adding (or subtracting) the same number to each data value in a variable does not change measures of dispersion. Krisztina Boda INTERREG 30 Transformations of data values Multiplication, division Measures of center and spread change in predictable ways when we multiply or divide each data value by the same number. Multiplying (or dividing) each data value by the same number multiplies (or divides) all measures of center or spread by that value. Krisztina Boda INTERREG 31 Proof. The effect of linear transformations Let the transformation be x ->ax+b Mean: ax b ax b ax b ... ax b a( x x ... x ) nb n i 1 i n 1 2 n 1 2 n n ax b Standard deviation: n ((axi b) (a x b)) n 2 i 1 n 1 n a ( xi x) 2 Krisztina Boda n i 1 n 1 ((axi b a x b)) i 1 n 1 n 2 2 ( ax a x ) i i 1 n 1 n 2 a 2 ( x x ) i i 1 n 1 a SD INTERREG 32 Example: the effect of transformations Sample data (xi) Addition (xi +10) Subtraction (xi -10) Multiplication (xi *10) Division (xi /10) 1 11 -9 10 0.1 2 12 -8 20 0.2 4 14 -6 40 0.4 1 11 -9 10 0.1 Mean=2 12 -8 20 0.2 Median=1.5 11.5 -8.5 15 0.15 Range=3 3 3 30 0.3 St.dev.≈1.414 ≈1 .414 ≈ 1.414 ≈ 14.14 ≈ 0.1414 Krisztina Boda INTERREG 33 Special transformation: standardisation The z score measures how many standard deviations a sample element is from the mean. A formula for finding the z score corresponding to a particular sample element xi is xi x zi s , i=1,2,...,n. We standardize by subtracting the mean and dividing by the standard deviation. The resulting variables (z-scores) will have Zero mean Unit standard deviation No unit Krisztina Boda INTERREG 34 Example: standardisation Sample data (xi) Standardised data (zi) 1 -1 2 0 4 2 1 1 Mean 2 0 St. deviation ≈1 .414 1 Krisztina Boda INTERREG 35 Review questions and exercises Problems to be solved by handcalculations ..\Handouts\Problems hand I.doc Solutions ..\Handouts\Problems hand I solutions.doc Problems to be solved using computer ..\Handouts\Problems comp I.doc Krisztina Boda INTERREG 36 Useful WEB pages Krisztina Boda http://www-stat.stanford.edu/~naras/jsm http://www.ruf.rice.edu/~lane/rvls.html http://my.execpc.com/~helberg/statistics.html http://www.math.csusb.edu/faculty/stanton/m26 2/index.html INTERREG 37