Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
FK6163 Collect, Explore & Summarise Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia ©drtamil@gmail.com 2013 Data Collection Data collection begins after deciding on design of study and the sampling strategy ©drtamil@gmail.com 2013 Data Collection  Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner. ©drtamil@gmail.com 2013 Data Collection  Information is collected on certain characteristics, attributes and the qualities of interest from the samples  These data may be quantitative or qualitative in nature. ©drtamil@gmail.com 2013 Data Collection Techniques  Use available information  Observation  Interviews  Questionnaires  Focus group discussion ©drtamil@gmail.com 2013 Using Available Information  Existing • • • • Records Hospital records - case notes National registry of births & deaths Census data Data from other surveys ©drtamil@gmail.com 2013 Disadvantages of using existing records  Incomplete records  Cause of death may not be verified by a physician/MD  Missing vital information  Difficult to decipher  May not be representative of the target group - only severe cases go to hospital ©drtamil@gmail.com 2013 ©drtamil@gmail.com 2013 Disadvantages of using existing records  Delayed publication - obsolete data  Different method of data recording between institutions, states, countries, making comparison & pooling of data incompatible  Comparisons across time difficult due to difference in classification, diagnostic tools etc ©drtamil@gmail.com 2013 Advantages of using existing records  Cheap  convenient  in some situations, it is the only data source i.e. accidents & suicides ©drtamil@gmail.com 2013 Observation  Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena  Done using defined scales  Participant observation e.g. PEF and asthma symptom diary  Non-participant observation e.g. cholesterol levels ©drtamil@gmail.com 2013 Interviews  Oral questioning of respondents either individually or as a group.  Can be done loosely or highly structured using a questionnaire ©drtamil@gmail.com 2013 Administering Written Questionnaires  Self-administered  via mail  by gathering them in one place and getting them to fill it up  hand-delivering and collecting them later  Large non-response can distort results ©drtamil@gmail.com 2013 Questionnaires  Influenced by education & attitude of respondent esp. for self-administered  Interviewers need to be trained  open ended vs close ended  the need for pre-testing or pilot study ©drtamil@gmail.com 2013 Focus group discussion  Selecting relevant parties to the research questions at hand and discussing with them in focus groups  examples in your own field of interest? ©drtamil@gmail.com 2013 Plan for data collection  Permission to proceed  Logistics - who will collect what, when and with what resources  Quality control ©drtamil@gmail.com 2013 Accuracy & Reliability  Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure  Reliability is the consistency of replicate measures ©drtamil@gmail.com 2013 Reliability & Accuracy ©drtamil@gmail.com 2013 Accuracy & Reliability  Both are reduced by random error and systematic error from the same sources of variability; • the data collectors • the respondents • the instrument ©drtamil@gmail.com 2013 Strategies to enhance precision & accuracy  Standardise procedures and measurement methods  training & certifying the data collectors  Repetition  Blinding ©drtamil@gmail.com 2013 Introduction Method of Exploring and Summarising Data differs According to Types of Variables ©drtamil@gmail.com 2013 Dependent/Independent Independent Variables Frequency of Exercise Food Intake Obesity Dependent Variable ©drtamil@gmail.com 2013 ©drtamil@gmail.com 2013 Explore  It is the first step in the analytic process  to explore the characteristics of the data  to screen for errors and correct them  to look for distribution patterns - normal distribution or not  May require transformation before further analysis using parametric methods  Or may need analysis using non-parametric techniques ©drtamil@gmail.com 2013 Data Screening R r u c  V 1 7 7 2 4 2 3 6 5 4 2 1 5 1 6 6 8 7  7 3 4 8 7 2 9 5 3 1 3 4 1 1 5 th 1 1 5 T 8 0 By running frequencies, we may detect inappropriate responses How many in the audience have 15 children and currently pregnant with the 16 ? ©drtamil@gmail.com 2013 Data Screening  See whether the data make sense or not.  E.g. Parity 10 but age only 25. ©drtamil@gmail.com 2013 ©drtamil@gmail.com 2013 ©drtamil@gmail.com 2013 Data Screening  By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data e t N e i m m a P 2 4 5 7 V ©drtamil@gmail.com 2013 Interpreting the Box Plot Largest non-outlier Upper quartile Median Lower quartile Smallest non-outlier Outlier The whiskers extend to 1.5 times the box width from both ends of the box and ends at an observed value. Three times the box width marks the boundary between "mild" and "extreme" outliers. "mild" = closed dots Outlier"extreme"= open dots ©drtamil@gmail.com 2013 Data Screening 600  We can also make use of graphical tools such as the box plot to detect wrong data entry 500 73 400 300 200 100 181 211 198 141 0 N = 184 Pre-pregnancy weight ©drtamil@gmail.com 2013 Data Cleaning  Identify the extreme/wrong values  Check with original data source – i.e. questionnaire  If incorrect, do the necessary correction.  Correction must be done before transformation, recoding and analysis. ©drtamil@gmail.com 2013 Parameters of Data Distribution – central value of data  Standard deviation – measure of how the data scatter around the mean  Symmetry (skewness) – the degree of the data pile up on one side of the mean  Kurtosis – how far data scatter from the mean  Mean ©drtamil@gmail.com 2013 Normal distribution    The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population. The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population. However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape. ©drtamil@gmail.com 2013 Normal distribution    If the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it includes about 68.3% of the observations; a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and of three standard deviations above and three below (+ 3sd) about 99.7% of the observations 99.7% 95.4% 68.3% ©drtamil@gmail.com 2013 Normality  Why bother with normality??  Because it dictates the type of analysis that you can run on the data ©drtamil@gmail.com 2013 Variable 1 Qualitative Variable 2 Qualitative Qualitative Dichotomus Qualitative Dichotomus Qualitative Dichotomus Qualitative Dichotomus Qualitative Dichotomus Qualitative Polinomial Quantitative Quantitative Quantitative Quantitative continous Criteria Sample size > 20 dan no expected value < 5 Sample size > 30 Type of Test Chi Square Test (X2) Sample size > 40 but with at least one expected value < 5 Normally distributed data X2 Test with Yates Correction Student's t Test Normally distributed data ANOVA Normality-Why? Parametric Proportionate Test Quantitative Repeated measurement of the Paired t Test same individual & item (e.g. Hb level before & after treatment). Normally distributed data Quantitative - Normally distributed data Pearson Correlation continous & Linear Regresssion ©drtamil@gmail.com 2013 Normality-Why? Non-parametric Variable 1 Qualitative Dichotomus Variable 2 Qualitative Dichotomus Criteria Type of Test Sample size < 20 or (< 40 but Fisher Test with at least one expected value < 5) Qualitative Quantitative Data not normally distributed Wilcoxon Rank Sum Dichotomus Test or U MannWhitney Test Qualitative Quantitative Data not normally distributed Kruskal-Wallis One Polinomial Way ANOVA Test Quantitative Quantitative Repeated measurement of the Wilcoxon Rank Sign same individual & item Test Quantitative Quantitative - Data not normally distributed Spearman/Kendall continous/ordina continous Rank Correlation l ©drtamil@gmail.com 2013 Normality-How?  Explored • • • • graphically Histogram Stem & Leaf Box plot Normal probability plot • Detrended normal plot  Explored statistically • Kolmogorov-Smirnov statistic, with Lilliefors significance level and the Shapiro-Wilks statistic • Skew ness (0) • Kurtosis (0) – + leptokurtic – 0 mesokurtik – - platykurtic ©drtamil@gmail.com 2013 Kolmogorov- Smirnov  In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters.  This is known as the KolmogorovSmirnov test. ©drtamil@gmail.com 2013 Skew ness  Skewed to the right indicates the presence of large extreme values  Skewed to the left indicates the presence of small extreme values ©drtamil@gmail.com 2013 Kurtosis  For symmetrical distribution only.  Describes the shape of the curve  Mesokurtic average shaped  Leptokurtic - narrow & slim  Platikurtic - flat & wide ©drtamil@gmail.com 2013 Skew ness & Kurtosis  Skew ness ranges from -3 to 3.  Acceptable range for normality is skew ness lying between -1 to 1.  Normality should not be based on skew ness alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4).  Likewise, acceptable range for normality is kurtosis lying between -1 to 1. ©drtamil@gmail.com 2013 ©drtamil@gmail.com 2013 Normality - Examples Graphically 60 50 40 30 20 10 Std. Dev = 5.26 Mean = 151.6 N = 218.00 0 140.0 145.0 142.5 Height 150.0 147.5 152.5 155.0 160.0 157.5 165.0 162.5 167.5 ©drtamil@gmail.com 2013 Q&Q Plot  This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).  If the distributional shapes differ, then the points will plot along a curve instead of a line.  Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers. ©drtamil@gmail.com 2013 Normality - Examples Graphically Normal Q-Q Plot of Height 3 2 1 0 Detrended Normal Q-Q Plot of Height .6 -1 .5 -2 .4 -3 .3 130 140 150 160 170 .2 Dev from Normal Observed Value .1 0.0 -.1 -.2 130 140 Observed Value 150 160 170 ©drtamil@gmail.com 2013 Normal distribution Mean=median=mode ©drtamil@gmail.com 2013 Normality - Examples Statistically Descriptives Height Mean 95% Confidence Interval for Mean Lower Bound Upper Bound Statis tic 151.65 150.94 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Std. Error .356 Normal distribution Mean=median=mode 152.35 151.59 151.50 27.649 5.258 139 168 29 8.00 .148 .061 Skewness & kurtosis within +1 .165 .328 p > 0.05, so normal distribution Tests of Normality a Shapiro-Wilks; only if sample size less than 100. Height Kolmogorov-Smirnov Statis tic df Sig. .060 218 .052 a. Lilliefors Significance Correction ©drtamil@gmail.com 2013 K-S Test ©drtamil@gmail.com 2013 K-S Test  very sensitive to the sample sizes of the data.  For small samples (n<20, say), the likelihood of getting p<0.05 is low  for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution ©drtamil@gmail.com 2013 Guide to deciding on normality ©drtamil@gmail.com 2013 Normality Transformation Normal Q-Q Plot of PARITY 3 2 1 Normal Q-Q Plot of LN_PARIT 0 3 -1 2 -2 0 2 4 6 8 10 12 14 16 Observed Value Expected Normal 1 0 -1 -2 -.5 0.0 Observed Value .5 1.0 1.5 2.0 2.5 3.0 ©drtamil@gmail.com 2013 TYPES OF TRANSFORMATIONS Square root Reflect and square root Logarithm Reflect and logarithm Inverse Reflect and inverse ©drtamil@gmail.com 2013 Summarise  Summarise a large set of data by a few meaningful numbers.  Single variable analysis • For the purpose of describing the data • Example; in one year, what kind of cases are treated by the Psychiatric Dept? • Tables & diagrams are usually used to describe the data • For numerical data, measures of central tendency & spread is usually used ©drtamil@gmail.com 2013 Frequency Table Race Malay Chinese Indian Others TOTAL F 760 5 0 28 793 % 95.84% 0.63% 0.00% 3.53% 100.00% •Illustrates the frequency observed for each category ©drtamil@gmail.com 2013 Frequency Distribution Table • > 20 observations, best presented as a frequency distribution table. •Columns divided into class & frequency. •Mod class can be determined using such tables. Umur 0-0.99 1-4.99 5-14.99 15-24.99 25-34.99 35-44.99 45-54.99 55-64.99 65-74.99 75-84.99 85+ JUMLAH Bil 25 78 140 126 112 90 66 60 50 16 3 766 % 3.26% 10.18% 18.28% 16.45% 14.62% 11.75% 8.62% 7.83% 6.53% 2.09% 0.39% ©drtamil@gmail.com 2013 Measurement of Central Tendency & Spread ©drtamil@gmail.com 2013 Measures of Central Tendency Mean Mode Median ©drtamil@gmail.com 2013 Measures of Variability Standard deviation Inter-quartiles Skew ness & kurtosis ©drtamil@gmail.com 2013 Mean  the average of the data collected  To calculate the mean, add up the observed values and divide by the number of them. A major disadvantage of the mean is that it is sensitive to outlying points ©drtamil@gmail.com 2013 Mean: Example 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Total n= of x = 648 20 Mean = 648/20 = 32.4 ©drtamil@gmail.com 2013 Measures of variation standard deviation     tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores. a summary measure of the differences of each observation from the mean. If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero. Consequently the squares of the differences are added. ©drtamil@gmail.com 2013 ©drtamil@gmail.com 2013 sd: Example      12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Mean = 32.4; n = 20 (x-mean)2 Total of = 3050.8 Variance = 3050.8/19 = 160.5684 x (x-mean)^2 x (x-mean)^2 12 416.16 32 0.16 13 376.36 35 6.76 17 237.16 37 21.16 21 129.96 38 31.36 24 70.56 41 73.96 24 70.56 43 112.36 26 40.96 44 134.56 27 29.16 46 184.96 27 29.16 53 424.36 30 5.76 58 655.36 TOTAL 1405.8 TOTAL 1645 sd = 160.56840.5=12.67 ©drtamil@gmail.com 2013 Median  the ranked value that lies in the middle of the data  the point which has the property that half the data are greater than it, and half the data are less than it.  if n is even, average the n/2th largest and the n/2 + 1th largest observations  "robust" to outliers ©drtamil@gmail.com 2013 Median:  12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58  (20+1)/2 = 10th which is 30, 11th is 32  Therefore median is (30 + 32)/2 = 31 ©drtamil@gmail.com 2013 Measures of variation quartiles  The range is very susceptible to what are known as outliers A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile. ©drtamil@gmail.com 2013 Quartiles  12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58  25th percentile 24; (24+24)/2  50th percentile 31; (30+32)/2 ; = median  75th percentile 42.5; (41+43)/2 ©drtamil@gmail.com 2013 Mode  The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13.  It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences. ©drtamil@gmail.com 2013 Mode: Example  12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58  Modes are 24 (10%) & 27 (10%) ©drtamil@gmail.com 2013 Mean or Median?  Which measure of central tendency should we use?  if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate. ©drtamil@gmail.com 2013 Not Normal distribution; Use Median & IQR Normal distribution; Use Mean+SD ©drtamil@gmail.com 2013 Presentation Qualitative & Quantitative Data Charts & Tables ©drtamil@gmail.com 2013 Presentation Qualitative Data ©drtamil@gmail.com 2013 Graphing Categorical Data: Univariate Data Categorical Data Graphing Data Tabulating Data The Summary Table Pie Charts CD Pareto Diagram S a vi n g s Bar Charts B onds S to c k s 0 10 20 30 40 50 45 120 40 100 35 30 80 25 60 20 15 40 10 20 5 0 0 S to c k s B onds S a vi n g s CD ©drtamil@gmail.com 2013 Bar Chart 80 69 60 40 Percent 20 20 11 0 Housew ife Type of work Office w ork Field w ork ©drtamil@gmail.com 2013 Pie Chart Others Chinese Malay ©drtamil@gmail.com 2013 Tabulating and Graphing Bivariate Categorical Data  Contingency tables: Table 1: Contigency table of pregnancy induced hypertension and SGA Count SGA Pregnancy induced hypertens ion Total No Yes Normal 103 5 108 SGA 94 16 110 Total 197 21 218 ©drtamil@gmail.com 2013 Tabulating and Graphing Bivariate Categorical Data 120  Side by side charts 100 103 94 80 60 40 SGA Count 20 16 0 Normal SGA No Yes Pregnancy induced hypertension ©drtamil@gmail.com 2013 Presentation Quantitative Data ©drtamil@gmail.com 2013 Tabulating and Graphing Numerical Data Numerical Data Ordered Array 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 41, 24, 32, 26, 27, 27, 30, 24, 38, 21 Frequency Distributions Cumulative Distributions Ogive 120 100 80 60 40 20 Stem and Leaf Display 0 2 144677 3 028 4 1 Histograms Area 10 20 30 40 50 7 6 5 4 Tables Polygons 3 2 1 0 10 20 30 40 50 60 ©drtamil@gmail.com 2013 6 Tabulating Numerical Data: Frequency Distributions  Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58  Find range: 58 - 12 = 46  Select number of classes: 5 (usually between 5 and 15)  Compute class interval (width): 10 (46/5 then round up)  Determine  Compute  Count class boundaries (limits): 10, 20, 30, 40, 50, 60 class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95 observations & assign to classes ©drtamil@gmail.com 2013 Frequency Distributions and Percentage Distributions Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class Midpoint Freq % 10.0 - 19.9 14.95 3 15% 20.0 - 29.9 24.95 6 30% 30.0 - 39.9 34.95 5 25% 40.0 - 49.9 44.95 4 20% 50.0 - 59.9 54.95 2 10% 20 100% TOTAL ©drtamil@gmail.com 2013 Graphing Numerical Data: The Histogram Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 7 6 6 5 Frequency 5 4 4 No Gaps Between 2 Bars 3 3 2 1 0 14.95 Class Boundaries 24.95 34.95 44.95 54.95 Age Class Midpoints ©drtamil@gmail.com 2013 Graphing Numerical Data: The Frequency Polygon Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 7 6 5 4 3 2 1 0 14.95 24.95 34.95 Class Midpoints 44.95 54.95 ©drtamil@gmail.com 2013 Linear Regression Line ©drtamil@gmail.com 2013 Survival Function 1.2 1.0 .8 .6 .4 .2 Survival Function 0.0 Censored 0 1 DURATION 2 3 4 5 6 7 ©drtamil@gmail.com 2013 Principles of Graphical Excellence  Presents data in a way that provides substance, statistics and design  Communicates complex ideas with clarity, precision and efficiency  Gives the largest number of ideas in the most efficient manner  Almost always involves several dimensions  Tells the truth about the data ©drtamil@gmail.com 2013