Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia Overview • Measurements • Population vs sample • Summary of data: mean, variance, standard deviation, standard error • Graphical analyses • Transformation Scales of Measurement • In general, most observable behaviors can be measured on a ratio-scale • In general, many unobservable psychological qualities (e.g., extraversion), are measured on interval scales • We will mostly concern ourselves with the simple categorical (nominal) versus continuous distinction (ordinal, interval, ratio) variables categorical continuous ordinal interval ratio Ordinal Measurement • Ordinal: Designates an ordering; quasi-ranking – Does not assume that the intervals between numbers are equal. – finishing place in a race (first place, second place) 1st place 1 hour 2 hours 2nd place 3rd place 3 hours 4 hours 4th place 5 hours 6 hours 7 hours 8 hours Interval and Ratio Measurement • Interval: designates an equal-interval ordering – The distance between, for example, a 1 and a 2 is the same as the distance between a 4 and a 5 – Example: Common IQ tests are assumed to use an interval metric • Ratio: designates an equal-interval ordering with a true zero point (i.e., the zero implies an absence of the thing being measured) – Example: number of intimate relationships a person has had • 0 quite literally means none • a person who has had 4 relationships has had twice as many as someone who has had 2 Statististics: Enquiry to the unknown Population Sample Parameter Estimate Estimate the population mean Population height mean = 160 cm Standard deviation = 5.0 cm ht <- rnorm(10, mean=160, sd=5) mean(ht) ht <- rnorm(10, mean=160, sd=5) mean(ht) ht <- rnorm(100, mean=160, sd=5) mean(ht) ht <- rnorm(1000, mean=160, sd=5) mean(ht) ht <- rnorm(10000, mean=160, sd=5) mean(ht) hist(ht) The larger the sample, the more accurate the estimate is! Estimate the population proportion Population proportion of males = 0.50 Take n samples, record the number of k males rbinom(n, k, prob) males <- rbinom(10, 10, 0.5) males mean(males) males <- rbinom(20, 100, 0.5) males mean(males) males <- rbinom(1000, 100, 0.5) males mean(males) The larger the sample, the more accurate the estimate is! Summary of Continuous Data • Measures of central tendency: – Mean, median, mode • Measures of dispersion or variability: – Variance, standard deviation, standard error – Interquartile range R commands length(x), mean(x), median(x), var(x), sd(x) summary(x) R example height <- rnorm(1000, mean=55, sd=8.2) mean(height) [1] 55.30948 median(height) [1] 55.018 var(height) [1] 68.02786 sd(height) [1] 8.2479 summary(height) Min. 1st Qu. 28.34 49.97 Median 55.02 Mean 3rd Qu. 55.31 60.78 Max. 85.05 Graphical Summary: Box plot 80 boxplot(height) 75% percentile Median, 50% perc. 25% percentile 40 50 60 70 95% percentile 30 5% percentile Strip chart 30 40 50 60 70 80 Histogram 150 100 50 0 Frequency 200 250 Histogram of height 30 40 50 60 height 70 80 90 Implications of the mean and SD • “In the Vietnamese population aged 30+ years, the average of weight was 55.0 kg, with the SD being 8.2 kg.” • What does this mean? • 68% individuals will have height between 55 +/- 8.2*1 = 46.8 to 63.2 kg • 95% individuals will have height between 55 +/- 8.2*1.96 = 38.9 to 71.1 kg Implications of the mean and SD • The distribution of weight of the entire population can be shown to be: 1.96SD 6 1SD Percent (%) 5 4 3 2 1 0 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 92 Weight (kg) Summary of Categorical Data • Categorical data: – Gender: male, female – Race: Asian, Caucasian, African • Semi-quantitative data: – Severity of disease: mild, moderate, severe – Stages of cancer: I, II, III, IV – Preference: dislike very much, dislike, equivocal, like, like very much Mean and variance of a proportion • For an individual i consumer, the probability he/she prefers A is pi. Assuming that all consumers are independent, then pi = p. • Variance of pi is var(pi) = p(1-p) • For a sample of n consumers, the estimated probability of preference for A is: p and the variance of p_bar is: p1 p2 p3 ... pn n p 1 p var p n Normal approximation of a binomial distribution • For an individual i consumer, the probability he/she prefers A is pi. Assuming that all consumers are independent, then pi = p. • Variance of pi is var(pi) = p(1-p) • For a sample of n consumers, the estimated probability of preference for A is: p p2 p3 ... pn p 1 and the variance of p_bar is: and standard deviation: p 1 p var p n s p1 p n n Normal approximation of a binomial distribution - example • • • • • 10 consumbers, 8 preferred product A. Proportion of preference for A: p = 0.8 Variance: var(p) = 0.8(0.2)/10 = 0.016 Standard deviation of p: s = 0.126 95% CI of p: 0.8 + 1.96(0.126) = 0.55 to 1.00 Descriptive Analyses Continuous data Paired t-test • Continuous data • Normally distributed • Two samples are NOT independent Paired t-test – an example • The problem: Viewing certain meats under red light might enhance judges preferences for meat. 12 judges were asked to score the redness of meat under red light and white light Results: Judge 1 2 3 4 5 6 7 8 9 10 11 12 Red 20 18 19 22 17 20 19 16 21 17 23 18 White 22 19 17 18 21 23 19 20 22 20 27 24 Paired t-test – analysis Judge Red light White light Difference 1 20 22 2 2 18 19 1 3 19 17 -2 4 22 18 -4 5 17 21 4 6 20 23 3 7 19 19 0 T-test = (1.83 – 0)/0.81 = 2.23 8 16 20 4 9 21 22 1 P-value = 0.0459 10 17 20 3 11 23 27 4 12 18 24 6 Mean 21.0 19.2 1.83 SD 2.8 2.1 2.82 Mean difference: 1.83, SD: 0.81 Standard error (SE): SD/sqrt(n) = 0.81/sqrt(10) = 0.81 Conclusion: there was a significant effect of light colour. Paired t-test – R analysis red < -c(20,18,19,22,17,20,19,16,21,17,23,18) white < -c(22,19,17,18,21,23,19,20,22,20,27,24) t.test(red, white, paired=TRUE) data: red and white t = -2.2496, df = 11, p-value = 0.04592 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.6270234 -0.0396433 sample estimates: mean of the differences -1.833333 Two-sample t-test Mean difference: Sample 1 2 3 4 5 … n Group 1 Group2 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 … xn yn Sample size n1 n2 Mean x y SD sx sy D=x–y Variance of D: T-statistic: 95% Confidence interval: Two-group comparison: an example 20 consumers rated their preference for two rice desserts (A and B) ID 1 2 3 4 5 6 7 8 9 10 A 3 7 1 9 3 4 1 2 6 7 B 3 1 2 4 5 2 2 5 3 2 ID 11 12 13 14 15 16 17 18 19 20 A 5 8 5 9 4 6 4 3 9 5 B 3 4 2 3 5 4 3 1 3 2 Unpaired t-test using R a<-c(3,7,1,9,3,4,1,2,6,7,5,8,5,9,4,6,4,3,9,5) b<-c(3,1,2,4,5,2,2,5,3,2,3,4,2,3,5,4,3,1,3,2) t.test(red,white) Welch Two Sample t-test data: a and b t = 3.3215, df = 27.478, p-value = 0.002539 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.8037895 3.3962105 sample estimates: mean of x mean of y 5.05 2.95 Transformation of data: multiplicative effects • The following data represent lysozyme levels in the gastric juice of 29 patients with peptic ulcer and of 30 normal controls. It was interested to know whether lysozyme levels were different between two groups. Group 1: 0.2 0.3 0.4 1.1 2.0 2.1 3.3 3.8 4.5 4.8 4.9 5.0 5.3 7.5 9.8 10.4 10.9 11.3 12.4 16.2 17.6 18.9 20.7 24.0 25.4 40.0 42.2 50.0 60.0 Group 2: 0.2 0.3 0.4 0.7 1.2 1.5 1.5 1.9 2.0 2.4 2.5 2.8 3.6 4.8 4.8 5.4 5.7 5.8 7.5 8.7 8.8 9.1 10.3 15.6 16.1 16.5 16.7 20.0 20.7 33.0 Unpaired t-test by R g1 <- c( 0.2, 0.3, 0.4, 1.1, 2.0, 2.1, 3.3, 3.8, 4.5, 4.8, 4.9, 5.0, 5.3, 7.5, 9.8, 10.4, 10.9, 11.3, 12.4, 16.2, 17.6, 18.9, 20.7, 24.0, 25.4, 40.0, 42.2, 50.0, 60) g2 <- c(0.2, 0.3, 0.4, 0.7, 1.2, 1.5, 1.5, 1.9, 2.0, 2.4, 2.5, 2.8, 3.6, 4.8, 4.8, 5.4, 5.7, 5.8, 7.5, 8.7, 8.8, 9.1, 10.3, 15.6, 16.1, 16.5, 16.7, 20.0, 20.7, 33.0) t.test(g1, g2) data: g1 and g2 t = 2.0357, df = 40.804, p-value = 0.04831 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.05163216 13.20239083 sample estimates: mean of x mean of y 14.310345 7.683333 Exploration of data par(mfrow=c(1,2)) hist(g1) hist(g2) 15 10 5 5 = 14.3 15.7 Frequency 10 15 Histogram of g2 Frequency 0 = 7.7 7.8 0 Group 1: mean(g1) sd(g1) = Group 2: mean(g2) sd(g2) = Histogram of g1 0 10 20 30 g1 40 50 60 0 5 10 20 g2 30 Re-analysis of lysozyme data log.g1 <- log(g1) log.g2 <- log(g2) t.test(log.g1, log.g2) data: log.g1 and log.g2 t = 1.406, df = 55.714, p-value = 0.1653 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2182472 1.2453165 sample estimates: mean of x mean of y 1.921094 1.407559 exp(1.921-1.407) = 1.67 Group 1’s mean is 67% higher than group 2’s Descriptive analysis Categorical data Comparison of two proportions - theory Group 1 2 ____________________________________________ Sample size Number of events Proportion of events n1 e1 p1 n2 e2 p2 Difference: D = p1 – p2 SE difference: SE = [p1(1–p1)/n1 + p2(1–p2)/n2]1/2 Z = D / SE 95% CI: D + 1.96(SE) With (n1 + n2) > 20, and if Z > 2, it is possible to reject the null hypothesis. Comparison of two proportions - example Thirty-day mortality rate (%) of 100 rats who had been exposed to heroine or cocain. Group Heroine Cocaine __________________________________________ Sample size 100 Number of deaths 90 Mortality rate 0.90 100 36 0.36 Analysis Difference: D = 0.90 – 0.36 = 0.54 SE (D) = [0.9(0.1)/100 + 0.36(0.64)/100]1/2 = 0.057 Z = 0.54 / 0.057 = 9.54 95% CI: 0.54 + 1.96(0.057) 0.43 to 0.65 Conclusion: reject the null hypothesis. Comparison of two proportions - R events <- c(90, 36) total <- c(100, 100) prop.test(events, total) 2-sample test for equality of proportions with continuity correction data: deaths out of total X-squared = 60.2531, df = 1, p-value = 8.341e-15 alternative hypothesis: two.sided 95 percent confidence interval: 0.4190584 0.6609416 sample estimates: prop 1 prop 2 0.90 0.36 Comparison of >2 proportions – Chi square analysis table(sex, ethnicity) ethnicity sex African Asian Caucasian Others Female 4 43 22 0 Male 4 17 8 2 females <- c(4, 43, 22, 0) total <- c(8, 60, 30, 2) prop.test(females, total) Comparison of >2 proportions – Chi square analysis 4-sample test for equality of proportions without continuity correction data: females out of total X-squared = 6.2646, df = 3, p-value = 0.09942 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 0.5000000 0.7166667 0.7333333 0.0000000 Warning message: Chi-squared approximation may be incorrect in: prop.test(females, total) Summary • Examine the distribution of data – Mean and variance: systematic difference? – Normally distributed ? • Transformation? • Present confidence intervals (and p-values)