Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Science Journalists STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Journalists hate math  Definition of journalist: A do-gooder who hates math.  “Word person, not a numbers person.”  1936 JQ article noting habitual numerical errors in newspapers  Japanese 6th graders more accurate on math test than applicants to Columbia’s Graduate School of Journalism  20% of journalists got more than half wrong on 25question “math competency test” (Maier)  18% of 5,100 stories examined by Phil Meyer had math errors Bad examples abound  Paulos: 300% decrease in murders  Detroit Free Press (2006): Compared ACS to Census data to get false drop in median income  KC Star (2000): Priests dying of AIDS at 4 times the rate of all Americans  Delaware ZIP Code of infant death  NYT: 51% of women without spouses Common problems  Numbers that don’t add up  Making the reader do the math  Failure to ask “Does this make sense?”  Over-precision  Ignoring sampling error margins  Implying that correlation equals causation Dangers of journalistic innumeracy  Misleads math-challenged readers/viewers  Hurts credibility among math-capable readers/viewers  Leads to charges of bias, even when cause is ignorance  Makes reporters vulnerable to being used for the agendas of others Common Research Methods  Randomized experiments: Measure deliberate manipulation of the environment  Observational studies: Measure the differences that occur naturally  Meta-analyses: Quantitative review of multiple studies  Case Study: Descriptive in-depth examination of one or a few individuals Simple Measures... ...don’t exist! Measurement Variability  Variable measurements include unpredictable errors or discrepancies that aren’t easily explained.  Natural variability is the result of the fact that individuals and other things are different. Reasons for variable measures  Measurement error  Natural variability between individuals  Natural variability over time in a single individual Some Pitfalls in Studies Deliberate Bias? If you found a wallet with $20, would you:  “Keep it?” (23% would keep it)  “Do the honest thing and return it?” (13% would keep it) Unintentional Bias?  “Do you use drugs?”  “Are you religious?” Desire to Please? People routinely say they have voted when they actually haven’t, that they don’t smoke when they do, and that they aren’t prejudiced. One study six months after an election:  96% of actual voters said they voted.  40% of non-voters said they voted. Asking the uninformed? Washington Post poll : “Some people say the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?”  24% said yes  19% said no  rest had no opinion Asking the uninformed? Later Washington Post poll: “President Clinton says the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?”  36% of Democrats agreed  16% of Republicans agreed  rest had no opinion Unnecessary Complexity?  “Do you support our soldiers in Iraq so that terrorists won’t strike the U.S. again?” Question Order  “About how many times a month do you normally go out on a date?”  “How happy are you with life in general?” Sampling Margin of Error 95% of the time, a random sample’s characteristics will differ from the population’s by no more than about 1 n where N= sample size Two Important Concepts about Error Margin  The larger the sample, the smaller the margin of sampling error.  The size of the population being surveyed doesn’t matter.* *Unless the sample is a significant fraction of the population. Sampling realities  Bigger sample means more cost (money and/or time)  Diminishing return on error margin improvement as sample increases.    N=100: +/- 10 percentage points N=400: +/- 5 percentage points N=900: +/- 3.3 percentage points  Sample needs only to be large enough to give a reasonable answer.  Sampling error affects subsamples, too. Describing data sets Three Useful Features of a Set of Data  The Center  The Variability  The Shape The Center  Mean (average): Total of the values, divided by the number of values  Median: The middle value of an ordered list of values  Mode: The most common value  Outliers: Atypical values far from the center Yankees’ Baseball Salaries  Average: $7,404,762  Median: $2,500,000  Mode: $500,000 (also the minimum)  Outlier: $27.5 million (Alex Rodriguez) The Variability Some measures of variability:  Maximum and minimum: Largest and smallest values  Range: The distance between the largest and smallest values  Quartiles: The medians of each half of the ordered list of values  Standard deviation: Think of it as the average distance of all the values from the mean. What is “normal”?  Don’t consider the average to be “normal”  Variability is normal  Anything within about 3 standard deviations of the mean is “normal” Bell-Shaped “Normal” Curve Some Characteristics of a Normal Distribution  Symmetrical (not skewed)  One peak in the middle, at the mean  The wider the curve, the greater the standard deviation  Area under the curve is 1 (or 100%) mean Percentiles Your percentile for a particular measure (like height or IQ) is the percentage of the population that falls below you. Compared to other American males:  My height (5’ 11”): 75th percentile  My weight ( ): 85th percentile  My age (66): 88th percentile 230 lbs. Therefore, I am older and heavier than I am tall. Standardized Scores A standardized score (also called the z-score) is simply the number of standard deviations a particular value is either above or below the mean. The standardized score is:  Positive if above the mean  Negative if below the mean Useful for defining data points as outliers. The Empirical Rule For any normal curve, approximately:  68% of values within one StdDev of the mean  95% of values within two StdDevs of the mean  99.7% of values within three StdDevs of the mean Outlier  A value that is more than three standard deviations above or below the mean. Correlation Strength of Relationship Correlation (also called the correlation coefficient or Pearson’s r) is the measure of strength of the linear relationship between two variables. Think of strength as how closely the data points come to falling on a line drawn through the data. Features of Correlation  Correlation can range from +1 to -1  Positive correlation: As one variable increases, the other increases  Negative correlation: As one variable increases, the other decreases  Zero correlation means the best line through the data is horizontal  Correlation isn’t affected by the units of measurement Positive Correlations r = +.1 r = +.8 r = +.4 r = +1 Negative Correlations r = -.4 r = -.1 r = -.8 r = -1 Zero correlation r=0 r=0 Number of Points Doesn’t Matter r = .8 r = .8 Important! Correlation does not imply causation. Correlation of variables  When considering relationships between measurement variables, there are two kinds:   Explanatory (or independent) variable: The variable that attempts to explain or is purported to cause (at least partially) differences in the… Response (or dependent or outcome) variable  Often, chronology is a guide to distinguishing them (examples: baldness and heart attacks, poverty and test scores) Some reasons why two variables could be related  The explanatory variable is the direct cause of the response variable Example: pollen counts and percent of population suffering allergies, intercourse and babies Some reasons two variables could be related  The response variable is causing a change in the explanatory variable Example: hotel occupancy and advertising spending, divorce and alcohol abuse Some reasons two variables could be related  The explanatory variable is a contributing -- but not sole -- cause Example: birth complications and violence, gun in home and homicide, hours studied and grade, diet and cancer Some reasons two variables could be related  Both variables may result from a common cause Example: SAT score and GPA, hot chocolate and tissues, storks and babies, fire losses and firefighters, WWII fighter opposition and bombing accuracy Some reasons two variables could be related  Both variables are changing over time Example: divorces and drug offenses, divorces and suicides Some reasons two variables could be related  The association may be nothing more than coincidence Example: clusters of disease, brain cancer from cell phones So how can we confirm causation? The only way to confirm is with a designed (randomized double-blind) experiment. But non-statistical evidence of a possible connection may include:  A reasonable explanation of cause and effect.  A connection that happens under varying conditions.  Potential confounding variables ruled out. Regression Linear Regression In addition to figuring the strength of the relationship, we can create a simple equation that describes the best-fit line (also called the “least-squares” line) through the data. This equation will help us predict one variable, given the other. Best-fit (“least-squares”) Line Best-fit Line??? (much variance) Best-fit Line! (least variance) Remember 9th Grade Algebra? x = horizontal axis y = vertical axis Equation for a line: y = slope * x + intercept or as it often is stated: y = mx + b Regression in data journalism  Public school test scores  Cheating in school test scores  Tenure of white vs. black coaches in NBA  Racial bias in picking jurors  Racial profiling in traffic stops Confusion of the inverse Confusion of the Inverse Confusing these two:  Probability of actually having a condition, given a positive test for it  Probability of having a positive test, given actually having the condition When the incidence of some disease or condition is very low, and the test for it is not perfect, there will be a high probability that a positive test result is false positive. Definitions  Base rate: The probability that someone has a disease or condition, without knowing any test results.  Test Sensitivity: Proportion of people who correctly test positive when they have the disease or condition (true positive)  Test Specificity: Proportion of people who correctly test negative when they don’t have the disease or condition (true negative) Drug Tests Consider this scenario:  Base rate: 1% of population to be tested uses dangerous drugs  You use a test that’s 99% accurate in both sensitivity and specificity  10,000 people are tested Drug Tests Test Positive Test Negative Total Users 100 Not 9,900 Total 10,000 Drug Tests Users Test Positive 99 Test Negative 1 Total 100 Not 9,900 Total 10,000 Drug Tests Test Negative 1 Total Not 9,801 9,900 Total 9,802 10,000 Users Test Positive 99 100 Drug Tests Test Negative 1 Total Users Test Positive 99 Not ??? 9,801 9,900 9,802 10,000 Total 100 Drug Tests Users Not Total Test Positive 99 Test Negative 1 Total 99 9,801 9,900 198 9,802 10,000 100 (50% of positives are FALSE!) Confidence intervals and p-values Confidence Intervals  Like the error margin around poll results  A confidence interval is a tradeoff between certainty and accuracy, like shooting at targets of different sizes  The bigger the sample, the smaller the confidence interval at the 95% level  When comparing results, if confidence intervals overlap, the results are NOT statistically significant P-values  P-value is the probability that the sample result is     significantly different from the true result (i.e., wrong) 95% confidence interval (p < 0.05) is the most commonly used interval in social science research Hard science, particularly medicine, often needs tighter confidence intervals and smaller p-values, like p<0.01 Studies are going to be wrong about 5% of the time (and you won’t know when) On the other hand, they probably won’t be very wrong. How to read a research study  Pay attention to the method: Observational, randomized       double-blind experiment, meta-analysis, case study Note the sample size Don’t ignore the confidence intervals Consider the p-value as the probability you’re writing about something that isn’t true Remember correlation doesn’t necessarily mean causation. Consider the quality of the journal (peer reviewed?) Who paid for the research? Newsroom math bibliography  “Numbers in the Newsroom”, by Sarah Cohen, IRE  “News and Numbers”, by Victor Cohn and Lewis     Cope “Precision Journalism (4th edition)”, by Phil Meyer “Innumeracy”, by John Allen Paulos “A Mathematician Reads the Newspaper,” by John Allen Paulos “Damned Lies and Statistics,” by Joel Best Questions?