Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Practicing the Concepts #1 – Basic Concepts and Terminology WSU STUDENT SURVEY - In order to generate data for use in one of my introductory statistics courses a few years back, I had the class develop a short survey and administer this survey to ten WSU students of their choosing. In the end, survey responses were recorded for a total of 348 WSU students (n=348). 1. What is the population of interest? WSU students 2. What is the sample? 348 students selected using a sample of convenience. Not likely to representative as a result. 3. What are some potential problems with this survey methodology? Non-random sampling can introduce bias, either intentional or unintentional. 4. The following items comprised the survey. Classify each item (variable) as being either numeric (quantitative) or categorical (qualitative). Survey Item/Variable Variable Type Gender Age Did student have a declared major? Major, if declared College major program is in, e.g. College of Liberal Arts (CLA) Class (Fr, So, Jr, or Sr) Hours spent studying per day GPA Is student involved in extra-curricular activity, e.g. intramural sports or biology club? Is student living on- or off-campus? Hours of sleep per night Number of credits student is currently taking Does student have a “significant other”? Does student skip at least on class per week? If they do skip, what is the most common reason for skipping? Does student drink alcohol? If they do drink, what would be a typical number of “drinks” they would have per night that they drink? Does student smoke cigarettes? If student is a smoker, how many cigarettes do they smoke per day? Should President Clinton be impeached for his sexual relations with Monica Lewinsky? Is student of legal drinking age (21 yrs. old)? How much did student spend on textbooks this semester? Does student think the WSU Laptop Program is a good idea? Cat Num Cat Cat Cat Cat Num Num Cat Cat Num Num Cat Cat Cat Cat Num Cat Num Cat Cat Num Cat 1 Chapter 2 – Descriptive Statistics WSU STUDENT SURVEY (cont’d) – We now consider methods for describing the WSU student survey data described above. Describing Categorical/Qualitative Data To do this in JMP select Analyze > Distribution and put College in the Y, Columns box. The following options have also been selected from the College pull-down menu. Display Options > Horizontal Layout Mosaic Plot Histogram Options > Std Err Bars Histogram Options > Prob Axis Histogram Options > Show Percents The mosaic plot is essentially a rectangular pie chart. The Prob Axis adds a vertical axis with relative frequencies. Finally the standard error bars are p SE p . The standard error is the estimated standard, SE p , is actually the estimated standard error because the true proportion of students in each p(1 p) college ( ) is not known, i.e. SE p . n Frequency Distribution Table from JMP The first numeric column contains the Count or Frequency, which is the number of students in our sample that had declared majors in the college. Notice that 48 students could not be classified according to college because they had not yet declared a major (N Missing 48). The second numeric column contains the estimated probability (Prob) that a randomly selected WSU student would have a major in that college. For example from this we would estimate that the probability that a randomly selected WSU student would have a declared major in the College of Liberal Arts is .33 or 33% chance. More practically we would could say that based our study we estimate that 33% of WSU students are enrolled the College of Liberal Arts. Alternative labels for this column could be Relative Frequency or Proportion (p). 2 Exploring the Relationship Between Two Categorical Variables Suppose that we wish to examine the relationship between a student’s opinion of the WSU Laptop Program and their gender. Comparative Bar Graphs Opinions of Females Opinions of Males 1. Why can’t we directly compare the 142 females who do not think the laptop program is a good idea to the 78 males who feel the same way? Because the number of males and females sampled are different. We would expect the number of females who think the laptop program is a bad idea to be larger perhaps simply because 56 more females were sampled. Contingency Table and Mosaic Plot Gender Female Male Column Totals Opinion of Laptop Program No Undecided Yes 142 7 53 78 1 67 220 8 120 Row Totals 202 146 348 2. What percentage of females surveyed have a favorable opinion of the laptop program? 53 p .2624 or 26.24% 202 3. What percentage of those students who had a favorable opinion of the laptop program were female? 53 p .4417 or 44.17% 120 4. What percentage of males surveyed had an unfavorable opinion of laptop program? 78 p .5342 or 53.42% 146 5. What is the estimated probability that a randomly selected student has a favorable opinion of the laptop program? 120 .3448 or a 34.48% chance 348 3 Mosaic Plot of Laptop Program Opinion and Gender How to read a mosaic plot 26.24% The thin strip off to the side shows the break down of laptop program opinion of all 348 respondents. For example the percent having a favorable opinion is 34.48% is the blue shading. 45.89% 70% The width of the vertical strips in the main plot is controlled by the number/percent of respondents in each gender. Here more females were sampled so their strip in the plot is wider. 50% Contingency Table with Row %’s (in bold) Count Row % F M Column Totals No Undecided Yes Row Totals 202 142 7 53 70.30 3.47 26.24 78 1 67 53.42 0.68 45.89 220 8 120 146 n = 348 The shading within each strip shows the laptop opinions conditional on gender. The plot is a graphical representation of the row percentages in the contingency table. We can clearly see that in our sample male respondents had a more favorable opinion of the laptop program. Nearly 70% of females had negative opinion vs. approximately 50% for males. Questions to Think About: Do these results suggest that the proportion of WSU males who favor the Laptop Program is greater than the proportion WSU females who do? We will look at ways to determine whether this difference is “real” or statistically significant in Chapters 6, 14, and 16. Could a difference this large be explained as chance variation? How could we determine this? We need to somehow determine how likely we are to get sample proportions/percentages this different if in reality the two populations, female and male WSU students in this case, have the same distribution of opinion. If did conclude the proportion of males with a favorable opinion exceeds that for females, can we quantify how much larger we think it is? We will look at how this is done Chapters 13 and 14. 4 GRAPHICAL METHODS FOR DESCRIBING NUMERIC DATA (HISTOGRAMS and STEM-AND-LEAF PLOTS) Histogram of Book Costs Per Semester 1. What would be the typical amount a WSU student would spend on books? Somewhere between $250 and $275 2. Most students would have textbook costs in what range? If we interpret most to mean more than half $200 - $350 would include over 60% of the students surveyed. 3. How much variation in book costs do we see? This distribution is unimodal and fairly bell-shaped or normal with a large percentage of the values within about $75 dollars of what we might consider to be typical. However there are a few extreme values with largest values being at least 6 times larger than the smallest values. 4. Estimate the probability that a randomly selected student would spend between $300 and $400 on books. By adding up the heights of the appropriate bars we find this probability to be approximately .25 + .07 = .32 or a 32% chance. 5. Estimate the percentage of students who spend more than $500 on books. Approximately .02 + 0 + .01 = .03 or 3%. 5 Histogram with Density Scaling Book costs in $100s 6. Estimate the probability that a randomly selected student has book costs between $200 and $300 dollars. Each rectangle or bar in the histogram has width = .5 (i.e. $50 dollars). The area of the rectangles or bars corresponding to the desired interval is given by: (.5 .44) (.5 .46) .22 .23 .45 or a 45% chance. Note: Using the histogram with data in the original scale on previous page we arrive at the same result by adding the heights of the appropriate bars, .22 + .23 = .45. 7. Estimate the probability that a randomly selected student has book costs below $200 dollars. The area of the rectangles or bars corresponding to this range of book costs is given by: (.5 .02) (.5 .08) (.5 .18) .01 .04 .09 .14 or a 14% chance. The key idea here is that: PROBABILITY = AREA !!! If we could do it, better estimates of these probabilities or percentages might come from considering the area beneath the smoothed histogram or density curve. This would require two things: 1) we know the exact mathematical formula for the curve 2) ability to find the area beneath that curve, i.e. integral calculus! We will actually be doing this in Chapter 7 when we discuss continuous random variables. 6 Stem-and-Leaf Plot of Book Costs 8. What advantage if any does the stem-and-leaf display of these data provide when compared to the histogram? The only real advantage is that you can see the raw data as well as how it is distributed. In my opinion this does not offset the disadvantages. Histogram of Hours Spent Studying Per Day Hours Studying 9. Discuss what is learned about studying time of WSU students from the histogram. There are number of comments that could be made: The most common response was 3 hours. About half the students reported studying between 2 – 3 hours per day. 10. What interesting feature(s) does this particular histogram have? Students reported their times to the nearest half hour with most reporting their study times to the nearest hour. 7 Using Histograms to Compare Two Groups on a Single Numeric Variable GPA’s of Female WSU Students 11. Use these histograms to compare and contrast the GPAs of female and male WSU students. GPA’s of Male WSU Students It appears that female students reported having higher GPAs than male students. The variation in GPAs is similar for the both genders as well as the distributional shape. Hours Studying Per Day (Females vs. Males) Females Males 12. What are the differences between males and females in terms of the hours they spend studying per day? Female respondents reported having studied more per day than male respondents. The most common response, or mode, for females is 3 hours and 2 hours. Numerical Measures of “Average” n Mean - x x i 1 i n Median – middle value when data is ranked from smallest to largest. Mode – most frequently occurring value. 8 Hours Spent Studying Per Day 1. What are the mean, median, and mode for the time spent studying per day? Sample mean ( x ) = 2.65 hrs. Sample median (m) = 3.0 hrs. Sample mode = 3.0 hrs. Hours Spent Studying (WSU Females) 2. Use the measures of central tendency to compare and contrast the hours spent studying for male and female students. x females 2.91 xmales 2.29 x females xmales Hours Spent Studying (WSU Males) similarly for the sample medians, m females mmales and the sample modes. 9 Numerical Measures of Variability Range = max - min Variance and Standard Deviation s 1 n ( xi x ) 2 which is an estimate of the population standard deviation n 1 i 1 1 N ( xi ) 2 . The sample variance and population variance are the N i 1 squares of these quantities. Interquartile Range (IQR) – range of the middle 50% of the data = IQR Q3 Q1 Coefficient of Variation (CV) s CV 100% this measures the amount of variation relative to the size of the x mean. GPA of WSU Students 1. What is the range of the GPA’s? Range = 4 – 1.90 = 2.10 2. What are the sample variance and standard deviation? s = .473 and s2 = .224 3. What is the inter-quartile range (IQR)? IQR = 3.465 – 2.720 = .745 4. What is the coefficient of variation (CV)? CV = (.473/3.065)*100% = 15.43% 5. Approximately 68% of the students will have GPAs in what range? Assuming the distribution of GPAs is approximately normal we estimate that 68% of students will GPAs in the range 2.592 to 3.548. Approximately 95% of the students will have GPAs in what range? 3.065 – 2*.473 to 3.065 + 2*.473 which is 2.119 to 4.011. We should truncate the 4.011 to 4.00 as GPAs cannot exceed 4.00. Approximately 99.73% of the students will have GPAs in what range? Find the interval given by x 3s 10 Cost of Textbooks for WSU Students 6. Approximately what percent of WSU students spend between $178.09 and $357.19? This interval represents the sample mean +/- one standard deviation, so assuming the population is approximately normally distributed we would say 68%. Approximately what percent of WSU students spend between $88.52 and $446.76? This interval represents the sample mean +/- two standard deviations, so assuming the population is approximately normally distributed we would say 95%. 7. Which has more variation GPAs of WSU students or their textbook costs? Explain. Numerical Measures of Relative Standing Percentiles/Quantiles and Quartiles z-scores 1. 25 percent of WSU students have GPAs below what value? _____2.720____ 2. 75 percent of WSU students have GPAs below what value? _____3.465_________ 3. What percent of WSU students have GPAs below 3.706? _____90%_________ 4. What is the z-score associated with a GPA = 3.75? ___1.45___ 11 5. What is the z-score associated with a GPA = 2.25? __-1.72______ 6. What is the z-score associated with a GPA = 3.90? ___1.76______ Histogram of z-scores for GPAs Mean = 0 Standard Deviation = 1 Comparative Boxplots 7. How do the GPAs of female students compare to GPAs of males? The mean and median for WSU female students are approximately .15 grade points higher than that for the male students. The variation in GPAs for both groups are similar. This is evidenced by the fact that both the sample standard deviations ( s female .46 and s male .48 ) and the IQRs ( IQR female 1.0 and IQR male .90 ) are nearly equal. We will examine later how we can use these results to decide whether or not the true population means or medians significantly differ. Note: The standard error of the mean for females is smaller because more females were included in the sample ( SE female .46 / 192 .033 ). 12 8. How do the GPAs of students who skip at least one class per week compare to those who do not? The mean and median for students who do not regularly skip classes are larger than that for students who regularly skip classes: ( x no 3.14 vs. x yes 2.87 and mno 3.2 vs. m yes 2.8 ) The variation in GPAs for both groups are similar. This is evidenced by the fact that both the sample standard deviations ( s no .45 and s yes .47 ) and the IQRs ( IQRno 1.0 and IQR yes .90 ) are nearly equal. We will examine later how we can use these results to decide whether or not the true population means or medians significantly differ. Note: The standard error of the mean for “non-skippers” is smaller there over twice as many of them in our survey ( SEno .45 / 236 .030 ). CDF Plots (ogives) We will see in Chapter 4 that the cumulative distribution function (CDF) is defined F ( x) P( X x) which reads as “the probability the random variable X takes on a value less than or equal to x”. For example X = the GPA of a randomly selected female WSU student and we might be interested in the probability that her GPA is at or below 3.00 which would be written as P( X 3.00) . We can estimate this using data for a particular value x as follows: F ( x) P( X x) # of xi ' s x n The CDF Plot is a plot of the estimated cumulative distribution function vs. x. The estimate probability only changes at the observed xi values. This gives the CDF a step function appearance. 13 CDF Plot ~ GPA of “Skippers” vs. “Non-Skippers” Skippers – said “Yes” they skip at least one class per week. Yes No Non-skippers – said “No” they do not skip at least one class per week. 10. Estimate the probability that a skipper has a GPA below 2.80. Approximately .50 or 50% chance 11. Estimate the probability that a non-skipper has a GPA below 3.00. Approximately .35 or 35% chance 12. Which group, skippers or non-skippers, has a greater chance of having a GPA below 3.0? Skippers Comparing age at which respondent had their first child across education level (comparative boxplots with histograms added) How does the age distribution differ across education level? The typical age at which a person had their first child appears to differ greatly across the different education levels. The more educated a person is the later in life they have their first child. We estimate that those who dropped out of high school had their first child at around 20 years of age, those with a high school diploma only around 22 years of age, those with a college degree around 26 years of age and those with professional degrees around 28 years of age. The histograms suggest that the distributional shapes also differ. Those with the least amount of education have skewed right age distributions while those with the more education have skewed left distributions. The variability in each distribution are roughly the same with the exception of those with professional degrees. ] 14