Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
These are the questions for Masters Comps and Doctoral Prelims for Fall, 2011. Please write your answers in the blue books. Mark each item number and section clearly. Put your name on the outside of each blue book. Do NOT write your name inside the books. Try to write neatly, please. The proctor will pick up the booklets when you are done or when your time is over. Turn off your phone. If you need a calculator, one will be provided you. Best of luck, Drs Lissitz, Jiao, and Stapleton Statistics 1. This question is related to power and hypothesis testing. a. Explain the following terms in hypothesis testing: Type I error rate, Type II error rate, and power. RUBRIC: Type I error rate – probability of rejecting the null hypothesis when the null hypothesis is true. Type II error rate – probability of failing to reject the null hypothesis when the null hypothesis is false. Power – the probability of rejecting the null hypothesis when the null hypothesis is false. Type II error rate and Power combined are 1.0 b. Use a graph to represent the relations among α, β, and power in each of the following two hypothesis testing respectively: a two-tailed test and a one-tailed test with alternative distribution falling at the upper end of the null distribution. RUBRIC: Graph should have a null and alternate sampling distribution. Alternate distribution should be to the right of the null (at least for part #2). Correct lines/shading should be used to identify 2 equal-sized areas under the null distribution (for part #1) and one area (equal to the combination of both parts) for part #2 for α. Student should use two graphs – one for part #1 and one for part #2 -- for clarity. Each section should be correctly labeled (power, Type II and Type I). c. Specify the factors affecting the power in hypothesis testing and state each factor’s impact on power (specify the change of direction in the factor). RUBRIC: Should specify:  sample size– explain that the larger the sample, the smaller the variability in the sampling distributions, the greater the power,  effect size – explain that the greater the effect size, the greater the separation of the sampling distributions, the greater the power,  tails of hypo test – explain that if only one tail, the greater the power (as seen in graph in 1b)  Type I error rate – explain that the greater the rate, the more power  Variance of the sampling distribution-the greater the variance of the sampling distribution, the smaller the power 2. The county school board has instituted an incentive program to encourage middle schools to improve their mathematics instruction. If a school’s 7th grade students score on average above the 50th percentile on a standardized exam, then the school is awarded $20,000 for equipment purchases. The standardized exam has a normed mean of 50 with a standard deviation of 10. The Park middle school participated in this program last year and the school board randomly selected ten of the school’s sixty students to take the exam. On average, the students earned a score of 48.2 and because this average was lower than 50, the school did not receive the $20,000. Statistically speaking, do you think it was appropriate that the Park middle school did not receive the $20,000? Why or why not? RUBRIC: The student should indicate that the sample mean is not necessarily reflective of population mean, discussing the concept of sampling error. Excellent answer would be conducting actual z-test (z = -1.8/3.16 = .57. p>.05) or creating a confidence interval, but could have passing answer without doing the actual test(s). There is no evidence that the population student mean is not at least 50. Note that the question began with “statistically speaking” so there is no reason to get into issues of validity… but a student might discuss the low stakes nature of the assessment for the student and question the validity of the scores on this assessment. (But this would be beyond the grading of this question). 3. A researcher is interested in investigating the effect of genres of writing prompts on students’ writing achievement. Three genres were included in the study: narrative, descriptive, and persuasive. Each student was required to complete all three prompts. To reduce the ordering effect of prompts, the researcher assigned a representative sample of the student population randomly to each of the possible orderings of the three writing prompts. Further, the researcher believed that gender may interact with the genre of writing prompts and so included it in the design of the study. a. Suggest a proper analysis of variance (ANOVA) method for data analyses for this research study. Be sure to reflect the nature of the analysis components. RUBRIC: A good answer would suggest the name: split-plot design. The student should indicate that there would be one repeated factor and that gender would be treated as a between-subjects factor. Given the counterbalancing, the between-subjects ordering factor is not necessary to include (but it is not problematic to include it as well). b. What are the assumptions for the identified data analysis model? Explain each assumption and suggest a method for checking potential violation of each assumption. RUBRIC: Assumptions :  Homogeneity of variance – the variability of the scores is the same in the population across each of the genders. Levene’s test or some other such test would be an appropriate assessment of the assumption; alternately, a suggestion to examine plots of data would be appropriate.  Normality of sampling distribution – the distribution of the scores is the population is normal, within each cell of the design -- check skew, kurtosis, Kolmogorow-Smirnov or Wilks/Shapiro, or depend on sample sizes (CLT) at each measurement occasion   Sphericity – the variance of the difference scores between each pair of repeated scores is the same -- Mauchly’s test can be used – if the assumption is violated, use alternate df/test Independence of observations – conditional on the design, scores are not dependent -- check the research design description. (No information was provided, however). 4. The standard error of the jth regression coefficient in a multiple regression model can be computed as S SEˆ  Y j SX j (1  R 2 ) ( N  p  1)(1  R 2j ) where R2 is the proportion of variance accounted for by the entire model, SY and S X j are the standard deviations of Y and the jth variable, Xj. R2j is defined as the proportion of variance in the jth variable accounted for by the other predictors in the model. Using this formula, discuss in detail what affects the power of the t-test to detect a significant finding when testing the jth regression coefficient. RUBRIC: N – the larger the sample size (N), the greater the power, all other things constant p – the fewer the number of other predictors, the greater the power, all other things constant R2j – we want the jth variable to share as little variance with the other variables as possible, so R2j should be small (as close to zero as possible) to achieve the greatest power, all other things constant – this measure represents collinearity. R2 – we want maximal variance explained in the outcome variable, so as close to 1 as possible, the higher the power, all other things constant Research design 1. A researcher wants to examine the relation between amount of time spent on homework and achievement in algebra. He is given two possible study designs to consider. Study A: 30 randomly-selected students enrolled in algebra classes in School A would be assigned to do 30 minutes of homework per day and 30 randomlyselected students in algebra classes in School B would be assigned to do 2 hours of homework per day. At the end of one month, the 60 students would be given a standardized algebra exam. The average performance on the exam across the two conditions would be compared using a t-test to determine the relation of time spent on homework and algebra performance. Study B: 60 randomly-selected algebra students across Schools A and B would be surveyed and asked how much algebra homework they do on average per day. Additional data about the students would be obtained on the survey. The 60 students would also be given a standardized algebra exam. To answer the research question, the performance on the exam would be regressed on time on homework, after controlling for other student characteristics believe to influence performance. Recommend the study design that the researcher should use. In your recommendation, be sure to discuss the advantages and disadvantages of each of the designs, specifically addressing threats to internal validity. You may also want to suggest additional analyses that the researcher should undertake to evaluate any assumptions that are being made in selecting the study design. RUBRIC: Selection of either study is fine. Study A – the biggest threat to internal validity is likely differential selection (disadvantage). Advantage is that manipulation of independent variable did occur. We would want to evaluate the equivalence of student ability in each school prior to study and examine equivalence of instruction. Possibly include any pre-existing differences into an ANCOVA model. Possible external threat to validity, given that students may not generally study 30 min vs 2 hours. (Study B might capture more natural values of study time). Study B – this is a correlational design, so has little internal validity; 3rd variable might explain any relation between the two. Possibly may be more palatable to the school administration instead of forcing study time on students. Additionally, will students be truthful about responses regarding homework time? Design specifically includes consideration of covariates….but there might be a treatment by selection effect… 2. You are consulted to design a comparability study to investigate the equivalence between a paper-and-pencil test and its computer-based version for a large-scale state math test. The researcher is interested in using either a matched pair design or a repeated measures design. a. Describe how to implement the study using each design. For the matched pair design, be sure to include two matching variables which you think are relevant. RUBRIC: For matched pair, student should identify how to select participants and how to do the matching. Students should be randomly assigned within matched pairs to each of the conditions. Student should explain the variables used for matching (any variables suggested are reasonable here). For repeated measures, student should explain the ordering (or counter balancing if using) of the tests. Should explain how participants are selected and how the tests are administered (time allowed between). b. Discuss the pros and cons of each design. Make sure to refer to the potential threats to internal validity and external validity. RUBRIC: For repeated measures, clearly state testing effects may be problematic. Counterbalancing might solve the problem. Each subject serves as his/her own control Subjects are exposed to every level of the treatment Fewer subjects needed threats to validity: carryover effects, sample attrition Counterbalancing, but fatigue, time-consuming Matched pair requires greater sample size and depends on the strength of matching variables for equivalence. Random assignment within pair would alleviate any differential selection issues. After matching based on one or multiple variables, random assignment within the matched group to treatment levels Subjects are exposed to one level of the treatment Reduce threats of fatigue c. Recommend possible actions the researcher could take to reduce the threats. RUBRIC: Use between-subject design Use more matching variables Assessment 1. Assume that you are running a carefully constructed assessment program that includes a one-hour session with 50 test questions administered from a carefully constructed item bank of a thousand questions. Suppose, further, that you were required to decrease the testing to 45 minutes and therefore need to decrease the number of test items. Describe three considerations that you would worry about. What are they and why would they potentially be a problem? Is there anything you can do to minimize the problems associated with each? RUBRIC: Concerns should include 1) Content coverage 2) Difficulty level coverage 3) Loss of reliability in total score from assessmentTo address minimizing the problems, work with a test blueprint to address items 1 and 2. For 3, Use CAT. Or, at least select those items with highest discrimination and/or information across the range of theta that is desired to be measured. Spearman brown equation referenced and how it might be used. Assumptions for its use. Maintaining test design fidelity, changing the difficulty can be an issue and care with matching difficulty to ability of test takers or the concern with where on the scale you are focusing, Issue of how to actually select the items to eliminate. Reliability Inter-item correlation Purpose of the test 2. Test item analysis utilizes a number of classical statistical calculations. Suppose you had a set of multiple choice data from a test and each item was scored correct (1 point) and incorrect (0 points). The total test score is measured both as the number correct and as a pass (1) and fail (0) score. The following are four statistics that are often used to analyze such items to decide if the item behavior is good or bad from the standpoint of developing a test with high reliability: Phi Tetrachoric Point Biserial Biserial Answer each of the questions: a. Define each statistic RUBRIC: Phi 2x2 equal to Pearson Tetrachoric 2x2 with continuous and normal distribution assumptioin. Point Biserial 2 by many, equal to Pearson Biserial 2 by many, but the dichotomous is assumed to come from a normally distributed continuous variable a. What are the pros and cons of each of the statistics? Include any concerns with the accuracy of each statistic as a measure of association. RUBRIC: Issue of differential proportions decreasing correlation, unless you use normality assumption and underlying continuous dimension .. more complex, harder to describe to lay audience, tetrachoric and biserial are greater than or equal than their related pair. Might talk about dichotomous data issues. Assumptions are different Measurement 1. Properly understanding item properties is vital in test form construction. Using the two graphs below, answer the following questions. a. Identify the most difficult item and the most discriminating item assuming the items fit to a two-parameter unidimensional item response theory model. RUBRIC: Most difficult = Item 2. Most discrim = Item 3. b. If two test forms are to be constructed to classify examinees with ability around 2 and 0.5 on the logit scale respectively, which two items should be included on each form? State your reasons. RUBRIC: For the form targeting at 2, Item 2 (blue) and item 1 (pink). For the form targeting at 0.5, item 1 and 3. These two carry the most information at those two locations. 1.0 Item 1 0.9 Item 2 0.8 Item 3 0.6 0.5 0.4 0.3 0.2 0.1 Theta 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5 -3.0 -3.5 0.0 -4.0 Probability 0.7 Figure 1. Item characteristic curves. 1.0 Item 1 0.9 Item 2 0.8 Item 3 Item Information 0.7 0.6 0.5 0.4 0.3 0.2 0.1 4. 0 3. 5 3. 0 2. 5 2. 0 1. 5 1. 0 0. 5 0. 0 -4 .0 -3 .5 -3 .0 -2 .5 -2 .0 -1 .5 -1 .0 -0 .5 0.0 Theta Figure 2. Item information curves. 2. In the 3 parameter IRT item analysis, there are a number of statistics that are computed to examine the quality of items. The following are 5 useful statistics: The a parameter The b parameter The c parameter Conditional Item information Conditional Item error (standard error of measurement) a. Define each statistic RUBRIC: The a parameter: the discrimination parameter indicating how well an item may distinguish between high and low-ability students. Proportional to the slope at the steepest point on the item characteristic curve The b parameter: item difficulty indicating the point on the ability scale with a probability (1+c)/2 to answer an item correctly. When related to 1PL and 2PL IRT models, item difficulty is the same point on the ability scale where examinees have 50% probability to get the item correct. The c parameter: the guessing parameter, the lower asymptote of the item characteristic curve indicating the probability that lower ability students may get the item correctly Conditional Item information: indicating the measurement precision at a certain ability level, depends on the probability of an correct response and incorrect response and the first derivative of the probability of an correct response with respect to theta, determined by all item parameters involved in a particular IRT model conditional on theta. Conditional Item error (standard error of measurement): an index for measurement precision/imprecision. It is the square root of the inverse of the conditional item information conditional on theta. a. Explain how (in what context) each statistic is useful or not useful. . Talk about the usefulness of each. A is quality of item, b is where the item is most likely useful, c gives us an idea how useless the item is for low ability examinees, item information for planning and for making decisions. Information and SE are related – explain how…. b. If you were interested in designing a test so that it is the most useful with a certain ability level of student, which one(s) of these statistics would you pay closest attention to and why? Information or item error statistics or the b parameter. Pick one and defend its use. Talk about targeting the design of a test.. 3. Validity is often classified into three types – content, criterion and construct. Suppose you were going to use a test to select workers in a company and six months later you were going to measure their success on the job using a criterion measure. You plan to compare each employee’s initial test results to his or her criterion measure. What concerns might you have with this study? Be sure to talk about issues that would impact on criterion validity in this context. How do you think this approach to validating the test for selection compares to content validity? Talk about the differences between these two approaches to validating a selection test in a work environment. RUBRIC: The existence of a criterion and the utilitarian nature of the study makes it a natural criterion validity study. Concerns include the quality of the criterion and the degree to which it is really an indication of the success of an employee. How much learning occurs on the job so that the entry exam becomes more irrelevant. Scale truncation, restriction of range issues. 4. An evaluation team has created a scale to assess the satisfaction of trainees in a workshop. The following scale was created: 1 means the workshop was much too hard, 2 means it was too hard. Three means the workshop was just right. 4 means it was too easy and 5 means it was much too easy. Discuss this scale design. Does it pose any issues for using the results in a correlation study that the researchers were planning to conduct? If you were asked to improve the design of the scale, what would you suggest and why? If you like the scale the way it is, tell us why you like it. What level of measurement scale is it and why (nominal, ordinal, interval or ratio)? What impact does the scale of measurement have upon data analysis? Be specific about the importance of the scale upon an analysis. RUBRIC: Note that the scale is not monotonic. 3 is the best number and 1 and 5 are worse, so you can’t use the scale as is for any statistical work. Suggest a fix, such as scale distance from 3 being the transformation, like an unfolding scale. The transformed scale is more ordinal than interval, although you might treat it as interval for practical purposes. Not good for correlation analysis using Pearson product moment correlation which requires at least interval scaled variables. May use multiple items to measure the different facets of trainees’ satisfaction to get a sum score which is continuous.