Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A quick reference for symbols and formulas covered in COGS14: MEAN OF SAMPLE: x = ∑ xi n • € • • € • € x : “X bar”Mean (i.e. Average) of a sample ∑ : “Capital Sigma” Sum of everything that comes after it i : “X sub i” This stands for each individual value you have in your sample. For example, when you’re finding the mean of values 3, 4, and 5, you substitute 3 into the xi spot, then 4, then 5, and then add these together x n: the number of observations in your sample; for the above example of finding the mean of 3, 4, 5, n = 3 observations MEAN OF POPULATION: µ = ∑ xi N • • € µ = “mu” Mean of a Population Notice that this equation is very similar to the one for the mean of a sample, the only difference is that you know you have observed the ENTIRE population (this is rare in real life) ESTIMATED POPULATION VARIANCE/VARIANCE OF A SAMPLE: € s2 = ∑ (x i − x ) 2 n −1 • € • • € € • s 2 : “S squared” the term for the variance of a sample, also known as the estimated variance of a population ∑ : “Capital Sigma” Sum of everything that comes after it i : “X sub i” This stands for each individual value you have in your sample. For example, when you’re finding the variance within the sample of values 3, 4, and 5, you substitute 3 into the xi spot, subtract the mean from 3, and then square this value. Repeat this step for as many values of xi as you have, then add those results together. : Number of observations in your sample minus 1; for the example of observations equaling 3, 4, and 5, n=3, so n-1=2. x n-1 ESTIMATED POPULATION STANDARD DEVIATION/STANDARD DEVIATION OF A SAMPLE: s = s2 • • € s= standard deviation of a sample, also known as estimated standard deviation for the population see above for how to calculate s2, then take the square root of your answer to find standard deviation POPULATION VARIANCE: σ 2 = ∑ (xi − µ )2 N • € • • • € € Note that this equation is very similar to the equation for estimated population variance above—The difference is that you divide by “N” in the denominator to find population variance, which is equal to the total number of members of your population, whereas you divide by n-1 to find the ESTIMATED population variance σ 2 = “sigma squared” the term used for population variance ∑ : “Capital Sigma” Sum of everything that comes after it xi : “X sub i” This stands for each individual value you have in your sample. For example, when you’re finding the variance within the sample of values 3, 4, and 5, you substitute 3 into the xi spot, subtract the mean from 3, and then square this value. Repeat this step for as many values of xi as you have, then add those results together. • : Number of members of your population/observations **This equation will only be used when you can observe the ENTIRE population, which is commonly not feasible in real life. But you should understand how to find population variance, and how it is related/different from ESTIMATED population variance N POPULATION STANDARD DEVIATION: σ = σ2 • • € σ = “sigma” the term for population standard deviation see above for how to calculate sigma squared, then take the square root of your answer to find sigma ENTROPY: H = −∑ f (xi )log 2 ( f (xi )) € € € • H: the symbol to denote entropy • ∑ : “Capital Sigma” Sum of everything that comes after it • • € f (xi ) : Relative frequency of something occurring; For example, you flip a coin 10 times, and 4 times it comes up heads. The relative frequency = 0.4 For each outcome, figure out the relative frequency, then find log2 of that frequency, and then multiply that value times the relative frequency itself. Once you have done this for each outcome you had, add all your answers together and take the negative of it to find entropy. MAXIMUM POSSIBLE ENTROPY: H max = −log 2 (1/ k) = log 2 (k) • € k: the number of possible outcomes. For example, with a coin toss, there are 2 possible outcomes. With a die roll, there are 6. RELATIVE ENTROPY: J=H H max • € A value close to 1 indicates maximum possible entropy. A value close to 0 indicates minimum possible entropy. EXPECTED VALUE OF A RANDOM VARIABLE: E(X) = ∑ P(X = xi )xi • • • € € • E(X) = Notation for “Expected Value” ∑ : “Capital Sigma” Sum of everything that comes after it = Probability P xi : “X sub i” This again stands for each possible observed value. For example, you are trying to find the observed value for a die that has 5 sides showing “1” and 1 side showing “0”; then 1 and 0 are your values you plug in for xi. You would first figure out the probability of rolling a 1 (P=5/6) and then multiply that P times the actual value of 1. Then repeat with the probability of rolling a 0 (P=1/6) times the value of 1, add these results together, and find your expected value (E(X)) = 5/6 VARIANCE OF A RANDOM VARIABLE: Var(X) = ∑ P(X = xi )(xi − E(X))2 • • • € € Var(X) = Notation for “Variance of a random variable” ∑ : “Capital Sigma” Sum of everything that comes after it see above, this means “expected value of a random variable.” So to find the variance of a random variable, you will first need to find the expected value. E(X) • • xi : “X sub i” This stands for each possible observed value. To find the variance, plug in each possible value for xi and then subtract the expected value from this observed value, and square this answer. Then multiply this answer by the probability of getting that observed value. For example, assume we roll a fair die and want to know what the variance of the random variable will be. We find that the Exptected Value = 3.5. For each possible value of the die (1, 2, 3, 4, 5, 6) we will plug each value in for xi, subtract the expected value of 3.5, square the answer, and then multiply it by the probability of rolling that value (in this case each number has a 1/6 chance of being rolled). Calculate this for all 6 numbers, and sum those components together to find the variance. STANDARD DEVIATION OF A RANDOM VARIABLE: Std(X) = Var(X) • • € Std(X) = Standard Deviation of a Random Variable Once you compute variance as in the above example, take the square root of it to get the standard deviation of a Random variable BINOMIAL DISTRIBUTION: n P(k | n, p) = p k (1− p)n−k k EXPANDED TO: k n! n−k P(k | n, p) = p (1− p) k!(n − k)! € € € € • k: The number of “successful” outcomes. • n: The number of trials. • p: The probability of getting a successful outcome. • P(k | n, p) : “The probability of getting “k” successes, given “n” number of • You define what you think a success is—it could be something like getting heads on a coin flip. When you are doing a binomial equation, this might be listed as the number of times you flip the coin, reach into a bag, etc. If you are flipping a coin and have defined success as getting heads, then p=the probability of getting a head when you flip the coin. trials, and “p” probability of success n k : “n choose k” You define getting “k” number of successes out of “n” number of trials (see below to calculate) n! k!(n − k)! the Expansion of “n choose k”. n! means “n factorial”, which means you take “n” and multiply it by all numbers smaller than “n”. For example, to find 4!, you multiply 4x3x2x1 • The rest of the equation is just plugging in values to figure out the correct probability of getting “k” number of successes across “n” number of trials, given that you have a “p” probability of getting “k” on any given trial **Define n, k, and p before you start the problem—It might help to write them next to the binomial equation and then just go back and plug them in where needed. • € THE SAMPLING DISTRIBUTION OF THE MEAN: µ x = E(X) = E(X) = µ x • € The concept of the sampling distribution of the mean is a very helpful and crucial concept for statistics. In short, the sampling distribution of the mean is a hypothetical distribution that represents what you would get if you took infinite samples of size “n”, took the mean of each of those samples, and then graphed those means. Some things we know about the sampling distribution of the mean are: o For a large enough n (25-100), the sampling distribution of the mean will be normally distributed o The mean of the sampling distribution of the mean = mean of the population µ x : Mean of the sampling distribution of the mean µ x : Mean of the population • • E(X) : Expected value of the sampling distribution of the mean E(X) : Expected value of the population • € € € € • BUT: σx = € € € = σx n • σ x : Standard deviation of the sampling distribution of the mean Var( X ) : Variance of the population • n: Number of observations • σ x : Standard deviation of the population • € Var ( X ) n • SO, we know that the standard deviation of the sampling distribution of the mean will always be smaller than the standard deviation of the population by a specific amount (i.e. population standard deviation divided by the square root of the number of observations in a sample) COHEN’S D: d= • • • • € € € • x −µ σ x : mean of your sample µ : mean of the null hypothesis σ : Standard deviation of the null hypothesis Cohen’s d is a measure of effect size, or how large of an effect your sample had in comparison to the null hypothesis d =0.20 (small effect), d = 0.50 (medium effect), d = 0.80 (large effect) OBSERVED Z-SCORE: € z= x −u σx EXPANDED TO: z= € • € € € € x −u σ n x : mean of your sample • µ : mean of the null hypothesis • σ x : standard error of the mean (also known as the standard deviation of the population divided by the square root of the number of observations) CONFIDENCE INTERVALS (FOR A Z-TEST): x ± (zconf )σ x € € • x : observed mean of your sample • (zconf ) : the critical z-scores for your level of confidence. For purposes of this class, think of these like when you are finding critical z-scores for two-tailed ztests. If you have a 95% confidence interval, you will have the same “z conf” as you would have for a 2-tailed z-test with an alpha level of 0.05. To find your “z conf” subtract your level of confidence from 100 (i.e. 100-95% confidence = 5). Divide this 5% by 2 =2.5% or 0.025, find 0.025 in the “C” Column of the z-table, then find the corresponding z-score in the “A” column. • € σ x : standard error of the mean (also known as the standard deviation of the • population divided by the square root of the number of observations) ONE SAMPLE T-TEST (3 related formulas): € 1) x −µ t= sx 2) ^ ^ sx = σ x = € s σ = n n 3) ^ € s =σ = • € • • € ∑ (x − x) 2 n −1 x : your sample mean x: Each individual observation in your sample n-1: the number of observations in your sample, minus 1 • • € µ : the population mean (usually what you are comparing your sample mean to, to see if there is a difference s x / σ^ x : The estimated standard error of the mean. Note this is also represented as the Greek letter sigma σ , with a “hat”, so we can call it “sigma hat”—this indicates it’s an estimate ^ € € • • • s = σ : The€estimated standard deviation of the population. ∑ : “Capital Sigma” Sum of everything that comes after it ^ • € € To estimate the population standard deviation, we need to find s or σ , which we find in a similar way to how we always calculate standard deviation. Take each individual score (x) and subtract the mean ( x ). Square that value. Repeat for each individual score and then add up what you get. Then€divide that value by the number of observations minus 1 (n-1), and finally take the square root of your ^ € answer to find “s” or “ σ ” **Note that you will use “n” at least 2 times in the t-score formula: once to find the estimated standard deviation (formula #3 above) and again when finding the estimated standard error of the mean (formula #2). You will also need to know n to € find your critical t-score on your t-score chart. Your degrees of freedom (df) is equal to the number of observations minus 1 for a one-sample t-test (so df=(n-1) for this test) CONFIDENCE INTERVAL FOR A ONE-SAMPLE T-TEST: x ± t conf (s x ) • x : Observed mean of your sample t conf : the critical t-scores for your level of confidence. • € € € • • For purposes of this class, think of these like when you are finding critical t-scores for two-tailed ttests. If you have a 95% confidence interval, you will have the same “t conf” as you would have for a 2-tailed t-test with an alpha level of 0.05. To find your “t conf” subtract your level of confidence from 100 (i.e. 10095% confidence = 5). Go to the 2-tailed test side of the t-test table, find the column for 0.05 and go down to your df to find the correct “t conf” s x : Estimated standard error of the mean (see above to calculate) INDEPENDENT SAMPLES T-TEST (3 related formulas): € 1) (x1 − x2 ) − (µ1 − µ2 ) t= s x −x 1 2 2) 1 1 + n1 n2 ^ s x −x = σ x −x = s p € 1 2 1 2 3) ^ ^ sp = σ p = € € € € • • µ1: The hypothetical mean of population 1 • µ2 : The hypothetical mean of population 2 **Note that µ1 − µ 2 will usually be set to 0. This is because your null hypothesis is usually that there is no difference between the two populations you are trying to compare. • € € € ^ • € •€ • € € (n1 −1)σ 1 − (n2 −1)σ 2 n1 + n2 − 2 x1 : The observed mean of one of your groups x2 : The observed mean of the other group • ^ s x −x / σ x −x 1 2 1 2 : Pooled estimate of the standard error of the mean. This is what goes in the denominator of your independent samples t-test formula, and is computed by multiplying the pooled standard deviation (see next step to calculate) by the square root of (1/number of samples in group 1) + (1/number of samples in group 2) n1 : Number of samples in group 1 n2 : Number of samples in group 2 ^ • s p = σ p : Pooled standard deviation. This is computed by entering the correct number of subjects in each group for n1 and n2, respectively (see above), and ^ by entering the correct values for ^ € • σ 1 and σ 2 (see below) σ 1 : estimated variance of group 1. 2 To calculate, take each individual score (x) and subtract the mean ( x ). Square that value. Repeat for each individual score and then add up what you get. Then divide that value by the number of observations minus 1 (n-1). € Repeat these steps with the data from group 2 to € € get the estimated variance of group 2: € • ^ ^ σ 22 Finding the correct df for the independent samples t-test: n1+n2 – 2: total number of samples in group 1, plus the total number of samples in group 2, minus 2 COHEN’S D FOR INDEPENDENT SAMPLES T-TEST: € d= • € (x1 − x 2 ) − (µ1 − µ2 ) sp To find Cohen’s d for independent samples t-test, subtract the mean of group 2 from the mean of group 1. Then subtract the hypothetical mean of population 2 from the hypothetical mean of population 1 (this step will usually be 0, see explanation above). Then divide by the pooled standard deviation, which you found above. Remember to take the absolute value of your answer. CONFIDENCE INTERVAL FOR THE INDEPENDENT SAMPLES T-TEST: (x1 −x 2 ) ± t conf (s x −x ) 1 • 2 You’ve already calculated most of the important parts of this formula. You found (x1 −x 2 ) earlier as the numerator of your t-score formula. (s x −x ) as the denominator of the t-score formula. 1 € • € € € • You also found 2 t conf : the critical t-scores for your level of confidence. For purposes of this class, think of these like when you are finding critical t-scores for two-tailed ttests. If you have a 95% confidence interval, you will have the same “t conf” as you would have for a 2-tailed t-test with an alpha level of 0.05. To find your “t conf” subtract your level of confidence from 100 (i.e. 100-95% confidence = 5). Go to the 2-tailed test side of the t-test table, find the column for 0.05 and go down to your df to find the correct “t conf” REPEATED MEASURES T-TEST (3 related formulas): 1) t= D − µD ^ σD ^ 2) € σD = ∑ (Di − D) 2 nD −1 ^ σD σD = 3) # pairs ^ € • € € D : mean (average) of the differences of each set of observations. Since repeated measures tests use the same subjects across 2 conditions, we are really only interested in the difference in performance from one condition vs. another. To find the average of the differences, take the score from one condition minus the score from the other condition. Repeat this for all pairs of observations, add up, and divide by the number of pairs you have. So if you had 10 people who each participated in 2 conditions (such as placebo vs. a drug trial), you will have 10 pairs of scores. • nD : number of pairs of scores • € • µ D : hypothesized mean difference between the 2 conditions. **Like the independent samples t-test, this will usually be 0. ^ σ D : Estimated standard error of the mean for the repeated measures t-test. See below for how to calculate. • Di : Difference score for an individual. € € € Subtract the person’s score in one condition from their score in the other condition. For example, if someone got a score of 5 in the placebo condition and 0 in the drug condition, their difference score is 5 (5-0) ). ^ • σ D : Estimated standard deviation for the repeated measures test. This is necessary for the estimated standard error of the mean, which is then necessary for the t-score. To calculate, take each individual difference score and subtract the mean of the differences ( D ) from each individual difference score. Square € € each of these values, then add them together. Then divide by the number of pairs of scores minus 1. ^ • σ D : Estimated standard error of the mean for the repeated measures test. • € To calculate, divide the estimated standard deviation (see above) by the square root of the number of pairs of scores. Finding the correct df for a repeated measures t-test: number of pairs minus 1 CONFIDENCE INTERVAL FOR A REPEATED MEASURES T-TEST: ^ D ± t conf (σ D ) • From solving the t-test, you already have most of the information for the confidence interval formula. You found D , the mean of the differences in pairs ^ of scores, and σ D , the estimated standard error of the mean. To find t conf follow the guidelines from the independent samples t-test above, **but** remember that the degrees of freedom will be € different for these 2 tests € T-TEST OF€A PEARSON’S R SCORE: t= € r 1− r 2 n−2 • r: Pearson’s r that is either given to you or you have calculate • n: Number of sample pairs • degrees of freedom= n-2 € ONE-FACTOR ANOVA: F= MSbetween MSwithin MSbetween = SSbetween dfbetween MSwithin = SSwithin dfwithin € € • dfbetween : To find the degrees of freedom between, take the number of groups- 1. So if you were testing college students to see which class year had the most hours of homework per week, you would have 4 groups (freshman, sophomore, junior, senior), and 3 degrees of freedom between. • € € dfwithin : To find the degrees of freedom within, rely on the fact that the degrees of freedom between + the degrees of freedom within = total degrees of freedom. To find the total degrees of freedom, take the total number of observations – 1. Continuing the college student example from above, let’s say you asked 4 people from each class year about their homework. Then you would have 16 total observations (4*4). Then the total degrees of freedom = 15. To find the degrees of freedom within, do 15-3 (df total-df between = df within). ETA SQUARED FOR ONE-FACTOR ANOVA: η2 = • • • € € SSbetween SStotal η 2 : “Eta squared”: the way to measure effect size for a one-factor ANOVA test (like how we use Cohen’s d to measure effect size of z and t tests). Eta squared will be between 0 and 1. The closer to 1, the bigger the effect. Guidelines for interpreting eta squared: o 0.01: small effect o 0.09: medium effect o 0.25: large effect TUKEY’S HONESTLY SIGNIFICANT DIFFERENCE (HSD) TEST: HSD = q • € • • • • MSwithin n HSD: The value that you find for HSD is like a critical value for judging which pair or pairs of means are different, after performing an ANOVA test. You already found MSwithin while calculating F. n: this is the sample size of each group q: find this value from a HSD table, using the degrees of freedom within (that you found during the F calculation) and the number of means (also known as the € of groups) number If any difference between a pair of means for 2 of your groups that you tested with the ANOVA is greater than the value you find for the HSD, you can conclude that the difference between these 2 groups is significantly different. For example, assume you derive an HSD = 6.5. You have the following group means for some test you administered: o Group 1: 4 o Group 2: 6 o Group 3: 12 o Take the difference between each possible pair of means. For example, the difference in means between Group 1 vs. Group 2 is 2, the difference between Group 2 vs. Group 3 is 6, and the difference for Group 1 vs. Group 3 is 8. o With your HSD = 6.5, only the difference in group means between Group 1 vs. Group 3 is honestly significantly different, because the group difference of 8 exceeds the HSD of 6.5 ONE-FACTOR REPEATED MEASURES ANOVA: F= MSbetween MSerror MSbetween = SSbetween dfbetween MSerror = SSerror dferror € € • dfbetween : To find the degrees of freedom between, take the number of groups- 1. So if you were testing college students to see which class year had the most hours of homework per week, you would have 4 groups (freshman, sophomore, junior, senior), and 3 degrees of freedom between. € • € € dferror : The easiest way to find this is to work backward. First, df error is one sub-component of df within. The other sub-component is df subjects. We know from the one-factor ANOVA above that the df total = df within + df between. To find the total degrees of freedom, take the total number of observations – 1. You can then subtract the df between from the df total to get the df within. Then, to find the df subjects (as a part of df within) take the total number of subjects -1. Once you find the df subjects, you can do the following: df within-df subjects = df error. PARTIAL ETA SQUARED FOR REPEATED MEASURES ANOVA: η2 = SSbetween SSbetween + SStotal o See above for how to interpret eta squared (the same standards apply as for eta squared for the one-factor ANOVA) €