Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
A quick reference for symbols and formulas covered in COGS14:
MEAN OF SAMPLE:
x = ∑ xi
n
•
€
•
•
€
•
€
x : “X bar”Mean (i.e. Average) of a sample
∑
: “Capital Sigma” Sum of everything that comes after it
i : “X sub i” This stands for each individual value you have in your sample.
For example, when you’re finding the mean of values 3, 4, and 5, you substitute 3
into the xi spot, then 4, then 5, and then add these together
x
n: the number of observations in your sample; for the above example of finding
the mean of 3, 4, 5, n = 3 observations
MEAN OF POPULATION:
µ = ∑ xi
N
•
•
€
µ = “mu” Mean of a Population
Notice that this equation is very similar to the one for the mean of a sample, the
only difference is that you know you have observed the ENTIRE population
(this is rare in real life)
ESTIMATED POPULATION VARIANCE/VARIANCE OF A SAMPLE:
€
s2 = ∑ (x i − x ) 2
n −1
•
€
•
•
€
€
•
s 2 : “S squared” the term for the variance of a sample, also known as the
estimated variance of a population
∑
: “Capital Sigma” Sum of everything that comes after it
i : “X sub i” This stands for each individual value you have in your sample.
For example, when you’re finding the variance within the sample of values 3, 4,
and 5, you substitute 3 into the xi spot, subtract the mean from 3, and then square
this value. Repeat this step for as many values of xi as you have, then add those
results together.
: Number of observations in your sample minus 1; for the example of
observations equaling 3, 4, and 5, n=3, so n-1=2.
x
n-1
ESTIMATED POPULATION STANDARD DEVIATION/STANDARD
DEVIATION OF A SAMPLE:
s = s2
•
•
€
s=
standard deviation of a sample, also known as estimated standard deviation
for the population
see above for how to calculate s2, then take the square root of your answer to find
standard deviation
POPULATION VARIANCE:
σ 2 = ∑ (xi − µ )2
N
•
€
•
•
•
€
€
Note that this equation is very similar to the equation for estimated population
variance above—The difference is that you divide by “N” in the denominator
to find population variance, which is equal to the total number of members of
your population, whereas you divide by n-1 to find the ESTIMATED
population variance
σ 2 = “sigma squared” the term used for population variance
∑ : “Capital Sigma” Sum of everything that comes after it
xi : “X sub i” This stands for each individual value you have in your sample.
For example, when you’re finding the variance within the sample of values 3, 4,
and 5, you substitute 3 into the xi spot, subtract the mean from 3, and then square
this value. Repeat this step for as many values of xi as you have, then add those
results together.
•
: Number of members of your population/observations
**This equation will only be used when you can observe the ENTIRE population,
which is commonly not feasible in real life. But you should understand how to find
population variance, and how it is related/different from ESTIMATED population
variance
N
POPULATION STANDARD DEVIATION:
σ = σ2
•
•
€
σ = “sigma” the term for population standard deviation
see above for how to calculate sigma squared, then take the square root of
your answer to find sigma
ENTROPY:
H = −∑ f (xi )log 2 ( f (xi ))
€
€
€
•
H: the symbol to denote entropy
•
∑
: “Capital Sigma” Sum of everything that comes after it
•
•
€
f (xi ) : Relative frequency of something occurring; For example, you flip a coin
10 times, and 4 times it comes up heads. The relative frequency = 0.4
For each outcome, figure out the relative frequency, then find log2 of that
frequency, and then multiply that value times the relative frequency itself. Once
you have done this for each outcome you had, add all your answers together and
take the negative of it to find entropy.
MAXIMUM POSSIBLE ENTROPY:
H max = −log 2 (1/ k) = log 2 (k)
•
€
k: the number of possible outcomes.
For example, with a coin toss, there are 2
possible outcomes. With a die roll, there are 6.
RELATIVE ENTROPY:
J=H
H max
•
€
A value close to 1 indicates maximum possible entropy. A value close to 0
indicates minimum possible entropy.
EXPECTED VALUE OF A RANDOM VARIABLE:
E(X) = ∑ P(X = xi )xi
•
•
•
€
€
•
E(X) = Notation for “Expected Value”
∑
: “Capital Sigma” Sum of everything that comes after it
= Probability
P
xi : “X sub i” This again stands for each possible observed value.
For
example, you are trying to find the observed value for a die that has 5 sides
showing “1” and 1 side showing “0”; then 1 and 0 are your values you plug in for
xi. You would first figure out the probability of rolling a 1 (P=5/6) and then
multiply that P times the actual value of 1. Then repeat with the probability of
rolling a 0 (P=1/6) times the value of 1, add these results together, and find your
expected value (E(X)) = 5/6
VARIANCE OF A RANDOM VARIABLE:
Var(X) = ∑ P(X = xi )(xi − E(X))2
•
•
•
€
€
Var(X) = Notation for “Variance of a random variable”
∑
: “Capital Sigma” Sum of everything that comes after it
 see above, this means “expected value of a random variable.” So to
find the variance of a random variable, you will first need to find the expected
value.
E(X)
•
•
xi : “X sub i” This stands for each possible observed value.
To find the
variance, plug in each possible value for xi and then subtract the expected value
from this observed value, and square this answer. Then multiply this answer by
the probability of getting that observed value.
For example, assume we roll a fair die and want to know what the variance of the
random variable will be. We find that the Exptected Value = 3.5. For each
possible value of the die (1, 2, 3, 4, 5, 6) we will plug each value in for xi,
subtract the expected value of 3.5, square the answer, and then multiply it by the
probability of rolling that value (in this case each number has a 1/6 chance of
being rolled). Calculate this for all 6 numbers, and sum those components
together to find the variance.
STANDARD DEVIATION OF A RANDOM VARIABLE:
Std(X) = Var(X)
•
•
€
Std(X) = Standard Deviation of a Random Variable
Once you compute variance as in the above example, take the square root of it to
get the standard deviation of a Random variable
BINOMIAL DISTRIBUTION:
n 
P(k | n, p) =   p k (1− p)n−k
k 
EXPANDED TO:

 k
n!
n−k
P(k | n, p) = 
 p (1− p)
 k!(n − k)!
€
€
€
€
•
k: The number of “successful” outcomes.
•
n: The number of trials.
•
p: The probability of getting a successful outcome.
•
P(k | n, p) : “The probability of getting “k” successes, given “n” number of
•
You define what you think a success
is—it could be something like getting heads on a coin flip.
When you are doing a binomial equation, this might be
listed as the number of times you flip the coin, reach into a bag, etc.
If you are flipping a coin and
have defined success as getting heads, then p=the probability of getting a head
when you flip the coin.
trials, and “p” probability of success
n 
 
 k  : “n choose k” You define getting “k” number of successes out of “n”
number of trials (see below to calculate)

n! 


 k!(n − k)!  the Expansion of “n choose k”.
n! means “n factorial”, which
means you take “n” and multiply it by all numbers smaller than “n”. For
example, to find 4!, you multiply 4x3x2x1
• The rest of the equation is just plugging in values to figure out the correct
probability of getting “k” number of successes across “n” number of trials, given
that you have a “p” probability of getting “k” on any given trial
**Define n, k, and p before you start the problem—It might help to write them next to the
binomial equation and then just go back and plug them in where needed.
•
€
THE SAMPLING DISTRIBUTION OF THE MEAN:
µ x = E(X) = E(X) = µ x
•
€
The concept of the sampling distribution of the mean is a very helpful and crucial
concept for statistics. In short, the sampling distribution of the mean is a
hypothetical distribution that represents what you would get if you took infinite
samples of size “n”, took the mean of each of those samples, and then graphed
those means. Some things we know about the sampling distribution of the mean
are:
o For a large enough n (25-100), the sampling distribution of the mean will
be normally distributed
o The mean of the sampling distribution of the mean = mean of the
population
µ x : Mean of the sampling distribution of the mean
µ x : Mean of the population
•
•
E(X) : Expected value of the sampling distribution of the mean
E(X) : Expected value of the population
•
€
€
€
€
•
BUT:
σx =
€
€
€
=
σx
n
•
σ x : Standard deviation of the sampling distribution of the mean
Var( X ) : Variance of the population
•
n: Number of observations
•
σ x : Standard deviation of the population
•
€
Var ( X )
n
•
SO, we know that the standard deviation of the sampling distribution of the mean
will always be smaller than the standard deviation of the population by a specific
amount (i.e. population standard deviation divided by the square root of the
number of observations in a sample)
COHEN’S D:
d=
•
•
•
•
€
€
€
•
x −µ
σ
x : mean of your sample
µ : mean of the null hypothesis
σ : Standard deviation of the null hypothesis
Cohen’s d is a measure of effect size, or how large of an effect your sample had in
comparison to the null hypothesis
d =0.20 (small effect), d = 0.50 (medium effect), d = 0.80 (large effect)
OBSERVED Z-SCORE:
€
z=
x −u
σx
EXPANDED TO:
z=
€
•
€
€
€
€
x −u
σ
n
x : mean of your sample
•
µ : mean of the null hypothesis
•
σ x : standard error of the mean (also known as the standard deviation of the
population divided by the square root of the number of observations)
CONFIDENCE INTERVALS (FOR A Z-TEST):
x ± (zconf )σ x
€
€
•
x : observed mean of your sample
•
(zconf ) : the critical z-scores for your level of confidence.
For purposes of this
class, think of these like when you are finding critical z-scores for two-tailed ztests. If you have a 95% confidence interval, you will have the same “z conf” as
you would have for a 2-tailed z-test with an alpha level of 0.05.
To find your “z conf” subtract your level of confidence from 100 (i.e. 100-95%
confidence = 5). Divide this 5% by 2 =2.5% or 0.025, find 0.025 in the “C”
Column of the z-table, then find the corresponding z-score in the “A” column.
•
€
σ x : standard error of the mean (also known as the standard deviation of the
•
population divided by the square root of the number of observations)
ONE SAMPLE T-TEST (3 related formulas):
€
1)
x −µ
t=
sx
2)
^
^
sx = σ x =
€
s
σ
=
n
n
3)
^
€
s =σ =
•
€
•
•
€
∑ (x − x)
2
n −1
x : your sample mean
x: Each individual observation in your sample
n-1: the number of observations in your sample, minus 1
•
•
€
µ : the population mean (usually what you are comparing your sample mean to,
to see if there is a difference
s x / σ^ x : The estimated standard error of the mean. Note this is also represented
as the Greek letter sigma σ , with a “hat”, so we can call it “sigma hat”—this
indicates it’s an estimate
^
€
€
•
•
•
s = σ : The€estimated standard deviation of the population.
∑
: “Capital Sigma” Sum of everything that comes after it
^
•
€
€
To estimate the population standard deviation, we need to find s or σ , which we
find in a similar way to how we always calculate standard deviation. Take each
individual score (x) and subtract the mean ( x ). Square that value. Repeat for
each individual score and then add up what you get. Then€divide that value by the
number of observations minus 1 (n-1), and finally take the square root of your
^
€
answer to find “s” or “ σ ”
**Note that you will use “n” at least 2 times in the t-score formula: once to find the
estimated standard deviation (formula #3 above) and again when finding the
estimated standard error of the mean (formula #2). You will also need to know n to
€
find your critical t-score on your t-score chart. Your degrees of freedom (df) is equal
to the number of observations minus 1 for a one-sample t-test (so df=(n-1) for this
test)
CONFIDENCE INTERVAL FOR A ONE-SAMPLE T-TEST:
x ± t conf (s x )
• x : Observed mean of your sample
t conf : the critical t-scores for your level of confidence.
•
€
€
€
•
•
For purposes of this
class, think of these like when you are finding critical t-scores for two-tailed ttests. If you have a 95% confidence interval, you will have the same “t conf”
as you would have for a 2-tailed t-test with an alpha level of 0.05.
To find your “t conf” subtract your level of confidence from 100 (i.e. 10095% confidence = 5). Go to the 2-tailed test side of the t-test table, find the
column for 0.05 and go down to your df to find the correct “t conf”
s x : Estimated standard error of the mean (see above to calculate)
INDEPENDENT SAMPLES T-TEST (3 related formulas):
€
1)
(x1 − x2 ) − (µ1 − µ2 )
t=
s x −x
1
2
2)
1 1
+
n1 n2
^
s x −x = σ x −x = s p
€
1
2
1
2
3)
^
^
sp = σ p =
€
€
€
€
•
•
µ1: The hypothetical mean of population 1
•
µ2 : The hypothetical mean of population 2
**Note that µ1 − µ 2 will usually be set to 0. This is because your null
hypothesis is usually that there is no difference between the two populations
you are trying to compare.
•
€
€
€
^
•
€
•€
•
€
€
(n1 −1)σ 1 − (n2 −1)σ 2
n1 + n2 − 2
x1 : The observed mean of one of your groups
x2 : The observed mean of the other group
•
^
s x −x / σ x −x
1
2
1
2
: Pooled estimate of the standard error of the mean. This is
what goes in the denominator of your independent samples t-test formula, and
is computed by multiplying the pooled standard deviation (see next step to
calculate) by the square root of (1/number of samples in group 1) + (1/number
of samples in group 2)
n1 : Number of samples in group 1
n2 : Number of samples in group 2
^
•
s p = σ p : Pooled standard deviation.
This is computed by entering the correct
number of subjects in each group for n1 and n2, respectively (see above), and
^
by entering the correct values for
^
€
•
σ 1 and σ 2 (see below)
σ 1 : estimated variance of group 1.
2
To calculate, take each individual score (x)
and subtract the mean ( x ). Square that value. Repeat for each individual
score and then add up what you get. Then divide that value by the number of
observations minus 1 (n-1). €
Repeat these steps with the data from group 2 to
€
€
get the estimated variance of group 2:
€
•
^
^
σ 22
Finding the correct df for the independent samples t-test: n1+n2 – 2: total number
of samples in group 1, plus the total number of samples in group 2, minus 2
COHEN’S D FOR INDEPENDENT SAMPLES T-TEST:
€
d=
•
€
(x1 − x 2 ) − (µ1 − µ2 )
sp
To find Cohen’s d for independent samples t-test, subtract the mean of group 2
from the mean of group 1. Then subtract the hypothetical mean of population 2
from the hypothetical mean of population 1 (this step will usually be 0, see
explanation above). Then divide by the pooled standard deviation, which you
found above. Remember to take the absolute value of your answer.
CONFIDENCE INTERVAL FOR THE INDEPENDENT SAMPLES T-TEST:
(x1 −x 2 ) ± t conf (s x −x )
1
•
2
You’ve already calculated most of the important parts of this formula. You found
(x1 −x 2 ) earlier as the numerator of your t-score formula.
(s x −x ) as the denominator of the t-score formula.
1
€
•
€
€
€
•
You also found
2
t conf : the critical t-scores for your level of confidence.
For purposes of this
class, think of these like when you are finding critical t-scores for two-tailed ttests. If you have a 95% confidence interval, you will have the same “t conf” as
you would have for a 2-tailed t-test with an alpha level of 0.05.
To find your “t conf” subtract your level of confidence from 100 (i.e. 100-95%
confidence = 5). Go to the 2-tailed test side of the t-test table, find the column for
0.05 and go down to your df to find the correct “t conf”
REPEATED MEASURES T-TEST (3 related formulas):
1)
t=
D − µD
^
σD
^
2)
€
σD =
∑ (Di − D)
2
nD −1
^
σD
σD =
3)
# pairs
^
€
•
€
€
D
: mean (average) of the differences of each set of observations. Since
repeated measures tests use the same subjects across 2 conditions, we are really
only interested in the difference in performance from one condition vs. another.
To find the average of the differences, take the score from one condition minus
the score from the other condition. Repeat this for all pairs of observations, add
up, and divide by the number of pairs you have. So if you had 10 people who
each participated in 2 conditions (such as placebo vs. a drug trial), you will have
10 pairs of scores.
• nD : number of pairs of scores
•
€
•
µ D : hypothesized mean difference between the 2 conditions. **Like the
independent samples t-test, this will usually be 0.
^
σ D : Estimated standard error of the mean for the repeated measures t-test. See
below for how to calculate.
• Di : Difference score for an individual.
€
€
€
Subtract the person’s score in one
condition from their score in the other condition. For example, if someone got a
score of 5 in the placebo condition and 0 in the drug condition, their difference
score is 5 (5-0) ).
^
• σ D : Estimated standard deviation for the repeated measures test.
This is
necessary for the estimated standard error of the mean, which is then necessary
for the t-score. To calculate, take each individual difference score and subtract
the mean of the differences ( D ) from each individual difference score. Square
€
€
each of these values, then add them together. Then divide by the number of pairs
of scores minus 1.
^
• σ D : Estimated standard error of the mean for the repeated measures test.
•
€
To
calculate, divide the estimated standard deviation (see above) by the square root
of the number of pairs of scores.
Finding the correct df for a repeated measures t-test: number of pairs minus 1
CONFIDENCE INTERVAL FOR A REPEATED MEASURES T-TEST:
^
D ± t conf (σ D )
•
From solving the t-test, you already have most of the information for the
confidence interval formula. You found D , the mean of the differences in pairs
^
of scores, and σ D , the estimated standard error of the mean. To find t conf follow
the guidelines from the independent samples t-test above, **but** remember that
the degrees of freedom will be €
different for these 2 tests
€
T-TEST OF€A PEARSON’S R SCORE:
t=
€
r
1− r 2
n−2
• r: Pearson’s r that is either given to you or you have calculate
• n: Number of sample pairs
• degrees of freedom= n-2
€
ONE-FACTOR ANOVA:
F=
MSbetween
MSwithin
MSbetween =
SSbetween
dfbetween
MSwithin =
SSwithin
dfwithin
€
€
•
dfbetween : To find the degrees of freedom between, take the number of groups-
1. So if you were testing college students to see which class year had the most
hours of homework per week, you would have 4 groups (freshman, sophomore,
junior, senior), and 3 degrees of freedom between.
•
€
€
dfwithin : To find the degrees of freedom within, rely on the fact that the degrees
of freedom between + the degrees of freedom within = total degrees of freedom.
To find the total degrees of freedom, take the total number of observations – 1.
Continuing the college student example from above, let’s say you asked 4 people
from each class year about their homework. Then you would have 16 total
observations (4*4). Then the total degrees of freedom = 15. To find the degrees
of freedom within, do 15-3 (df total-df between = df within).
ETA SQUARED FOR ONE-FACTOR ANOVA:
η2 =
•
•
•
€
€
SSbetween
SStotal
η 2 : “Eta squared”: the way to measure effect size for a one-factor ANOVA test
(like how we use Cohen’s d to measure effect size of z and t tests).
Eta squared will be between 0 and 1. The closer to 1, the bigger the effect.
Guidelines for interpreting eta squared:
o 0.01: small effect
o 0.09: medium effect
o 0.25: large effect
TUKEY’S HONESTLY SIGNIFICANT DIFFERENCE (HSD) TEST:
HSD = q
•
€
•
•
•
•
MSwithin
n
HSD: The value that you find for HSD is like a critical value for judging which
pair or pairs of means are different, after performing an ANOVA test. You
already found MSwithin while calculating F.
n: this is the sample size of each group
q: find this value from a HSD table, using the degrees of freedom within (that you
found during the F calculation) and the number of means (also known as the
€ of groups)
number
If any difference between a pair of means for 2 of your groups that you tested
with the ANOVA is greater than the value you find for the HSD, you can
conclude that the difference between these 2 groups is significantly different.
For example, assume you derive an HSD = 6.5. You have the following group
means for some test you administered:
o Group 1: 4
o Group 2: 6
o Group 3: 12
o Take the difference between each possible pair of means. For example,
the difference in means between Group 1 vs. Group 2 is 2, the difference
between Group 2 vs. Group 3 is 6, and the difference for Group 1 vs.
Group 3 is 8.
o With your HSD = 6.5, only the difference in group means between Group
1 vs. Group 3 is honestly significantly different, because the group
difference of 8 exceeds the HSD of 6.5
ONE-FACTOR REPEATED MEASURES ANOVA:
F=
MSbetween
MSerror
MSbetween =
SSbetween
dfbetween
MSerror =
SSerror
dferror
€
€
•
dfbetween : To find the degrees of freedom between, take the number of groups-
1. So if you were testing college students to see which class year had the most
hours of homework per week, you would have 4 groups (freshman, sophomore,
junior, senior), and 3 degrees of freedom between.
€
•
€
€
dferror : The easiest way to find this is to work backward. First, df error is one
sub-component of df within. The other sub-component is df subjects. We know
from the one-factor ANOVA above that the df total = df within + df between. To
find the total degrees of freedom, take the total number of observations – 1. You
can then subtract the df between from the df total to get the df within. Then, to
find the df subjects (as a part of df within) take the total number of subjects -1.
Once you find the df subjects, you can do the following: df within-df subjects = df
error.
PARTIAL ETA SQUARED FOR REPEATED MEASURES ANOVA:
η2 =
SSbetween
SSbetween + SStotal
o See above for how to interpret eta squared (the same standards apply as for eta
squared for the one-factor ANOVA)
€