Download Glossary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
An Introduction to Statistical Inference
Glossary
2×2 table
A two-way table where the explanatory and response variables each have two categories ........ 5-6
2SD Method
Approximating a confidence interval by taking the statistic and the standard deviation of
the statistic (from simulation or formula) and extending two standard deviations in each
direction from the statistic. ........................................................................................................ 2-18
3S Strategy
A framework for evaluating the strength of evidence against the chance model (null
hypothesis). The 3 S’s are Statistic, Simulate, and Strength of Evidence .......................... 1-8, 1-16
90% confidence interval for An interval of values of π that would not be rejected by a two-sided test of significance
with .10 level of significance. .................................................................................................... 2-10
alternative hypothesis
The not by chance or there is an effect explanation, often the research conjecture. ........ 1-18, 1-25
anecdotal evidence
Evidence for a conclusion that consists of only one or a few observations that may not be
a good representation of the larger situation ................................................................................ 0-1
ANOVA test
Analysis of variance test, is an overall test of multiple means that explores the variation
between groups compared to the variation within groups ......................................................... 9-19
association
Two variables are associated or related if the distribution of the response variable differs
across the values of the explanatory variable ....................................................................... 4-3, 4-5
biased
A sampling method is biased if the results from different samples consistently
overestimate or consistently underestimate the population parameter of interest. ..... 3-1, 3-3, 3-12
binary variable
Categorical variable with only two outcomes ................................................................... 1-17, 1-18
categorical variable (alt. Qualitative)
A variable that places each observational unit into a category, and arithmetic operations
(e.g., adding, subtracting) don’t make sense ..................................................................... 0-12, 0-17
causation
In randomized experiments, can potentially conclude the explanatory variable is causing
the effect seen in the response variable............................................................................... 4-3, 6-23
cell contributions
Contribution of cell in two-way table to the chi-square statistic. Helpful in determining
where large differences from observed data to what would be expected if the null
hypothesis were true .................................................................................................................. 8-19
cells
Entries in two-way tables ............................................................................................................. 5-3
An Introduction to Statistical Inference
census
A study in which data are gathered on all individuals in the population ..................................... 3-2
center
A typical value in the data. Usually calculated as the average or the median .................. 0-14, 0-19
chi-square distribution
A non-negative, right-skewed distribution used in theory-based test for an association
between two categorical variables ............................................................................................. 8-19
chi-square statistic
Theory-based test statistic used to evaluate the strength of evidence for an association
between two categorical variables with multiple categories...................................................... 8-18
coefficient of determination
The percentage of the variability in the response variable that is explained by the leastsquares regression on the explanatory variable. The coefficient of determination is equal
to the square of the correlation coefficient. Denoted by r2 or R2.. .......................................... 10-24
conditional proportion
Proportion of response variable outcomes for a given category of the explanatory variable5-4, 5-7
confidence intervals
An inference tool used to estimate the value of the parameter, with an associated measure
of uncertainty due to the randomness in the sample data ............................................................ 2-2
confidence level
How confident we are that the population parameter is contained in our confidence
interval. Represents the reliability of the procedure. .................................................. 0-4, 2-6, 2-10
confounding variable
A confounding variable is a variable that is related to both the explanatory and response
variable in such a way that its effects on the response variable cannot be separated from
the explanatory variable. ....................................................................................................... 4-4, 4-6
convenience samples
A non-random sample of a population......................................................................................... 3-4
correlation coefficient
Statistic that measures direction and strength of a linear relationship between two
quantitative variables ................................................................................................................. 10-5
critical value
Multiplier of the standard deviation of the statistic, e.g., z*, together this product makes
up the margin of error ................................................................................................................ 2-24
cumulative proportions
Proportion of occurrences of an event after a set number of trials ............................................ 0-22
data
Values measured or categories recorded on individual entities of interest ................. 0-5, 0-7, 0-11
data file (alt. spreadsheet; data table)
A way to organize and store the data (the measurements of each observational unit on
each variable) ............................................................................................................................. 0-11
descriptive statistics
Numbers like averages and percentages or graphs like bar charts............................................... 0-4
direction
Direction of association between two quantitative variables can be either positive (as one
increases, so does the other) or negative (as one increases, the other decreases). ..................... 10-1
An Introduction to Statistical Inference
distribution
The characteristics of a variable’s behavior...................................................................... 0-12, 0-18
dotplots
A basic way to graphically summarize quantitative data........................................................... 0-14
estimate
A statistic which is our best guess for the size of the tendency (or difference) in the
general population or underlying process. ................................................................................... 0-4
estimation
Using the sample statistic to create a confidence interval to estimate the parameter of
interest ........................................................................................................................................ 6-23
expected counts
Number of observational units you would expect to observe in each cell of the two-way
table if the null hypothesis of no association were true ............................................................. 8-26
experimental units
What observational units are called in an experiment study...................................................... 4-11
explanatory variable
The variable that, if the alternative hypothesis is true, is explaining changes in the
response variable; sometimes known as the independent or predictor variable .... ….. 4-2, 4-3, 4-5
extrapolation
Predicting values for the response variable for given values of the explanatory variable
that are outside of the range of the original data ...................................................................... 10-20
F-distribution
Theory-based approximation for simulated null distribution of F-statistic is non-negative
and skewed right ............................................................................................................... 9-18, 9-26
follow-up analysis
A second step in the analysis process that follows a significant ANOVA or chi-square
test. A follow-up test tells where significant differences between pairs of groups are
found. This is usually presented as confidence intervals for the difference in each pair of
means or each pair of proportions ..................................................................................... 8-25, 9-21
form
The form of association between two quantitative variables can be linear or can follow a
more complicated curve ............................................................................................................. 10-1
F-statistic
Ratio of variation between the groups to the variation within the groups .............. 9-17, 9-22, 9-25
generalization (alt. generalize)
To extend conclusions from a sample to a larger population or process; this is only valid
when the sample is representative of the population. ................................... 0-2, 0-9, 3-1, 3-6, 6-23
graphical summary
Summary of a distribution that is a graph .................................................................................. 0-13
H0
Denotes null hypothesis ............................................................................................................. 1-33
Ha
Denotes alternative hypothesis .................................................................................................. 1-33
An Introduction to Statistical Inference
independent samples
The data recorded on one sample are unrelated to those recorded on the other sample. In
other words, if the data from the samples can be rearranged without altering the structure
of the data then the samples are independent...................................................................... 7-4, 7-11
influential observations
An observation is considered influential if removing it from the data set dramatically
changes the correlation coefficient or regression line. Often have extreme x values. ... 10-6, 10-23
interval of plausible values
An interval of values that have been tested under the null and have resulted in p-values
higher than the significance level. or An interval of values that have been tested under
the null and do not put the observed data in the tail of the simulated null distribution.
These values are concluded to be plausible values for the population parameter. ...................... 2-6
logic of inference
Involves two components: significance and estimation............................................................... 0-4
MAD
A statistic testing an association between variables of more than two groups. (M)ean of
the (A)bsolute values of the (D)ifferences in the sample averages or conditional
proportions. ........................................................................................................ 8-6, 8-11, 9-5, 9-10
margin-of-error
How much we expect in our statistic to vary from the parameter from the random
sampling process alone (roughly two standard deviations); half-width of confidence
interval ....................................................................................................................................... 2-15
matched-pairs design
Observations are paired and dependent, such as repeat observations on the same
individual or observations are paired naturally (e.g., identical twins) ......................................... 7-3
mean
A measure of center in a distribution, also called the average................................................... 0-14
mean squares error
Denominator of the F-statistic. Measures the within group variation. It is similar to
averaging the standard deviations across the groups being compared. ..................................... 9-19
mean squares for treatment
Numerator of the F-statistic. Measures the variation between the groups. ............................... 9-19
median
The 50th percentile of a distribution or the middle number in a sorted list ................................ 0-14
model
A mathematical or probabilistic conceptualization meant to closely match reality, but
always making assumptions about the reality which may or may not be true:............................ 1-4
n
A symbol used to indicate the sample size ................................................................................ 1-19
no association
General statement of the null hypothesis when two or more variables are involved. ............... 5-12
non-sampling error
Reasons why the statistic may not be close to the parameter that are separate from how
the sample was selected from the population. ........................................................................... 3-27
An Introduction to Statistical Inference
null distribution
Distribution of simulated statistics that represent what could have happened in the study
assuming the null hypothesis was true .............................................................................. 1-20, 1-27
null hypothesis
The by chance alone or no effect explanation; A hypothesis that can be modeled by
simulation.......................................................................................................................... 1-18, 1-25
numerical summary
Summary of a distribution that is a number ............................................................................... 0-13
observational studies
Studies in which researchers observe individuals and measure variables of interest, but do
not intervene in order to attempt to influence responses. ............................................................ 4-2
observational units
The individual entities on which data are recorded .................................................... 0-5, 0-7, 0-11
one proportion z-interval
Theory-based confidence interval for π .................................................................................... 2−24
outcomes
All possible values a variable can assume ................................................................................... 0-7
outlier
An unusual observation. A value of a variable that differs substantially from the general
pattern of the other observations in the data set ......................................................................... 0-20
outliers
An observation with a large residual, not necessarily influential .............................................. 10-6
paired design
Study design that allows for the comparison of two groups on a response variable but by
comparing two measurements on each observational unit instead of on completely
separate groups of individuals. This serves to reduce variability in the response variable. ...... 4-16
paired
Data collected on paired samples consist of two sets of observations on the response
variable that are recorded on the same set of observational units................................................ 7-4
parameter
A number calculated from the underlying process or population from which the sample
was selected .................................................................................................................. 0-6, 0-7, 3-2
p-hat
The proportion or percentage of observational units that have a particular characteristic
based on a measured variable. A statistic .................................................................................. 1-19
plausible value
A parameter value tested under the null hypothesis that, based on the data gathered, we
do not find strong evidence against the null ................................................................................ 2-4
plausible
A term used to indicate that the chance model is a reasonable/believable explanation for
the data we observed .................................................................................................................... 1-9
population
The entire set of observational units we want to know about ............................................... 0-6, 3-1
practically significant
Large enough, based on the context, to be meaningful.............................................................. 2-15
An Introduction to Statistical Inference
predictor
Another word for explanatory variable, often used in correlation/regression settings .............. 10-2
probability
Long run proportion (relative frequency) of times an event would occur if the random
process were repeated over and over again under identical conditions ............................ 0-24, 0-28
process
A situation which we think of as a random selection from an underlying set of possible
outcomes ...................................................................................................................................... 3-7
p-value
The proportion of statistics in the null distribution that are at least as extreme as the value
of the statistic actually observed in the study. .................................................................. 1-21, 1-27
quantitative variable
Measures on an observational unit for which arithmetic operations (e.g., adding,
subtracting) make sense ............................................................................................ 0-12, 0-17, 3-2
quasi-experiments
Experiments that manipulate the explanatory variable, but not randomly. ............................... 4-12
r
Symbol for correlation coefficient, values range from -1 to 1 and are unit-less. Values
close to -1 and 1 denote a strong linear relationship while values close to 0 denote a weak
or no linear relationship ............................................................................................................. 10-5
random digit dialing
A common sampling technique when a sampling frame is unavailable. It involves a
computer randomly dialing phone numbers within a certain area code by randomly
selecting the digits to be dialed after the area code. .................................................................. 3-25
random events
An event with an unknown short-term outcome, but a known long-term relative
frequency.................................................................................................................................... 0-22
random sampling
Using a probability device to select observational units from a population or process ...... 3-6, 3-30
randomized, comparative experiment
An experiment where experimental units are randomly assigned to two or more treatment
conditions and the explanatory variable is actively imposed on the subjects. ........................... 4-11
range
Distance from the largest to the smallest value in a data set ..................................................... 0-15
regression equation
Least squares regression equation, where a is the y-intercept, b is the slope, x represents
the explanatory variable, and ŷ (pronounced y-hat) is the predicted value for the response
variable. .................................................................................................................................... 10-19
relative frequency
Long run proportion ................................................................................................................... 0-24
representative
Describes a sample with statistics similar to the parameters in the entire population.
Simple random samples are representative; convenience samples may not be
representative ........................................................................................................................ 3-1, 3-3
An Introduction to Statistical Inference
residuals
Same as prediction errors. The vertical distances between a point and the least squares
regression line ............................................................................................................... 10-17, 10-22
response rate
Of those selected to be in the sample, the percent that respond. ............................................... 3-26
response variable
The variable that , if the alternative hypothesis is true, is impacted by the explanatory
variable; sometimes known as the dependent variable. ............................................................... 4-2
sample size
Number of observational units ..................................................................................................... 0-7
sample
The subgroup of the population on which we record data ............................................ 0-6, 3-1, 3-2
sampling frame
A list of all of the members of the population of interest ................................................... 3-4, 3-13
sampling variability
The amount that a value changes as it is observed repeatedly ..................................................... 3-5
scatterplot
Graphical display of the relationship between two quantitative variables ................................ 10-1
scope of inference
Involves two components that depend on how the data was gathered: generalizability and
cause-and-effect ........................................................................................................................... 0-4
segmented bar graphs
Graphical display of conditional proportion from two-way table ............................................... 5-4
shape
The form of a graph of quantitative data ................................................................................... 0-14
significance level
A value used as a criterion for deciding how small a p-value needs to be to provide
convincing evidence against the null hypothesis .................................................................. 2-5, 2-9
significance
The sample results are unlikely to have arisen by chance alone................................................ 6-23
simple random sample
Selecting individuals from the sampling frame, so that each individual has the same
probability of being selected into the sample ............................................................. 3-4, 3-6, 3-13
simulate
Artificially represent a real life situation by generating observations from a model................... 1-5
simulated data
Data that are generated by a chance model .................................................................................. 1-5
simulation
Artificial representation of a random process used to study the process’s long-term
properties.................................................................................................................................... 0-24
skewed distribution
A distribution with the bulk of the data on one side and a tail on the other, the direction of
the skew is the side the tail is on ................................................................................................ 0-14
slope
Change in predicted response variable divided by change in explanatory variable ................ 10-19
An Introduction to Statistical Inference
SSE
Sum of squared errors. It is the sum of all the squared residuals............................................. 10-22
standard deviation (SD)
A typical deviation or distance of the data values from their average .................... 0-15, 0-20, 2-14
standard deviation of p-hat:
The standard deviation of the distribution of sample proportions can be shown
mathematically to follow the formula π (1 − π ) / n ................................................................... 1-54
standard error
Approximate estimate for the standard deviation of the null distribution ........................ 5-35, 7-23
standardize
To standardize an observation, compute the distance of the observation from the mean
and divide by the standard deviation of the distribution. .................................................. 1-35, 1-38
statistic + margin-of-error
Most confidence intervals can be written in this form ............................................................... 2-15
statistic
A number calculated from the observed data which summarizes information about the
variable or variables of interest .............................................................................. 0-2, 0-6, 0-7, 3-2
statistical inference
Drawing conclusions beyond the sample data to a larger population or process ........................ 0-8
statistical significance
When the sample results are unlikely to have arisen by chance alone ........................................ 1-2
statistical spreadsheet
The file of individual data values with variables as columns and observational units as
rows ............................................................................................................................................ 0-11
statistical tendency
Not a hard-and-fast rule, rather something that is typically observable .................................... 0-19
statistically significant
Sample results that are unlikely to have arisen by chance alone are considered
statistically significant ......................................................................................................... 0-4, 0- 8
Statistics
A discipline that guides us in collecting, exploring and drawing conclusions from data ............ 0-1
strength of evidence
Determining whether the observed statistic provides convincing evidence against the null
hypothesis ........................................................................................................................... 1-2, 6-23
strength of association
Strength of association between two quantitative variables tells how closely data follow a
particular pattern, be it linear or a more complicated curve ...................................................... 10-1
subjects
Study participants that are human .............................................................................................. 1-24
symmetric distribution
A distribution with a vertical line of symmetry through it ........................................................ 0-14
test of significance
A procedure for measuring the strength of evidence against a null hypothesis about the
parameter of interest .................................................................................................................. 1-17
An Introduction to Statistical Inference
theory-based approach for a single proportion
A test of significance that uses the Central Limit Theorem to predict the simulated null
distribution of sample proportions. If the sample size is large enough the predicted
distribution will be bell-shaped (or normal), centered at the underlying probability (π) ,
with a standard deviation of π (1 − π ) / n . ................................................................................ 1-54
transform
Express data on a different scale, such as logarithmic, often used to meet validity
conditions ................................................................................................................................. 10-36
t-standardized statistic
Similar to the standardized z-statistic for proportions, this is a standardized statistic for
means. Typically the distribution is bell-shaped and symmetric,but it is not exactly
normally distributed, being less peaked with fatter tails. As the sample size increases, the
distribution becomes more normal............................................................................................. 6-32
two by two (2 × 2)
A two-way table where the explanatory and response variables each have two categories ........ 5-4
two-sided test
Estimates the p-value by considering results that are at least as extreme as our observed
result in either direction ............................................................................................................. 1-45
two-way table
A tabular summary of two categorical variables, also called a contingency table ...................... 5-3
unbiased
A sampling method that, on average across many random samples, produces statistics
whose average is the value of the population parameter ............................................................. 3-6
unusual observations
Any observations that do not fit in with the pattern of the rest of the observations .................. 10-5
validity check
Check to see that certain conditions are met the render the theory-based approach valid.
Often these conditions deal with sample size, shape, and variability of distributions............... 1-54
validity conditions for ANOVA
For the F-statistic to follow an F distribution, need each population distribution to be
normal and the population standard deviations to be equal ....................................................... 9-20
validity conditions for chi-square test
Each cell of the two-way table must have at least 10 observations ........................ ……..8-19, 8-24
validity conditions for comparing two averages
For the theory-based method for means to apply, want sample sizes in each group of at
least 20, OR distributions of quantitative variable in both groups are normal (Bell-shaped
and symmetric)........................................................................................................................... 6-34
validity conditions for matched pairs
Sample size of pairs is at least 20 OR the differences follow a normal distribution ................. 7-28
validity conditions for one proportion
To use theory-based approach for a single proportion, must have at least 10 observations
in each category ......................................................................................................................... 2-26
variability
Fluctuations in data. In Statistics one needs to consider source of variability ................. 0-15, 0-19
variables
Recorded characteristics of the observational units .................................................... 0-5, 0-7, 0-11
An Introduction to Statistical Inference
y-intercept
Predicted value of the response variable when the explanatory variable has a value of
zero ........................................................................................................................................... 10-19
z-statistic
z-statistic, also called the standardized statistic, compares an observed statistic with a
hypothesized parameter value, mostly used with proportions (and t-statistic with means)....... 1-54
α
Parameter for y-intercept of a regression line .......................................................................... 10-30
β
Parameter for slope of a regression line ................................................................................... 10-30
ρ
Symbol for population correlation coefficient ........................................................................... 10-7
π
Symbol used for the unknown underlying process probability or true population
proportion. ................................................................................................................................ 1−19
̅ d
Observed sample average of the differences ............................................................................... 7-6
µd
Population or process parameter for the average of the differences ............................................ 7-6