Download statistics - Bio-Link

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
lseidman@matcmadison.edu
Biotechnology Laboratory
Technician Program
Course:
Basic Biotechnology Laboratory Skills
for a Regulated Workplace
Lisa Seidman, Ph.D. Ph.D.
STATISTICS
A BRIEF INTRODUCTION
WHY LEARN ABOUT
STATISTICS?

Statistics provides tools that are used in



Quality control
Research
Measurements

Sports
lseidman@matcmadison.edu
IN THIS COURSE

We will use some of these tools



Ideas
Vocabulary
A few calculations
lseidman@matcmadison.edu
VARIATION

There is variation in the natural world




People vary
Measurements vary
Plants vary
Weather varies
lseidman@matcmadison.edu

Variation among organisms is the basis of
natural selection and evolution
lseidman@matcmadison.edu
EXAMPLE



100 people take a drug and 75 of them get
better
100 people don’t take the drug but 68 get
better without it
Did the drug help?
lseidman@matcmadison.edu
VARIABILITY IS A PROBLEM



There is variation in response to the illness
There is variation in response to the drug
So it’s difficult to figure out if the drug helped
lseidman@matcmadison.edu
STATISTICS

Provides mathematical tools to help arrive at
meaningful conclusions in the presence of
variability
lseidman@matcmadison.edu


Might help researchers decide if a drug is
helpful or not
This is a more advanced application of
statistics than we will get into
lseidman@matcmadison.edu
DESCRIPTIVE STATISTICS


Chapter 16 in your textbook
Descriptive statistics is one area within
statistics
lseidman@matcmadison.edu
DESCRIPTIVE STATISTICS

Provides tools to DESCRIBE, organize and
interpret variability in our observations of the
natural world
lseidman@matcmadison.edu
DEFINITIONS

Population:

Entire group of events, objects, results, or
individuals, all of whom share some unifying
characteristic
lseidman@matcmadison.edu
POPULATIONS

Examples:

All of a person’s red blood cells
All the enzyme molecules in a test tube

All the college students in the U.S.

lseidman@matcmadison.edu
SAMPLE

Sample: Portion of the whole population that
represents the whole population
lseidman@matcmadison.edu

Example: It is virtually impossible to measure
the level of hemoglobin in every cell of a
patient

Rather, take a sample of the patient’s blood
and measure the hemoglobin level
lseidman@matcmadison.edu
MORE ABOUT SAMPLES

Representative sample: sample that truly
represents the variability in the population -good sample
lseidman@matcmadison.edu
TWO VOCABULARY WORDS



A sample is random if all members of the population
have an equal chance of being drawn
A sample is independent if the choice of one
member does not influence the choice of another
Samples need to be taken randomly and
independently in order to be representative
lseidman@matcmadison.edu
SAMPLING


How we take a sample is critical and often
complex
If sample is not taken correctly, it will not be
representative
lseidman@matcmadison.edu
EXAMPLE

How would you sample a field of corn?
lseidman@matcmadison.edu
VARIABLES

Variables:


Characteristics of a population (or a sample) that
can be observed or measured
Called variables because they can vary among
individuals
lseidman@matcmadison.edu
VARIABLES

Examples:



Blood hemoglobin levels
Activity of enzymes
Test scores of students
lseidman@matcmadison.edu


A population or sample can have many
variables that can be studied
Example

Same population of six year old children can be
studied for




Height
Shoe size
Reading level
Etc.
lseidman@matcmadison.edu
DATA

Data: Observations of a variable (singular is datum)


May or may not be numerical
Examples:



Heights of all the children in a sample (numerical)
Lengths of insects (numerical)
Pictures of mouse kidney cells (not numerical)
lseidman@matcmadison.edu
ALWAYS UNCERTAINTY

Even if you take a sample correctly, there is
uncertainty when you use a sample to represent the
whole population


Various samples from the same population are unlikely to
be identical
So, need to be careful about drawing conclusions
about a population, based on a sample – there is
always some uncertainty
lseidman@matcmadison.edu
SAMPLE SIZE



If a sample is drawn correctly, then, the larger
the sample, the more likely it is to accurately
reflect the entire population
If it is not done correctly, then a bigger
sample may not be any better
How does this apply to the corn field?
lseidman@matcmadison.edu
INFERENTIAL STATISTICS



Another branch of statistics
Won’t talk about it much
Deals with tools to handle the uncertainty of
using a sample to represent a population
lseidman@matcmadison.edu
EXAMPLE PROBLEM

In a quality control setting, 15 vials of product
from a batch are tested. What is the sample?
What is the population?

In an experiment, the effect of a carcinogenic
compound was tested on 2000 lab rats.
What is the sample? What is the population?
lseidman@matcmadison.edu

A clinical study of a new drug was tested on
fifty patients. What is the sample? What is
the population?
lseidman@matcmadison.edu
ANSWERS

15 vials, the sample, were tested for QC. The
population is all the vials in the batch.

The sample is the rats that were tested. The
population is probably all lab rats.

The sample is the 50 patients tested in the trial. The
population is all patients with the same condition.
lseidman@matcmadison.edu
EXAMPLE PROBLEM

An advertisement says that 2 out of 3 doctors
recommend Brand X.



What is the sample? What is the population?
Is the sample representative?
Does this statement ensure that Brand X is better
than competitors?
lseidman@matcmadison.edu
ANSWER

Many abuses of statistics relate to poor sampling.
The population of interest is all doctors. No way to
know what the sample is. The sample could have
included only relatives of employees at Brand X
headquarters, or only doctors in a certain area.
Therefore the statement does not ensure that the
majority of doctors recommend Brand X. It certainly
does not ensure that Brand X is best.
lseidman@matcmadison.edu
DESCRIBING DATA SETS



Draw a sample from a population
Measure values for a particular variable
Result is a data set
lseidman@matcmadison.edu
DATA SETS


Individuals vary, therefore the data set has
variation
Data without organization is like letters that
aren’t arranged into words
lseidman@matcmadison.edu

Numerical data can be arranged in ways that
are meaningful – or that are confusing or
deceptive
lseidman@matcmadison.edu
DESCRIPTIVE STATISTICS


Provides tools to organize, summarize, and
describe data in meaningful ways
Example:



Exam scores for a class is the data set
What is the variable of interest?
Can summarize with the class “average”, what
does this tell you?
lseidman@matcmadison.edu


A measure that describes a data set, such as
the average, is sometimes called a “statistic”
Average gives information about the center of
the data
lseidman@matcmadison.edu
MEDIAN AND MODE



Two other statistics that give information
about the center of a set of data
Median is the middle value
Mode is most frequent value
lseidman@matcmadison.edu
MEASURES OF CENTRAL
TENDENCY


Measures that describe the center of a data
set are called: Measures of Central Tendency
Mean, median, and the mode
lseidman@matcmadison.edu
HYPOTHETICAL DATA SET
2 5 6 7 8 3 9 3 10 4 7 4 6 11 9
Simplest way to organize them is to put in
order:
2 3 3 4 4 5 6 6 7 7 8 9 9 10 11
By inspection they center around 6 or 7
lseidman@matcmadison.edu
MEAN
Mean is basically the same as the average
 Add all the numbers together and divide by
number of values
2 3 3 4 4 5 6 6 7 7 8 9 9 10 11
What is the mean for this data set?

lseidman@matcmadison.edu
NOMENCLATURE



Mean = 6.3 =  read “X bar”
The observations are called X1, X2, etc.
There are 15 observations in this example, so the
last one is X15
Mean = Xi
n
Where n = number of values
lseidman@matcmadison.edu
EXAMPLE

Data set
2 3 3 4 5 6 7 8 9
What is the mode?
What is the median?
lseidman@matcmadison.edu
MEAN OF A POPULATION
VERSUS THE MEAN OF A
SAMPLE




Statisticians distinguish between the mean of
a sample and the mean of a population
The sample mean is 
The population mean is μ
It is rare to know the population mean, so the
sample mean is used to represent it
lseidman@matcmadison.edu
DISPERSION

Data sets A and B both have the same
average:
A 4 5 5 5 6 6
B 1 2 4 7 8 9

But are not the same:


A is more clumped around the center of the
central value
B is more dispersed, or spread out
lseidman@matcmadison.edu
MEASURES OF DISPERSION


Measures of central tendency do not describe
how dispersed a data set is
Measures of dispersion do; they describe
how much the values in a data set vary from
one another
lseidman@matcmadison.edu
MEASURES OF DISPERSION

Common measures of dispersion are:




Range
Variance
Standard deviation
Coefficient of variation
lseidman@matcmadison.edu
CALCULATIONS OF
DISPERSION


Measures of dispersion, like measures of
central tendency, are calculated
Range is the difference between the lowest
and highest values in a data set
lseidman@matcmadison.edu
Example:
2 3 3 4 4 5 6 6 7 7 8 9 9 10 11
 Range: 11-2 = 9 or, 2 to 11
 Range is not particularly informative because
it is based only on two values from the data
set

lseidman@matcmadison.edu
CALCULATING VARIANCE AND
STANDARD DEVIATION


Variance and standard deviation measure of
the average amount by which each
observation varies from the mean
Example:
4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm
Data set, lengths of 8 insects
lseidman@matcmadison.edu
4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm



The mean is 7 cm
How much do they vary from one another?
Intuitively might see how much each point
varies from the mean

This is called the deviation
lseidman@matcmadison.edu
CALCULATION OF
DEVIATIONS FROM MEAN
4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm
Value-Mean
in cm
Deviation
(4-7)
(5-7)
(6-7)
(7-7)
(7-7)
(7-7)
(9-7)
(11-7)
-3
-2
-1
0
0
0
+2
+4
lseidman@matcmadison.edu
Value-Mean
Deviation
(in cm)
(4-7)
(5-7)
(6-7)
(7-7)
(7-7)
(7-7)
(9-7)
(11-7)
-3
-2
-1
0
0
0
+2
+4
Sum of deviations =
lseidman@matcmadison.edu
0



Sum of the deviations from the mean is
always zero
Therefore, cannot use the average deviation
Therefore, mathematicians decided to square
each deviation so they will get positive
numbers
lseidman@matcmadison.edu
Value-Mean Deviation SquaredDeviation
(in cm)
(4-7)
(5-7)
(6-7)
(7-7)
(7-7)
(7-7)
(9-7)
(11-7)
-3
-2
-1
0
0
0
+2
+4
9 cm2
4 cm2
1 cm2
0
0
0
4 cm2
16 cm2
total squared deviation = sum of squares =
lseidman@matcmadison.edu
34 cm2
VARIANCE

Total squared deviation (sum of squares)
divided by the number of measurements:
34 cm2 = 4.25 cm2
8
lseidman@matcmadison.edu
STANDARD DEVIATION

Square root of the variance:
4.25 cm2 = 2.06 cm


Note that the SD has the same units as the data
Note also that the larger the variance and SD, the
more dispersed are the data
lseidman@matcmadison.edu
VARIANCE AND SD OF
POPULATION VS SAMPLE



Statisticians distinguish between the mean
and SD of a population and a sample
The variance of a population is called sigma
squared, σ2
Variance of a sample is S2
lseidman@matcmadison.edu


The standard deviation of a population is
called sigma, σ
Standard deviation of a sample is S or SD
lseidman@matcmadison.edu
STANDARD DEVIATION OF A
SAMPLE
(Xi - )2
n -1
lseidman@matcmadison.edu
EXAMPLE PROBLEM
A biotechnology company sells cultures of E.
coli. The bacteria are grown in batches that
are freeze dried and packaged into vials.
Each vial is expected to have 200 mg of
bacteria. A QC technician tests a sample of
vials from each batch and reports the mean
weight and SD.
lseidman@matcmadison.edu

Batch Q-21 has a mean weight of 200 mg
and a SD of 12 mg. Batch P-34 has a mean
weight of 200 mg and as SD of 4 mg. Which
lot appears to have been packaged in a more
controlled fashion?
lseidman@matcmadison.edu
ANSWER

The SD can be interpreted as an indication of
consistency. The SD of the weights of Batch
P-34 is lower than of Batch Q-21. Therefore,
the weights for vials for Batch P-34 are less
dispersed than those for Batch Q-21 and
Batch P-34 appears to have been better
controlled.
lseidman@matcmadison.edu
FREQUENCY DISTRIBUTIONS


So far, talked about calculations to describe
data sets
Now talk about graphical methods
lseidman@matcmadison.edu
TABLE 5
THE WEIGHTS OF 175 FIELD MICE
(in grams)
19
21
19
20
19
20
22
22
23
21
20
22
25
25
24
26
22
21
24
20
24
20
22
22
21
20
22
21
22
26
20
22
21
23
21
21
21
21
23
22
21
22
21
22
20
20
20
21
23
22
25
21
21
22
23
20
22
19
23
22
21
23
23
21
23
21
24
22
23
25
22
23
22
24
24
25
21
22
22
19
22
24
19
24
22
23
20
21
22
24
25
21
25
21
23
23
23
21
19
19
24
21
23
20
20
20
24
26
20
23
lseidman@matcmadison.edu
19
24
22
22
22
24
20
21
18
23
21
22
21
23
28
21
26
21
21
21
21
22
27
21
19
27
24
19
23
25
20
22
24
24
22
22
20
23
22
23
22
22
25
20
25
17
22
23
21
22
20
23
24
20
20
23
22
23
20
20
22
24
23
22
FREQUENCY DISTRIBUTION TABLE OF THE
WEIGHTS OF FIELD MICE
Weight
(g)
Frequency
17
18
19
20
21
22
23
24
25
26
27
1
1
11
25
34
40
27
19
10
4
2
28
1
lseidman@matcmadison.edu
FREQUENCY TABLE


Tells us that most mice have weights in the
middle of the range, a few are lighter or
heavier
The word distribution refers to a pattern of
variation for a given variable
lseidman@matcmadison.edu


It is important to be aware of patterns, or
distributions, that emerge when data are
organized by frequency
The frequency distribution can be illustrated
as a frequency histogram
lseidman@matcmadison.edu
FREQUENCY HISTOGRAM




X axis is units of measurement, in this
example, weight in grams
Y axis is the frequency of a particular value
For example, 11 mice weighed 19 g
The values for these 11 mice are illustrated
as a bar
lseidman@matcmadison.edu


Note that when the mouse data were
collected, a mouse recorded as 19 grams
actually weighed between 18.5 g and 19.4 g.
Therefore the bar spans an interval of 1 gram
lseidman@matcmadison.edu
FIRST FOUR BARS
F
R
E
Q
U
E
N
C
Y
17 18 19 20
WEIGHTS
IN GRAMS
lseidman@matcmadison.edu
CONSTRUCTING A
FREQUENCY HISTOGRAM




Divide the range of the data into intervals
It is simplest to make each interval (class) the
same width
No set rule as to how many intervals to have
For example, length data might be 1-9 cm,
10-19 cm, 20-29 cm and so on
lseidman@matcmadison.edu



Count the number of observations that are in
each interval
Make a frequency table with each interval
and the frequency of values in that interval
Label the axes of a graph with the intervals
on the X axis and the frequency on the Y axis
lseidman@matcmadison.edu



Draw in bars where the height of a bar
corresponds to the frequency of the value
Center the bars above the midpoint of the
class interval
For example, if the interval is 0-9 cm, then
the bar should be centered at 4.5 cm
lseidman@matcmadison.edu
NORMAL FREQUENCY
DISTRIBUTION

If weights of very many lab mice were
measured, would likely have a frequency
distribution that looks like a bell shape, also
called the “normal distribution”
lseidman@matcmadison.edu
NORMAL DISTRIBUTION
F
R
E
Q
U
E
N
C
Y
WEIGHT
lseidman@matcmadison.edu
NORMAL DISTRIBTION


Very important
Examples:


Heights of humans
Measure same thing over and over,
measurements will have this distribution
lseidman@matcmadison.edu
CALCULATIONS AND
GRAPHICAL METHODS



Related
The center of the peak of a normal curve is
the mean, the median and the mode
Values are evenly spread out on either side
of that high point
lseidman@matcmadison.edu



The width of the normal curve is related to
the SD
The more dispersed the data, the higher the
SD and the wider the normal curve
Exact relationship is in text, not go into it this
semester
lseidman@matcmadison.edu
EXAMPLE PROBLEM
A technician customarily performs a certain assay.
The results of 8 typical assays are:
32.0 mg 28.9 mg 23.4 mg 30.7 mg
23.6 mg 21.5 mg 29.8 mg 27.4 mg
a.
If the technician obtains a value of 18.1 mg,
should he be concerned? Base your answer on
estimation.
b.
Perform statistical calculations to see if the answer
if out of the range of two SDs.
lseidman@matcmadison.edu
ANSWER


The average appears to be in the
midtwenties and hovers around + 5.
Therefore, 18.1 mg appears a bit low.
Mean = 27.16 mg, SD = 3.87 mg. The mean
– 2SD is 19.4 mg, so 18.1 mg appears to be
outside the range and should be investigated
lseidman@matcmadison.edu
lseidman@matcmadison.edu