Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 1: Measurements, Statistics,
Probability, and Data Display
Karen Bandeen-Roche, PhD
Department of Biostatistics
Johns Hopkins University
July 11, 2011
Introduction to Statistical
Measurement and Modeling
What is statistics?
The study of …
(i.) … populations
(ii.) …variation
(iii.) … methods of the reduction of data.
“The original meaning of the word … suggests
that it was the study of populations of human
beings living in political union.”
Sir R. A. Fisher
What is statistics?
 “… Statistical Science [is] the particular aspect
of human progress which gives the 20th century
its special character…. It is to the statistician
that the present age turns for what is most
essential in all its more important activities.”
Sir R. A. Fisher
What is statistics?
Less complimentary views
 “Science is difficult. You need mathematics and
statistics, which is dull like learning a language.”
Richard Gregory
 “There are three kinds of lies: lies, damned lies
and statistics.”
Mark Twain, quoting Disraeli
What is statistics?
 Statistics in concerned with METHODS for
COLLECTING & DESCRIBING DATA and
then for ASSESSING STRENGTH OF
EVIDENCE in DATA FOR/AGAINST
SCIENTIFIC IDEAS!”
Scott L. Zeger
What is statistics?
 the art and science of gathering, analyzing,
and making inferences from data.”
Encyclopaedia Britannica
Poetry
Music
Statistics
Mathematics
Physics
What is biostatistics?
 The science of learning from biomedical data
involving appreciable variability or
uncertainty.
Amalgam
Data examples
 Osteoporosis screening
 Importance: Osteoporosis afflicts millions of older
adults (particularly women) worldwide
 Lowers quality of life, heightens risk of falls etc.
 Scientific question: Can we detect osteoporosis earlier
and more safely?
 Method: ultrasound versus dual photon
absorptiometry (DPA) tried out on 42 older women
 Implications: Treatment to slow / prevent onset
Osteoporosis data
DPA scores by osteoporosis groups
0.6
1600
0.7
1700
0.8
1800
0.9
1.0
1900
1.1
2000
1.2
Ultrasound scores by osteoporosis groups
control
case
control
case
Data examples
 Temperature modeling
 Importance: Climate change is suspected. Heat waves,
increased particle pollution, etc. may harm health.
 Scientific question: Can we accurately and precisely
model geographic variation in temperature?
 Method: Maximum January-average temperature over
30 years in 62 United States cites
 Implications: Valid temperature models can support
future policy planning
United States temperature map
http://green-enb150.blogspot.com/2011/01/isorhythmic-map-united-states-weather.html
Modeling geographical variation:
Latitude and Longitude
http://www.enchantedlearning.com/usa/activity/latlong/
Temperature data
100
120
140
160
60
70
80
80
120
140
160
20
30
40
50
temp
40
50
80
100
longtude
20
30
latitude
20
30
40
50
60
70
80
20
30
40
50
Data examples
 Boxing and neurological injury
 Importance: (1) Boxing and sources of brain jarring
may cause neurological harm. (2) In ~1986 the IOC
considered replacing Olympic boxing with golf.
 Scientific question: Does amateur boxing lead to
decline in neurological performance?
 Method: “Longitudinal” study of 593 amateur boxers
 Implications: Prevention for brain injury from
subconcussive blows.
Boxing data
-20
-10
0
blkdiff
10
20
Lowess smoother
0
bandw idth = .8
100
200
blbouts
300
400
Data examples
 Temperature modeling
 Importance: Climate change is suspected. Heat waves,
increased particle pollution, etc. may harm health.
 Scientific question: Can we accurately and precisely
model geographic variation in temperature?
 Implications: Valid temperature models can support
future policy planning
Course objectives
 Demonstrate familiarity with statistical tools for
characterizing population measurement properties
 Distinguish procedures for deriving estimates from
data and making associated scientific inferences
 Describe “association” and describe its importance in
scientific discovery
 Understand, apply and interpret findings from
 methods of data display
 standard statistical regression models
 standard statistical measurement models
 Appreciate roles of statistics in health science
Basic paradigm of statistics
 We wish to learn about populations
 All about which we wish to make an inference
 “True” experimental outcomes and their mechanisms
 We do this by studying samples
 A subset of a given population
 “Represents” the population
 Sample features are used to infer population features
 Method of obtaining the sample is important
 Simple random sample: All population elements /
outcomes have equal probability of inclusion
Basic paradigm of statistics
Probability
Truth for
Population
Observed Value for a
Representative Sample
Statistical inference
Tools for description
 Populations
 Samples
 Probability
 Probability
 Parameters
 Statistics / Estimates
 Values, distributions
 Data displays
 Hypotheses
 Statistical tests
 Models
 Analyses
Probability
 Way for characterizing random experiments
 Experiments whose outcome is not determined
beforehand
 Sample space: Ω := {all possible outcomes}
 Event = A ⊆ Ω := collection of some outcomes
 Probability = “measure” on Ω
 Our course: measure of relative frequency of occurrence
 “Bayesian”: measure of relative belief in occurrence
Probability measures
 Satisfy following axioms:
i) P{Ω} = 1: reads "probability of Ω"
ii) 0 ≤ P{A} ≤ 1 for each A
> 0 = “can’t happen”; 1 = “must happen”
iii) Given disjoint events {Ak}, P{  kK1 Ak } = Σ P{Ak}
> “disjoint” = “mutually exclusive”; no two can
happen at the same time
Random variable (RV)
 A function which assigns numbers to outcomes of a
random experiment - X:Ω → ℝ
 Measurements
 Support:= SX = range of RV X
 Two fundamental types of measurements
 Discrete: SX is countable (“gaps” in possible values)
 Binary: Two possible outcomes
 Continuous: SX is an interval in ℝ
 “No gaps” in values
Random variable (RV)
 Example 1: X = number of heads in two fair coin tosses
 SX = {0,1,2}
 Example 2: Draw one of your names out of a hat.
X=age (in years) of the person whose name I draw.
 SX =
 Mass function:
Probability distributions
 Heuristic: Summarizes possible values of a random
variable and the probabilities with which each occurs
 Discrete X: Probability mass function = list exactly as
the heuristic: p:x → P(X=x)
 Example = 2 fair coin tosses:
 P{HH} = P{HT} = P{TH} = P{TT} = ¼
 Mass function:
x
p(x) = P(X=x)
0
¼
1
½
2
¼
y  {0,1,2}
0
Cumulative probability distributions
 F: x → P(X ≤ x) = cumulative distribution function CDF
 Discrete X: Probability mass function = list exactly as
the heuristic
 Example = 2 fair coin tosses:
x (-,0)
0
x (0,1)
1
x (1,2)
2
x (2,)
F(x)
0
1/4
1/4
3/4
3/4
1
1
> draw picture of p, F
p(x)
0
1/4
0
1/2
0
1/4
0
Cumulative probability distributions
 Example = 2 fair coin tosses:
 Notice: p(x) recovered as differences in values of F(x)
 Suppose x1≤ x2≤ … and SX = {x1, x2, …}
 p(xi) = F(xi) - F(xi-1), each i (define x0= -∞ and F(x0)=0)
Cumulative probability distributions
 Draw one of your names out of a hat. X=age (in
years) of the person whose name I draw
What about continuous RVs?
 Can we list the possible values of a random variable
and the probabilities with which each occurs?
 NO. If SX is uncountable, we can’t list the values!
 The CDF is the fundamental distributional quantity
 F(x) = P{X≤x}, with F(x) satisfying
i) a ≤ b ⇒ F(a) ≤ F(b);
ii) lim (b→∞) F(b) = 1;
iii) lim (b→-∞) F(b) = 0;
iv) lim (bn ↓ b) F(bn) = b
v) P{a<X≤b} = F(b) - F(a)
Two continuous CDFs
 “Exponential”
0.8
0.6
0.2
0.4
P(US<=score)
0.6
0.4
0.2
0.0
0.0
P(US<=score)
0.8
1.0
1.0
 “Normal”
1400
1600
1800
ultrasound scores
2000
2200
0
2000
4000
6000
ultrasound scores
8000
10000
Mass function analog: Density
 Defined when F is differentiable everywhere
(“absolutely continuous”)
 The density f(x) is defined as
lim(ε↓0) P{X є [x-ε/2,x+ε/2]}/ε
= lim(ε↓0) [F(x+ε/2)-F(x-ε/2)]/ε
= d/dy F(y) |y=x
 Properties
i) f ≥ 0
ii) P{a≤X≤b} = ∫ab f(x)dx
iii) P{XεA} = ∫A f(x)dx
iv) ∫-∞∞ f(x)dx = 1
Two densities
 “Exponential
4e-04
2e-04
3e-04
density
0.0020
0.0015
1e-04
0.0010
0.0005
0e+00
0.0000
density
0.0025
5e-04
0.0030
 “Normal”
1400
1600
1800
ultrasound score
2000
2200
0
2000
4000
6000
ultrasound score
8000
10000
Probability model parameters
 Fundamental distributional quantities:
 Location: ‘central’ value(s)
 Spread: variability
 Shape: symmetric versus skewed, etc.
Location and spread
(Different Locations)
(Different Spreads)
Probability model parameters
 Location
 Mean: E[X] = ∫ xdF(x) = µ
 Discrete FV: E[X] = ΣxεSX xp(x)
 Continuous case: E[X] = ∫ xf (x)dx
 Linearity property: E[a+bX] = a + bE[X]
 Physical interpretation: Center of mass
Probability model parameters
 Location
 Median
 Heuristic: Value so that ½ of probability weight above, ½
below
Definition: median is m such that F(m) ≥ 1/2, P{X≥m} ≥ ½
 Quantile ("more generally"...)
Definition: Q(p) = q: FX(q) ≥ p, P{X≥q} ≥ 1-p
Median = Q(1/2)
Probability model parameters
 Spread
 Variance: Var[X] = ∫(x-E[X])2dF(x) = σ2
 Shortcut formula:
E[X2]-(E[X])2
 Var[a+bX] = b2Var[X]
 Physical interpretation: Moment of inertia
 Standard deviation: SD[X] = σ
= √(Var[X])
 Interquartile range (IQR) = Q(.75) - Q(.25)
Pause / Recapitulation
 We learn about populations through representative
samples
 Probability provides a way to characterize populations
 Possibly unseen (models, hypotheses)
 Random experiment mechanisms
 We will now turn to the characterization of samples
 Formal: probability
 Informal: exploratory data analysis (EDA)
Describing samples
 Empirical CDF
 Given data X1,...,Xn, Fn(x) = {#Xi's ≤ x}/n
 Define indicator 1{A}:= 1 if A true
= 0 if A false
 ECDF = Fn = (1/n)Σ 1{Xi≤x}
= probability (proportion) of values ≤ x in sample
 Notice is real CDF with correct properties
 Mass function px = 1/n if x ε {X1,...,Xn};
= 0 otherwise.
Sample statistics
 Statistic = Function of data
 As defined in probability section, with F=Fn
 Mean = X n = ∫ xdFn(x)
= (1/n) Σ Xi.
1 n
 Variance = s2 = n  1  1 X i  X
i 1
 Standard deviation = s
2
Sample statistics - Percentiles
 “Order statistics” (sorted values):
 X(1) = min(X1,...,Xn)
 X(n) = max(X1,...,Xn)
 X(j) = jth largest value, etc.
 Median = mn = {x:Fn(x)≥1/2} and {x:PFn{X≥x}≥1/2
= X((n+1)/2) = middle if n odd;
= [X(n/2)+X(n/2+1)]/2 = mean of middle two if n even
 Quantile Qn(p) = {x:Fn(x)≥p} and {x:PFn{X≥x}≥1-p}
 Outlier = data value "far" from bulk of data
Describing samples - Plots
 Stem and leaf plot: Easy “density” display
Steps
 Split into leading digits, trailing digits
 Stems: Write down all possible leading digits in
order, including “might have occurred's”
 Leaves: For each data value, write down first trailing
digit by appropriate value (one leaf per datum).
Issue: # stems
 Chiefly science
 Rules of thumb: root-n, 1+3.2log10n
Describing samples - Plots
 Boxplot
 Draw box whose "ends" are Q(1/4) and Q(3/4)
 Draw line through box at median
 Boxplot criterion for "outlier": beyond "inner fences"
= hinges +/- 1.5*IQR
 Draw lines ("Whiskers") from ends of box to last
points inside inner fences
 Show all outliers individually
 Note: perhaps greatest use = with multiple batches
Osteoporosis data
Osteo
Age
US Score
DPA
0
58
1606
0.837
0
68
1650.25
0.841
0
53
1659.75
0.917
0
68
1662
0.975
0
54
1760.75
0.722
0
56
1770.25
0.801
0
77
1773.5
1.213
0
54
1789
1.027
0
62
1808.25
1.045
0
59
1812.5
0.988
0
72
1822.38
0.907
0
53
1826
0.971
0
61
1828
0.88
0
51
1868.5
0.898
0
61
1898.25
0.806
0
52
1908.88
0.994
0
66
1911.75
1.045
0
53
1935.75
0.869
0
62
1937.75
0.968
0
59
1946
0.957
0
50
2004.5
0.954
0
61
2043.08
1.072
Osteoporosis data
Osteo
Age
US Score
DPA
1
73
1588.66
0.785
1
63
1596.83
0.839
1
61
1608.16
0.786
1
75
1610.5
0.825
1
25
1617.75
0.916
1
64
1626.5
0.839
1
69
1658.33
1.191
1
62
1663.88
0.648
1
68
1674.8
0.906
1
58
1690.5
0.688
1
57
1695.15
0.834
1
62
1703.88
0.6
1
64
1704
0.762
1
66
1704.8
0.977
1
58
1715.75
0.704
1
70
1716.33
0.916
1
62
1739.41
0.86
1
67
1756.75
0.776
1
70
1800.75
0.799
1
42
1884.13
0.879
Introduction: Statistical Modeling
 Statistical models: systematic + random
 Probability modeling involves random part
 Often a few parameters “Θ” left to be estimated by data
 Scientific questions are expressed in terms of Θ
 Model is tool / lens / function for investigating scientific
questions
 "Right" versus "wrong" misguided
 Better: “effective” versus “not effective”
Modeling: Parametric Distributions
 Exponential distribution
F(x)
= 1-e-λx
if x ≥ 0
= 0 otherwise
 Model parameter: λ = rate
 E[X] = 1/λ
 Var[X] = 1/λ2
 Uses
 Time-to-event data
 “Memoryless”
Modeling: Parametric Distributions
 Normal distribution
f(x) =
on support SX = (-∞, ∞).
 Distribution function has no closed form:
 F(x) := ∫-∞x f (t)dt, f given above
 F(x) tabulated, available from software packages
 Model parameters: μ=mean; σ2=variance
Normal distribution
 Characteristics
a) f(x) is symmetric about μ
b) P{μ-σ≤X≤μ+σ} ≈ .68
c) P{μ-2σ≤X≤μ+2σ} ≈ .95
 Why is the normal distribution so popular?
a) If X distributed as (“~”) Normal with parameters
(μ,σ) then (X-μ)/σ = “Z” ~ Normal (μ=0,σ=1)
b) Central limit theorem: Distributions of sample means
converge to normal as n →∞
Normal distributions
Application
 Question: Is the normal distribution or exponential
distribution a good model for ultrasound measurements
in older women?
 If so, then comparisons between cases, controls reduce to
comparisons of mean, variance
 Method
 Each model predicts the distribution of measurements
 ECDF Fn characterizes the distribution in our sample
 Compare Fn to
 Normal CDF with mean= 1761.43, SD=120.31
 Exponential CDF with rate = 1/1761.43
Aside
 When is the proposed method a good idea?
 Need Fn to well approximate F if the sample is
representative of a population distributed as F
 Glivenko-Cantelli theorem: Let X1, . . . ,Xn be a
sequence of random variables obtained through
simple random sampling from a population distributed
as F. Then P(lim supx(|Fn(x) − F(x)|) = 0) = 1.
Application – Two models
 “Exponential”
0.8
0.6
0.2
0.4
P(US<=score)
0.6
0.4
0.2
0.0
0.0
P(US<=score)
0.8
1.0
1.0
 “Normal”
1400
1600
1800
ultrasound scores
2000
2200
0
2000
4000
6000
ultrasound scores
8000
10000
Application – Two models
 “Exponential”
0.6
0.2
0.4
P(US<=score)
0.6
0.4
0.0
0.2
0.0
P(US<=score)
0.8
0.8
1.0
1.0
 “Normal”
1400
1600
1800
Ultrasound scores
2000
2200
0
2000
4000
6000
Ultrasound scores
8000
10000
12000
Main points
 The goal of biostatistics is to learn from biomedical
data involving appreciable variability or uncertainty
 We do this by inferring features of populations from
representative samples of them
 Probability is a tool for characterizing populations,
samples and the uncertainty of our inferences from
samples to populations
 Definitions
 Random variables
 Distributions
 Parameters: Location, spread, other
Main points
 Describing sample distributions is a key step to
making inferences about populations
 If the sample “is” the population: The only step
 ECDF, Summary statistics, data displays
 Models are lenses to focus questions for statistical
analysis
 Parametric distributions
 Normal distribution