Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 10. Random Sampling
and Sampling Distributions
David R. Merrell
90-786 Intermediate Empirical
Methods for Public Policy and
Management
Agenda
Normal Approximation to Binomial
Poisson Process
Random sampling
Sampling statistics and sampling
distributions
Expected values and standard errors of
sample sums and sample means
Binomial Random Variable
Binomial random variable X is the number
of “successes” in n trials, where
Probability of success remains the same
from trial to trial
Trials are independent
Binomial Probability Distribution
Discrete distribution with:
P(X=x) =
(n!/(x!(n-x)!))px qn-x
n is number of trials
x is number of successes in n trials
(x = 0, 1, 2, ..., n)
p is the probability of success on a single trial
q is the probability of failure on a single trial
Properties of the Binomial RV
Mean:
= np
Variance:
= npq
Standard Deviation:
Binomial(n = 10, p = .4)
x
0
1
2
3
4
5
6
7
8
9
10
P(X=x)
0.006047
0.040311
0.120932
0.214991
0.250823
0.200658
0.111477
0.042467
0.010617
0.001573
0.000105
0
1
2
3
4
5
6
7
8
9
10
0.006047
0.040311
0.120932
0.214991
0.250823
0.200658
0.111477
0.042467
0.010617
0.001573
0.000105
Approximation to Binomial
Distribution
Use normal distribution when:
n is large
np > 10
n(1 - p) > 10
Parameters of the approximating
normal distribution are the mean and
standard deviation from the binomial
distribution
Approximation of Binomial
Distribution
0.09
0.08
0.07
n = 80, p = .4
C2
0.06
0.05
0.04
0.03
0.02
0.01
0.00
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
10
20
30
40
50
60
C1
How Good is the Approximation?
Binomial with n = 80 and p = 0.400000
P(X < 29)
x P( X <= x)
28.00
0.2131
Normal with mean = 32.0000 and standard deviation = 4.38000
x P( X <= x)
28.0000
0.1806
x P( X <= x)
28.5000
0.2121
Application 1
The Chicago Equal Employment Commission
believes that the Chicago Transit Authority
(CTA) discriminates against Republicans. The
records show that 37.5% of the individuals
listed as passing the CTA exam were
Republicans; the remainder were Democrats
(no one registers as an independent in
Illinois). CTA hired 30 people last year, 25 of
them were Democrats. What is the
probability that this situation could exist if
CTA did not discriminate?
Application 1 (cont.)
Success: a Republican is hired
The probability of success, p = 0.375
The number of trials, n = 30
The number of successes, x = 5
P(x 5) = ???
Application 1 (cont.)
Mean:
= np = 30*.375 = 11.25
Variance: = npq = 30*.375*.625 =
7.03
Standard Deviation:
= 2.65
Normal with mean = 11.25 and standard deviation = 2.65
x P( X <= x)
5.5000
0.0150
Poisson Process
rate
x x
x
0
time
Assumptions
time homogeneity
independence
no clumping
Poisson Process
Earthquakes strike randomly over time
with a rate of = 4 per year.
Model time of earthquake strike as a
Poisson process
Count: How many earthquakes will
strike in the next six months?
Duration: How long will it take before
the next earthquake hits?
Count: Poisson Distribution
What is the probability that 3
earthquakes will strike during the next
six months?
Poisson Distribution
Count in time period t
e ( t )
P(Y y )
, y 0, 1,
y!
t
y
Minitab Probability Calculation
Click: Calc > Probability Distributions >
Poisson
Enter: For mean 2, input constant 3
Output:
Probability Density Function
Poisson with mu = 2.00000
x
P( X = x)
3.00
0.1804
Duration: Exponential Distribution
Time between occurrences in a Poisson
process
Continuous probability distribution
Mean =1/t
Exponential Probability Problem
What is the probability that 9 months
will pass with no earthquake?
t = 1/12, t= 1/3
1/ t = 3
Minitab Probability Calculation
Click: Calc > Probability Distributions >
Exponential
Enter: For mean 3, input constant 9
Output:
Cumulative Distribution Function
Exponential with mean = 3.00000
x
P( X <= x)
9.0000
0.9502
Exponential Probability Density
Function
MTB > set c1
DATA > 0:12000
DATA > end
Let c1 = c1/1000
Click: Calc > Probability distributions > Exponential
> Probability density > Input column
Enter: Input column c1 > Optional storage c2
Click: OK > Graph > Plot
Enter: Y c2 > X c1
Click: Display > Connect > OK
Exponential Probability Density
Function
0.3
C2
0.2
0.1
0.0
0
5
10
C1
Sampling
Population - entire set of objects that
we are interested in studying
Sample - a chosen subset of a
population
Some Samples Are ...
random -- each item in the population
has an equal chance of being selected
to be part of the sample
representative -- has the same
characteristics as the population under
study, a microcosm of the population
Population Parameters and Sample
Statistics
Population Parameter
Numerical descriptor of a population
Values usually uncertain
e.g., population mean (), population standard
deviation ()
Sample Statistics
Numerical descriptor of a sample
Calculated from observations in the sample
e.g., sample mean X
, sample standard deviation S
What is a sampling distribution?
Sample statistics are random variables
Sample statistics have probability
distributions
“Sampling distribution” is the probability
distribution of a sample statistic
MTB > Retrieve 'C:\MTBWIN\DATA\RESTRNT.MTW'.
Retrieving worksheet from file: C:\MTBWIN\DATA\RESTRNT.MTW
Worksheet was saved on 5/31/1994
MTB > info
Information on the Worksheet
Column
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
Name
ID
OUTLOOK
SALES
NEWCAP
VALUE
COSTGOOD
WAGES
ADS
TYPEFOOD
SEATS
OWNER
FT.EMPL
PT.EMPL
SIZE
Count
279
279
279
279
279
279
279
279
279
279
279
279
279
279
Missing
0
1
25
55
39
42
44
44
12
11
10
14
13
16
MTB > desc 'sales'
Descriptive Statistics
Variable
SALES
N
254
N*
25
Mean
332.6
Median
200.0
Variable
SALES
Min
0.0
Max
8064.0
Q1
83.7
Q3
382.7
MTB > boxp 'sales'
* NOTE * N missing = 25
8000
7000
SALES
6000
5000
4000
3000
2000
1000
0
TrMean
248.9
StDev
650.5
SEMean
40.8
MTB > hist 'sales'
* NOTE * N missing = 25
Frequency
200
100
0
0
1000 2000 3000 4000 5000 6000 7000 8000
SALES
MTB > let c15 = loge('sales')
MTB > let c15 = loge('sales')
J
*** Values out of bounds during operation at J
Missing returned 1 times
MTB > let c15 = loge('sales' + 1)
MTB > name c15 'logsales'
MTB > desc 'logsales'
Descriptive Statistics
Variable
logsales
N
254
N*
25
Mean
5.1830
Median
5.3033
Variable
logsales
Min
0.0000
Max
8.9953
Q1
4.4394
Q3
5.9500
MTB > boxp 'logsales'
* NOTE * N missing = 25
TrMean
5.2134
StDev
1.1387
SEMean
0.0715
9
8
logsales
7
6
5
4
3
2
1
0
90
80
Frequency
70
60
50
40
30
20
10
0
0
1
2
3
4
5
logsales
6
7
8
9
Four Samples of Size 50 From Restaurant “Logsales” Data--Histograms
25
15
Frequency
Frequency
20
10
5
15
10
5
0
0
3
4
5
6
7
2
4
C16
8
C17
20
Frequency
20
Frequency
6
10
0
10
0
2
3
4
5
C18
6
7
3
4
5
C19
6
7
Random Samples from Restaurant “Logsales” Data--Summary
MTB > Desc c16-c19
Descriptive Statistics
Variable
C16
C17
C18
C19
N
43
43
48
43
N*
7
7
2
7
Mean
5.246
5.351
5.366
5.244
Median
5.375
5.352
5.461
5.198
Variable
C16
C17
C18
C19
Min
2.773
1.099
2.485
3.434
Max
6.621
8.456
7.091
6.868
Q1
4.625
4.710
4.961
4.595
Q3
5.787
6.176
5.994
6.089
TrMean
5.280
5.383
5.388
5.253
StDev
0.867
1.223
0.888
0.937
SEMean
0.132
0.186
0.128
0.143
Next Time ...
Central Limit Theorem--”Sample
averages are approximately normally
distributed”