Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Short Course in
Statistics
Learning Statistics through Computer
 Notice that Microsoft Chinese
Windows is needed in some slides
1
Random Sampling
To obtain information through sampling
 Population and Sample
 Parameter and Statistic
2
Population versus
Sample
Population
– The entire group of
individuals about
which we want
information
Sample
– A part of the
population from
which we actually
collect information,
used to draw
conclusions about
the whole
population.
3
Example
Population = the
measurements of
weights of all
children under 18
Sample = the
measurements of
weights of students
in 20 secondary
and primary
schools
4
Parameter versus
Statistic
Parameter
– A number that
describes the
population.
Statistic
– A number that
describes a sample.
5
Drawing balls from a
box
A box contains 10 balls: 5 red, 5 black
 Population: 10 balls
 Parameter: proportion of red balls
 Draw a random sample of size 3
 Statistic: red balls in the sample
e.g. 2/3
6
Statistical Science
Statistics provides methodology to
estimate the parameter through the
(random) sample
7
How to draw a random
sample
Construct a sampling frame---give a
number (name) to each individual in
the population
 Use “random number table” to draw a
random sample of prescribed size
8
Random Number Table
Imagine that a box containing 10
identical balls with numbers 0, 1, 2, 3,
4, 5, 6, 7, 8 and 9.
 Each time you draw a ball and record
the number before returning it to the
box and draw the next ball --- this list
(record) is the “random number table”
9
Example
Objective---draw a sample of size 5
from a class of 30 students
 Sampling frame---label each student
with the numbers 00, 01,…29.
 Read the random number table at line
130 ---- 69051 64817 87174 09517
 69 05 16 48 17 87 17 40 95 17
10
Multiple Label
00=30=60, 01=31=61, 02=32=62, etc.
 Notice 01 will correspond to the
second individual
11
Measurements in the
Laboratory
Each measurement in the physics lab
or chemistry lab can be regarded as
an element in a random sample
12
http://www.cuhk.edu.hk/webct
 User ID & Password
=STA2103(Surname)(Initials)
Go to the above website and learn sample
survey, design of experiment and regression
13
Henry,Chau,STA2103chauh
Ka Ho Enoch,Chan,STA2103chankhe
Jane,Tang,STA2103tangj
Vincent,Pong,STA2103pongv
Clara,Yip,STA2103yipc
14
Why Random Sampling
To be representative
 Some laws governing the statistic--sampling distribution and compute the
 Probability---the chance of the
occurrence of an event in n
independent samplings---can be
computed
15
Not representative
Call in
 Voluntary response on the Web
 Telephone survey asking the
respondents to respond with the
number keys
 Readers’ letters to the newspaper
16
Sampling Distribution
Random sampling  the statistic
would change as the sample varies
 That is, the conclusion might be
changed for different sample
 But, if the samples are randomly
drawn, we can predict the result with
high probability
17
Example
Population: Hong Kong adult residents
 Sample (random): 600 persons
 Parameter: proportion of the
population supporting one more public
holiday
 Statistic; proportion in the sample
18
Consequence of
Random Sampling
If we draw 1000 samples (with each
sample of size 600), and we compute
the statistic for each sample, the
histogram of these 1000 (sample)
proportion is approximately a bellshaped curve---normal density
19
Normal and Probability
Normal density has 2 parameters:
 Mean --- true proportion (p)
 Variance ---var=p(1-p)/n
 Standard deviation (std)=sqrt(var)
 The one sample we draw has
probability .95 in the interval (p-1.96
std, p+1.96 std)
20
Mean of normal=true
parameter
If you draw a sample 1000 times, you
have 1000 sample proportions.
 The average of these 1000 sample
proportions would be approximately
the true proportion --- sample
proportion is an unbiased estimate of
the population proportion
21
Variance=p(1-p)/n
If it is truly random, we can estimate
the variance of these 1000 sample
proportions using p (parameter) only.
 If I have only one sample with accurate
estimate of p, then the variance of the
1000 sample proportion can be
computed without using the 1000
sample proportions
22
Intuition behind the
formula p(1-p)/n
Symmetric about ½
 It is maximized by p=1/2 (very
uncertain)
 When p is closer to 0 or 1, I.e., things
are more definite, the variance gets
smaller
23
Confidence Interval
Conversely, p will be covered by the
interval (p-1.96 std, p+1.96 std) 95
times out of 100 such experiments.
 Notice std=sqrt(p(1-p)/n)
24
95% Confidence
Interval
Use the formula for 100 surveys, we
obtain 100 different interval estimates
 95 out of these 100 intervals would
contain the true p
25
Opinion Polls
People may not give the true
response --- response error
 People may not answer the
questions --- nonresponse error
– Unit nonresponse (the person does not
response at all)
– Item nonresponse (the person does not
respond to some questions)
26
Response rate
If the response rate is less than 80%,
we would doubt about the validity of
the inference
27
Election Polls
The respondent may not be voters
 The respondent may not vote even
he/she has registered
 The respondent may lie (response
error)
28
Questionnaire
The way to set questions would affect
the response (well-known)
29
Other Data Collection
Methods
Experimental Design
 Observational Data (e.g. registry Data)
30
How to know the effect
of vaccine in
preventing polio
We cannot apply the vaccine to all
children and compare the results in the
past
 We need two groups:
control group (no “real” treatment)
treatment group (apply the vaccine)
31
We should compare the
two groups under
“equal” conditions
People are different from each other
 By random assignment of participants
into the two groups, we can make the
two groups have almost identical
conditions – e.g., around the same on
average
32
Design of an
Experiment
For comparing one treatment (A) with
the other treatment (B), we need to
randomize the patient into each group
receiving the one of the treatments
33
Some possible
mistakes
Data---from hospital record
 Death rates of surgical patients are
different for operations with different
anesthetics
 Halothane (1.7%), Pentothal (1.7%),
Cyclopropane (3.4%), Ether (1.9%)
 Can we say that cyclopropane is more
dangerous than the other anesthetics?
34
Answer
No! the worst patients were receiving
cyclopropane.
35
The vaccine can
prevent Polio
1956---USA---over two million children
involved
 Should they all receive vaccine?
 Should the male receive vaccine while
the female receive placebo?
36
Placebo
In this case, placebo is another kind of
liquid, which is similar to the vaccine in
its outlook, injected into the children.
 It is used so that all children were
receiving “same” treatment. So that
the difference in the results would not
be explained as psychological effect
37
Data
Polio (after
half year)
No polio (after
half year)
Control
(placebo)
A=115
B=201,114
treatment
C=33
D=200712
38
Analysis
The proportion of control group having
polio after ½ year --- a/(a+b)=0.00057
 The proportion of treatment group
having polio after ½ year--c/(c+d)=0.00016
 The effect of treatment---
– RD (risk difference)=c/(c+d) - a/(a+b)
=0.00041
39
Formulation of the
Hypotheses
Null Hypothesis: no difference in the
proportions
 Alternative Hypothesis: the two
proportions are different
40
Analysis
We need to compare RD with its
variation
 That is, if we have different
experiments, the results are different.
The variation of these results can be
measured by its variance.
 But we have only one experiment
41
Estimate the variation
If there are no effect of the vaccine,
the true risk (probability) of getting
polio is pr=(a+c)/(a+b+c+d)=0.00037
 Under above hypothesis, the variance
of RD is given by
pr(1-pr) / (1/(a+b)+1/(c+d))
 The standard deviation is 0.000061.
42
Contd.
Thus the ratio 0.00041/0.000061=6.76
measures the effect of vaccine.
 Is 6.76 indicates a large or small or no
effect?
 We need a yardstick.
43
Intuition
Thus the ratio (RD/std) measures the
effect of the vaccine.
 That is, if it is large in absolute value,
the effect of vaccine is significant
 How large is large?
44
Random assignment of
patients to treatments
If we do the experiment 1000 times
and each time we calculate the ratio
 We also assume that the effect of
vaccine is zero..
 Then we plot the histogram of the
1000 ratios. We find the histogram is
close to a bell-shape curve---normal
density curve.
45
Normality
Since we know that the ratio is normal
and we now obtain 6.76.
 We can compute the area to the right
of 6.76----the probability that the ratio
is larger than 6.76 under the
hypothesis of no effect. We find the
area is very small (6.9 x 10^{-12})
46
P value
The area correspond to the probability
of the event which is more extreme to
the observed value
 The usual rule --- p-value <0.05 reject
the null hypothesis
 0.05 can be interpreted as 5 wrong
conclusions among 100 experiments
47
Chi Square TestAnother approach
We can apply the chi square test to the
same data set.
 The chi square test is used to test
whether the proportion of getting polio
is the same for the two groups
(homogeneity). Equivalently, whether
the occurrence of polio is independent
of the treatment (group)
48
Analysis
The chi square test statistic is given by
N(ad - bc)**2/((a+b)(a+c)(b+d)(c+d))
 N=a+b+c+d
 When the statistic is large, the
hypothesis is likely to be wrong
49
Statistical Reasoning
The above statistic can be expressed
as the summation of the quantities
 (observed counts-expected counts)**2
divided by the expected counts
 Here expected counts means the
average counts under the hypothesis
that the two groups are the same
50
Chi Square distribution
Chi square distribution with one
degree of freedom
 P-value=0.05
 Cutoff point 3.84 I.e., reject if the chi
square statistic is larger than 3.84.
Otherwise, accept the null hypothesis.
51
T-test (Two-Sample
unpaired)
Randomize female rats into two
groups (high (low) protein dies)
 Response variables—gain in weight
between the 28th and 84th days of age
52
Data
High protein
134 146 104 119
124 161 107 83 113
129 97 123
– Mean=120
– Variance=457.5
Low protein
70 118 101 85 107
132 94
– Mean=101
– Variance=425.3
53
Hypotheses
Null hypothesis: no difference in the
two means
 Alternative hypothesis: the means are
different
54
Analysis
The difference of the two means
=120 - 101=19
 19 measures the difference in weight
gains between two groups
 Is it large or small? By chance?
 We need to compare with its standard
deviation
55
Variance and standard
deviation
Standard deviation=square root of
variance
S S
2
56
1
2
S1 
( X 1i  X 1 )
n1  1
2
S2
2
Sp
2
1
2
( X 2i  X 2 )
n2  1
(X
 X 1 )   ( X 2i  X 2 )
2
1i
n1  n2  2
57
2
Indicator
X1  X 2
t
Sp
This is a better indicator of the
difference between the two
groups
58
Statistical reasoning
Indicator and yardstick
 If we repeat the experiment 1000 times
and compute 1000 t statistics
 Plot the histogram for these 1000 t
statistics
 The histogram is similar to normal but
with heavier tails
59
Analysis
We call it a t distribution
 There are many t distribution for
different sample sizes
 The number (the sum of two group
sizes –2) is called the degree of
freedom of the t distribution
(e.g. 12+7-2=17)
60
DF>= 30
When the degrees of freedom is larger
than or equal to 30, the t distribution
would become a normal distribution
61
Statistical Reasoning
Given the degree of freedom, we can
find the area (probability)
 If there are no difference between the
two groups, the t distribution would by
symmetric about zero.
 If the data is really arising from two
treatments with same results, the t
statistic should be small
62
Statistical Reasoning
If the t-statistics is small, the area
(probability) of observing the actual
statistic or larger must be large.
 Conversely, if the area is small, the
data tells us that the hypothesis is
likely to be wrong
63
Statistical Reasoning
In this case, t=1.89
 The area for |t| beyond 1.89 (when
degree of freedom=17) is 0.076.
 This area is called p-value
 Usually, when p-value is lees than 0.05,
we will reject the hypothesis
64
1.
2.
Interactive Statistical Pages
Try the t-test ( go to the
procedure)and chi square test (2
x 2 table for sample comparison)
here.
65
Regression
Finding the mean of y for each x
To see whether x and y are
associated
66
Data
•
國家
•
澳洲
2.5
211
•
奧地利
3.9
167
•
比利時
2.9
131
•
加拿大
2.4
191
•
丹麥
2.9
220
•
芬蘭
0.8
297
•
法國
9.1
71
•
冰島
0.8
211
•
愛爾蘭
0.7
300
•
意大利
7.9
107
酒耗量 心臟病死亡率
國家 酒耗量
荷蘭
1.8
新西蘭
1.9
挪威
0.8
西班牙
6.5
瑞士
5.8
瑞典
1.6
英國
1.3
美國
1.2
西德
2.7
心臟病死亡率
167
266
227
86
115
207
285
199
172
67
散布圖的形狀表示: 酒耗量與心
臟病的死亡率成負相關的關係,
但只反映國家整體數據之間的
關係, 我們不能引申為每個人喝
酒愈多, 其死於心臟病的機會愈
少, 否則便犯上生態的偏差-----Ecologic bias.
1
2
3
4
5
6
7
8
9
68
嚴謹推算與應用分析
300
之間的取捨 !
250
200
150
100
50
69
嚴謹推算與應用分析
300
之間的取捨 !
250
200
150
100
50
70
Analysis
Y (death rate)= 260.56-22.97 x (Alcohol)
The negative sign indicates that Y and x
go in opposite direction.
More Alcohol, less heart disease death
rate?
The result cannot be extended to
individual level --- ecologic bias
71
Analysis
The variance of the error is given by
1434.79
 If we compute the variance of Y, we
find that the variance is given by
4678.05.
72
Questions
Email address:
tslau@sparc2.sta.cuhk.edu.hk
 Telephone:
 2609-7927
73
Exercises
1.(Sample survey)
 Population=(Adults in Hong Kong)
Sample=(random sample, telephone
survey)
 Parameter=proportion supporting the
government in handling the protest
 Statistic=
74