Download Lecture 2: Data Compression for One Variable

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 2. Data Compression for One
Variable
George Duncan
90-786 Intermediate Empirical
Methods for Public Policy and
Management
Lecture 2: Data Compression for One
Variable




Forms of data compression
Complex thinking about simple means
Links between centers and spreads
Use of Minitab
Forms of Data Compression: Relation to
Level of Measurement
Level of Measurement
Description
Nominal
Ordinal
Interval
Summary of
Observations
Frequency table
Bar Chart
Pie Chart
Frequency table
Bar Chart
Frequency table
Histogram
Box Plot
One-way scatterplot
Central Tendency
Mode
Median
Mean
Median
Dispersion
Relative frequency Interquartile range
of the mode
Standard deviation
Example

How prevalent is the mayor-council
form of government?





What are the units of analysis?
How many units have been observed?
How many cases are in the sample?
What type of analysis do we have?
What variables are being measured?
What is the level of measurement?
Form of Government in Cities
Under 25,000 Population in Kansas
Form of Government
No.
1
2
3
4
5
6
...
City
Abilene
Andale
Andover
Atchison
Beloit
Cherryvale
...
74
Symbolic Code
CM
MC
MC
CM
MC
CO
...
Winfield
CM = 1, council-manager
MC = 2, mayor-council
CO = 3, commission
Numerical Code
1
2
2
1
2
3
...
CM
1
Governance Frequency Table
Value Form of Government
Absolute
Relative Frequency
Frequency
Number of Proportion Percentage
Observations
1
Council-Manager
37
0.50
50%
2
Mayor-Council
32
0.43
43.2%
3
Commission
5
0.07
6.8%
74
1.00
100%
Total
Governance Bar Chart
40
35
30
25
20
15
10
5
0
Council-Manager
Mayor-Council
Commission
Governance Pie Chart
2. Mayor-council 43.2% (32)
3. Commission 6.8% (5)
1. Council-manager 50% (37)
Quality of Fire Departments
Fire Insurance Class
Number
Relative Frequency
Cumulative Frequency
1
1
0.30%
0.30
2
45
13.35
13.65
3
148
43.92
57.57
4
98
29.08
86.65
5
35
10.39
97.03
6
8
2.37
99.41
7
1
0.30
99.70
8
1
0.30
100.00
9
0
0.00
100.00
10
0
0.00
100.00
Total
337
100.00%
Fire Insurance Bar Chart
160
140
120
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
Garbage Collection
Tons of Trash Collected by the City of Normal, Oklahoma for the
Week of June 8, 1992
Tons of Garbage
50-60
60-70
70-80
Number of
Observations
15
25
30
80-90
90-100
20
10
Total
100
Garbage Histogram
Frequency 30
25
20
15
10
5
0
50-60 60-70 70-80 80-90 90-100
Tons of Garbage
Measures of Central Tendency



Median = 73 tons
Mode = 75 tons
Mean (average of all observed values )
x = 72.97
Where:
x =
 xi
n
Measures of Dispersion
Range = Max - Min
Variance = S
2
Standard Deviation = S
2
where: S =
 (xi - x)
Coefficient of Variation =
n-1
S
x
2
Measure of Dispersion: Garbage
Example

Range = 97 - 50 = 47

Variance = 151.3

Standard Deviation = 12.3

Coefficient of Variation = 0.17
Box Plot
Outer fence = Q + 3.0 *IQR
3
o
Outlier (extreme data value)
Inner fence = Q 3 + 1.5 *IQR
Whisker
Q 3 75th percentile
Median
Q1 25th percentile
Interquartile range,
IQR = ( Q 3 - Q1 )
Whisker
Inner fence = Q - 1.5 *IQR
1
Outer fence = Q - 3.0 *IQR
1
Garbage Box Plot
Max = 97
Q 3 = 82.25
Median = 73
Q 1 = 64
Min = 50
Shapes of Distribution

Positive skewness


Symmetric distribution


Mean > Median
Mean = Median
Negative skewness

Mean < Median
Complex Thinking about Simple
Means



The mean time served for drug law
violation by prisoners released from
U.S. Federal prisons during 1965 to
1980 was 22.4 months.
The median family income in Texas in
1975 was $12,672.
The modal number of commercial TV
stations in 1980 among the fifty U.S.
states was 12 per state.
Applications of a Mean



Earnings of workers in the automobile
industry averaged $577.30 per week in
the U.S. for 1986.
The mean temperature in MinneapolisSt. Paul during January is minus 12
degrees Celsius.
The U.S. national rate of motor-vehicle
traffic deaths per 100,000 population in
1985 was 18.8.
Means can be tricky!
Quality of Life Index
Country
A
B
C
1965
Population Index
20
30
10
100
70
20
1975
Population
22
34
32
Index
104
76
33
Calculate the average (per capita) quality of life, separately for 1965
and 1975.
Explain why the 1975 average is lower than the 1965 average, even
though the quality of life has increased in every country.
Links between Centers and Spreads
Data = Fit + Residual
X
Fit
Z
Y
Locate Fit to Minimize a Function
of the Residuals
Mean and Standard Deviation


Average Deviation is Zero
Sum of Squared Deviations is Minimized
Median and Average Absolute
Deviation


No more than half of the residuals are
less than zero and no more than half of
the residuals are greater than zero.
The sum of the absolute values of the
residuals is as small as possible.
Mode and Percentage of Misses

As many as possible of the residuals are
zero.
Next Time ...


Friday Workshop--Minitab Applications
Lecture 3--Data Compression for Two
Variables: Scatterplots
Related documents