Download Computer Hardware - Computer Science@IUPUI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSCI N207 Data Analysis Using Spreadsheet
10b. Univariate Analysis Part 2
Lingma Acheson
linglu@iupui.edu
Department of Computer and Information Science, IUPUI
The Range
• Difference between minimum and maximum values in
a data set
• Larger range usually (but not always) indicates a
large spread or deviation in the values of the data
set.
(73, 66, 69, 67, 49, 60, 81, 71, 78, 62, 53, 87, 74,
65, 74, 50, 85, 45, 63, 100)
Range : 100 – 45 = 55
• Some extreme low or high value might throw off the
range, e.g. (20, 76, 77, 80, 82, 82, 84, 88, 90, 93,
99, 100)
Range: 100 – 20 = 80
Variance
• One measure of dispersion (deviation from the mean) of a
data set. How far away is each data from the mean?
• Variance – average distance to the mean
• The larger the variance, the greater is the average deviation
of each datum from the mean (more numbers are away from
the mean).
1 N
2
2
2
2
2
(
m

m
)

((
m

m
)

(
m

m
)

(
m

m
)

...

(
m

m
)
)/ N
1
2
3
N
Variance = N  i
i 1
m  Average value of the data set
• E.g. 73, 67, 70, 67, 49, 60, 81, 71, 78, 62, 53, 87, 72, 65, 74, 50, 84, 45, 62,100
Variance = ((73-68.5)2+(67-68.5)2 +(70-68.5)2 + … +(100-68.5)2)/20
Excel Functions:
VARP() – variance for the whole population (data set is complete)
VAR() – variance from a sample population (data set is a sample)
Standard Deviation
• Square root of the variance, as the variance gets the
square of the distance.
• The magnitude of the number is more in line with the
values in the data set.
• Can be thought of as the average deviation from the
mean of a data set.
Standard Deviation =
1 N
2
( mi  m )

N i 1
Excel Functions:
STDEVP() – use this when the data set is complete
STDEV() – use this when the data set is a sample
Frequency Tables
• Use frequency table to observe the distribution
• E.g. Consider the following data set:
{45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74,
74, 78, 81, 85, 87, 100}
• Need to determine how to group data into different
bins.
Category Labels
Frequency
0-50
3
51-60
2
61-70
6
71-80
5
81-90
3
>90
1
Histogram
• A histogram is simply a column chart of
the frequency table.
Category
Labels
Frequency
0-50
3
51-60
2
61-70
6
71-80
5
81-90
3
>90
1
7
Frequency
6
5
4
3
2
1
0
0-50
51-60 61-70 71-80 81-90
Scores
Page 6
>90
Data Distribution
Normal Distribution
6
Category
Labels
Frequency
0-50
1
51-60
3
61-70
5
71-80
5
81-90
3
>90
1
5
4
3
2
1
0
0-50
51-60
61-70
71-80
81-90
>90
Normal Distributions
• The Bell curve
– Symmetrical
– Mean ≈ Median
Skewed Distributions
• Most of the times the distributions are skewed.
• Positively skewed
distribution:
mean > median
• Negatively skewed
distribution:
mean < median
Data Distribution
{45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81,
85, 87, 100}
Average (68.6)
Mode (74)
and Median (68)
7
55.14
-1SD
6
82.06
+1SD
Frequency
5
4
3
2
1
0
0-50
51-60
61-70
71-80
Scores
81-90
>90
Standard Deviation
With a normal distribution:
mean + 1*SD covers 68% of data
mean + 2*SD covers 95% of data
mean + 3*SD covers 99.7% of data
Page 11
Related documents