Download A "real" data set, and how we would work on it as a worksheet or test problem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Examples
“Real” Data
The following is a data set of body temperatures of 106 people, recorded at 12 am, by doctors in
a Maryland institute. No further details are available, so we are merely describing (or, better,
summarizing) the data with no further deductions.
98.6
98.6
98.0
98.0
99.0
98.4
98.4
98.4
98.4
98.6
98.6
98.8
98.6
97.0
97.0
98.8
97.6
97.7
98.8
98.0
98.0
98.3
98.5
97.3
98.7
97.4
98.9
98.6
99.5
97.5
97.3
97.6
98.2
99.6
98.7
99.4
98.2
98.0
98.6
98.6
97.2
98.4
98.6
98.2
98.0
97.8
98.0
98.4
98.6
98.6
97.8
99.0
96.5
97.6
98.0
96.9
97.6
97.1
97.9
98.4
97.3
98.0
97.5
97.6
98.2
98.5
98.8
98.7
97.8
98.0
97.1
97.4
99.4
98.4
98.6
98.4
98.5
98.6
98.3
98.7
98.8
99.1
98.6
97.9
98.8
98.0
98.7
98.5
98.9
98.4
98.6
97.1
97.9
98.8
98.7
97.6
98.2
99.2
97.8
98.0
98.4
97.8
98.4
97.4
98.0
97.0
The same data is sorted and a rank and percentile position assigned to each
Data
99.6
99.5
99.4
99.4
99.2
99.1
99
99
Rank
1
2
3.5
3.5
5
6
7.5
7.5
Percentile Rank
100.00%
99.05%
97.14%
97.14%
96.19%
95.24%
93.33%
93.33%
98.9
98.9
98.8
98.8
98.8
98.8
98.8
98.8
98.8
98.7
98.7
98.7
98.7
98.7
98.7
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.5
98.5
98.5
98.5
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.4
9.5
9.5
14
14
14
14
14
14
14
20.5
20.5
20.5
20.5
20.5
20.5
31
31
31
31
31
31
31
31
31
31
31
31
31
31
31
40.5
40.5
40.5
40.5
48.5
48.5
48.5
48.5
48.5
48.5
48.5
48.5
91.43%
91.43%
84.76%
84.76%
84.76%
84.76%
84.76%
84.76%
84.76%
79.05%
79.05%
79.05%
79.05%
79.05%
79.05%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
64.76%
60.95%
60.95%
60.95%
60.95%
49.52%
49.52%
49.52%
49.52%
49.52%
49.52%
49.52%
49.52%
98.4
98.4
98.4
98.4
98.3
98.3
98.2
98.2
98.2
98.2
98.2
98
98
98
98
98
98
98
98
98
98
98
98
98
97.9
97.9
97.9
97.8
97.8
97.8
97.8
97.8
97.7
97.6
97.6
97.6
97.6
97.6
97.6
97.5
97.5
97.4
48.5
48.5
48.5
48.5
55.5
55.5
59
59
59
59
59
68
68
68
68
68
68
68
68
68
68
68
68
68
76
76
76
80
80
80
80
80
83
86.5
86.5
86.5
86.5
86.5
86.5
90.5
90.5
93
49.52%
49.52%
49.52%
49.52%
47.62%
47.62%
42.86%
42.86%
42.86%
42.86%
42.86%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
30.48%
27.62%
27.62%
27.62%
22.86%
22.86%
22.86%
22.86%
22.86%
21.90%
16.19%
16.19%
16.19%
16.19%
16.19%
16.19%
14.29%
14.29%
11.43%
97.4
97.4
97.3
97.3
97.3
97.2
97.1
97.1
97.1
97
97
97
96.9
96.5
93
93
96
96
96
98
100
100
100
103
103
103
105
106
11.43%
11.43%
8.57%
8.57%
8.57%
7.62%
4.76%
4.76%
4.76%
1.90%
1.90%
1.90%
0.95%
0.00%
Some indexes were computed by software:
Mean
Median
Mode
Standard Deviation
Sample Variance
Range
Minimum
Maximum
1st quartile
3rd quartile
98.2
98.4
98.6
0.62290
0.388
3.1
96.5
99.6
97.8
98.6
Here is how the same data would be presented in a worksheet or test for our class:
99.6
99.5
99.4
99.4
99.2
99.1
99.0
99.0
98.9
98.9
98.8
98.8
98.8
98.8
98.8
98.8
98.8
98.7
98.7
98.7
98.7
98.7
98.7
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.6
98.5
98.5
98.5
98.5
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.4
98.3
98.3
98.2
98.2
98.2
98.2
98.2
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
97.9
97.9
97.9
97.8
97.8
97.8
97.8
97.8
97.7
97.6
97.6
97.6
97.6
97.6
97.6
97.5
97.5
97.4
97.4
97.4
97.3
97.3
97.3
97.2
97.1
97.1
97.1
97.0
97.0
97.0
96.9
96.5
count
106
Sum
10409.2
Sum of Squares
1022224
The last three rows would help us compute the mean, and the variances (“population”, and
“sample”). The division in columns would help us determine median and quartiles.
In particular, the mean is given by Sum/Count, or 10409.2/106. The “population” variance by
³
´2
1022224
10409:2
“average of squares - square of average”: 106 ¡
. The “population” standard
106
deviation is the square root of this number, the “sample” variance is obtained by multiplying the
“population” variance by 106/105, and its square root is the “sample” standard deviation. Don’t
worry: a “cheat sheet” comes with every worksheet or test.
The example above uses “real” data, on which, however, we have very little information as to
how they were collected. This is the usual case when working with information from the web (or
from a textbook, for that matter). Even with more information, it is often difficult to verify that
the assumptions behind the mathematical models we will use to analyze the data are satisfied (if
they are not, many “indexes”, like mean and variance, may have very little useful meaning).
Many example we will work on in class, and at home will instead be “simulated data”, produced
by computer in a way that makes sure that they do indeed satisfy specific models.
Summation (“Sigma”) Notation
Try to familiarize with the following notation for sums. Sums are everywhere in statistics, and,
more generally, in mathematics, so we have developed a shorthand symbol for the operation of
summing numbers.
Suppose we have n numbers and need their sum. Denote the numbers by x 1; x2; …
; xn . For
1
example, if the numbers are 2; 4; ¡5; , this would correspond to
2
n
X
n = 4; x1 = 2; x2 = 4; x3 = ¡5; x4 = 1. The sum of our n numbers is written as
xk, and,
2
k=1
for our example, this would mean the computation 2 + 4 + ( ¡5) + 21 = 23.
; xn is given by
Thus, the mean of a data set
Xofn n numbers x1; x2; …
Xn
2
¹)
x
(x
¡
x
k
k
x + x2 + …+ xn
2
k=1
k=1
=
, and the “population” variance by S =
.
x¹ = 1
n
n
n
Incidentally, it is sometimeX
convenient to compute the “population” variance using the
n
x2k
2
2
k=1
equivalent1 formula S =
¡ (x¹ )
n
p
2
The “population” standard deviation is then given by S , while
p the “sample” variance is given
2
n
2
by s =
S , and the “sample” standard deviation by s = s2.
n ¡1
1 The two expressions are algebraically equivalent. However, they may behave differently when used with large
data sets, as the rounding approximations that are inevitable may affect the second expression more than the
first. Of course, this issue will not concern us at all, since we will never face big calculations in our class.
Related documents