Download The meaning of the standard deviation - Math & Computer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Biostatistics,
statistical software I.
Basic statistical concepts
Krisztina Boda PhD
Department of Medical Informatics,
University of Szeged
What is biostatisics?
Statistics is a mathematical science
pertaining to the collection, analysis,
interpretation or explanation, and
presentation of data.
 Biostatistics or biometry is the application
of statistics to a wide range of topics in
biology. It has particular applications to
medicine and to agriculture.

Krisztina Boda
INTERREG
2
Application of biostatistics



Krisztina Boda
Research
Design and analysis
of clinical trials in
medicine
Public health,
including
epidemiology,
INTERREG
3
Krisztina Boda
INTERREG
4
Biostatistical methods
 Descriptive
statistics
 Hypothesis tests (statistical tests)
 They depend on:
the type of data
 the nature of the problem
 the statistical model

Krisztina Boda
INTERREG
5
The data set
A data set contains information on a
number of individuals.
 Individuals are objects described by a set
of data, they may be people, animals or
things. For each individual, the data give
values for one or more variables.
 A variable describes some characteristic
of an individual, such as person's age,
height, gender or salary.

Krisztina Boda
INTERREG
6
The data-table
Data of one experimental unit
(“individual”) must be in one record
(row)
 Data of the answers to the same
question (variables) must be in the
same field of the record (column)
Number
SEX
AGE
....
1
1
20
....
2
2
17
....
.
.
.
...

Krisztina Boda
INTERREG
7
Variables

Categorical (discrete)
A discrete random
variable X has finite
number of possible
values




Krisztina Boda

Continuous
A continuous random
variable X has takes
all values in an
interval of numbers.
 Concentration
 Temperature
 …
Gender
Blood group
Number of children
…
INTERREG
8
Types of data from two aspects


Based on the number of
values they can have
 discrete (categorical)
 Continuous
Based on the property they
represent
 Qualitative data

nominal data (they can be

distinguished by their
names )
ordinal data (there are
categories of classification
it may be possible to order)

 Sex, blood-group, number
of children
 Age, temperature,
concentration

 Quantitative (or
numerical) data
Krisztina Boda
Example
Example
 Qualitative data


 Quantitative (or
numerical) data

INTERREG
Sex, blood-group
very good-good –
acceptable- wrong - very
wrong-very wrong,
low - normal - high,
Age, number of children
9
Distribution of variables
Continuous: the distribution
of a continuous variable
describes what values it
takes and how often
these values fall into an
interval.
Discrete: the distribution
of a categorical
variable describes what
values it takes and how
often it takes these
values.
Histogram
10
SEX
14
8
12
10
6
8
4
6
Frequency
Frequency
4
2
0
male
female
0
5.0
SEX
Krisztina Boda
2
15.0
25.0
35.0
45.0
55.0
65.0
age in years
INTERREG
10
The distribution of a continuous variable, example
20.00
17.00
22.00
28.00
9.00
5.00
26.00
60.00
35.00
51.00
17.00
50.00
9.00
10.00
19.00
22.00
25.00
29.00
27.00
19.00
Krisztina Boda
Categories:
0-10
11-205
21-30
31-40
41-50
51-60
Frequencies
4
7
1
1
2
8
7
6
Frequency
Values:
5
4
3
2
1
0
0-10
11-20
21-30
31-40
41-50
51-60
Age
INTERREG
11
The length of the intervals (or the number
of intervals) affect a histogram
8
10
7
9
8
6
7
count
count
5
4
5
4
3
3
2
2
1
1
0
0
0-10
11-20
21-30
31-40
41-50
51-60
age
Krisztina Boda
6
0-20
21-40
41-60
age
INTERREG
12
The overall pattern of a distribution
The center, spread and shape describe
the overall pattern of a distribution.
 Some distributions have simple shape,
such as symmetric and skewed. Not all
distributions have a simple overall shape,
especially when there are few
observations.
 A distribution is skewed to the right if the
right side of the histogram extends much
farther out then the left side.

Krisztina Boda
INTERREG
13
Histogram of body weights (kg)
Hisztogram
Jelenlegi testsúlyok
300
200
100
Std. D ev = 8.74
M ean = 57.0
N = 1090.00
0
32.5
37.5
42.5
47.5
52.5
57.5 62.5
67.5
72.5
77.5
82.5
87.5
Jelenlegi testsúlya /kg/
Krisztina Boda
INTERREG
14
Outliers

Outliers are observations that lie outside
the overall pattern of a distribution. Always
look for outliers and try to explain them
(real data, typing mistake or other).
10
8
6
4
2
Std. Dev = 13.79
Mean = 62.1
N = 4 3.00
0
40.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
100.0
95.0
110.0
105.0
Jelenlegi testsúlya
Krisztina Boda
INTERREG
15
Describing distributions with numbers
Measures of central tendency: the
mean, the mode and the median are three
commonly used measures of the center.
 Measures of variability : the range, the
quartiles, the variance, the standard
deviation are the most commonly used
measures of variability .
 Measures of an individual: rank, z score

Krisztina Boda
INTERREG
16
Measures of the center
n



Krisztina Boda
Mean:
x
x1  x 2 ... x n

n
x
i 1
i

n

Mode: is the most
frequent number
Median: is the value
that half the members
of the sample fall
below and half above.
In other words, it is the
middle number when
the sample elements
are written in numerical
order


INTERREG
Example: 1,2,4,1
Mean
Mode
Median
17
Measures of the center

n
Mean:
x


Krisztina Boda
x1  x 2 ... x n

n
x
i 1
n
Mode: is the most
frequent number
Median: is the value
that half the members
of the sample fall
below and half above.
In other words, it is the
middle number when
the sample elements
are written in numerical
order
i




INTERREG
Example: 1,2,4,1
Mean=8/4=2
Mode=1
Median
 First sort data
1124
 Then find the element(s)
in the middle

If the sample size is odd,
the unique middle
element is the median
If the sample size is
even, the median is the
average of the two
central elements
1124

Median=1.5


18



Example
The grades of a test written by 11 students were
the following:
100 100 100 63 62 60 12 12 6 2 0.
A student indicated that the class average was
47, which he felt was rather low. The professor
stated that nevertheless there were more 100s
than any other grade. The department head
said that the middle grade was 60, which was
not unusual.
The mean is 517/11=47, the mode is 100, the
median is 60.
Krisztina Boda
INTERREG
19
Relationships among the mean(m), the
median(M) and the mode(Mo)

A symmetric curve
m=M=Mo

A curve skewed to the right
Mo<M< m

A curve skewed to the left
M < M < Mo
Krisztina Boda
INTERREG
20
Measures of variability (dispersion)



The range is the difference between the
largest number (maximum) and the smallest
number (minimum).
Percentiles (5%-95%): 5% percentile is the
value below which 5% of the cases fall.
Quartiles: 25%, 50%, 75% percentiles
n

The variance= SD 
2
 (x
i
i 1
 x) 2
n 1
n

Krisztina Boda
The standard deviation: SD 
INTERREG
 ( x  x)
i 1
i
n 1
2
 var iance
21
Example




Data: 1 2 4 1, in ascending order: 1 1 2 4
Range: max-min=4-1=3
Percentiles
Quartiles:
Standard deviation: Weighted Average(Definition 1)
Tukey's Hinges
xi
xi  x
Percentiles
25
50
1.0000
1.5000
1.0000
1.5000
75
3.5000
3.0000
( xi  x) 2
n
1
1
2
4
Total
Krisztina Boda
1-2=-1
1-2=-1
2-2=0
4-2=2
0
1
1
0
4
6
SD 
INTERREG
 ( x  x)
i 1
i
n 1
2

6
 2  1.414
3
22
The meaning of the standard deviation


Krisztina Boda
A measure of dispersion around the mean. In a
normal distribution, 68% of cases fall within one
standard deviation of the mean and 95% of
cases fall within two standard deviations.
For example, if the mean age is 45, with a
standard deviation of 10, 95% of the cases
would be between 25 and 65 in a normal
distribution.
INTERREG
23
The use of sample characteristics
in summary tables
Center
Dispersion
Publish
Mean
Standard deviation,
Standard error
Median
Min, max
5%, 95%s percentile
25 % , 75% (quartiles)
Mean (SD)
Mean  SD
Mean  SE
Mean  SEM
Med (min, max)
Med(25%, 75%)
Krisztina Boda
INTERREG
24
Displaying data

Categorical data
Kördiagram
Apja iskolai végzettsége
Oszlopdiagram
40
8 ált.-nal kevesebb
nincs válasz
30
 bar chart
 pie chart
8 ált.
felsőfokú végzettség
20
gimnáziumi érettségi
Percent
10
szakmunkásképző
szakközépiskolai ére
0
8 ált.-nal kevesebb
8 ált.
szakmunkásképző
gimnáziumi érettségi
nincs válasz
szakközépiskolai ére fels őfokú végzettség
Apja legmagasabb is kolai végzettsége
Histogram (kerd97.STA 20v*43c)
12
10
8
Box Plot (kerd97 20v*43c)
100
4
90
80
2
70
0
35 40 45 50 55 60 65 70 75 80 85 90 95
NEM: fiú
SULY
35 40 45
60 50 55 60 65 70 75 80 85 90 95
SULY
NEM: lány
50
40
Median
25%-75%
Min-Max
Extremes
Mean Plot (kerd97 20v*43c)
30
fiú
85
80
lány
NEM
75
70
65
SULY
 histogram
 box-whisker plot
 mean-standard
deviation plot
 scatter plot
6
60
55
Szóródási diagram
50
120
45
fiú
Mean
Mean±SD
100
lány
NEM
80
Jelenlegi testsúlya /kg/
Continuous data
No of obs

60
40
20
0
40
60
Kivánatosnak tartott testsúlya /kg/
Krisztina Boda
INTERREG
80
100
25
Distribution of body weights
The distribution is skewed in case of girls
Histogram (kerd97.STA 20v*43c)
12
10
8
6
No of obs
4
2
0
35 40 45 50 55 60 65 70 75 80 85 90 95
35 40 45 50 55 60 65 70 75 80 85 90 95
NEM: fiú
NEM: lány
boys
1. Leíró statisztika
Krisztina Boda
SULY
INTERREG
girls
26
Histogram (kerd97.STA 20v*43c)
12
10
8
6
No of obs
4
2
0
35 40 45 50 55 60 65 70 75 80 85 90 95
35 40 45 50 55 60 65 70 75 80 85 90 95
NEM = 2.00
NEM: f iú
NEM: lány
SULY
NEM = 1.00
SULY
SULY
40
65
70
75
80
80
Jelenlegi testsúlya
Jelenlegi testsúlya
Krisztina Boda
60
85
INTERREG
27
Mean-dispersion diagrams
Mean Plot (kerd97 20v*43c)
85
80
70
65
SULY
 Mean + SD
 Mean + SE
 Mean + 95% CI
75
60
55
50
45
fiú
lány
Mean
Mean±SE
NEM
Mean  SE
Mean Plot (kerd97 20v*43c)
85
Mean Plot (kerd97 20v*43c)
85
80
80
75
75
70
70
65
SULY
SULY
65
60
60
55
55
50
50
45
45
fiú
lány
fiú
Mean
Mean±0.95 Conf. Interval
lány
Mean
Mean±SD
NEM
NEM
Mean  95% CI
Krisztina Boda
Mean  SD
INTERREG
28
Box diagram
Box Plot (kerd97 20v*43c)
Box Plot (kerd97 20v*43c)
100
100
90
90
80
80
70
70
SULY
SULY
60
60
50
50
40
40
30
fiú
lány
Median
25%-75%
Non-Outlier Range
Extremes
30
fiú
lány
Median
25%-75%
Min-Max
Extremes
NEM
NEM
A box plot, sometimes called a box-and-whisker plot
displays the median, quartiles, and minimum and
maximum observations .
Krisztina Boda
INTERREG
29
Transformations of data values
Addition, subtraction
Adding (or subtracting) the same number
to each data value in a variable shifts each
measures of center by the amount added
(subtracted).
 Adding (or subtracting) the same number
to each data value in a variable does not
change measures of dispersion.

Krisztina Boda
INTERREG
30
Transformations of data values
Multiplication, division
Measures of center and spread change in
predictable ways when we multiply or
divide each data value by the same
number.
 Multiplying (or dividing) each data value by
the same number multiplies (or divides) all
measures of center or spread by that
value.

Krisztina Boda
INTERREG
31
Proof.
The effect of linear transformations
Let the transformation be x ->ax+b
 Mean:  ax  b
ax  b  ax  b  ...  ax  b a( x  x  ...  x )  nb

n
i 1
i

n

1
2
n

1
2
n
n
 ax  b
Standard deviation:
n
 ((axi  b)  (a x  b))
n
2

i 1
n 1
n
 a ( xi  x)
2

Krisztina Boda
n
i 1
n 1
 ((axi  b  a x  b))
i 1
n 1
n
2

2
(
ax

a
x
)
 i
i 1
n 1
n
2
a
2
(
x

x
)
 i
i 1
n 1
 a SD
INTERREG
32
Example: the effect of transformations
Sample data
(xi)
Addition
(xi +10)
Subtraction
(xi -10)
Multiplication
(xi *10)
Division
(xi /10)
1
11
-9
10
0.1
2
12
-8
20
0.2
4
14
-6
40
0.4
1
11
-9
10
0.1
Mean=2
12
-8
20
0.2
Median=1.5
11.5
-8.5
15
0.15
Range=3
3
3
30
0.3
St.dev.≈1.414
≈1 .414
≈ 1.414
≈ 14.14
≈ 0.1414
Krisztina Boda
INTERREG
33
Special transformation: standardisation

The z score measures how many standard
deviations a sample element is from the mean. A
formula for finding the z score corresponding to
a particular sample element xi is
xi  x
zi 
s



,
i=1,2,...,n.
We standardize by subtracting the mean and
dividing by the standard deviation.
The resulting variables (z-scores) will have
 Zero mean
 Unit standard deviation
 No unit
Krisztina Boda
INTERREG
34
Example: standardisation
Sample data (xi) Standardised data (zi)
1
-1
2
0
4
2
1
1
Mean
2
0
St. deviation
≈1 .414
1
Krisztina Boda
INTERREG
35
Review questions and exercises

Problems to be solved by handcalculations
 ..\Handouts\Problems hand I.doc

Solutions
 ..\Handouts\Problems hand I solutions.doc

Problems to be solved using computer
..\Handouts\Problems comp I.doc
Krisztina Boda
INTERREG
36
Useful WEB pages




Krisztina Boda
http://www-stat.stanford.edu/~naras/jsm
http://www.ruf.rice.edu/~lane/rvls.html
http://my.execpc.com/~helberg/statistics.html
http://www.math.csusb.edu/faculty/stanton/m26
2/index.html
INTERREG
37
Related documents