Download Lecture 3 - Tresch Group

Document related concepts

Polycomb Group Proteins and Cancer wikipedia , lookup

Metagenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Achim Tresch
UoC / MPIPZ
Cologne
Statistics
treschgroup.de/OmicsModule1415.html
tresch@mpipz.mpg.de
1
Clustering = Partitioning into groups
K-means clustering, Example with k=2
14
13
12
11
10
9
8
7
8
10
12
14
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Initial Cluster Centers at Iteration 1
14
13
12
11
10
9
8
7
8
10
12
14
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Memberships and Boundary at Iteration 1
14
13
Y Variable
12
11
10
9
8
7
8
10
12
14
X Variable
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Cluster Centers at Iteration 2
14
13
12
11
10
9
8
7
8
10
12
14
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Memberships and Boundary at Iteration 2
14
13
Y Variable
12
11
10
9
8
7
8
10
12
14
X Variable
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Cluster Centers at Iteration 3
14
13
12
11
10
9
8
7
8
10
12
14
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Memberships and Boundary at Iteration 3
14
13
Y Variable
12
11
10
9
8
7
8
10
12
14
X Variable
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Cluster Centers at Iteration 4
14
13
12
11
10
9
8
7
8
10
12
14
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
K-means clustering, Example with k=2
Updated Memberships and Boundary at Iteration 4
14
13
Y Variable
12
11
10
9
8
7
8
10
12
14
X Variable
Taken from Padhraic Smyth, University of California Irvine, 2007
16
18
20
Example (Image Compression)
20
40
60
80
100
120
20
40
60
80
Original image
100
120
Example (Image Compression)
Image segmentation
(k=2)
Example (Image Compression)
Image segmentation
(k=3)
Example (Image Compression)
Image segmentation
(k=8)
Pseudocolor display
Hierarchical clustering
(on the Black board)
Classification
Expression profile of
Ms. Smith
Ms. Smith
Microarray of
Ms. Smith
Classification
The 30.000 properties of Mrs. Smith
The expression profile ...
- a list of 30,000 numbers
- some of them reflect her
health problem (e.g., cancer)
-the profile is an image of Ms.
Smith‘s physiology
How can these numbers tell us (predict) whether Ms. Smith has
tumor type A or tumor type B ?
Classification
Looking for similarities
?
Mrs. Smith
Compare her profile
to profiles of
people with tumor
type A and to
patients with tumor
type B
Training and Prediction
There are patients of
known class,
the training samples
There are patients
of unknown class,
the ”new“ samples
Mrs. Smith
Training and Prediction
Use the
training
samples ...
... to learn how
to predict
”new“ samples
Mrs. Smith
Prediction using one Gene
Color coded expression levels of trainings samples
A
B
Ms. Smith
 type A
Ms. Smith
 type B
Ms. Smith
 borderline
Which color shade is a good decision
boundary?
Optimal decision rule
Use the cutoff with the fewest misclassifications on
the trainings samples
Smallest training error
Decision
boundary
A
B
Distribution of
expression
values in type A
Training error
Distribution of
expression
values in type B
Optimal decision rule
Training set
Training error
The decision boundary was
chosen to
minimize the training error
The two distributions of
expression values for type A and
B will be similar but not identical
in a set of new cases
Test set
We can not adjust the decision
boundary because we do not know
the class of the new samples
Test errors are usually larger
then training errors
This phenomenon is called
Test error
overfitting
Combining information across genes
Taking means across genes
The top gene
The average of
the top 10 genes
ALL vs. AML, Golub et al.
Combining information across genes
Using a weighted average
y  x11  x2  2  ...  xn  n
with “good weights” you get an improved
separation
x1 ,..., xn
Expression values
1 ,...,  n
weights
Combining information across genes
The geometry of weighted averages
( x1 , x2 )
y
Calculating a weighted average is
identical to projecting
(orthogonally) the expression
profiles onto the line defined by
the weights vector (of length 1).
Linear decision rules
Hyperplanes
y  x11  x2  2  ...  xn  n  0
A
B
2 genes
3 genes
Together with an offset β0 the weight vector defines
a hyperplane that cuts the data in two groups
Linear decision rules
Linear Signatures
A
y  0  x11  x2  2  ...  xn  n
B
2 genes
If y≥0  Disease A
If y<0  Disease B
Nearest Centroids
Linear Discriminant Analysis
Diagonal Linear
Discriminant
Analysis (DLDA)
Rescale axis according
to the variances of
genes
Linear Discriminant Analysis
Discriminant Analysis
The data often shows
evidence of non identical
covariances of genes in
the two groups
Hence using LDA, DLDA
or NC introduces a model
biad (=wrong model
assumptions, here due to
oversimplification)
Feature Reduction
Gene Filtering
- Rank genes according to a
score
- Choose top n genes
- Build a signature with
these genes only
Still 30.000 weights, but
most of them are zero …
Note that the data decides
which are zero and which
are not
Limitation: You have no(?)
chance to find these two
genes among 30,000 noninformative genes
Feature Reduction
How many genes?
Is this a biological or a statistical question?
Biology: How many genes are (causally) involved in
the biological process?
Statistics: How many genes should we use for
classification ?
Gene expression measurements provide
~30.000 individual expression values per sample.
Feature Reduction
Finding the needle in the haystack
A common myth:
Classification information in gene expression
signatures is restricted to a small number of genes,
the challenge is to find them
Feature Reduction
The Avalanche
Aggressive lymphomas with
and without a MYC-breakpoint
MYC-neg MYC-pos
Cross Validation
Training error
Validation of a signature
requires independent
test data
Test set
The accuracy of a
signature on the data it
was learned from is
biased because of the
overfitting phenomenon
Training set
Independent Validation
Test error
Cross Validation
Generating Test Sets
Split data
randomly into …
test …
… and training
data
ok
mistake
Learn the classifier on the
training data, and apply it
to the test data
Cross validation
Problem: The test error cannot be measured directly.
Idee: Generate artificial test data by splitting the data
into k partitions of ~equal size, e.g. k=5.
Find the regression function /
the classifier using k-1
partitions
D1
D2
D3
Measure the training error
TR on the k-1 partitionsx measure
the cross validation error CV on the
remaining partition.
D4
D5
TR1
CV1
Training error
cross validation error
T = mean(TR1,…,TR5)
CV = mean(CV1,…,CV5)
CV is a good estimate of the test error.
Cross validation
Problem: The test error cannot be measured directly.
Idee: Generate artificial test data by splitting the data
into k partitions of ~equal size, e.g. k=5.
Find the regression function /
the classifier using k-1
partitions
D1
D1
TR1
TR2
CV1
CV2
Training error
cross validation error
D3
Measure the training error
TR on the k-1 partitionsx measure
the cross validation error CV on the
remaining partition.
D4
D5
T = mean(TR1,…,TR5)
CV = mean(CV1,…,CV5)
CV is a good estimate of the test error.
Cross validation
Problem: The test error cannot be measured directly.
Idee: Generate artificial test data by splitting the data
into k partitions of ~equal size, e.g. k=5.
Find the regression function /
the classifier using k-1
partitions
D1
D2
D3
TR1
TR2
TR3
CV1
CV2
Training error
cross validation error
Measure the training error
TR on the k-1 partitionsx measure
the cross validation error CV on the
remaining partition.
D4
D5
CV3
T = mean(TR1,…,TR5)
CV = mean(CV1,…,CV5)
CV is a good estimate of the test error.
Cross validation
Problem: The test error cannot be measured directly.
Idee: Generate artificial test data by splitting the data
into k partitions of ~equal size, e.g. k=5.
Find the regression function /
the classifier using k-1
partitions
Measure the training error
TR on the k-1 partitionsx measure
the cross validation error CV on the
remaining partition.
D1
D2
D3
D4
TR1
TR2
TR3
TR4
CV1
CV2
Training error
cross validation error
CV3
D5
CV4
T = mean(TR1,…,TR5)
CV = mean(CV1,…,CV5)
CV is a good estimate of the test error.
Cross validation
Problem: The test error cannot be measured directly.
Idee: Generate artificial test data by splitting the data
into k partitions of ~equal size, e.g. k=5.
Find the regression function /
the classifier using k-1
partitions
Measure the training error
TR on the k-1 partitionsx measure
the cross validation error CV on the
remaining partition.
D1
D2
D3
D4
D5
TR1
TR2
TR3
TR4
TR5
CV1
CV2
Training error
cross validation error
CV3
CV4
CV5
T = mean(TR1,…,TR5)
CV = mean(CV1,…,CV5)
CV is a good estimate of the test error.
Bootstrap
Problem: The test error cannot be measured directly.
Idee: Generate artificial (sub)samples from the sample at hand
Baron von Münchhausen, pulling himself
out of the swamp with his own hair. In the
English literature, he does the same using
his own bootstraps.
Bradley Efron *1938,
Stanford University
Bootstrap
Idea: Draw a bootstrap (sub)sample B. from the whole
sample S, allowing repetitions. Find a regression function on
the bootstrap sample B. Calculate bootstrap error E on S-B.
Population
Sample S
Bootstrap sample B
N cases
Prediction error on S-B
N cases
Regression
function fB
Bootstrap
Repeat the above process many times (Bootstrap samples
B1,…,BK), e.g., k=1000. The test error can be estimated as
the average bootstrap error.
V(f)
= mean (E1,…,Ek)
This is one of the best methods to estimate the test
error.
Drawback: It is computationally expensive (many
regressions need to be calcualted).
Cross Validation
Estimators of performance
have a variance …
… which can be high. The
chances of a meaningless
signature to produce 100%
accuracy on test data is
high if the test data
includes only few patients
Nested 10-fold- CV
Variance from 100
random partitions
Bias & Overfitting
The gap between training error and test error
becomes wider
Overfitting is a good reason for not including
hundreds of genes in a model even if they are
biologically affected
Centroid Shrinkage
genes
The shrunken centroid
method and the PAM
package
genes
genes
genes
Tibshirani et al 2002
genes
genes
genes
genes
genes
genes
genes
genes
genes
genes
genes
Centroid Shrinkage
Shrinkage D
Centroid Shrinkage
How much shrinkage is good in PAM (partitioning
around medoids?
Train Train Select Train Train
Train Train Train Select Train
cross
validation
Compute the CV-Performance for several values of D
Pick the D that gives you the smallest number of CVMisclassifications
Adaptive Model Selection
PAM does this routinely
Selection Bias
The test data must
not be used for gene
selection or adaptive
model selection,
otherwise the
observed (Cross
Validation-based)
accuracy is biased
Selection bias
Cross Validation
Small D, many genes poor performance due to overfitting
High D, few genes, poor performance due to lack of
information – underfitting The optimal D is somewhere in the middle
Predictive genes are not causal genes
Assume protein A binds to
protein B and inhibits it
The clinical phenotype is
caused by active protein A
Predictive information is in
expression of A minus
expression of B
Calling signature genes markers for a
certain disease is misleading!
Naïve Idea: Don’t calculate weights based on single gene
scores but optimize over all possible hyperplanes
Optimal decision rules
Only one of these problems exists
Problem 1:
No separating line
Problem 2:
Many separating lines
Why is this a problem?
Optimal decision rules
This problem is related to overfitting ...
more soon
The p>N problem
With the microarray we have more
genes than patients
Think about this in three dimensions
There are three genes, two patients
with known diagnosis (red and yellow)
and Ms. Smith (green)
There is always one plane separating
red and yellow with Ms. Smith on the
yellow side and a second separating
plane with Ms. Smith on the red side
OK! If all points fall onto one line it does not always work. However, for
measured values this is very unlikely and never happens in practice.
The p>N problem
The overfitting disaster
From the data alone we can not decide which genes
are important for the diagnosis, nor can we give a
reliable diagnosis for a new patient
This has little to do medicine. It
is a geometrical problem.
If you find a separating signature, it does not mean (yet)
that you have a top publication ...
... in most cases it means nothing.
Finding meaningful signatures
There always exist separating signatures caused by
overfitting
- meaningless signatures Hopefully there is also a separating signature
caused by a disease mechanism,
or which at least are predictive for the disease
- meaningful signatures –
We need to learn how to find and validate
meaningful signatures
Separating hyperplanes
Which hyperplane is the best?
Support Vector Machines (SVMs)
Fat planes: With an infinitely thin plane the data can
always be separated correctly, but not necessarily with
a fat one.
Again if a large margin separation exists, chances are
good that we found something relevant.
Large Margin Classifiers
Support Vector Machines (SVMs)
Maximal Margin Hyperplane
There are theoretical
results that the size of
the margin correlates
with the test (!) error
(V. Vapnik)
SVMs are not only
optimized to fit to the
training data but for
predictive performance
directly
Support Vector Machines (SVMs)
No separable training set
Penalty of error: distance
to hyperplane multiplied by
a parameter c
Balance over- and
underfitting
External Validation and Documentation
Documenting a signature is conceptually different
from giving a list of genes, although is is what most
publications give you
In order to validate a signature on external data or
apply it in practice:
- All model parameters need to be specified
- The scale of the normalized data to which the
model refers needs to be specified
Establishing a signature
External
Validation
Cross Validation:
Split Data into
Training and Test Data
- select genes
- find the optimal
number of genes
Training data only:
Machine Learning
- learn model parameters
Cookbook for good classifiers
1. Decide on your diagnosis model (PAM,SVM,etc...) and don‘t
change your mind later on
2. Split your profiles randomly into a training set and a test set
3. Put the data in the test set away ... far away!
4. Train your model only using the data in the training set
(select genes, define centroids, calculate normal vectors for
large margin separators, perform adaptive model selection ...)
don‘t even think of touching the test data at this time
5. Apply the model to the test data ...
don‘t even think of changing the model at this time
6. Do steps 1-5 only once and accept the result ...
don‘t even think of optimizing this procedure
Acknowledgements
Rainer Spang,
University of
Regensburg
Florian Markowetz,
Cancer Research
UK, Cambridge
Regression
(Estimation of one quantitative endpoint
by a function of the covariates)
Population
Sample
Unknown
functional relation 
Yi ~ X i   i
$
$
$
$
Regression function
$
Yi  f ( X i )   i
$
$
Regression
Specify a (parametric) family of functions, which
describes the type of dependence you want to
model.
E.g., linearer dependence,
f(x) = ax+b
quadratric dependence,
f(x) = ax2+bx+c
100
50
0
0
20
40
60
Goodness-of-fit measures
Specify the loss function = the measure for the
goodness of fit = the target function to be minimized.
E.g. quadratec loss (= residual sum of squares, RSS)
(Xj ,Yj)
True value
Yj
f(Xj)
Vorhersage
60
40
20
0
Y
Y= f(X)
0
20
40
X
60
Residuum
Xj
RSS = Σj(jth true value – jth predicted value) 2
= Σj( Yj- f(Xj) )2
Goodness-of-fit measures
Specify a loss function) L, which accounts for the
difference of the predictions from the observed „true“
values
Ex.:
y = true value,
f(x) = prediction
for continuous data:
L(y,f(x)) = ( y - f(x) )2
quadratic Loss
L(y,f(x)) = | y - f(x) |
linear Loss
for binary data:
L(y,f(x)) =
0 falls y=f(x)
1 falls y≠f(x)
0-1 Loss
Regression
Find the function from the specified family of
functions (i.e. find the parameters defining this
function) which fits the data best.
RSS = 8.0
RSS = 1.1
RSS = 1.7
100
RSS = 3
50
0
0
20
40
60
Univariate linear Regression
Ex.: Relation between body weight and brain weight.
Brain weight
Body- / Brain weight of 62 mammalians
Body weight
Univariate linear Regression
Ex.: Relation between body weight and brain weight.
weight
Brain
Log10
(Gehirngewicht)
Körper-/Gehirngewichte
Body- / Brain weight of 62
vonmammalians
62 Säugern
Log10Body
(Körpergewicht)
weight
74
Univariate linear Regression
Ex.: Relation between body weight and brain weight.
weight
Brain
Log10
(Gehirngewicht)
Körper-/Gehirngewichte
Body- / Brain weight of 62
vonmammalians
62 Säugern
Log10Body
(Körpergewicht)
weight
75
Univariate linear Regression
Ex.: Relation between body weight and brain weight.
weight
Brain
Log10
(Gehirngewicht)
Residuals
Log10
Body
(Körpergewicht)
weight
Chironectes minimus
(Schwimmbeutelratte, Opossum)
76
Linear Regression
Goals
Find a good predictor of the endpoint Y, given the
Identify covariates that are relevant (have
prognostic value) for the prediction of Y.
Ex.: An increase of the body
weight by 1 unit (on the
log scale) results in an
average increase in the
(log) brain weight by
0.75 units.
Body weight seems to exert a major influence on
brain weight.
77
Univariate nonlinear Regression
Nonlinear dependencies can also be modeled by a
regeression
“True” regression function
Y = aX2+bX+c
Multiple Regression
Univariate regression: only one covariate
Multiple Regression: several (up to thousands) of
covariates
Beispiel:
Y = Oxygen consumption (mMol O2/min)
X1= body temperature
X2= physical
performance
Multivariate linear
regression function:
Y = a1 X1 + a2 X2 + c
79
From the treasury of statistics
Olympia 2156
Women win the men‘s 100m race
Nature 431, 525 (30 September 2004) | doi:10.1038/431525a;
Athletics: Momentous sprint at the 2156 Olympics?
Andrew J. Tatem, Carlos A. Guerra, Peter M. Atkinson & Simon I. Hay
The 2004 Olympic women's 100-metre sprint champion, Yuliya Nesterenko,
is assured of fame and fortune. But we show here that — if current trends
continue — it is the winner of the event in the 2156 Olympics whose name
will be etched in sporting history forever, because this may be the first
occasion on which the women‘s race is won in a faster time than the men's
event.
10
5
World‘s best 100m times
of each year. women, men
0
Zeit [s]
15
20
From the treasury of statistics
1900
1920
1940
1960
Jahr
1980
2000
2156
From the treasury of statistics
10
5
World‘s best 100m times
of each year. women, men
0
Zeit [s]
15
20
Note:
Interpolation is much
more reliable than
Extrapolation!
1900
1920
1940
1960
Jahr
1980
2000
2156
2385
One needs to be clear about the range of values to
which the regression model can be applied sensibly.
Training vs. test error
Problem: The loss function (if applied to the training data)
measures the wrong thing.
Population
Sample
Yi  f ( X i ) ?
Yi  f ( X i )
regression function f
Yi  f ( X i )
Training error T(f)
Test error V(f)
How well does the regression
function apply to the population?
How well does the
regression function
apply to the sample?
Biases in RNA-Seq data
Aim: to provide you with a brief overview of literature about biases in
RNA-seq data such that you become aware of this potential problem.
Bias and Variance
x
Strong
noise
Weak
noise
Bias
No bias
Biases in RNA-Seq data
• Experimental (and computational) biases affect
expression estimates and, therefore,
subsequent data analysis:
–
–
–
–
–
Differential expression analysis
Study of alternative splicing
Transcript assembly
Gene set enrichment analysis
Other downstream analyses
• We must attempt to avoid, detect and correct
these biases
Sources of Bias and Variance
Systematic (Bias)
Stochastic (Variance)
• Similar effects on many (all) data of
one sample
• Correction can be estimated and
removed from the data
(normalization)
• Effects on single data points of a sample
• Correction can not be estimated, noise
can only be quantified and taken into
account (error model)
Efficiency of RNA Extraction,
Reverse Transcription
Background
Backgroundfluorescence
fluorescence
Amplification efficiency
Tissue
Tissue contamination
contamination
DNA
DNA Quality
quality
RNA Degradation
ACTG-signal detection
Bias-Variance Tradeoff
Which factors influence the quality of predictions?
Ex.: Binary classification of points in the plane.
Flexibility
Stability
Overfitting
Bias
From Hastie, Tibshirani, Efron. The Elements of Statistical Learning
Bias-Variance Tradeoff
Test error
V(f)
Training error
T(f)
Even though increasing flexibility („complexity“) reduces the
training error, the test error rises again at some point. This
phenomenon is called overfitting.
Aus: Hastie, Tibshirani, Efron. The Elements of Statistical Learning