Download slides

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Statistical challenges in micro-array data
Hans C. van Houwelingen
Department of Medical Statistics
Leiden University Medical Center, The Netherlands
email: jcvanhouwelingen@lumc.nl
jcvh, London , 5 February 2003, page 1
Key paper
Golub TR, Slonim DK, Tamayo P, et al. ,
Molecular classification of cancer: Class discovery and class
prediction by gene expression monitoring,
SCIENCE 286, 531-537, 1999
jcvh, London , 5 February 2003, page 2
Task Force Bio-informatics Leiden
Leiden University Medical Center
Human Genetics
Molecular Cell Biology
Pathology
Medical Statistics
Faculty of Mathematics and Natural Sciences
Mathematical Statistics
jcvh, London , 5 February 2003, page 3
Collaboration between Medical
Statistic and Mathematical
Statistics led to
jcvh, London , 5 February 2003, page 4
Micro-arrays are used for
measuring gene expression profiles
Affymetrix (single colour)
cDNA (two colours)
measuring gene amplification/deletion
Comparative Genome Micro-array (single colour).
In each array experiment measurements are obtained for very
many genes (500-25000).
jcvh, London , 5 February 2003, page 5
Practical problems.
techniques are based on hybridization and optical reading,
hence there are problems
to distinguish the signal from the background
to carry out a proper background correction.
The measurements can be far from perfect, leading to a lot of
missing data.
absolute intensity levels vary from array to array (and from
colour to colour).
to make the data comparable, they have to be normalized.
the simplest way is by setting the geometric mean =1
jcvh, London , 5 February 2003, page 6
Sources of random variation at different levels:
between pixels within spots,
between spots within arrays,
between arrays within samples,
between samples within individuals
between individuals within groups.
We ignore the variation within the array and assume that each
array gives one measurement per gene, usually expressed as
ln((relative) intensity).
jcvh, London , 5 February 2003, page 7
Data sets/ designs
Micro-arrays are still quite costly per array (but not per gene).
Large data sets have about 100 arrays
data sets with only a few arrays are very common
Study designs depend on the field of application
(plants/animals/human).
In non-human applications material is often pooled to
reduce the number of arrays.
In medical research pooling of patients is less natural.
jcvh, London , 5 February 2003, page 8
Simplified scheme
X: gene-expression data on many genes (for few patients)
Y: patient characteristic
jcvh, London , 5 February 2003, page 9
Designs/Questions/Analyses
1. Sequence of observations X1,...,Xn
Cluster analysis
2. Factorial designs Y X
Multivariate Analysis of Variance
Searching differentially expressed genes
3. Classification and prediction X Y
Discriminant analysis/ Multiple regression
Searching influential gene
jcvh, London , 5 February 2003, page 10
Remark about the multiplicity problem
Having many genes has advantages and disadvantages.
disadvantage: multiple testing problem. There is a big chance of
false positives. Bonferoni-type corrections could be far too
conservative and too restrictive.
advantage: all genes give similar data. Information from the other
genes can be used to make inference about one particular gene. That
makes micro-array data the ideal playing field for empirical Bayes
methodology .
jcvh, London , 5 February 2003, page 11
General challenges
Computational
not really, as long as the number of experiments is
relatively small
however, standard software has trouble coping with few
rows and many colums
Conceptual
what do we want ?
what can be delivered ?
jcvh, London , 5 February 2003, page 12
Specific challenges
Structured cluster analysis (design 1)
Increasing the degrees of freedom when searching for
differentially expressed genes (design 2)
Finding reliable predictors (design 3)
Finding influential genes (design 3)
jcvh, London , 5 February 2003, page 13
Design 1. No structure between experiments
Typical data
Xij = ln (ratio experimental/control), i=1,...,G, j=1,..,J
i
stands for the gene
j
stands for different experiments.
The “natural” value is Xij 0 .
Natural “cut-off” ln(2) (two-fold change).
jcvh, London , 5 February 2003, page 14
For J=1 classification in under-expressed, normal and overexpressed can be based on one-dimensional cluster analysis or
latent class models.
Example of histogram over all genes
jcvh, London , 5 February 2003, page 15
For J>1 cluster analysis (unsupervised learning) can be used to
cluster the genes. Statistical challenges are:
to quantify the uncertainty of the clusters and cluster
membership by proper statistical modelling.
to limit the possible cluster profiles by proper modelling (prestructured clustering)
jcvh, London , 5 February 2003, page 16
Example: CG (comparative genomic) micro-arrays
Used in diagnosis of Down syndrome.
21 individuals, 448 genes. Data per “individual” look like
.8
.6
.4
amplified
.2
normal
.0
-.2
deleted
-.4
009000c
-.6
-.8
-1.0
genome
jcvh, London , 5 February 2003, page 17
Data per gene
.1
look like
-.4
-.5
ln(ratio)
.3
.2
.0
-.1
-.2
-.3
c
33 m
oe xfe
e
al
m xfem
m
fe c
34
ba 3c
3
ba 6c
2
ba c
22
ba 9c
1
ba 8c
1
ba c
15 1c
ba 649
_ c
99 75
92 c
98 55
58 3c
98 996
_ 5c
98 628
_ c
98 039
6 21
97 73
_1 03
97 141
_ c
96 708
5 1c
92 46
12 c
90 000
9
00
patient
jcvh, London , 5 February 2003, page 18
Interest in clustering genes
Probabilistic clustering
True clusters overlap.
Each cluster has its own mean
Variation can depend on sample (patient) (not on cluster)
samples within cluster independent
So, model within cluster
Xij~N(µ kj,
2
j ))
jcvh, London , 5 February 2003, page 19
Statistical procedure estimates
cluster means
standard deviations
prior probabilities of clusters
posterior probabilities of genes
Easily fitted by EM, missing data no problem
jcvh, London , 5 February 2003, page 20
Cluster 1: prior prob. 0.42 (mean ± 2 st. dev.)
jcvh, London , 5 February 2003, page 21
Cluster 2: prior prob. 0.22
jcvh, London , 5 February 2003, page 22
Cluster 3: prior prob 0.24
jcvh, London , 5 February 2003, page 23
Cluster 4: prior prob. 0.12
jcvh, London , 5 February 2003, page 24
All means in one picture
jcvh, London , 5 February 2003, page 25
Structured probabilistic clustering.
Put prior/biological/genetic/medical knowledge in the model
Proposal for this kind of data:
µ kj ak
j
with
k
0 for the cluster of normal genes
This is a one-factor latent-class model.
jcvh, London , 5 February 2003, page 26
Cluster means for structured model
jcvh, London , 5 February 2003, page 27
cluster
prob.
ln(ratio) (on relative
scale)
deleted
0.2477
-0.9319
normal
0.6180
0
amplified
0.1321
2.2988
overamplified
0.0022
6.2428
jcvh, London , 5 February 2003, page 28
Posteriors along the genome
CHROMOSO:
1.2
1.0
1.0
.8
.8
.6
.6
.4
overamplified
.8
.6
deleted
DISTANCE
.4
overamplified
normal
deleted
0.0
DISTANCE
overamplified
amplified
amplified
.2
.2
normal
20.00
1.0
amplified
.2
0.0
CHROMOSO:
7.00
1.2
.4
Value
probability
1.00
Value
CHROMOSO:
1.2
normal
deleted
0.0
DISTANCE
jcvh, London , 5 February 2003, page 29
Points of discussion
check the fit of the model
include dependencies in cluster status along the chromosome
(hidden (semi-) Markov model)
jcvh, London , 5 February 2003, page 30
Design 2. Searching differentially expressed genes
example: affymetrix-array
12488 genes
9 experiments
3 wildtype
2 MX
3 HDMD
1 HDMDxMDX
Question: Which genes show difference between groups ?
jcvh, London , 5 February 2003, page 31
Data for some of the genes
5
4
3
log(intensity)
2
1
0
H
H
H
H
D
D
D
D
D
M
D
M
D
M
2
1
D
3
2
1
xM
3
2
1
pe
-ty
pe
-ty
pe
-ty
D
M
X
M
ild
ild
ild
X
M
w
w
w
X
experiment
jcvh, London , 5 February 2003, page 32
Typical data : Xicj
i=1,..,G
stands for the genes
c=1,..,C
for different conditions
j 1,..,Jc
for the repetitions within the conditions.
The usual case is C=2 and J1 and J2 quite small ( in the range
from 2-4).
jcvh, London , 5 February 2003, page 33
Let µ ic E[Xcij|i,c] and
i
var[Xicj|i,c]
Due to the small sample sizes, it is impossible to carry out
significance tests per gene.
Much can be gained from modeling the variation of (µ i1,..,µ iC)
and
i over
all genes.
A simple model that relates
i
to µ i1 can help dramatically.
jcvh, London , 5 February 2003, page 34
Back to the example. First attempt : F-test per gene.
Histogram of all p-values
.06
.05
.03
.02
Std. Dev = .28
Mean = .517
N = 12488.00
0.00
.025
.225
.125
.425
.325
.625
.525
.825
.725
.925
p-values per gene
jcvh, London , 5 February 2003, page 35
1.0
Q-Q-plot
.8
.6
p-value per gene
.4
.2
0.0
0.0
.2
.4
.6
.8
1.0
standardized rank of p-value per gene
It is nearly perfectly uniform, no proof of any significance.
Main reason: too few degrees of freedom in the denominator.
jcvh, London , 5 February 2003, page 36
Look at the distribution of the within group variances.
4000
There is some
3000
overdispersion
CV=1.1 instead of
2000
CV
2/5 0.63
1000
Std. Dev = .11
Mean = .10
N = 12488.00
0
1
1.
0
1.
0
.9
0
.8
0
.7
0
.6
0
.5
0
.4
0
.3
0
.2
0
.1
0
0.
0
0
0
variance within group per gene
jcvh, London , 5 February 2003, page 37
Second attempt: using the average variance in the denominator
Histogram of all p-values
.200
.150
.100
.050
Std. Dev = .33
Mean = .60
N = 12488.00
0.000
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
p-value per gene using average variance
jcvh, London , 5 February 2003, page 38
p-value per gene using average variance
1.0
.8
.6
.4
.2
0.0
0.0
.2
.4
.6
.8
1.0
rank of p-value
Looks like the p-values are computed from the wrong distribution
jcvh, London , 5 February 2003, page 39
Third attempt: relating
i
to µ i1 .
2
1
.5
.4
.3
.2
.1
variance within groups
.05
.04
.03
.02
.01
.005
.004
.003
.002
.001
.0005
.0004
.0003
.0002
.0001
Rsq = 0.3950
0
1
2
3
4
5
mean of wild-types
jcvh, London , 5 February 2003, page 40
Using predicted variances instead of mean variance :
.14
.10
.05
Std. Dev = .32
Mean = .453
N = 12488.00
0.00
.025
.225
.125
.425
.325
.625
.525
.825
.725
.925
p-value based on estimated variance
jcvh, London , 5 February 2003, page 41
1.0
.8
.6
.4
PHAT
.2
0.0
0.0
.2
.4
.6
.8
1.0
rank of p-value
This looks really promising!
jcvh, London , 5 February 2003, page 42
The interesting issue is that the collection of p-values per gene
give a good impression about the validity of the test.
Alternative to the procedure above:
Variance stabilizing transform à la Huber et al.
Further improvement
Hierarchical model for variation in standard deviation à la
Baldi and Long (work in progress)
jcvh, London , 5 February 2003, page 43
Design 3 Classification and prediction
Data structure:
expression data Xij ,
i=1,..,G again stands for the gene
j=1,..,J for the individual (patient)
outcome Yj per patient
Wanted: To make a model to predict Y from X
Problem: High-dimension G of the design matrix
jcvh, London , 5 February 2003, page 44
Example: Golub data set with dichotomous Y (ALL/AML)
J=38 individuals
G=3571 genes (“bad” genes thrown away)
Histogram of correlations of outcome Y with all gene expressions in
the Golub data-set
Std. Dev = .29
Mean = .03
N = 3571.00
1
.8
9
.6
6
.5
4
.4
1
.3
9
.1
9
1
4
6
9
1
6
.0
6
-.0
-.1
-.3
-.4
-.5
-.6
-.8
jcvh, London , 5 February 2003, page 45
Natural model: logistic regression
(X)
ln(
(X X̄) with
)
1 (X)
(X) P(Y 1|X)
If you fit this to complete data-set you get
ˆ
i
or ˆ i
ˆ(X) Y
Way out
penalization (Eilers et al.)
empirical Bayes
jcvh, London , 5 February 2003, page 46
Empirical Bayes approach
i
i.i.d. N(0, 2) With unknown
remaining parameters
and
2
2
Applications:
simple test for no effect:
2
0 vs
2
>0
regularized estimate of
jcvh, London , 5 February 2003, page 47
Score test for effect:
2
0 vs 2>0 (Goeman et al. )
Test statistics Q (Y Ȳ) (X X̄)(X X̄) (Y Ȳ)
P-value based of distribution of Y given X’s (and Ȳ )
P-value easily obtained by random permutation
jcvh, London , 5 February 2003, page 48
Result: awfully significant
(Graphs show permutation distribution of Q and position of the
observed Q)
jcvh, London , 5 February 2003, page 49
Estimating ,
2
Marginal likelihood:
L( , 2)
L( , 1,...., G)f( 1, 2,..., G| 2)d 1...d
G
Integrals very cumbersome to compute.
Approximations far from perfect.
jcvh, London , 5 February 2003, page 50
One the parameters
,
2
have been estimated, the ’s are
obtained from the posterior distribution
Posterior mode
penalized likelihood estimator, minimizes
ln(L( , 1,..., G)) 0.5
2 2
i/
Posterior distribution gives impression of precision of
linear predictor Xi
individual
j ‘s
(doable)
(hopeless)
jcvh, London , 5 February 2003, page 51
Confidence interval for linear predictor look O.K.
jcvh, London , 5 February 2003, page 52
Posterior modes look messy (much better for peaked priors)
jcvh, London , 5 February 2003, page 53
Posterior Z-values look hopeless (independent of prior)
jcvh, London , 5 February 2003, page 54
Big problem: Lack of any prior structure among different genes
This makes selection of influential hopeless
Big challenge: Bring structure in the 20000 genes using either
biology:
pathways and the like
statistics:
meta-analysis on all data sets from the same
platform
jcvh, London , 5 February 2003, page 55
References
Baldi P, Long AD , Bayesian framework for the analysis of microarray
expression data: regularized t-test and statistical inferences of gene changes,
BIOINFORMATICS, 17, 509-519, 2001
Eilers PHC, Boer J, van Ommen GJB, van Houwelingen JC, Classification
of microarrays with penalized logistic regression. Proceedins of SPIE, Volume
4266, Progress in Biomedical Optics and Immaging 2, 187-198, 2001
Goeman JJ, van de Geer SA, de Kort F, van Houwelingen JC, A global
score test for differential expression of groups of genes, preprint, 2003
jcvh, London , 5 February 2003, page 56
Golub TR, Slonim DK, Tamayo P, et al. , Molecular classification of cancer:
Class discovery and class prediction by gene expression monitoring,
SCIENCE 286, 531-537, 1999
Huber W, v.Heydebreck A, Sueltmann H, Poustka A and Vingron M,
Variance stabilization applied to microarray data calibration and to the
quantification of differential expression, Proceedings of ISMB 2002,
Bioinformatics, 18, Suppl 1:S96-S104, 2002
Lee MLT, Kuo FC, Whitmore GA, et al., Importance of replication in
microarray gene expression studies: Statistical methods and evidence from
repetitive cDNA hybridizations, P NATL ACAD SCI USA 97, 9834-9839,
2000
jcvh, London , 5 February 2003, page 57
de Menezes RX, Boer JM, van Houwelingen HC, Microarray data analysis:
hierarchical modelling to handle heteroscedasticit, preprint, 2003
Tusher VG, Tibshirani R, Chu G , Significance analysis of microarrays
applied to the ionizing radiation response, P NATL ACAD SCI USA 98,
5116-5121, 2001
West M, Blanchette C, Dressman H, et al. , Predicting the clinical status of
human breast cancer by using gene expression profiles, P NATL ACAD SCI
USA 98, 11462-11467, 2001
jcvh, London , 5 February 2003, page 58