* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download A method for finding molecular signatures from gene expression data
X-inactivation wikipedia , lookup
Gene therapy wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Public health genomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene desert wikipedia , lookup
Quantitative trait locus wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Essential gene wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
A method for finding molecular
signatures from gene expression data
Ramón Dı́az-Uriarte
rdiaz@cnio.es
http://bioinfo.cnio.es/˜rdiaz
Unidad de Bioinformática
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Workshop on Statistical Analysis of Gene Expression Data, Wye, UK, July 2003
Gene-expression signatures – p. 1/21
Introduction
Molecular signatures or Gene expression signatures are a key
feature in many papers in cancer research. For instance,
Alizadeh et al., 2000; Golub et al., 1999; Huang et al., 2002;
Pomeroy et al. 2002; Ramaswamy et al., 2003; Rosenwald et
al., 2002; Shipp et al., 2002; Yeoh et al., 2002.
A possible definition: “(...) a group of genes expressed in a
specific cell lineage or stage of differentiation or during
particular biological response.” (Rosenwald et al., 2002, N.
Eng. J. Med., 346, p. 1942)
Often used as independent variables to model clinically
relevant information (cancer vs. healthy, survival time, etc).
Provide insight into biological mechanisms and processes and
have potential diagnostic use.
However, searching for molecular signatures often done using
a very diverse and ad-hoc methodology.
Gene-expression signatures – p. 2/21
What we want:
Find groups of genes [“group of genes” = “signature
component”] so that genes within a group are tightly
coexpressed, and the set of groups do a decent predictive job.
Nice if a similar procedure can be applied to different types of
dependent (phenotypic) data (e.g., class membership, survival
data, expression of a relevant protein).
Should help gain some understanding, not necessarily find
The best predictor (flexibility to play around with trade-offs).
Gene-expression signatures – p. 3/21
What is, operationally, a signature?
Gene-expression signatures – p. 4/21
What is, operationally, a signature?
Based on the literature, it seems reasonable that the following
conditions should hold:
Gene-expression signatures – p. 4/21
What is, operationally, a signature?
Based on the literature, it seems reasonable that the following
conditions should hold:
Genes of a signature component should show tight
co-expression. Thus, the component or profile should capture
most of the variability in the genes that are part of that
signature component or profile.
Gene-expression signatures – p. 4/21
What is, operationally, a signature?
Based on the literature, it seems reasonable that the following
conditions should hold:
Genes of a signature component should show tight
co-expression. Thus, the component or profile should capture
most of the variability in the genes that are part of that
signature component or profile.
For a given classification/prediction problem only a few
signature components should be needed to get a decent
predictor.
Gene-expression signatures – p. 4/21
What is, operationally, a signature?
Based on the literature, it seems reasonable that the following
conditions should hold:
Genes of a signature component should show tight
co-expression. Thus, the component or profile should capture
most of the variability in the genes that are part of that
signature component or profile.
For a given classification/prediction problem only a few
signature components should be needed to get a decent
predictor.
Signature components could have many genes.
Gene-expression signatures – p. 4/21
A method: key elements
Gene-expression signatures – p. 5/21
A method: key elements
Tight co-expression: We can use Principal Components
Analysis to characterize a signature component. The first
principal component of a PCA on the genes that belong to a
signature component should capture a large fraction of the
total variance of those genes.
Gene-expression signatures – p. 5/21
A method: key elements
Tight co-expression: We can use Principal Components
Analysis to characterize a signature component. The first
principal component of a PCA on the genes that belong to a
signature component should capture a large fraction of the
total variance of those genes.
Few signature components: Add new components only if
justified. (Is this always what we want to do?).
Gene-expression signatures – p. 5/21
A method: key elements
Tight co-expression: We can use Principal Components
Analysis to characterize a signature component. The first
principal component of a PCA on the genes that belong to a
signature component should capture a large fraction of the
total variance of those genes.
Few signature components: Add new components only if
justified. (Is this always what we want to do?).
Signature components could have many genes: Retain as
many genes per component as possible.
Gene-expression signatures – p. 5/21
A method: key elements
Tight co-expression: We can use Principal Components
Analysis to characterize a signature component. The first
principal component of a PCA on the genes that belong to a
signature component should capture a large fraction of the
total variance of those genes.
Few signature components: Add new components only if
justified. (Is this always what we want to do?).
Signature components could have many genes: Retain as
many genes per component as possible.
Predictor: Build a predictor using the signature components
(1st PCs). Diagonal Linear Discriminant Analysis (DLDA).
Gene-expression signatures – p. 5/21
We have: Y (n × q), X (n × p), p n.
We want: Y, X∗ (n × k), k < n.
X = [x1 , . . . , xpr1,1 , xpr1,2 , xpr1,3 , . . . , xpr2,1 , xpr2,2 , . . .]
X∗ = [pr1 , pr2 , . . . , prk ].
pri is the ith signature component or profile, and it is the 1st
PC of a PCA on genes xpri ,1 , xpri ,2 , . . . , xpri ,mi .
Each gene belongs to either one (and only one) signature
component or to none.
Y = f (X∗ ) + .
Gene-expression signatures – p. 6/21
Addition of signature components
Find a seed gene for a new signature component:
Gene-expression signatures – p. 7/21
Addition of signature components
Find a seed gene for a new signature component:
For each gene i among available genes:
modeli = DLDA(previous.components + genei )
Gene-expression signatures – p. 7/21
Addition of signature components
Find a seed gene for a new signature component:
For each gene i among available genes:
modeli = DLDA(previous.components + genei )
Select as seed gene: i with lowest (cross-validated) error,
pred.error.modeli .
Gene-expression signatures – p. 7/21
Addition of signature components
Find a seed gene for a new signature component:
For each gene i among available genes:
modeli = DLDA(previous.components + genei )
Select as seed gene: i with lowest (cross-validated) error,
pred.error.modeli .
Add new signature component if
pred.error.modeli < last.pred.error − c1 s.e. (c1 = 1).
Gene-expression signatures – p. 7/21
Addition of signature components
Find a seed gene for a new signature component:
For each gene i among available genes:
modeli = DLDA(previous.components + genei )
Select as seed gene: i with lowest (cross-validated) error,
pred.error.modeli .
Add new signature component if
pred.error.modeli < last.pred.error − c1 s.e. (c1 = 1).
Initial signature component: all genes with abs. corr. with
seed gene > rseed (e.g., rseed = 0.65).
These are the candidate genes to belong to that
component.
But this initial signature component might not fulfill previous
requirements (%var, predictive performance).
Examine if elimination of genes is needed.
Gene-expression signatures – p. 7/21
Elimination of genes from components
Gene-expression signatures – p. 8/21
Elimination of genes from components
Eliminate, one by one, genes with smallest abs. corr. with 1st
PC until % variance of 1st PC > v (e.g., 85% or 75%).
Gene-expression signatures – p. 8/21
Elimination of genes from components
Eliminate, one by one, genes with smallest abs. corr. with 1st
PC until % variance of 1st PC > v (e.g., 85% or 75%).
Ensure that predictive accuracy cannot be improved by
removing any gene from signature component:
For each gene, i, in current signature component: modeli =
DLDA(previous.components + current.component−i ).
Eliminate gene i from signature component if
(cross-validated) pred.error.modeli < last.pred.error − c2 s.e.
(c2 = 1).
Repeat until no gene is eliminated.
Gene-expression signatures – p. 8/21
Bootstrap
We use the bootstrap to asses stability of results and measure
prediction error (.632+ rule).
Take B (= 100) bootstrap samples, and for each one run the
above procedure.
Common genes: genes that are returned in at least 20% of the
samples.
For each run, eliminate from the signature components those
genes that are not in common genes to obtain “clean signature
components”.
Consensus signature components are obtained as the (most
inclusive) union of all “clean signature components” with a
non-zero intersection.
Gene-expression signatures – p. 9/21
Can we recover signatures?
Simulation study.
Generate signature data from a multivariate normal distribution.
Correlation between genes within a signature component: 0.9.
Between genes among signature components: 0. (i.e.,
a 0 0 ··· 0
0 a 0 ··· 0
Σ= . . .
..
.. ,
.
.
.
. . .
.
.
0 0 0 ···
1 0.9 · · ·
a = 0.9 1 · · ·
..
..
..
.
.
.
).
a
0.9
0.9
..
.
Gene-expression signatures – p. 10/21
Means of classes set so that:
unconditional prediction error rate of a DLDA with a gene
from each signature component is approx. 5%;
each signature component has the same relevance in
separation.
Number of signature components: {1, 2, 3}.
Number of classes: {2, 3, 4}.
Number of genes per signature component: {5, 20, 100}.
Add another 4000 N (0, 1) variables to matrix of covariates.
Number of subjects: 25 per class.
Generate 20 data sets and run procedure.
Gene-expression signatures – p. 11/21
Class means
One signature component:
Two classes: µ1 = −1.65, µ2 = 1.65.
Three classes: µ1 = −3.58, µ2 = 0, µ3 = 3.58.
Four classes: µ1 = −3.7, µ2 = 0, µ3 = 3.7, µ4 = 7.4.
Two signature components:
Two classes:µ1 = [−1.18, −1.18], µ2 = [1.18, 1.18].
Three classes: µ1 = [0, 0], µ2 = [3.88cos(15), 3.88sin(15)], µ3 =
[3.88cos(75), 3.88sin(75)].
Four classes: µ1 = [1, 1], µ2 = [4.95, 1], µ3 = [1, 4.95], µ4 = [4.95, 4.95].
Three signature components:
Two classes:µ1 = [−0.98, −0.98, −0.98], µ2 = [0.98, 0.98, 0.98].
Three classes: µ1 = [2.76, 0, 0], µ2 = [0, 2.76, 0], µ3 = [0, 0, 2.76].
Four classes:
Gene-expression
µ1 = [2.96, 0, 0], µ2 = [0, 2.96, 0], µ3 = [0, 0, 2.96], µ4 = [2.96, 2.96,
2.96]signatures – p. 12/21
Four classes
Gene overlap: overall
Gene overlap: mean of signatures
Number of returned signatures
III.100
III.100
III.100
III.20
III.20
III.20
III.5
III.5
III.5
II.100
II.100
II.100
II.20
II.20
II.20
II.5
II.5
II.5
I.100
I.100
I.20
I.20
I.5
I.5
I.100
I.20
I.5
0.0
0.2
0.4
0.6
0.8
1.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1
2
3
4
Gene-expression signatures – p. 13/21
Three classes
Gene overlap: overall
Gene overlap: mean of signatures
Number of returned signatures
III.100
III.100
III.100
III.20
III.20
III.20
III.5
III.5
III.5
II.100
II.100
II.100
II.20
II.20
II.20
II.5
II.5
II.5
I.100
I.100
I.20
I.20
I.5
I.5
I.100
I.20
I.5
0.0
0.2
0.4
0.6
0.8
1.0
0.5
0.6
0.7
0.8
0.9
1.0
1
2
3
4
Gene-expression signatures – p. 14/21
Two classes
Gene overlap: overall
Gene overlap: mean of signatures
Number of returned signatures
III.100
III.100
III.100
III.20
III.20
III.20
III.5
III.5
III.5
II.100
II.100
II.100
II.20
II.20
II.20
II.5
II.5
II.5
I.100
I.100
I.20
I.20
I.5
I.5
I.100
I.20
I.5
0.0
0.2
0.4
0.6
0.8
1.0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
1
2
3
4
Gene-expression signatures – p. 15/21
Comparison with standard methods
10−fold CV error rate
Golub
Van’t Veer
Ramaswamy
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
svm
knn
dlda
signature
svm
knn
dlda
signature
svm
knn
dlda
signature
Gene-expression signatures – p. 16/21
Stability of results?
Models on all data:
Golub: 1st PC > 85%: 1 comp. (1 gene); 1st PC > 75%: 1
comp. (6 genes); 1st PC > 70%: 2 comp. (11 and 19 genes).
van’t Veer: 1 comp. (1 gene); 1 comp. (3 genes); 5 comp.
(13 genes); . . .
Ramaswamy: 1 comp. (1 gene); 2 comp. (10 and 1 genes);
2 comp. (2 genes); . . .
Gene-expression signatures – p. 17/21
Stability of results?
Models on all data:
Golub: 1st PC > 85%: 1 comp. (1 gene); 1st PC > 75%: 1
comp. (6 genes); 1st PC > 70%: 2 comp. (11 and 19 genes).
van’t Veer: 1 comp. (1 gene); 1 comp. (3 genes); 5 comp.
(13 genes); . . .
Ramaswamy: 1 comp. (1 gene); 2 comp. (10 and 1 genes);
2 comp. (2 genes); . . .
Bootstrap:
1st PC > 85% var:
Golub: 1 comp. with 19 genes;
van’t Veer and Ramaswamy: no common genes;
1st PC > 75% var:
Golub: 1 comp. with 48 genes;
van’t Veer: no common genes;
Ramaswamy: 1 comp. of 2 genes;
Gene-expression signatures – p. 17/21
Discussion
?
Appropriate threshold for % var., correlations, etc?
Entry of a component, given previous components?
Within-group heterogeneity?
PCA: between vs. within group patterns.
Gene-expression signatures – p. 18/21
Discussion
?
Appropriate threshold for % var., correlations, etc?
Entry of a component, given previous components?
Within-group heterogeneity?
PCA: between vs. within group patterns.
Easily extended:
Other classifiers (e.g., logistic regression, knn, svm).
Other dependent variables: survival analysis.
Gene-expression signatures – p. 18/21
Related to
Partial Least Squares (and, to a lesser extent, Principal
Components Regression).
Factor analysis with oblique rotations to obtain clusters of
variables (SAS’s PROC VARCLUS).
“Supergenes” or “metagenes” of West et al.
...
Gene-expression signatures – p. 19/21
Conclusion
The logic of this method follows directly from what are
considered biologically relevant signature characteristics.
Gene-expression signatures – p. 20/21
Conclusion
The logic of this method follows directly from what are
considered biologically relevant signature characteristics.
Results that are biologically relevant and interpretable and can
complement other approaches.
Gene-expression signatures – p. 20/21
Conclusion
The logic of this method follows directly from what are
considered biologically relevant signature characteristics.
Results that are biologically relevant and interpretable and can
complement other approaches.
This method defines a framework that allows us to find
signatures regardless of the type of dependent variable.
Gene-expression signatures – p. 20/21
Conclusion
The logic of this method follows directly from what are
considered biologically relevant signature characteristics.
Results that are biologically relevant and interpretable and can
complement other approaches.
This method defines a framework that allows us to find
signatures regardless of the type of dependent variable.
Easy to implement and R code available.
Gene-expression signatures – p. 20/21
Acknowledgements
A. Pérez and M. A. Piris for emphasizing the importance of
molecular signatures.
Bioinformatics Unit, CNIO, for discussion.
C. Lázaro-Perea for discussion.
Gene-expression signatures – p. 21/21