* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Talk2.stat.methods
X-inactivation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Metagenomics wikipedia , lookup
Essential gene wikipedia , lookup
Gene nomenclature wikipedia , lookup
Oncogenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Long non-coding RNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene desert wikipedia , lookup
Public health genomics wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Minimal genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Ridge (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression programming wikipedia , lookup
Analysis of gene expression data
(Nominal explanatory variables)
Shyamal D. Peddada
Biostatistics Branch
National Inst. Environmental
Health Sciences (NIH)
Research Triangle Park, NC
Outline of the talk
Two types of explanatory variables
(“experimental conditions”)
Some scientific questions of interest
A brief discussion on false discovery rate (FDR)
analysis
Some existing statistical methods for analyzing
microarray data
Types of explanatory variables
Types of explanatory variables
(“experimental conditions”)
Nominal variables:
– No intrinsic order among the levels of the explanatory
variable(s).
– No loss of information if we permuted the labels of the
conditions.
E.g. Comparison of gene expression of samples from
“normal” tissue with those from “tumor” tissue.
Types of explanatory variables
(“experimental conditions”)
Ordinal/interval variables:
– Levels of the explanatory variables are ordered.
– E.g.
Comparison of gene expression of samples from different
stages of severity of lessions such as “normal”,
“hyperplasia”, “adenoma” and “carcinoma”. (categorically
ordered)
Time-course/dose-response experiments. (numerically
ordered)
Focus of this talk:
Nominal explanatory variables
Types of microarray data
Independent samples
– E.g. comparison of gene expression of independent
samples drawn from normal patients versus independent
samples from tumor patients.
Dependent samples
– E.g. comparison of gene expression of samples drawn
from normal tissues and tumor tissues from the same
patient.
Possible questions of interest
Identify significant “up/down” regulated genes
for a given “condition” relative to another
“condition” (adjusted for other covariates).
Identify genes that discriminate between various
“conditions” and predict the “class/condition” of a
future observation.
Cluster genes according to patterns of expression
over “conditions”.
Other questions?
Challenges
Small sample size but a large number of genes.
Multiple testing – Since each microarray has
thousands of genes/probes, several thousand
hypotheses are being tested. This impacts the
overall Type I error rates.
Complex dependence structure between genes and
possibly among samples.
– Difficult to model and/or account for the underlying
dependence structures among genes.
Multiple Testing:
Type I Errors
- False Discovery Rates …
The Decision Table
Number of
Not
rejected
H0
Number of
rejected
H0
Number of
True
H0
The only
observable
values
Number of
True
Total
U
V
m0
T
S
m1
Ha
Total
W
R
m
Strong and weak control of
type I error rates
Strong control: control type I error rate under any
combination of true H and H
0
a
Weak control: control type I error rate only when
all null hypotheses are true
Since we do not know a priori which hypotheses are
true, we will focus on strong control of type I error
rate.
Consequences of multiple testing
Suppose we test each hypothesis at 5% level of
significance.
– Suppose n = 10 independent tests performed. Then the
probability of declaring at least 1 of the 10 tests
significant is 1 – 0.9510 = 0.401.
– If 50,000 independent tests are performed as in
Affymetrix microarray data then you should expect
2500 false positives!
Types of errors in the
context of multiple testing
Per-Family Error “Rate” (PFER): E(V )
– Expected number of false rejection of H 0
Per-Comparison Error Rate (PCER): E(V )/m
– Expected proportion of false rejections of H 0 among
all m hypotheses.
Family-Wise Error Rate (FWER): P( V > 0 )
– Probability of at least one false rejection of H 0
among all m hypotheses
Types of errors in the
context of multiple testing
False Discovery Rate (FDR):
–
Expected proportion of Type I errors among all rejected
hypotheses.
Benjamini-Hochberg (BH): Set V/R = 0 if R = 0.
V
V
E ( 1{ R 0} ) E ( | R 0) P( R 0)
R
R
Storey: Only interested in the case R > 0. (Positive FDR)
pFDR E (
V
V
1{ R 0} ) E ( | R 0)
R
R
Some useful inequalities
Since V R m, therefor e
V V
1{ R 0}
m R
(1)
Again, since V R and R 0 V 0
Therefore
V 1{R 0} R 1{V 0}.
V
Thus
1{R 0} 1{V 0} .
R
Also 1{V 0} V
(2)
(3)
Some useful inequalities
Combining (1), (2) and (3), we have :
V V
1{R 0} 1{V 0} V
m R
(4)
Taking expectatio ns in (4) we have :
V
V
E E 1{ R 0} E{1{V 0}} E{V }
m
R
(5)
Some useful inequalities
Thus we have :
PCER FDR FWER PFER
Trivially
FDR pFDR
(6)
(7)
Conclusion
It is conservative to control FWER rather than
FDR!
It is conservative to control pFDR rather than
FDR!
Some useful inequalities
Question: Is pFDR FWER?
Some useful inequalities
Example : Suppose m0 m.
Note : m0 m m1 0
S 0 V R
V
FDR E 1{ R 0}
R
E (1{V 0} )
P (V 0) FWER
Some useful inequalities
V
But pFDR E | R 0
R
E (1 | R 0) 1.
Hence if m0 m then
1 pFDR FDR FWER
Some useful inequalities
However, in most applications such as
microarrays, one expects m 0
1
In general, there is no proof of the statement
pFDR FWER
Some popular Type I error
controlling procedures
Let P(1) P( 2) ... P( m) denote the ordered
p-values for the ‘m’ tests that are being performed.
Let (1) ( 2) ... ( m) denote the ordered
levels of significance used for testing the ‘m’ null
hypotheses, H 0(1) , H 0( 2) ,..., H 0( m) respectively.
Some popular controlling procedures
Step-down procedure:
Step 1 : If P(1) (1) then reject H 0(1) - Goto Step 2
Else Stop.
Step 2 : If P( 2)H ,( 2H
reject
H 0( 2) - Goto Step 3
) then
...,
H
(1)
( 2)
(r )
Else Stop.
Step 3 : If P(3) (3) then reject H 0(3) - Goto Step 3
Else Stop.
and so on.
Some popular controlling procedures
Step –up procedure:
Step 1 : If P( m) ( m) then reject H 0(i ) , i 1,2,...m and stop.
Else goto Step 2.
Step 2 : If P( m1) ( m1) then reject H 0(i ) , i 1,2,...m 1 and stop.
Else goto Step 3.
Step 3 : If P( m2) ( m2) then reject H 0(i ) , i 1,2,...m 2 and stop.
Else goto Step 4.
and so on!
Some popular controlling procedures
Single-step procedure
A stepwise procedure with critical same critical
constant for all ‘m’ hypotheses.
(1) ( 2) ... ( m)
Some typical stepwise procedures:
FWER controlling procedures
Bonferroni: A single-step procedure with
Sidak: A single-step procedure with
i 1 (1 )1/ m
Holm: A step-down procedure with
i /( m i 1)
Hochberg: A step-up procedure with
i /( m i 1)
i
/m
minP method: A resampling-based single-step procedure with
c where c be the α quantile of the distribution of
i
the minimum p-value.
Comments on the methods
Bonferroni: Very general but can be too
conservative for large number of hypotheses.
Sidak: More powerful than Bonferroni, but
applicable when the test statistics are
independent or have certain types of positive
dependence.
Comments on the methods
Holm: More powerful than Bonferroni and is
applicable for any type of dependence structure
between test statistics.
Hochberg: More powerful than Holm’s procedure
but the test statistics should be either
independent or the test statistic have a MTP2
property.
Comments on the methods
Multivariate Total Positivity of Order 2 (MTP2)
f (x) is said to MTP2 if for all x,y R p ,
f (x y) f (x y) f (x) f (y)
Some typical stepwise procedures:
FDR controlling procedure
Benjamini-Hochberg:
A step-up procedure with
i i / m
An Illustration
Lobenhofer et al. (2002) data:
Expose breast cancer cells to estrodial for 1 hour or (12, 24
36 hours).
Number of genes on the cDNA 2 spot array - 1900.
Number of samples per time point 8.,
Compare 1 hour with (12, 24 and 36 hours) using a two-sided
bootstrap t-test.
Some Popular Methods of Analysis
1. Fold-change
1. Fold-change in gene expression
For gene “g” compute the fold change between
two conditions (e.g. treatment and control):
X trt
fg
X cont
1. Fold-change in gene expression
R1, R2 :
pre-defined constants.
f g R1 : gene “g” is “up-regulated”.
f g R2 : gene “g” is “down-regulated”.
1. Fold-change in gene expression
Strengths:
– Simple to implement.
– Biologists find it very easy to interpret.
– It is widely used.
Drawbacks:
– Ignores variability in mean gene expression.
– Genes with subtle gene expression values can be
overlooked. i.e. potentially high false negative rates
– Conversely, high false positive rates are also possible.
2. t-test type procedures
2.1 Permutation t-test
For each gene “g” compute the standard two-sample
t-statistic:
X g ,trt X g ,cont
tg
1
1
Sg
ntrt ncont
where X g ,trt , X g ,cont are the sample means and Sg is the
pooled sample standard deviation.
2.1 Permutation t-test
Statistical significance of a gene is determined by
computing the null distribution of t g using either
permutation or bootstrap procedure.
2.1 Permutation t-test
Strengths:
–
–
–
Simple to implement.
Biologists find it very easy to interpret.
It is widely used.
Drawback:
–
Potentially, for some genes the pooled sample standard deviation
could be very small and hence it may result in inflated Type I errors
and inflated false discovery rates.
2.2 SAM procedure
(Significance Analysis of Microarrays)
(Tusher et al., PNAS 2001)
For each gene “g” modify the standard two-sample
t-statistic as:
dg
X g ,trt X g ,cont
s0 S g
1
1
ntrt ncont
The “fudge” factor s0 is obtained such that the
coefficient of variation in the above test statistic is minimized.
3. F-test and its variations for
more than 2 nominal conditions
Usual F-test and the P-values can be obtained by a
suitable permutation procedure.
Regularized F-test: Generalization of Baldi and
Long methodology for multiple groups.
– It better controls the false discovery rates and the powers
comparable to the F-test.
Cui and Churchill (2003) is a good review paper.
4. Linear fixed effects models
Effects:
– Array (A) - sample
– Dye (D)
– Variety (V) – test groups
– Genes (G)
– Expression (Y)
4. Linear fixed effects models
(Kerr, Martin, and Churchill, 2000)
Linear fixed effects model:
log( Yijkg ) Ai D j Gg ( AD)ij
( AG)ig ( DG ) jg (VG) kg ijkg
iid
ijkg ~ N (0, ).
2
H 0 : (VG) kg 0 for all k 1,2,..., v
4. Linear fixed effects models
All effects are assumed to be fixed effects.
Main drawback – all genes have same variance!
5. Linear mixed effects models
(Wolfinger et al. 2001)
Stage 1 (Global normalization model)
log( Ygij ) Ti A j (TA)ij gij
Stage 2 (Gene specific model)
ˆgij Gg (GT ) gi (GA) gj gij
5. Linear mixed effects models
Assumptions:
iid
Ai ~
2
N (0, ),
iid
(TA)ij ~
2
N (0, TA )
iid
ijkg ~ N (0, ), (GA) gj ~
2
iid
gij ~
2
N (0, g )
2
N (0, GAg )
5. Linear mixed effects models
(Wolfinger et al. 2001)
Perform inferences on the interaction term
(GT) gi
A popular graphical representation:
The Volcano Plots
A scatter plot of
log 10 ( p value)
vs
log 2 ( fold change)
Genes with large fold change will lie outside a pair of vertical
“threshold” lines. Further, genes which are highly significant with
large fold change will lie either in the upper right hand or upper left
hand corner.
A useful review article
Cui, X. and Churchill, G (2003), Genome Biology.
Software:
R package: statistics for microarray analysis.
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
SAM: Significance Analysis of Microarray.
http://www-stat.stanford.edu/%7Etibs/SAM
Supervised classification algorithms
Discriminant analysis based methods
A. Linear and Quadratic Discriminant analysis based methods:
Strength:
– Well studied in the classical statistics literature
Limitations:
– Based on normality
– Imposes constraints on the covariance matrices. Need to
be concerned about the singularity issue.
– No convenient strategy has been proposed in the
literature to select “best” discrminating subset of genes.
Discriminant analysis based methods
B. Nonparametric classification using Genetic Algorithm and Knearest neighbors.
– Li et al. (Bioinformatics, 2001)
Strengths:
– Entirely nonparametric
– Takes into account the underlying dependence structure among
genes
– Does not require the estimation of a covariance matrix
Weakness:
– Computationally very intensive
GA/KNN methodology – very brief
description
Computes the Euclidean distance between all pairs of samples
based on a sub-vector on, say, 50 genes.
Clusters each sample into a treatment group (i.e. condition) based
on the K-Nearest Neighbors.
Computes a fitness score for each subset of genes based on how
many samples are correctly classified. This is the objective
function.
The objective function is optimized using Genetic Algorithm
K-nearest neighbors classification (k=3)
X
Expression levels of gene 1
Subcategories within a class
Expression levels of gene 1
Advantages of KNN approach
Simple, performs as well as or better than more
complex methods
Free from assumptions such as normality of the
distribution of expression levels
Multivariate: takes account of dependence in
expression levels
Accommodates or even identifies distinct
subtypes within a class
Expression data: many genes and few samples
There may be many subsets of genes that can
statistically discriminate between the treated and
untreated.
There are too many possible subsets to look at.
With 3,000 genes, there are about 1072 ways to
make subsets of size 30.
The genetic algorithm
Computer algorithm (John Holland) that works by
mimicking Darwin's natural selection
Has been applied to many optimization problems
ranging from engine design to protein folding and
sequence alignment
Effective in searching high dimensional space
GA works by mimicking evolution
Randomly select sets (“chromosomes”) of 30
genes from all the genes on the chip
Evaluate the “fitness” of each “chromosome” –
how well can it separate the treated from the
untreated?
Pass “chromosomes” randomly to next generation,
with preference for the fittest
Summary
Pay attention to multiple testing problem.
– Use FDR over FWER for large data sets such as gene
expression microarrays
Linear mixed effects models may be used for
comparing expression data between groups.
For classification problem, one may want to
consider GA/KNN approach.