Download Using Statistical Design and Analysis to Detect

Statistical Design and Analysis of Microarray Experiments Peng Liu 6/15/2010 1 Microarray Technology  Microarray technology allows measuring expression levels (abundance of mRNA transcripts) of thousands of genes simultaneously.  Two types of platforms: Affymetrix (single-color) Two-color microarray 2 Wild-type vs. Myostatin Knockout Mice Belgian Blue cattle have a mutation in the myostatin gene. Design of Affymetrix experiment: one sample  one chip Designing 2-color microarray (3 layers) From Churchill, 2002, nature genetics 4 Example I: Sawers et al, 2007, BMC Bioinformatics bundle sheath strands mesophyll protoplasts M B V 5 Example I: Sawers et al, 2007, BMC Bioinformatics  The establishment of C4 photosynthesis in maize is associated with differential accumulation of gene transcripts and proteins between bundle sheath and mesophyll photosynthetic cell types.  Goal: To detect genes that are differentially expressed in Bundle Sheath (B) and Mesophyll (M) cells. 6 Example I: Sawers et al, 2007, BMC Bioinformatics  A simple method: Isolate cells and perform a microarray experiments to compare the gene expression between the two cells (treatments). 7 Example I: Sawers et al, 2007, BMC Bioinformatics  A little more complication: The procedure for extracting mRNA for the two cells are different. The one to extract mRNA from M cells introduces stress.  Solution: Add two more treatment groups: samples with both M and B cells going through extraction of mRNA with and without stress. B, M, Stress and Total (4 treatment groups) 8 Direct comparison vs indirect comparison  Direct: comparison within slide  Indirect: comparison between slides  Suppose we want to compare gene expression levels between treatment 1 and treatment 2. 1 2 1 2 Direct Comparison 2 1 R Indirect Comparison 9 Comments about 2-color Microarray Designs  A unique and powerful feature of 2-color microarray is to make direct comparison between two samples on the same slide.  For pairing samples, the variation due to slide can be accounted for.  When possible, it is more efficient to use direct comparison.  However, sometimes, it is not practical to make direct comparison of all possible pairs. 10 Efficiency of comparison  The efficiency of comparisons between 2 samples is determined by the length and the number of paths connecting them. 1 2 1 2 Direct Comparison (Dye-swap) 2 1 R Indirect Comparison 11 Reference vs Loop design 1 2 R Reference Design 3 2 1 3 Loop Design 12 Designing experiment for example I B With 6 biological replicates Total Stress M 13 Performing the experiment (Nature cell biol. 2001 3:8) 14 After the bench work… 2-color microarray image Affymetrix Gene Chip image 15 The data table looks like Header Begin Raw Data Flag Row Column Gene ID Field Meta Row Meta Column 1 MZ00040724 1 1 1 A 2 MZ00040730 1 1 1 A 3 MZ00040748 1 1 1 A 4 MZ00040754 1 1 1 A 5 MZ00040772 1 1 1 A 6 MZ00040778 1 1 1 A 7 MZ00040796 1 1 1 A 8 MZ00040802 1 1 1 A 9 MZ00013020 1 1 1 A 10 MZ00013026 1 1 1 A 11 MZ00013044 1 1 1 A 0 2 0 0 0 2 2 3 3 3 3 Mean Median Background Signal MedianSignal 533 1645.5 469 613 462 741.5 473 909 471.5 964 469 574 487 579 614 38051 516.5 4539 491.5 597.5 521.5 16210 16 Pre-normalization analysis  Image processing obtain the intensity measurement of the signal  Background correction get rid of local background that might due to nonspecific binding and obtain the target sample intensity  Filtration remove unreliable spots and reduce the dimension of data  Transformation convert data into a format that makes data analysis valid or easier 17 Normalization  Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.  Aim: remove sources of systematic variation  Example of non-biological variation: dye difference for 2-color microarray 18 Figure from Dudoit et al, 2002, Statistica Sinica Self-self experiment 19 Log Red-Log Green = M Normalization: M vs. A Plot (45o rotation) (Log Green+Log Red)/2 = A 20 Log Red-Log Green LOWESS Fit (Log Green+Log Red)/2 21 Normalized M After normalization A 22 Statistical Inference  Data notation for normalized signal intensities (NSI): Yijk for each gene (g) i: treatment index j: dye index k: slide index Y114 treatment Y224 dye slide 23 Fitting linear models to microarray data  After the normalization, we have one observation (normalized signal intensity) for each gene on each channel (a combination of dye and array).  Together, the data is an array with each row for one gene and each column for one channel or one chip.  We will fit a statistical model for each gene separately. 24 Mean expressions for 4 treatment groups Treatments means μ+v2+ μ+v1 μ+c*v2+ (1-c)*v1 μ+c*v2+ (1-c)* v1+     M (M cell with stress) B (B cell without stress) TO (both cells without stress) ST (both cells with stress)  Note that c is the proportion of M cells in the total leaf sample with both cells.  We are interested in testing H0: v1 = v2, whether a given gene is differentially expressed between M and B cells or not. 25 Fixed effects  The parameters on the previous slide (v1, v2, and ) specify fixed effects.  Fixed effects are used to specify the mean of the response variable.  A factor is fixed if the levels of the factor were selected by the investigator with the purpose of comparing the effects of the levels to one another.  The fixed effects included in the model depend on the experimental design. 26 Random effects  There are some random effects that are unknown: slide effects other effects introduced in the experiment (such as biological replicate effects) residual random effects that include any sources of variation unaccounted for by other terms B Total Stress M 27 Random effects  Random factors are used to specify the correlation structure among the response variable observations. e.g., observations on the same slide are more correlated than observations from different slides.  The random effects included in the model also depend on the experimental design.  A model that has both fixed and random effects is called a mixed model. 28 Detecting differentially expressed genes  Construct statistical test for parameters that we are interested in, e.g., what are the difference in gene expression (v1 - v2)? v1 - v2  0 means differential expression.  Model the random effects and perform tests or construct confidence intervals.  Perform tests for each gene and obtain a p-value. Empirical Bayes test that borrows information across genes is often used because of higher power. 29 Results from testing A set ID 1 3 8 9 11 12 16 18 21 22 33 35 37 38 40 46 48 50 Gene ID MZ00040724 MZ00040748 MZ00040802 MZ00013020 MZ00013044 MZ00013050 MZ00013098 MZ00000486 MZ00000528 MZ00000534 MZ00032020 MZ00032044 MZ00032068 MZ00032074 MZ00032098 MZ00008134 MZ00008158 MZ00024806 … v1-v2 -4.69E-01 1.01E-01 -4.10E-01 -4.96E-01 -2.77E-01 -7.81E-02 -7.50E-02 -5.16E-01 3.69E-01 4.98E-01 1.98E-01 -6.73E-01 -5.98E-01 -4.17E-01 -1.88E-01 2.11E-01 8.70E-02 1.01E-01 … p-value for (v1-v2)q-value 0.33691808 0.61046054 0.18009214 0.12907116 0.26988092 0.77596069 0.73097085 0.005203899 0.25837106 0.041544897 0.52396675 0.000939694 0.016160615 0.27593771 0.28042709 0.77894787 0.79905176 0.73992828 … 0.4012188 0.5306277 0.2881755 0.2438822 0.3566803 0.5895432 0.5752585 0.04976865 0.3488733 0.1337469 0.4961501 0.02472483 0.0844817 0.3610925 0.3641593 0.5905477 0.5954345 0.5788615 … 30 2536 p-values below 0.05. We would expect around 0.05*40000=2000 p-values to be less than 0.05 by chance if no genes were differentially expressed. 0.05 31 Possible Errors in Testing ONE gene Hypothesis Accept Null Reject Null (sig) True Null (non-DE) False Null (DE) correct Type I Error Type II Error correct (Power)  Type I Error: false positives  Type II Error: false negatives (1-power)  Power: true positives 32 Error Rate in Multiple Testing Outcomes when testing m genes (Benjamini and Hochberg, 1995) Hypothesis True Null (Non-DE) False Null (DE) Total Accept Null U Reject Null V Total m0 T S m1 W R m Family-wise error rate, FWER= Pr(V >0) False Discovery Rate, FDR = E(V/R |R>0) * Pr(R>0) 33 Results from testing for example I A set ID 1 3 8 9 11 12 16 18 21 22 33 35 37 38 40 46 48 50 Gene ID MZ00040724 MZ00040748 MZ00040802 MZ00013020 MZ00013044 MZ00013050 MZ00013098 MZ00000486 MZ00000528 MZ00000534 MZ00032020 MZ00032044 MZ00032068 MZ00032074 MZ00032098 MZ00008134 MZ00008158 MZ00024806 … v1-v2 -4.69E-01 1.01E-01 -4.10E-01 -4.96E-01 -2.77E-01 -7.81E-02 -7.50E-02 -5.16E-01 3.69E-01 4.98E-01 1.98E-01 -6.73E-01 -5.98E-01 -4.17E-01 -1.88E-01 2.11E-01 8.70E-02 1.01E-01 … p-value for (v1-v2)q-value 0.33691808 0.61046054 0.18009214 0.12907116 0.26988092 0.77596069 0.73097085 0.005203899 0.25837106 0.041544897 0.52396675 0.000939694 0.016160615 0.27593771 0.28042709 0.77894787 0.79905176 0.73992828 … 0.4012188 0.5306277 0.2881755 0.2438822 0.3566803 0.5895432 0.5752585 0.04976865 0.3488733 0.1337469 0.4961501 0.02472483 0.0844817 0.3610925 0.3641593 0.5905477 0.5954345 0.5788615 … 34 Clustering  Grouping genes into different “clusters” based on their expression profile  Clustering 35 Other analyses  Relating the gene expressions with biological functional categories  Gene Enrichment Test  Connecting microarray data with other kinds of data such as survival data.  More … 36 Assigned References  Nettleton, D. (2006) A Discussion of statistical methods for design and analysis of microarray experiments for plant scientists. The Plant Cell,18, 2112–2121. 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using Statistical Design and Analysis to Detect