Download slides

Clustering Gene Expression Data EMBnet: DNA Microarrays Workshop Mar. 4 – Mar. 8, 2002 ,UNIL & EPFL, Lausanne Gaddy Getz, Weizmann Institute, Israel • Gene Expression Data • Clustering of Genes and Conditions • Methods – Agglomerative Hierarchical: Average Linkage – Centroids: K-Means – Physically motivated: Super-Paramagnetic Clustering • Coupled Two-Way Clustering Mar 2002 (GG) 1 Gene Expression Technologies • DNA Chips (Affymetrix) and MicroArrays can measure mRNA concentration of thousands of genes simultaneously • General scheme: Extract RNA, synthesize labeled cDNA, Hybridize with DNA on chip. Mar 2002 (GG) 2 Single Experiment • After hybridization – Scan the Chip and obtain an image file – Image Analysis (find spots, measure signal and noise) Tools: ScanAlyze, Affymetrix, … • Output File – Affymetrix chips: For each gene a reading proportional to the concentrations and a present/absent call. (Average Difference, Absent Call) – cDNA MicroArrays: competing hybridization of target and control. For each gene the log ratio of target and control. (CH1I-CH1B, CH2I-CH2B) Mar 2002 (GG) 3 Preprocessing: From one experiment to many • Chip and Channel Normalization – Aim: bring readings of all experiments to be on the same scale – Cause: different RNA amounts, labeling efficiency and image acquisition parameters – Method: Multiply readings of each array/channel by a scaling factor such that: • The sum of the scaled readings will be the same for all arrays • Find scaling factor by a linear fit of the highly expressed genes – Note: In multi-channel experiments normalize each channel separately. Mar 2002 (GG) 4 Preprocessing: From one experiment to many Colon cancer data (Alon et. al.) 45 200 • Filtering of Genes 40 400 – Remove genes that are absent in most 600 experiments 800 – Remove genes that are constant in all 1000 experiments 1200 – Remove genes with low readings which are not 1400 reliable. 35 Genes 30 25 20 15 1600 10 1800 5 2000 Mar 2002 (GG) 10 20 30 40 Experiments 50 60 5 Noise and Repeats log – log plot • • • • >90% 2 to 3 fold Multiplicative noise Repeat experiments Log scale dist(4,2)=dist(2,1) Mar 2002 (GG) 6 We canSupervised ask many Methods questions? (use predefined labels) • Which genes are expressed differently in two known types of conditions? • What is the minimal set of genes needed to distinguish one type of conditions from the others? • Which genes behave similarly in the experiments? • How many different types of conditions are there? Unsupervised Methods (use only the data) Mar 2002 (GG) 7 Unsupervised Analysis • Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated. • Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression. Clustering Methods Mar 2002 (GG) 8 What is clustering? Mar 2002 (GG) 9 Cluster Analysis Yields Dendrogram T (RESOLUTION) Mar 2002 (GG) 10 What is clustering? More Mathematically • Input: N data points, Xi, i=1,2,…,N in a D dimensional space. • Goal: Find “natural” groups or clusters. Data point of same cluster - “more similar” • Tasks: – Determine number of clusters – Generate a dendrogram – Identify significant “stable” clusters Mar 2002 (GG) 11 Clustering is ill-posed • Problem specific definitions • Similarity: which points should be considered close? – Correlation coefficient – Euclidean distance • Resolution: specify/hierarchical results • Shape of clusters: general, spherical. Mar 2002 (GG) 12 Similarity Measure • Similarity measures – – – – Centered Correlation Uncentered Correlation Absolute correlation Euclidean Mar 2002 (GG) 15 Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Agglomerative Hierarchical Clustering Complete Linkage: distance between farthest pair. Average Linkage: average Distance between joined clustersdistance between all pairs or distance between cluster centers 4 2 5 3 1 1 3 2 4 5 The dendrogram induces a linear ordering of the data points Dendrogram Mar 2002 (GG) 16 Agglomerative Hierarchical Clustering • Results depend on distance update method – Single Linkage: elongated clusters – Complete Linkage: sphere-like clusters • Greedy iterative process • NOT robust against noise • No inherent measure to choose the clusters Mar 2002 (GG) 17 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 0 Mar 2002 (GG) 18 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 1 Mar 2002 (GG) 19 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 1 Mar 2002 (GG) 20 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 3 Mar 2002 (GG) 21 Centroid Methods - K-means • Result depends on initial centroids’ position • Fast algorithm: compute distances from data points to centroids • No way to choose K. • Example: 3 clusters / K=2, 3, 4 • Breaks long clusters Mar 2002 (GG) 22 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations at different temperatures (T). T=Low Mar 2002 (GG) 23 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations at different temperatures (T). T=High Mar 2002 (GG) 24 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations at different temperatures (T). T=Intermediate Mar 2002 (GG) 25 Super-Paramagnetic Clustering (SPC) • The algorithm simulates the magnets behavior at a range of temperatures and calculates their correlation • The temperature (T) controls the resolution • Example: N=4800 points in D=2 Mar 2002 (GG) 26 Output of SPC A function (T) that peaks when stable clusters break Size of largest clusters as function of T Dendrogram Mar 2002 (GG) Stable clusters “live” for large T 27 Choosing a value for T Mar 2002 (GG) 28 Advantages of SPC • Scans all resolutions (T) • Robust against noise and initialization calculates collective correlations. • Identifies “natural” () and stable clusters (T) • No need to pre-specify number of clusters • Clusters can be any shape Mar 2002 (GG) 29 Many clustering methods applied to expression data • Agglomerative Hierarchical – Average Linkage (Eisen et. al., PNAS 1998) • Centroid (representative) – K-Means (Golub et. al., Science 1999) – Self Organized Maps (Tamayo et. al., PNAS 1999) • Physically motivated – Deterministic Annealing (Alon et. al., PNAS 1999) – Super-Paramagnetic Clustering (Getz et. al., Physica A 2000) Mar 2002 (GG) 30 Available Tools • Software packages: – M. Eisen’s programs for clustering and display of results (Cluster, TreeView) • Predefined set of normalizations and filtering • Agglomerative, K-means, 1D SOM • Web sites: – Coupled Two-Way Clustering (CTWC) website http://ctwc.weizmann.ac.il both CTWC and SPC – http://ep.ebi.ac.uk/EP/EPCLUST/ • General mathematical tools – MATLAB • Agglomerative, public m-files. – Statistical programs (SPSS, SAS, S-plus) Mar 2002 (GG) 31 Colon cancer data (normalized genes) Back to gene expression data 200 0.8 400 • 2 Goals: Cluster Genes and Conditions • 2 independent clustering: 0.6 600 Genes 800 0.4 – Genes represented as vectors of expression in all conditions 1200 1400 – Conditions are represented as vectors of expression of all 1600 genes 1000 0.2 0 -0.2 1800 -0.4 2000 Mar 2002 (GG) 10 20 30 40 Experiments 50 60 32 First clustering - Experiments 1. Identify tissue classes (tumor/normal) Mar 2002 (GG) 33 Second Clustering - Genes 2. Find Differentiating And Correlated Genes Ribosomal proteins Cytochrome C metabolism HLA2 Mar 2002 (GG) 34 Two-way Clustering Mar 2002 (GG) 35 Coupled Two-Way Clustering (CTWC) G. Getz, E. Levine and E. Domany (2000) PNAS •Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. •New Goal: Use subsets of genes to study subsets of samples (and vice versa) •A non-trivial task – exponential number of subsets. •CTWC is a heuristic to solve this problem. Mar 2002 (GG) 36 Booing Cheering Mar 2002 (GG) 37 CTWC of colon cancer data 60 200 A 50 40 400 30 20 600 (A) 10 800 0 1000 B 0 10 20 30 40 50 60 1200 60 50 1400 40 1600 30 1800 20 (B) 10 2000 10 20 30 40 50 60 0 0 Mar 2002 (GG) 10 20 30 40 50 60 38 CTWC of Glioblastoma Data – S1(G5) Godard, Getz, Kobayashi, Nozaki, Diserens, Hamon, Stupp, Janzer, Bucher, de Tribolet, Domany & Hegi (2002) Submitted S14 S13 S11 S12 S10 Glioma cell line Low grade astrocytoma Secondary GBM Mar 2002 (GG) AB004904 M32977 M35410 X51602 M96322 AB004903 X52946 J04111 X79067 STAT-induced STAT inhibitor 3 VEGF ANGIOGENESIS IGFBP2 4904 STAT-induced STAT inhibitor 3 VEGFR1AB00 ANGIOGENESIS M3297 7 VEGF M3541 0 Gravin X51 602 IGFBP2 VEGFR1 M9632 2 gravin STAT-induced STAT inhibitor 2 AB00 4903 STAT-induced STAT inhibitor 2 PTN X5 2946 PTN J0 4111 c-jun C-JUN X79 067 TIS11B TIS11B Primary GBM p53 mutation 40 Biological Work • Literature search for the genes • Genomics: search for common regulatory signal upstream of the genes • Proteomics: infer functions. • Design next experiment – get more data to validate result. • Find what is in common with sets of experiments/conditions. Mar 2002 (GG) 41 Summary • Clustering methods are used to – find genes from the same biological process – group the experiments to similar conditions • Different clustering methods can give different results. The physically motivated ones are more robust. • Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions http://ctwc.weizmann.ac.il Mar 2002 (GG) 42

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides