Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

Metagenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

History of genetic engineering wikipedia , lookup

Essential gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

RNA-Seq wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Clustering Gene Expression Data
EMBnet: DNA Microarrays Workshop
Mar. 4 – Mar. 8, 2002 ,UNIL & EPFL, Lausanne
Gaddy Getz, Weizmann Institute, Israel
• Gene Expression Data
• Clustering of Genes and Conditions
• Methods
– Agglomerative Hierarchical: Average Linkage
– Centroids: K-Means
– Physically motivated: Super-Paramagnetic Clustering
• Coupled Two-Way Clustering
Mar 2002 (GG)
1
Gene Expression Technologies
• DNA Chips (Affymetrix) and MicroArrays can measure
mRNA concentration of thousands of genes simultaneously
• General scheme: Extract RNA, synthesize labeled cDNA,
Hybridize with DNA on chip.
Mar 2002 (GG)
2
Single Experiment
• After hybridization
– Scan the Chip and obtain an image file
– Image Analysis (find spots, measure signal and noise)
Tools: ScanAlyze, Affymetrix, …
• Output File
– Affymetrix chips: For each gene a reading proportional
to the concentrations and a present/absent call.
(Average Difference, Absent Call)
– cDNA MicroArrays: competing hybridization of target
and control. For each gene the log ratio of target and
control. (CH1I-CH1B, CH2I-CH2B)
Mar 2002 (GG)
3
Preprocessing: From one experiment to many
• Chip and Channel Normalization
– Aim: bring readings of all experiments to be on the
same scale
– Cause: different RNA amounts, labeling efficiency and
image acquisition parameters
– Method: Multiply readings of each array/channel by a
scaling factor such that:
• The sum of the scaled readings will be the same for all arrays
• Find scaling factor by a linear fit of the highly expressed genes
– Note: In multi-channel experiments normalize each
channel separately.
Mar 2002 (GG)
4
Preprocessing: From one experiment to many
Colon cancer data (Alon et. al.)
45
200
• Filtering of Genes
40
400
– Remove genes that are
absent in most
600
experiments
800
– Remove genes that are constant in all
1000
experiments
1200
– Remove genes with low readings which are not
1400
reliable.
35
Genes
30
25
20
15
1600
10
1800
5
2000
Mar 2002 (GG)
10
20
30
40
Experiments
50
60
5
Noise and Repeats
log – log plot
•
•
•
•
>90% 2 to 3 fold
Multiplicative noise
Repeat experiments
Log scale
dist(4,2)=dist(2,1)
Mar 2002 (GG)
6
We canSupervised
ask many
Methods questions?
(use predefined labels)
• Which genes are expressed differently in two
known types of conditions?
• What is the minimal set of genes needed to
distinguish one type of conditions from the others?
• Which genes behave similarly in the experiments?
• How many different types of conditions are there?
Unsupervised Methods
(use only the data)
Mar 2002 (GG)
7
Unsupervised Analysis
• Goal A: Find groups of genes that have correlated
expression profiles.
These genes are believed to belong to the same
biological process and/or are co-regulated.
• Goal B: Divide conditions to groups with similar
gene expression profiles.
Example: divide drugs according to their effect on
gene expression.
Clustering Methods
Mar 2002 (GG)
8
What is clustering?
Mar 2002 (GG)
9
Cluster Analysis Yields Dendrogram
T (RESOLUTION)
Mar 2002 (GG)
10
What is clustering? More Mathematically
• Input: N data points, Xi, i=1,2,…,N in a D
dimensional space.
• Goal: Find “natural” groups or clusters.
Data point of same cluster - “more similar”
• Tasks:
– Determine number of clusters
– Generate a dendrogram
– Identify significant “stable” clusters
Mar 2002 (GG)
11
Clustering is ill-posed
• Problem specific definitions
• Similarity: which points should be
considered close?
– Correlation coefficient
– Euclidean distance
• Resolution: specify/hierarchical results
• Shape of clusters: general, spherical.
Mar 2002 (GG)
12
Similarity Measure
• Similarity measures
–
–
–
–
Centered Correlation
Uncentered Correlation
Absolute correlation
Euclidean
Mar 2002 (GG)
15
Need to define the distance between the
new cluster and the other clusters.
Single Linkage:
distance between closest pair.
Agglomerative Hierarchical Clustering
Complete Linkage: distance between farthest pair.
Average
Linkage:
average
Distance between
joined
clustersdistance between all pairs
or distance between cluster centers
4
2
5
3
1
1
3
2
4
5
The dendrogram induces a linear ordering
of the data points
Dendrogram
Mar 2002 (GG)
16
Agglomerative Hierarchical Clustering
• Results depend on distance update method
– Single Linkage: elongated clusters
– Complete Linkage: sphere-like clusters
• Greedy iterative process
• NOT robust against noise
• No inherent measure to choose the clusters
Mar 2002 (GG)
17
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 0
Mar 2002 (GG)
18
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 1
Mar 2002 (GG)
19
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 1
Mar 2002 (GG)
20
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 3
Mar 2002 (GG)
21
Centroid Methods - K-means
• Result depends on initial centroids’ position
• Fast algorithm: compute distances from data
points to centroids
• No way to choose K.
• Example: 3 clusters / K=2, 3, 4
• Breaks long clusters
Mar 2002 (GG)
22
Super-Paramagnetic Clustering (SPC)
M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical
properties dilute magnets.
• Calculating correlation between magnet
orientations at different temperatures (T).
T=Low
Mar 2002 (GG)
23
Super-Paramagnetic Clustering (SPC)
M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical
properties dilute magnets.
• Calculating correlation between magnet
orientations at different temperatures (T).
T=High
Mar 2002 (GG)
24
Super-Paramagnetic Clustering (SPC)
M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical
properties dilute magnets.
• Calculating correlation between magnet
orientations at different temperatures (T).
T=Intermediate
Mar 2002 (GG)
25
Super-Paramagnetic Clustering (SPC)
• The algorithm simulates the magnets behavior at a range of
temperatures and calculates their correlation
• The temperature (T) controls the resolution
• Example: N=4800 points in D=2
Mar 2002 (GG)
26
Output of SPC
A function (T) that peaks
when stable clusters break
Size of largest clusters as
function of T
Dendrogram
Mar 2002 (GG)
Stable clusters
“live” for large T
27
Choosing a value for T
Mar 2002 (GG)
28
Advantages of SPC
• Scans all resolutions (T)
• Robust against noise and initialization calculates collective correlations.
• Identifies “natural” () and stable clusters (T)
• No need to pre-specify number of clusters
• Clusters can be any shape
Mar 2002 (GG)
29
Many clustering methods applied
to expression data
• Agglomerative Hierarchical
– Average Linkage (Eisen et. al., PNAS 1998)
• Centroid (representative)
– K-Means (Golub et. al., Science 1999)
– Self Organized Maps (Tamayo et. al., PNAS 1999)
• Physically motivated
– Deterministic Annealing (Alon et. al., PNAS 1999)
– Super-Paramagnetic Clustering (Getz et. al., Physica A 2000)
Mar 2002 (GG)
30
Available Tools
• Software packages:
– M. Eisen’s programs for clustering and display of
results (Cluster, TreeView)
• Predefined set of normalizations and filtering
• Agglomerative, K-means, 1D SOM
• Web sites:
– Coupled Two-Way Clustering (CTWC) website
http://ctwc.weizmann.ac.il both CTWC and SPC
– http://ep.ebi.ac.uk/EP/EPCLUST/
• General mathematical tools
– MATLAB
• Agglomerative, public m-files.
– Statistical programs (SPSS, SAS, S-plus)
Mar 2002 (GG)
31
Colon cancer data (normalized genes)
Back to gene expression data
200
0.8
400
• 2 Goals: Cluster Genes and Conditions
• 2 independent clustering:
0.6
600
Genes
800
0.4
– Genes represented as vectors of expression in
all conditions 1200
1400
– Conditions are represented
as vectors of
expression of all 1600
genes
1000
0.2
0
-0.2
1800
-0.4
2000
Mar 2002 (GG)
10
20
30
40
Experiments
50
60
32
First clustering - Experiments
1. Identify tissue classes (tumor/normal)
Mar 2002 (GG)
33
Second Clustering - Genes
2. Find Differentiating And Correlated Genes
Ribosomal proteins
Cytochrome C
metabolism
HLA2
Mar 2002 (GG)
34
Two-way
Clustering
Mar 2002 (GG)
35
Coupled Two-Way Clustering (CTWC)
G. Getz, E. Levine and E. Domany (2000) PNAS
•Motivation: Only a small subset of genes play a role in
a particular biological process; the other genes
introduce noise, which may mask the signal of the
important players. Only a subset of the samples exhibit
the expression patterns of interest.
•New Goal: Use subsets of genes to study subsets of
samples (and vice versa)
•A non-trivial task – exponential number of subsets.
•CTWC is a heuristic to solve this problem.
Mar 2002 (GG)
36
Booing
Cheering
Mar 2002 (GG)
37
CTWC of colon cancer data
60
200
A
50
40
400
30
20
600
(A)
10
800
0
1000
B
0
10
20
30
40
50
60
1200
60
50
1400
40
1600
30
1800
20
(B)
10
2000
10
20
30
40
50
60
0
0
Mar 2002 (GG)
10
20
30
40
50
60
38
CTWC of Glioblastoma Data – S1(G5)
Godard, Getz, Kobayashi, Nozaki, Diserens, Hamon, Stupp, Janzer,
Bucher, de Tribolet, Domany & Hegi (2002) Submitted
S14 S13
S11
S12
S10
Glioma cell line
Low grade astrocytoma
Secondary GBM
Mar 2002 (GG)
AB004904
M32977
M35410
X51602
M96322
AB004903
X52946
J04111
X79067
STAT-induced STAT inhibitor 3
VEGF
ANGIOGENESIS
IGFBP2
4904 STAT-induced STAT inhibitor 3
VEGFR1AB00
ANGIOGENESIS
M3297 7
VEGF
M3541
0
Gravin X51 602 IGFBP2
VEGFR1
M9632 2
gravin
STAT-induced
STAT inhibitor 2
AB00 4903 STAT-induced STAT inhibitor 2
PTN
X5
2946
PTN
J0 4111
c-jun
C-JUN X79 067 TIS11B
TIS11B
Primary GBM
p53 mutation
40
Biological Work
• Literature search for the genes
• Genomics: search for common regulatory
signal upstream of the genes
• Proteomics: infer functions.
• Design next experiment – get more data to
validate result.
• Find what is in common with sets of
experiments/conditions.
Mar 2002 (GG)
41
Summary
• Clustering methods are used to
– find genes from the same biological process
– group the experiments to similar conditions
• Different clustering methods can give different
results. The physically motivated ones are more
robust.
• Focusing on subsets of the genes and conditions
can uncover structure that is masked when using
all genes and conditions
http://ctwc.weizmann.ac.il
Mar 2002 (GG)
42