Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning for Epigenetics:
some initial results
Guido Sanguinetti
School of Informatics, University of Edinburgh
ICMS Big data in Science
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
1 / 37
Talk outline
1
Background
2
Statistical testing in ChIP-Seq data (G. Schweikert)
3
MMDiff: Results
4
Transcription factors and histone modifications (with D. Sproul)
5
Shape-based testing for methylation profiles (T. Mayo)
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
2 / 37
The central dogma
Where does variability come into play? What can we measure?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
3 / 37
Transcription as a hybrid system
Given the promoter state, the proteins obey the following
dynamical model of transcription (linear SDE)
dx(t) = (Aµ(t) + b − λx(t)) dt + σdW
(1)
For a long time, I’ve been interested in inference in this type of
system
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
4 / 37
Transcription as a hybrid system
Given the promoter state, the proteins obey the following
dynamical model of transcription (linear SDE)
dx(t) = (Aµ(t) + b − λx(t)) dt + σdW
(1)
For a long time, I’ve been interested in inference in this type of
system
What else is happening around this abstraction?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
4 / 37
Epigenetics
Genetics and transcription cannot be all; spatial organisation of
chromosomes plays a role. This is determined by chemical
modifications to DNA and histones.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
5 / 37
A more accurate picture?
Genome
....CCACCGAACGCGCGCGGGAACGGCACGAGCGGGGCGCCG...
DNA sequence
trans-factors
eg. Cfp-1
Epigenome
Transcriptome
RNA-Seq / Pol-II
Zhou et al., Nat Rev Genet, 2011
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
6 / 37
The modelling cycle
Informatics will provide the synthesis!
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
7 / 37
Epigenetics: what the data looks like
Each row is a tiny fraction of a next-generation sequencing
experiment’s data. Each row ≥1GB of data. How do we determine
relationships between the rows?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
8 / 37
What the data looks like
after QC, mapping, alignment,
Histone modification data
Guido Sanguinetti (University of Edinburgh)
DNA Methylation data
ML for epigenetics
ICMS, 06/05/15
9 / 37
Obvious problems
Small data, with each data point being very big
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
10 / 37
Obvious problems
Small data, with each data point being very big
Even restricting to regions (e.g. genes), the data is high
dimensional and non-trivial
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
10 / 37
Obvious problems
Small data, with each data point being very big
Even restricting to regions (e.g. genes), the data is high
dimensional and non-trivial
How can we even determine statistical differences?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
10 / 37
Obvious problems
Small data, with each data point being very big
Even restricting to regions (e.g. genes), the data is high
dimensional and non-trivial
How can we even determine statistical differences?
What is a suitable probability model for each of these
high-dimensional, non-Gaussian items?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
10 / 37
Obvious problems
Small data, with each data point being very big
Even restricting to regions (e.g. genes), the data is high
dimensional and non-trivial
How can we even determine statistical differences?
What is a suitable probability model for each of these
high-dimensional, non-Gaussian items?
Data associated with different genes may be of intrinsically
different dimensionality. How can I do even basic things like
clustering?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
10 / 37
Obvious problems
Small data, with each data point being very big
Even restricting to regions (e.g. genes), the data is high
dimensional and non-trivial
How can we even determine statistical differences?
What is a suitable probability model for each of these
high-dimensional, non-Gaussian items?
Data associated with different genes may be of intrinsically
different dimensionality. How can I do even basic things like
clustering?
How can we model in the presence of very strong redundancies
(dimensionality reduction)?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
10 / 37
Issues in statistical testing
Statistical testing: procedure to assess the significance of
differences between groups of measurements in the light of
normal variation
Given two sets of replicate measurements, we compute a
statistic and determine the probability that a difference as large
or larger in the statistic could be due to chance (p-value)
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
11 / 37
Issues in statistical testing
Statistical testing: procedure to assess the significance of
differences between groups of measurements in the light of
normal variation
Given two sets of replicate measurements, we compute a
statistic and determine the probability that a difference as large
or larger in the statistic could be due to chance (p-value)
Boring? Misleading (see controversy in Nature earlier this year)?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
11 / 37
Issues in statistical testing
Statistical testing: procedure to assess the significance of
differences between groups of measurements in the light of
normal variation
Given two sets of replicate measurements, we compute a
statistic and determine the probability that a difference as large
or larger in the statistic could be due to chance (p-value)
Boring? Misleading (see controversy in Nature earlier this year)?
How do we carry out (thousands of) tests when each
measurement is high dimensional?
Careful choice of 1-dimensional summaries (statistics) is
essential.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
11 / 37
Example: ChIP-Seq
DNA - binding
protein
- Cross-linking- Cross-linking
DNA
- DNA fragmentation
- DNA fragmentation
- Enrichment with
specific antibody
(ChIP)
- Enrichment
with
specific antibody (ChIP)
- Profiling of enriched DNA
Individual
(Seq)
sequencing
read (tag)
Read (tag) density
- Profiling of enriched DNA
(Seq)
Kim and Park, 2011
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
12 / 37
What the data looks like
after QC, mapping, alignment,
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
13 / 37
What the data looks like
after QC, mapping, alignment,
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
13 / 37
What the data looks like
after QC, mapping, alignment,
The shape of the peak seems highly conserved across replicates.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
13 / 37
Formulate the test question
Suppose for a peak i we are given
n observations (i.e. reads) in data set s (e.g. WT)
X s = {xs1 , ..., xsn }
m observations in data set s 0 (e.g. Null),
0
0
0
X s = {xs1 , ..., xsm }
0
where xs , xs random variables
drawn i.i.d. from probability distributions p and p 0 .
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
14 / 37
Formulate the test question
Suppose for a peak i we are given
n observations (i.e. reads) in data set s (e.g. WT)
X s = {xs1 , ..., xsn }
m observations in data set s 0 (e.g. Null),
0
0
0
X s = {xs1 , ..., xsm }
0
where xs , xs random variables
drawn i.i.d. from probability distributions p and p 0 .
Can we decide whether p 6= p 0 ?
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
14 / 37
Formulate the test question
Suppose for a peak i we are given
n observations (i.e. reads) in data set s (e.g. WT)
X s = {xs1 , ..., xsn }
m observations in data set s 0 (e.g. Null),
0
0
0
X s = {xs1 , ..., xsm }
0
where xs , xs random variables
drawn i.i.d. from probability distributions p and p 0 .
Can we decide whether p 6= p 0 ?
Define test statistic:
should summarize the data, preferably in a single number
should capture higher order moments
→ use Kernel trick
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
14 / 37
MMD Test statistics
0
Nonlinear kernel function k(xs , xs )→ the mean embedding of a
distribution p (in the RKHS F) contains the information of all
higher-order moments.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
15 / 37
MMD Test statistics
0
Nonlinear kernel function k(xs , xs )→ the mean embedding of a
distribution p (in the RKHS F) contains the information of all
higher-order moments.
The maximum mean discrepancy, (MMD) is the distance
between mean embeddings
MMD[F, p, p 0 ] = supf ∈F (Ex∼p [f (x)] − Ex∼p0 [f (x 0 )])
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
15 / 37
MMD Test statistics
0
Nonlinear kernel function k(xs , xs )→ the mean embedding of a
distribution p (in the RKHS F) contains the information of all
higher-order moments.
The maximum mean discrepancy, (MMD) is the distance
between mean embeddings
MMD[F, p, p 0 ] = supf ∈F (Ex∼p [f (x)] − Ex∼p0 [f (x 0 )])
0
Theorem: MMD p,p = 0 if and only if p = p 0
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
15 / 37
MMD Test statistics
0
Nonlinear kernel function k(xs , xs )→ the mean embedding of a
distribution p (in the RKHS F) contains the information of all
higher-order moments.
The maximum mean discrepancy, (MMD) is the distance
between mean embeddings
MMD[F, p, p 0 ] = supf ∈F (Ex∼p [f (x)] − Ex∼p0 [f (x 0 )])
0
Theorem: MMD p,p = 0 if and only if p = p 0
Finite sample estimates of MMD will be different from zero, but
their distribution can be estimated (by bootstrapping)
MMD can be efficiently computed in terms of Kernel functions
12
1
2
1
(s,s 0 )
s
s
s
s0
s0
s0
MMD
=
k(x , x ) −
k(x , x ) + 2 k(x , x )
(n)2
n·m
m
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
15 / 37
MMDiff: how we use it
MMD values are computed for each peak independently
Every time, we compare two sets of observations:
e.g. WT vs Null, WT vs Resc etc.
Each read mapping to a given peak is considered an observation
The feature we use is the 5’ end of the alignment
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
16 / 37
MMDiff: how we use it
MMD values are computed for each peak independently
Every time, we compare two sets of observations:
e.g. WT vs Null, WT vs Resc etc.
Each read mapping to a given peak is considered an observation
The feature we use is the 5’ end of the alignment
We use RBF Kernels to capture neighbourhood information
The Kernel width is chosen to be the median distance between
all observations
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
16 / 37
MMDiff: how we use it
MMD values are computed for each peak independently
Every time, we compare two sets of observations:
e.g. WT vs Null, WT vs Resc etc.
Each read mapping to a given peak is considered an observation
The feature we use is the 5’ end of the alignment
We use RBF Kernels to capture neighbourhood information
The Kernel width is chosen to be the median distance between
all observations
Empirical p-Values are determined on peaks with similar total
counts
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
16 / 37
Experiments on ENCODE data
Compare ChIP-Seq marks across different cell types
Studied two different marks: broad histone mark H3K27ac and
transcription factor CTCF binding
Cell types: human K562 (leukaemia) vs GM12878 for H3K27ac,
mouse brain cortex, cerebellum and liver for CTCF
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
17 / 37
ENCODE results
20
normalized counts
0
60
40
20
40
20
normalized counts
0
60
56318000
56322000
56326000
40
20
normalized counts
0
20
40
CTCF, Liver
0
normalized counts
0
CTCF, Cerebellum
H3K27ac, Gm12878
normalized counts
40
CTCF, Cortex
H3K27ac, K562
107833000
chr12:56318019−56327372
107833500
107834000
107834500
107835000
chr7:107832815−107834854
Both called by MMDiff and not DESeq. DESeq does not find any
differences between cortex and cerebellum.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
18 / 37
Main application: H3K4me3 upon deletion of Cfp1
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
19 / 37
Example Peaks
AB.1
AB.2
We use WT and Resc as replicates and treat AB1 and AB2 separately.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
20 / 37
Reproducibility
MMDiff gives reproducible results (as much as ChIP-Seq does)
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
21 / 37
Sequence enrichment
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
22 / 37
Indirect binding via E2F transcription factors?
Wilson, Molecular Cell, 2007
Tyagi, Molecular Cell, 2007
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
23 / 37
Mechanistic traces in big data?
If TF binding determined the histone mark signal, it should be
possible to predict histone modifications from TF binding
At a simpler level, it should be possible to predict the presence/
absence of marks from TF ChIP-Seq data
This does NOT provide a mechanistic proof; rather it is a
necessary but not sufficient condition
Isolated examples of interactions between TFs and histone
modifiers are known
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
24 / 37
Testing the hypothesis: data
We interrogated the ENCODE data sets in the three Tier I cell
lines (GM12878, K562, H1 hESC)
Outputs: five histone modifications, H3K4me1, H3K4me3,
H3K9ac, H3K27ac, H3K27me3 found near transcription start
sites. Genomic regions defined positive if they intersect with a
histone peak
Inputs: normalised read counts for ALL TF chipped in the Tier I
cell lines
Prediction method: logistic regression. Probabilistic predictor
which computes relative importance of input features as a
weight vector
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
25 / 37
TFs can predict very accurately
ROC curves for predictions of histone modifications at promoters in
H1 cells (left). TF-based predictions vs sequence based predictions in
H1 cells (right)
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
26 / 37
TFs can predict genome wide
Table: Predictions of histone modification presence in H1 cells
Mark
H3K4me1
H3K4me3
H3K9ac
H3K27ac
H3K27me3
Seq. (promoters)
N.D.
0.918 (± 0.001)
0.867 (± 0.001)
0.828 (± 0.002)
0.808 (± 0.002)
Guido Sanguinetti (University of Edinburgh)
TF promoters
N.D.
0.950 (± 0.001)
0.921 (± 0.001)
0.909 (± 0.001)
0.877 (± 0.002)
ML for epigenetics
DNase
0.854 ± 0.001
0.974 (± 0.001)
0.976 (± 0.001)
0.968 (± 0.001)
0.916 (± 0.001)
Enhancers F5
0.842± 0.003
0.962 (± 0.001)
0.961 (± 0.001)
0.950(± 0.001)
0.918 (± 0.002)
ICMS, 06/05/15
27 / 37
Methylation Data
Bisulfite conversion: unmethylated Cytosine to Uracil
NGS, conversion aware alignment
RRBS: focus on CpG-rich regions
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
28 / 37
A look at the data
Data exhibits strong spatial correlations conserved across replicates
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
29 / 37
Choice of Kernel
Each mapped cytosine is an individual data
point: xj = (Cj , Methj )
Composite kernel
kfull (xi , xj ) = kRBF (xi , xj )kSTR (xi , xj )
kRBF (xi , xj ) = exp[−(Ci − Cj )2 /2σ 2 ]
kSTR (xi , xj ) = 1 if Methi = Methj , 0 else
σ is modelled from the data as σ 2 = x̄ 2 /2 where x̄ is the median
observed distance in the region.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
30 / 37
Handling Coverage
The MMD tests whether samples are drawn from the same
distribution.
The frequency that data is drawn - the coverage - is independent
of the methylation profile.
We adapt the method by subtracting an appropriate ’coverage
only’ metric.
The MMD with an RBF kernel on genomic location only (no
methylation considered)
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
31 / 37
Test-statistic
M3 D test-statistic
M 3 D[X , Y ] = MMD[X , Y , kfull ] − MMD[X , Y , kRBF ]
The test statistic over all replicate pairs forms our testing
distribution
For a given region, the mean of the inter-group comparisons is
tested against this distribution
This gives the empirical probability of finding the cross-group
difference in methylation profiles among the replicates
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
32 / 37
M 3D produces nice histograms
M 3 D statistic between replicates (left) and between different
conditions (K562 vs H1 cells).
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
33 / 37
M 3D is robust to low replication/ coverage
M 3 D test results is robust to low coverage (left) and low replication
(right).
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
34 / 37
Conclusions
MMD-based statistics enable more powerful tests than currently
used approaches
MMDiff is complementary to count-based methods: changes
that only alter counts (keeping shape fixed) cannot be captured
MMD is potentially of use in other scenarios where distributions
arise naturally, e.g. methylation or metagenomics
Machine learning can help extract patterns from high-throughput
epigenomic data which may suggest biological functions
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
35 / 37
Thanks
School of Informatics
Gabriele Schweikert
Dan Benveniste
Tom Mayo
Wellcome Trust Centre for Cell Biology
Adrian Bird
IGMM
Duncan Sproul
Hans-Joachim Sonntag
Funding: EU FP7 Marie Curie Actions, ERC, EPSRC.
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
36 / 37
References
G. Schweikert, B. Cseke, T. Clouaire, A. Bird and G.S., MMDiff:
quantitative testing for shape changes in ChIP-Seq data sets,
BMC Genomics 14:826, 2013
MMDiff bioconductor package
http://www.bioconductor.org/packages/release/bioc/html/MMDiff.html
D. Benveniste, H.-J. Sonntag, G.S. and D. Sproul, Transcription
factor binding predicts histone modifications in human cell lines,
PNAS 111(37), 13367-13372, 2014
T. Mayo, G. Schweikert and G.S., M 3 D: a kernel-based test for
spatially correlated changes in methylation profiles,
Bioinformatics 31(6), 809-816, 2015
M3D bioconductor package
http://www.bioconductor.org/packages/devel/bioc/html/M3D.html
Guido Sanguinetti (University of Edinburgh)
ML for epigenetics
ICMS, 06/05/15
37 / 37