Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Statistics methods in bioinformatics
Yiannis Kourmpetis
Eric Boer
29 Jan 2007
Biometris
quantitative methods in life and earth sciences
Statistics methods in bioinformatics
The analysis of One DNA Sequence
5.2 – Modeling DNA (Eric)
5.3 – Modeling signals in DNA (Eric)
5.4 – Long Repeats (Eric)
5.6 – The analysis of patterns
• 5.7 – Overlaps counted (Yiannis)
• 5.8 – Overlaps not counted (Yiannis)
5.9 – Motifs (Yiannis)
5.1 – Shotgun sequencing (Eric if time left)
Biometris
quantitative methods in life and earth sciences
Modeling DNA
Hypothesis testing (Paragraph 3.4)
Null hypothesis: p = 0.25
Alternative hypothesis: p > 0.25
Y = number of matches
If null hypothesis is correct, Y ~ Binomial(26, 0.25)
Biometris
quantitative methods in life and earth sciences
Modeling DNA
Hypothesis testing (Paragraph 3.4)
Calculating P-value:
11 Matches, Y = 11; P-value = P(Y ≥ 11) = 0.04
Compare 0.04 with Type I error of testing:
• Type I error = 0.05; rejecting the null hypothesis
• Type I error = 0.01; not rejecting the null hypothesis
Biometris
quantitative methods in life and earth sciences
Modeling DNA
DNA: long sequence of nucleotides
Test for independence (Table 5.4)
Biometris
quantitative methods in life and earth sciences
Modeling DNA
H0:
no association in nucleotide present
and preceding site
there is an association in nucleotide
present and preceding site
Ha:
See paragraph 3.5.5; Chi-square test
Biometris
quantitative methods in life and earth sciences
Modeling DNA
Chi-square test
Biometris
quantitative methods in life and earth sciences
Modeling signals in DNA
Signal: short sequence of DNA having a specific
purpose
Signals also occur in nonfunctional DNA
Assumption all signals have the same length n
Biometris
quantitative methods in life and earth sciences
Modeling signals in DNA
Signal probabilities; Weight matrices: independence
Signal length = 5
Prob(s | M)
For example:
s = aacct
Prob(s | M) = 0.33 ×…
Biometris
quantitative methods in life and earth sciences
Modeling signals in DNA
Weight matrices: test of independence
Similar as Table 5.4
Position 1 in
the signal
Perform test for
every combination
Biometris
quantitative methods in life and earth sciences
Position 2 in
the signal
Modeling signals in DNA
Markov Dependencies:
If the nucleotides at the sites in a signal are not
indepent It can be modelled by first or higher order
Markov dependencies (chapter 4).
By higher order dependencies, important to know
the most informative dependencies in our model:
Maximal Dependence decomposition
Biometris
quantitative methods in life and earth sciences
Modeling signals in DNA
Pos. 1
Maximal Dependence decomposition
Biometris
quantitative methods in life and earth sciences
Pos. 2
Modeling signals in DNA
Maximal Dependence decomposition
Row with highest sum, indication position with
highest influence.
Biometris
quantitative methods in life and earth sciences
Modeling signals in DNA
Position 4 greatest influence:
• The model M then consists of the distribution:
– {pa, pg, pc, pt}
– {Ma, Mg, Mc, Mt}
If nucleotide x occurs in position 4, Matrix Mx is used
to assign probability pk, k = 1,2,3, 5.
Prob(s | M) = px × p1 × p2 × p3 × p5
Biometris
quantitative methods in life and earth sciences
Long Repeats
Repeats of a nucleotide: for example cgtataaaaaagg
Test of randomness
Suppose:
»
»
»
»
»
N is length DNA sequence
P(a) = p
Occurrence a is called: “succes”
Occurrence other nucleotide at any site: “failure”
Apply geometric distribution
Biometris
quantitative methods in life and earth sciences
Long Repeats
Apply geometric distribution:
Sequence of success (for example nucleotide a)
Prob(Y = y) = (1-p) py,
y = 0,1,2,…..
y = 0,
y = 1,
y = 2,
for example
for example
for example
(a)c
(a)ag
(a)aag
Biometris
quantitative methods in life and earth sciences
Prob = 1-p
Prob = (1-p) p
Prob = (1-p) p2
Long Repeats
Suppose that we have Y1,….,Yn independent random
variables:
In general: Ymax is the maximum of n discrete
independent variables, given a common cumulative
distribution function FY(y)
Prob(Ymax ≥ y) =1- (FY(y-1))n
For geometric distribution:
Prob(Ymax ≥ y) =1- (1-py)n
Biometris
quantitative methods in life and earth sciences
Long Repeats
For geometric distribution:
Prob(Ymax ≥ y) =1- (1-py)n
(5.13)
P-value of any observed value ymax can be calculated by
5.13, but which number should be n (number of
sequences of successes of length 0 or more)
Any sequences of success most be precede by a failure
Under
H0
(randomness),
(1-p)×N
failures,
so
approximately n = (1-p)N
Biometris
quantitative methods in life and earth sciences
Long Repeats
So, p-value:
Suppose: N = 100.000, p = 0.25 and the observed ymax of
the longest repeated sequence (ymax) is 10.
P-value ~ =1- (1-pymax)(1-p)N
P-value is equal to 0.0690. H0 not rejected
Suppose ymax = 12; p-value = 0.0045 H0 rejected
Approximation possible:
Biometris
quantitative methods in life and earth sciences
Long repeats
Parameters for normal distribution
Euler’s constant ≈ 0.5772
Long repeats
Now, looking at runs of any nucleotide.
p = 0.25 (probability that any nucleotide arises at any site)
Mean and variance longest repeated sequence:
Mean is 1 higher
as previous slide
Variance is the
equal
Long repeats
Suppose the probability (p*) of one nucleotide is bigger than
others:
When n is large intuition tells us that the longest sequence will
be of the nucleotide with the largest probability
r =2, for example pa = 0.3, pc = 0.3, pg = 0.2 pt = 0.2
Shot sequencing
Contigs
N = 17 fragments, 7 contigs, overlapping fragments
N fragments of length L and full length of DNA is G (G is chosen
large compared to L)
Coverage a = N L / G
Fragments are random placed on G, left-hand ends uniform
distribution over [0,G]
Shot sequencing
Binomial and poisson distribution:
Probability that left-hand side is in interval (x, x+h) is equal
to h/G.
Number of fragments in this interval ~ Bin(N, h/G)
When N is large and h is small poisson distribution can be
used as approximation (N > 20 and p < 0.05) with mean
(intensity) equal to Nh/G
The number Y of fragments whose left-hand end is located
in a interval of length L left to randomly chosen point is
Poisson distributed with parameter NL/G = a
Probability that at least one fragment arises in interval:
• 1- Prob(Y=0) = 1 – e-a
Shot sequencing
Mean proportion of the genome covered by contigs:
This is equal 1 – e-a
Shot sequencing
Mean number of contigs:
Each contig has a unique rightmost fragment
N fragments
Probability that no other fragment has its left-hand end point on
fragment in question
Mean number of contigs = N e-a
N small, small number of contigs
N large, large number of long contigs
Maximum at a=1, 1X coverage
Shot sequencing
Mean contig size: