Download HW7 key - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Copy-number variation wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Oncogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genetic engineering wikipedia , lookup

Microevolution wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Gene wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

NUMT wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

ENCODE wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pathogenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Human genome wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome editing wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
JHU 580.429 SB3
HW7: DNA information content
1. Assume that the length of the human genome is 3 × 109 base pairs, and that each of the 4
base pairs occurs with probability 1/4.
(a) How long in base pairs does a motif have to be to occur approximately once per
genome? A fractional result is fine.
n =number of chance occurrences of a motif per genome = 1
G =genome size in base pairs = 3 × 109
p =probability of observing motif of interest among all motifs of same length =?
L =motif length
n = G× p
p=
1
(assuming equal nucleotide probability)
4L
Solve
1
3 × 109
1
L = log2 3 × 109 = 15.74
2
Motifs of length 15, and 16 will have an average occurrence of 2.79 and 0.70 per
genome, respectively.
p=
(b) Suppose a motif occurs once on average in the genome. You can model this as a binomial distribution with 3 × 109 attempts and a success rate of p per attempt. What is p?
Modeling this process as binomial distributions assumes that we have 3 × 109 attempts;
i.e. 3 × 109 observations of motifs of length L. The parameter p is the probability of
success in each observation, which is the probability of a random motif of length L
being equal to our motif of interest. Therefore,
p=
1
4L
(c) The binomial distribution reduces to a Poisson distribution. From the Poisson distribution for a motif with λ occurrences on average per genome, what is the probability of
exactly k occurrences? This question is really just asking you to write down the Poisson
distribution.
n =number of occurrences of a motif per genome
n ∼ Poisson(λ )
Pr (n = k) =
Version: 2015/10/27 09:31:08
e−λ λ k
k!
1 of 6
JHU 580.429 SB3
HW7: DNA information content
2. Coding length and information content. The Shannon entropy for a discrete random variable
with n states i ∈ {1, 2, . . . n} is − ∑ni=1 pi log2 (pi ), where pi is the probability of state i and
the entropy is in bits.
(a) If all 4 base pairs are equally likely, what is the Shannon entropy in bits for a specific
position in the genome?
H =Shannon entropy per position
n
1
1
H = − ∑ pi log2 (pi ) = −4
log2
= 2 bits
4
4
i=1
Notice that this is also log2 4. The entropy of a uniform discrete random variable over
an alphabet of size n is log2 n
(b) Suppose that x1 , x2 , . . . , xT are T total random variables that are independent and identically distributed. Each random variable considered on its own has entropy H. Prove
that the joint distribution has entropy T × H.
Under independence assumption, we can write:
p(x1 , x2 , . . . , xT ) = p(x1 ) × p(x2 ) × · · · × p(xT )
Thus, we can simplify the definition of entropy as:
H (x1 , x2 , . . . , xT ) =
∑
−p (x1 , x2 , . . . , xT ) log2 p (x1 , x2 , . . . , xT )
∑
(−1)p(x1 ) . . . p(xT )log2 (p(x1 ) × · · · × p(xT ))
∑
(−1)p(x1 ) . . . p(xT ) (log2 (p(x1 )) + · · · + log2 (p(xT )))
x1 ,x2 ,...,xT ∈X
=
x1 ,x2 ,...,xT ∈X
=
x1 ,x2 ,...,xT ∈X
=
p(x2 ) . . . p(xT )
∑
x2 ,...,xT ∈X
∑
p(x1 )log2 (p (x1 ))
x1 ∈X
+...
+
p(x1 ) . . . p(x(T −1) )
∑
x1 ,...,x(T −1) ∈X
=H(x1 )
∑
p(xT )log2 (p (xT ))
xT ∈X
p(x2 ) . . . p(xT ) + · · · +
∑
x2 ,...,xT ∈X
H(xT )
∑
p(x1 ) . . . p(x(T −1) )
x1 ,...,x(T −1) ∈X
Also, for any number of random variables xi , we know that:
!!!
∑
p(xi ) . . . p(x j ) =
xi ,...,x j ∈X
Version: 2015/10/27 09:31:08
∑
xi ∈X
p(xi )
∑
x(i+1) ∈X
p(x(i+1) ) . . .
∑
p(x j )
=1
x j ∈X
2 of 6
JHU 580.429 SB3
HW7: DNA information content
Therefore,
T
H (x1 , x2 , . . . , xT ) = ∑ H(Xi )
i=1
And, because xi are all identically distributed, H(xi ) = H for all i, and
H (x1 , x2 , . . . , xT ) = T × H
(c) Suppose an alien form of life has 6 possible base pairs instead of 4. In fact, Steve
Benner is working on creating unnatural nucleotides that would increase the coding
capacity of the genome. What is the entropy per position in bits? What genome size
would have the same entropy as the human genome?
G j =genome size in bp for organism j
H j =per position entropy in genome of organism j
I j =genome entropy of organism j
Ij = Gj × Hj
For the alien with 6 equally probable bases:
n
1
Halien = − ∑ pi log2 (pi ) = −6 log2
6
i=1
1
= log2 6 = 2.585 bits
6
In order to have the same genome entropy as human genome, we should have:
Ghuman × Hhuman =Galien × Halien
Galien = genome size of alien in bp =
Ghuman × Hhuman
Halien
2 × 3 × 109
= 2, 321, 116, 843 bp
2.585
≈2.3 × 109 bp
Galien =
(d) The malaria parasite, Plasmodium falciparum, has a 23 megabase genome that is ATrich: AT and TA base pairs have about 40% frequency, whereas CG and GC have 10%
frequency. How many bits of information are encoded at each position? What genome
size would have the same entropy if the 4 base pairs had equal frequency?
The subscript p f indicates Plasmodium falciparum.
n
Hp f = − ∑ pi log2 (pi ) = −2 × 0.4 × log2 (0.4) − 2 × 0.1 × log2 (0.1) = 1.722 bits
i=1
Version: 2015/10/27 09:31:08
3 of 6
JHU 580.429 SB3
HW7: DNA information content
If the 4 bases had equal probability, the entropy per position would be:
n
1
Hpeqf = − ∑ pi log2 (pi ) = −4 × log2 (4) = 2 bits
4
i=1
In that case, the same genome entropy can be encoded using:
G p f × Hp f
23 × 106 × 1.722
≈ 19.8 megabase
=
eq
2
Hp f
3. Related problems.
(a) The human genome has about 20,000 protein-coding genes. One method used in the
1990’s to analyze expressed genes was to sequence the 30 terminus of a transcript immediately upstream of the poly-A tail, termed an expressed sequence tag (EST). Assuming
that each nucleotide in a transcript is equally likely, how long must a tag be to occur
once on average among the 30 ends of 20,000 genes?
Ng = the number of genes in human genome = 20, 000
p = probability of occurrence of ETS per gene
n = the average number of occurrences of ETS
n = Ng p
Assuming that all tags are of the same length L, the probability of observing a given tag
being equal to the ETS of a sample gene is equivalent to the probability of a random
sequence of length L being equal to a given sequence of same length. Therefore:
p=
1
4L
and we want:
1
≈1
4L
Which results in L = 7.14. This indicates that tags of length 7 will on average be
observed in more than 1 gene, and tags of length 8 will on average appear in less than
1 gene.
n = Ng p = 20, 000
(b) Suppose that the human population is 10 billion, and each gene has 10 equally likely
variants. How many genes must be examined to identify a person by giving a pattern
Version: 2015/10/27 09:31:08
4 of 6
JHU 580.429 SB3
HW7: DNA information content
that occurs on average once among humans? What fraction of the genome is this?
N is the number of genes, and p is the probability of a specific variant, we want
If
pN × # people =1
N
1
× 1010 =1
10
N =10
4. In class we showed that the probability distribution for a geometrically distributed random
variable x is pn = (1 − θ )θ n for n ∈ {0, 1, 2, . . . , ∞}. Calculate p̃(s), hni, and hn2 i − hni2 . For
this question, the Laplace transform for a discrete domain is
L [pn ] = p̃(s) ≡
∞
∑ exp(−sn)pn .
n=0
Hint: the result for p̃(s) is an infinite series that you should sum to get a compact closed-form
expression. Since pn is normalized, you can test your result by checking that p̃(s) = 1 when
s = 0. Then you can calculate the moments as (−d/ds) ln p̃(s) and (−d/ds)2 ln p̃(s).
pn = θ n (1 − θ )
We can find the generating function p̃(s) as,
∞
p̃(s) = ∑ pn e−sn
n=0
∞
= ∑ θ n (1 − θ )e−sn
n=0
∞
=(1 − θ ) ∑ (θ e−s )n
n=0
=(1 − θ )
1
1 − θ e−s
In order to find hxi and hx2 i − hxi2 , we need to take first and second derivative of log of the
generating function with respect to s.
ln ( p̃(s)) = ln(1 − θ ) − ln(1 − θ e−s )
The mean of the random variable can be found using:
d
ln ( p̃(s)) |s=0
ds
θ e−s
θ
hxi =
|s=0 =
−s
1−θe
1−θ
hxi = −
Version: 2015/10/27 09:31:08
5 of 6
JHU 580.429 SB3
HW7: DNA information content
The variance of the random variable can be found similarly by:
d2
(ln ( p̃(s))) |s=0
dss
θ e−s
=
|s=0
(1 − θ e−s )2
θ
=
(1 − θ )2
hx2 i − hxi2 =
Version: 2015/10/27 09:31:08
6 of 6