* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download HW7 key - WordPress.com
Gene desert wikipedia , lookup
Gene expression profiling wikipedia , lookup
Copy-number variation wikipedia , lookup
Metagenomics wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Oncogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genetic engineering wikipedia , lookup
Microevolution wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Transposable element wikipedia , lookup
Public health genomics wikipedia , lookup
Genome (book) wikipedia , lookup
Designer baby wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
Human Genome Project wikipedia , lookup
JHU 580.429 SB3 HW7: DNA information content 1. Assume that the length of the human genome is 3 × 109 base pairs, and that each of the 4 base pairs occurs with probability 1/4. (a) How long in base pairs does a motif have to be to occur approximately once per genome? A fractional result is fine. n =number of chance occurrences of a motif per genome = 1 G =genome size in base pairs = 3 × 109 p =probability of observing motif of interest among all motifs of same length =? L =motif length n = G× p p= 1 (assuming equal nucleotide probability) 4L Solve 1 3 × 109 1 L = log2 3 × 109 = 15.74 2 Motifs of length 15, and 16 will have an average occurrence of 2.79 and 0.70 per genome, respectively. p= (b) Suppose a motif occurs once on average in the genome. You can model this as a binomial distribution with 3 × 109 attempts and a success rate of p per attempt. What is p? Modeling this process as binomial distributions assumes that we have 3 × 109 attempts; i.e. 3 × 109 observations of motifs of length L. The parameter p is the probability of success in each observation, which is the probability of a random motif of length L being equal to our motif of interest. Therefore, p= 1 4L (c) The binomial distribution reduces to a Poisson distribution. From the Poisson distribution for a motif with λ occurrences on average per genome, what is the probability of exactly k occurrences? This question is really just asking you to write down the Poisson distribution. n =number of occurrences of a motif per genome n ∼ Poisson(λ ) Pr (n = k) = Version: 2015/10/27 09:31:08 e−λ λ k k! 1 of 6 JHU 580.429 SB3 HW7: DNA information content 2. Coding length and information content. The Shannon entropy for a discrete random variable with n states i ∈ {1, 2, . . . n} is − ∑ni=1 pi log2 (pi ), where pi is the probability of state i and the entropy is in bits. (a) If all 4 base pairs are equally likely, what is the Shannon entropy in bits for a specific position in the genome? H =Shannon entropy per position n 1 1 H = − ∑ pi log2 (pi ) = −4 log2 = 2 bits 4 4 i=1 Notice that this is also log2 4. The entropy of a uniform discrete random variable over an alphabet of size n is log2 n (b) Suppose that x1 , x2 , . . . , xT are T total random variables that are independent and identically distributed. Each random variable considered on its own has entropy H. Prove that the joint distribution has entropy T × H. Under independence assumption, we can write: p(x1 , x2 , . . . , xT ) = p(x1 ) × p(x2 ) × · · · × p(xT ) Thus, we can simplify the definition of entropy as: H (x1 , x2 , . . . , xT ) = ∑ −p (x1 , x2 , . . . , xT ) log2 p (x1 , x2 , . . . , xT ) ∑ (−1)p(x1 ) . . . p(xT )log2 (p(x1 ) × · · · × p(xT )) ∑ (−1)p(x1 ) . . . p(xT ) (log2 (p(x1 )) + · · · + log2 (p(xT ))) x1 ,x2 ,...,xT ∈X = x1 ,x2 ,...,xT ∈X = x1 ,x2 ,...,xT ∈X = p(x2 ) . . . p(xT ) ∑ x2 ,...,xT ∈X ∑ p(x1 )log2 (p (x1 )) x1 ∈X +... + p(x1 ) . . . p(x(T −1) ) ∑ x1 ,...,x(T −1) ∈X =H(x1 ) ∑ p(xT )log2 (p (xT )) xT ∈X p(x2 ) . . . p(xT ) + · · · + ∑ x2 ,...,xT ∈X H(xT ) ∑ p(x1 ) . . . p(x(T −1) ) x1 ,...,x(T −1) ∈X Also, for any number of random variables xi , we know that: !!! ∑ p(xi ) . . . p(x j ) = xi ,...,x j ∈X Version: 2015/10/27 09:31:08 ∑ xi ∈X p(xi ) ∑ x(i+1) ∈X p(x(i+1) ) . . . ∑ p(x j ) =1 x j ∈X 2 of 6 JHU 580.429 SB3 HW7: DNA information content Therefore, T H (x1 , x2 , . . . , xT ) = ∑ H(Xi ) i=1 And, because xi are all identically distributed, H(xi ) = H for all i, and H (x1 , x2 , . . . , xT ) = T × H (c) Suppose an alien form of life has 6 possible base pairs instead of 4. In fact, Steve Benner is working on creating unnatural nucleotides that would increase the coding capacity of the genome. What is the entropy per position in bits? What genome size would have the same entropy as the human genome? G j =genome size in bp for organism j H j =per position entropy in genome of organism j I j =genome entropy of organism j Ij = Gj × Hj For the alien with 6 equally probable bases: n 1 Halien = − ∑ pi log2 (pi ) = −6 log2 6 i=1 1 = log2 6 = 2.585 bits 6 In order to have the same genome entropy as human genome, we should have: Ghuman × Hhuman =Galien × Halien Galien = genome size of alien in bp = Ghuman × Hhuman Halien 2 × 3 × 109 = 2, 321, 116, 843 bp 2.585 ≈2.3 × 109 bp Galien = (d) The malaria parasite, Plasmodium falciparum, has a 23 megabase genome that is ATrich: AT and TA base pairs have about 40% frequency, whereas CG and GC have 10% frequency. How many bits of information are encoded at each position? What genome size would have the same entropy if the 4 base pairs had equal frequency? The subscript p f indicates Plasmodium falciparum. n Hp f = − ∑ pi log2 (pi ) = −2 × 0.4 × log2 (0.4) − 2 × 0.1 × log2 (0.1) = 1.722 bits i=1 Version: 2015/10/27 09:31:08 3 of 6 JHU 580.429 SB3 HW7: DNA information content If the 4 bases had equal probability, the entropy per position would be: n 1 Hpeqf = − ∑ pi log2 (pi ) = −4 × log2 (4) = 2 bits 4 i=1 In that case, the same genome entropy can be encoded using: G p f × Hp f 23 × 106 × 1.722 ≈ 19.8 megabase = eq 2 Hp f 3. Related problems. (a) The human genome has about 20,000 protein-coding genes. One method used in the 1990’s to analyze expressed genes was to sequence the 30 terminus of a transcript immediately upstream of the poly-A tail, termed an expressed sequence tag (EST). Assuming that each nucleotide in a transcript is equally likely, how long must a tag be to occur once on average among the 30 ends of 20,000 genes? Ng = the number of genes in human genome = 20, 000 p = probability of occurrence of ETS per gene n = the average number of occurrences of ETS n = Ng p Assuming that all tags are of the same length L, the probability of observing a given tag being equal to the ETS of a sample gene is equivalent to the probability of a random sequence of length L being equal to a given sequence of same length. Therefore: p= 1 4L and we want: 1 ≈1 4L Which results in L = 7.14. This indicates that tags of length 7 will on average be observed in more than 1 gene, and tags of length 8 will on average appear in less than 1 gene. n = Ng p = 20, 000 (b) Suppose that the human population is 10 billion, and each gene has 10 equally likely variants. How many genes must be examined to identify a person by giving a pattern Version: 2015/10/27 09:31:08 4 of 6 JHU 580.429 SB3 HW7: DNA information content that occurs on average once among humans? What fraction of the genome is this? N is the number of genes, and p is the probability of a specific variant, we want If pN × # people =1 N 1 × 1010 =1 10 N =10 4. In class we showed that the probability distribution for a geometrically distributed random variable x is pn = (1 − θ )θ n for n ∈ {0, 1, 2, . . . , ∞}. Calculate p̃(s), hni, and hn2 i − hni2 . For this question, the Laplace transform for a discrete domain is L [pn ] = p̃(s) ≡ ∞ ∑ exp(−sn)pn . n=0 Hint: the result for p̃(s) is an infinite series that you should sum to get a compact closed-form expression. Since pn is normalized, you can test your result by checking that p̃(s) = 1 when s = 0. Then you can calculate the moments as (−d/ds) ln p̃(s) and (−d/ds)2 ln p̃(s). pn = θ n (1 − θ ) We can find the generating function p̃(s) as, ∞ p̃(s) = ∑ pn e−sn n=0 ∞ = ∑ θ n (1 − θ )e−sn n=0 ∞ =(1 − θ ) ∑ (θ e−s )n n=0 =(1 − θ ) 1 1 − θ e−s In order to find hxi and hx2 i − hxi2 , we need to take first and second derivative of log of the generating function with respect to s. ln ( p̃(s)) = ln(1 − θ ) − ln(1 − θ e−s ) The mean of the random variable can be found using: d ln ( p̃(s)) |s=0 ds θ e−s θ hxi = |s=0 = −s 1−θe 1−θ hxi = − Version: 2015/10/27 09:31:08 5 of 6 JHU 580.429 SB3 HW7: DNA information content The variance of the random variable can be found similarly by: d2 (ln ( p̃(s))) |s=0 dss θ e−s = |s=0 (1 − θ e−s )2 θ = (1 − θ )2 hx2 i − hxi2 = Version: 2015/10/27 09:31:08 6 of 6