Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Comp. Genomics Recitation 3 The statistics of database searching Substitution matrices • Random model(R): x and y appear at random • P( x, y | R) qxi q y j i j • Match model(M): x and y are derived from a common ancestor • P( x, y | M ) pxi yi i • The odds ratio: P( x, y | M ) P( x, y | R) • Using the log of the odds ratio gives an additive scoring system Exercise The following substitution matrix is given: T C A G T 1 0 -1 -1 C 0 1 -1 -1 A -1 -1 1 0 G -1 -1 0 1 What is the average score per nucleotide pair? Assume each nucleotide appears with equal probability. T Solution C A G T 1 0 -1 -1 C 0 1 -1 -1 A -1 -1 1 0 G -1 -1 0 1 1 1 4 s (i , j ) 4 4 i, j 16 •What happens if the average score is not negative? •What is the chance that in a pair of evolutionary related sequences T is replaced by C? Solution S(i,j)= log 2 pij qi q j pij qi q j log 2 pij qi q j 0 1 Pij=1/16 Does the optimal alignment change if we multiply the matrix by a constant C? How significant is my score? • Create a mathematical model of the alignment of random sequences, and derive the score distribution analytically • Use simulation to estimate the score distribution of the alignment of: • Generated sequences • Real sequences that are known to be nonhomologous, or that are shuffled Empirical score distribution • The picture shows a distribution of scores from a real database search using BLAST. • This distribution contains scores from non-homologous and homologous pairs. High scores from homology. Empirical null score distribution • This distribution is similar to the previous one, but generated using a randomized sequence database. Statistical analysis • What is a null hypothesis? • An assumption that may be contradicted (but not validated) by the data. • The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis. Overview • What is a p-value? • The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” • Pr(x > S|null) Extreme value distribution • What is the name of the distribution created by local alignment scores, and what does it look like? • Extreme value distribution, or Gumbel distribution. • It looks similar to a normal distribution, but it has a larger tail on the right. • Ungapped local alignment max scores follow this distribution, and gapped alignment scores seem to follow it Extreme value distribution • The expected number of optimal alignments with a score ≥S is given by the formula: E Kmne S (E-value) where m,n are sequence lengths, λ is a scaling parameter for the scoring system and K is a scaling parameter for the search space (e.g. accounts for overlaps) • For ungapped local alignments, the parameters can be calculated directly from the substitution matrix scores and the lengths of the aligned sequences Exercise • Assuming that the probability for seeing x optimal alignments with score ≥S is given by the Poisson distribution: e x x! where μ is the mean, what is the p-value of the score S? Solution • The p-value is the probability of seeing the score ≥S by chance • The probability of not seeing the score by chance is e Kmne S • The probability of seeing the score by chance is S 1 exp( Kmne ) What p-value is significant? • The most common thresholds are 0.01 and 0.05. • A threshold of 0.05 means you are 95% sure that the result is significant. • Is 95% enough? It depends upon the cost associated with making a mistake. • Examples of costs: • Doing expensive wet lab validation. • Making clinical treatment decisions. • Misleading the scientific community. Database searching A database contains many sequences Problem: multiple comparisons Increase chance for random high score Multiple testing • Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. • Assume that all of the observations are explainable by the null hypothesis. • What is the chance that at least one of the observations will receive a p-value less than 0.05? Multiple testing • Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? • • • • Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.9520 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 • There is a 64.2% chance of making at least one mistake. Bonferroni correction • Divide the desired p-value threshold by the number of tests performed. • For the previous example, 0.05 / 20 = 0.0025. • • • • Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.997520 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488 Corrections for Database searching • Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What pvalue threshold should you use? • What is the hidden assumption here? Example • Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? • Say that you want to use a conservative p-value of 0.001. • Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. • A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10-9. Exercise • A sequence of size m is queried against a database. The database contains k sequences of lengths n1,n2,…,nk. The E-value for S alignment i is Kmni e . • What is the query E-value for score S? • If we know that the p-value for alignment i is Pi, what is the query p-value? Solution • Let Ai denote the number of optimal alignment i that scored ≥S (Ai is either 0 or 1) • E(A1+A2+…+Ak)=E(A1)+E(A2)+…+E(Ak)= k Kmni e i 1 S Kme S k S n Kmne i i 1 Solution • The probability of seeing an optimal alignment with score ≥S by chance in the entire database is k k i 1 i 1 1 (1 Pi ) 1 exp( Kmni e S k ) 1 exp( Kmni e S ) i 1