Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm Review: Dynamic Programming for LCS -Edit graph representation of alignment -Path = alignment -Incrementally fill in the table -Backtrack to find the best alignment The LCS Recurrence Revisited • The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: si-1, j-1+1 if vi = wj si,j = max Matching score si-1, j + 0 si, j-1 + 0 Insertion/deletion score How do we improve scoring? How do we improve the scoring of alignments? Can we still find an alignment efficiently? Outline • Improve Scoring – Scoring Matrix – Affine Gap Penalty • Variants of Alignment – Global vs. Local alignment • Assessing Score Significance Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score with comparison of a gap character “-”. This will simplify the scoring algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) The same dynamic programming algorithm would still work! The Global Alignment Problem Find the best alignment between two strings under a given scoring matrix Input : Strings v & w and a scoring matrix δ Output : Alignment of maximum score Algorithm: Dynamic programming si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) The only question left is how to define the scoring matrix… Measuring Similarity • Measuring the extent of similarity between two sequences – Based on percent sequence identity – Based on conservation Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant AC C TG A G – AG AC G TG – G C AG mismatch indel 70% identical Simple Scoring • When mismatches are penalized by some constant –μ, indels are penalized by some other constant –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) Making a Better Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations in the sequence. • Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. Scoring Matrix: Example A R N K A 5 -2 -1 -1 R - 7 -1 3 N - - 7 0 K - - - 6 • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. Scoring matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution matrices – DNA: less conserved than protein sequences – Less effective to compare coding regions at nucleotide level – Simple scoring is often used PAM • • Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions – After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may have not changed at all PAMX • PAMx = PAM1x Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions – PAM250 = PAM1250 • PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln ... Trp Tyr Val A R N D C Q Ala A 13 3 4 5 2 3 Arg R 6 17 4 4 1 5 Asn N 9 4 6 8 1 5 Asp D 9 3 7 11 1 6 Cys C 5 2 2 1 52 1 Gln Q 8 5 5 7 1 10 Glu E 9 3 6 10 1 7 Gly G 12 2 4 5 2 3 His H 6 6 6 6 2 7 Ile I 8 3 3 3 2 2 W Y V 0 1 7 2 1 4 0 2 4 0 1 4 0 3 4 0 1 4 0 1 4 0 1 4 1 3 5 0 2 4 Leu L 6 2 2 2 1 3 Lys ... K ... 7 ... 9 5 5 1 5 1 2 15 0 1 10 BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance – BLOSUMx was created using sequences sharing no more than x% identity – E.g., BLOSUM62 <-> 62% identity The Blosum50 Scoring Matrix Probability of seeing x aligned with y Val(x,y)=log(p(x,y)/p(x)p(y)) Probability of seeing x (or y) alone Deficiency in Scoring of Indels • A fixed penalty σ is given to every indel: – -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels Deficiency in Scoring of Indels (cont.) • In nature, many times indels come as a unit, not just at 1 nucleotide at a time. In nature, this is more likely. Normal scoring would give the same score for both alignments Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: -(ρ + σx), where ρ >0 is the penalty for introducing a gap. ρ will be large relative to σ because you do not want to add too much of a penalty for extending the gap. Affine Gap Penalties • Gap penalties: – -ρ-σ when there is 1 indels, -ρ-2σ when there are 2 indels, -ρ-3σ when there are 3 indels, etc. – -ρ- x * σ (-gap opening - x gap extensions) • Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges Affine Gap Penalty Recurrences si,j = max si,j = max si,j = max s i-1,j - σ s i-1,j –(ρ+σ) Continue Gap in w (deletion) Start Gap in w (deletion) s i,j-1 - σ Continue Gap in v (insertion) s i,j-1 –(ρ+σ) Start Gap in v (insertion) si-1,j-1 + δ (vi, wj) Match or Mismatch s i,j End deletion s i,j End insertion Once again, the same dynamic programming algorithm would work! Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. Local vs. Global Alignment (cont’d) • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C • Local Alignment—better alignment to find conserved segment tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: – Homeobox genes have a short region called the homeodomain that is highly conserved between species. – A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings Local Alignment in Edit Graph Local alignment Global alignment Compute a “mini” Global Alignment to get Local The Problem with this Problem • Problem of this, long run time O(n4): - There are ~n2 pairs of vertices (i,j) - For each pair of vertices computing an alignment takes O(n2) time. • Solution: Dynamic programming again! • Question: How do we recursively compute the best score of any local (as opposed to global) alignment for each cell in the edit graph? The Local Alignment Recurrence • The largest value of si,j over the whole edit graph is the score of the best local alignment. • The recurrence is shown below: 0 si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) Notice there is only this change from the original recurrence of a Global Alignment This is the well-known Waterman-Smith local alignment algoirthm Assessing Score Signficance • In general, larger s more significant. The question is how large should s be? • Factors to be considered: – Sequence length: longer sequences are expected to give higher scores – # sequences in the database: the score of the best alignment is expected to be higher for a larger DB – Evolution time: longer evolution causes more mismatches, making a lower score more significant • The Challenge is how to quantify all these… Two Basic Approaches • The classical approach: Extreme value distribution (EVD) – Assume a null (random) model for scores MR – P(Score > s|MR, a(x, y))=? (a(x,y)=alignment of x, y) • The Bayesian approach: Model comparison – Assume two models for a(x,y): random R; aligned: M – P(M|a(x,y))/P(R|a(x,y))=? log p( M | a( x, y )) p ( a ( x, y ) | M ) p( M ) log log p( R | a( x, y )) p ( a ( x, y ) | M R ) p(M R ) Log-odds score of the alignment prior EVD of the Best Score in Ungapped Local Alignment • The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean Parameters E (S ) Kmne • S e E ( S ) E ( S )n p ( N n) n! Sequence lengths The probability that there is a match of score greater than S is p( x S ) 1 p( N 0) 1 e E ( S ) • • K and can be fit using randomly generated data This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Bayesian Model Comparison Assumptions: log • M is a model for related sequences • R is a model for unrelated sequences (random) • Ungapped alignment n=m • Alignment of each pair is independent p( M | a ( x, y )) p ( a ( x, y ) | M ) p( M ) log log p ( R | a ( x, y )) p ( a ( x, y ) | R ) p( R) n p ( xi , yi | M ) p( M ) log log p ( xi , yi | R) p( R) i 1 n p ( xi , yi | M ) p( M ) log log p ( xi | R ) p ( yi | R ) p( R) i 1 BLOSUM Scoring n log s ( xi , yi ) log i 1 p( M ) p( R) log p ( M | a ( x, y )) 0? p( R | a( x, y )) n log s( x , y ) log i 1 i Score S(x,y) This partially addresses Q1: how to design the scoring function? i p(M ) p( R) Prior (Subjective!) Pairwise Alignment Summary (Modeling evolution) Q1: How should we define s? Q2: How should we define A? (Application-specific) X=x1,…,xn Model: scoring function s: A X=x1,…,xn Possible alignments of X and Y: A ={a1,…,ak} … Find the best alignment(s) Y=y1,…,ym a* arg max a s(a( X , Y )) Q3: How can we find a* quickly? (Dynamic programming) Q1 & Q4 are related! S(a*)= 21 Y=y1,…,ym Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? (Models for scores) What You Should Know • Alignment Scoring Methods (Matrix & Gap) • Global vs. Local alignments • How the dynamic programming algorithm solves both local and global alignments with a number of scoring strategies • Basic idea in assessing score significance