* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS 5263 Bioinformatics
Survey
Document related concepts
Transcript
CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms Poll • Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar? Roadmap • Review of last lecture • Local Sequence Alignment • Statistics of sequence alignment – Substitution matrix – Significance of alignment Bounded Dynamic Programming yN ………………………… y1 x1 ………………………… xM • O(kM) time • O(kM) memory – Possibly O(M+k) k • O(M+N) memory • 2MN time k* M/2 M/2 Linear-space alignment N-k* Graph representation of seq alignment A A T A G T A 0 -1 -2 -3 -4 -1 1 0 -1 -2 -2 -3 0 -1 0 -1 1 0 0 (0,0) -1 1 -1 -1 -1 -1 -1 1 1 2 An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph 1 (3,4) Question • If I change the scoring scheme, will it change the alignment? – Match = 1, mismatch = gap = -2 || v – Match = 2, mismatch = gap = -1? • Answer: Yes Proof • Let F1 be the score of an optimal alignment under the scoring scheme – Match = m > 0 – Mismatch = s < 0 – Gap = d < 0 • Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment • F1 = a1m + b1s + c1d Proof (cont’) • Let F2 be the score of a sub-optimal alignment under the same scoring scheme • Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment • F2 = a2m + b2s + c2d • Let F1 = F2 + k, where k > 0 Proof (cont’) • Now we change the scoring scheme, so that – Match = m + 1 – Mismatch = s + 1 – Gap = d + 1 Proof (cont’) • The new scores for the two alignments become: F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) length of alignment 1 = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2 length of alignment 2 Proof (cont’) • F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2) = k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 Length of alignment 1 Length of alignment 2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Proof (cont’) • This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. • We only need to show one example that it is possible to find such two alignments d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 1, s = d = –2 m F1 = 2 – 6 = –4 d F2 = 3 – 8 = –5 m s F1 > F2 s d d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 2, s = d = – 1 m F1’ = 4 – 3 = 1 d F2’ = 6 – 4 = 2 m s F2’ > F1’ s d A A m A C A G AACAG m | | ATCGT T F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 =1 m C G T m AA-CAG- F2 = 3x1 – 4x2 = -5 | | | F2’ = 3x2 – 4x1 -ATC-GT =2 • On the other hand, if we had doubled our scores, such that m’ = 2m, s’ = 2s d’ = 2d • F1’ = 2F1 • F2’ = 2F2 • Our alignment won’t be changed Today • How to model gaps more accurately? • Local sequence alignment • Statistics of alignment What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches. - During evolution, chunks of DNA may be lost entirely - Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA) Model gaps more accurately • Current model: – Gap of length n incurs penalty nd n • General: – Convex function – E.g. (n) = c * sqrt (n) n General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable) Compromise: affine gaps (n) = d + (n – 1)e | | gap gap open extension Match: 2 Gap open: 5 Gap extension: 1 (n) d GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 e Additional states • The amount of state needed increases – In scoring a single entry in our matrix, we need remember an extra piece of information • Are we continuing a gap in x? (if no, start is more expensive) • Are we continuing a gap in y? (if no, start is more expensive) • Are we continuing from a match between xi and yj? Finite State Automaton e Xi aligned to a gap d Xi and Yj aligned d Yj aligned to a gap e Dynamic programming • We encode this information in three different matrices • For each element (i,j) we use three variables – F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj – Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap – Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) Ix(i, j) = max Iy(i, j) = max F(i, j – 1) – d Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y F Ix Iy F Ix Iy • If we stack all three matrices – No cyclic dependency – We can fill in all three matrices in order Algorithm • for i = 1:m – for j = 1:n • Fill in F(i, j), Ix(i, j), Iy(i, j) – end end • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) • Time: O(MN) • Space: O(MN) or O(N) when combine with the linear-space algorithm To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max I(i – 1, j – 1) + (xi, yj) I (i, j) = max F(i, j – 1) – d I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed To summarize • Global alignment – Basic algorithm: Needleman-Wunsch – Variants: • Overlapping detection • Longest common subsequences • Achieved by varying initial conditions or scoring – Bounded DP (pruning search space) – Linear space (divide-and-conquer) – Affine gap penalty Local alignment The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex Y = xxxcde x y X’ = cxde Y’ = c-de Why local alignment • Conserved regions may be a small part of the whole – “Active site” of a protein – Scattered genes or exons among “junks” – Don’t have whole sequence • Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C B A D B A B D C A D A B C C D Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total. Reminder • The overlap detection algorithm – We do not give penalty to gaps in the ends Free gap Free gap Similar here • We are free of penalty for the unaligned regions The big idea • Whenever we get to some bad region (negative score), we ignore the previous alignment – Reset score to zero The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d Iteration: F(i, j) = max F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) The Smith-Waterman algorithm Termination: 1. If we want the best local alignment… FOPT = maxi,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back • The correctness of the algorithm can be proved by induction using the alignment graph 0 -10 0 100 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 x 0 x 0 c 0 d 0 e 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 x 0 0 0 x 0 0 0 c 0 0 0 d 0 0 0 e 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 c 0 0 0 2 d 0 0 0 1 e 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 x 0 0 0 0 2 x 0 0 0 0 2 c 0 0 0 2 1 d 0 0 0 1 0 e 0 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 x 0 0 0 0 2 1 x 0 0 0 0 2 1 c 0 0 0 2 1 1 d 0 0 0 1 0 3 e 0 0 0 0 0 2 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 c 0 0 0 2 1 1 0 d 0 0 0 1 0 3 2 e 0 0 0 0 0 2 5 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 • No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “Smith-Waterman” • Global alignment: “Needleman-Wunsch” Analysis • Time: – O(MN) for finding the best alignment – Depending on the number of sub-opt alignments • Memory: – O(MN) – O(M+N) possible The statistics of alignment Where does (xi, yj) come from? Are two aligned sequences actually related? Probabilistic model of alignments • We’ll focus on protein alignments without gaps • Given an alignment, we can consider two possibilities – R: the sequences are related by evolution – U: the sequences are unrelated • How can we distinguish these possibilities? • How is this view related to amino-acid substitution matrix? Model for unrelated sequences • Assume each position of the alignment is independently sampled from some distribution of amino acids • ps: probability of amino acid s in the sequences • Probability of seeing an amino acid s aligned to an amino acid t by chance is – Pr(s, t | U) = ps * pt • Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is Model for related sequences • Assume each pair of aligned amino acids evolved from a common ancestor • Let qst be the probability that amino acid s in one sequence is related to t in another sequence • The probability of an alignment of x and y is give by Probabilistic model of Alignments • How can we decide which possibility (U or R) is more likely? • One principled way is to consider the relative likelihood of the two possibilities (the odd ratios) – A higher ratio means that R is more likely than U Log odds ratio • Taking the log, we get • Recall that the score of an alignment is given by • Therefore, if we define This is indeed how biologists have defined the substitution matrices for proteins • We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U • ps can be counted from the available protein sequences • But how do we get qst? (the probability that s and t have a common ancestor) • Counted from trusted alignments of related sequences Protein Substitution Matrices • Two popular sets of matrices for protein sequences – PAM matrices [Dayhoff et al, 1978] • Better for aligning closely related sequences – BLOSUM matrices [Henikoff & Henikoff, 1992] • For both closely or remotely related sequences Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights BLOSUM-N matrices • Constructed from a database called BLOCKS • Contain many closely related sequences – Conserved amino acids may be over-counted • N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity – identity: % of matched columns • Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%) • If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75 • On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 • BLOSUM62 is the standard For DNAs • No database of trusted alignments to start with • Specify the percentage identity you would like to detect • You can then get the substitution matrix by some calculation For example • Suppose pA = pC = pT = pG = 0.25 • We want 88% identity • qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 • (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 • (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t. Substitution matrix A C G T A 1.26 -1.83 -1.83 -1.83 C -1.83 1.26 -1.83 -1.83 G -1.83 -1.83 1.26 -1.83 T -1.83 -1.83 -1.83 1.26 A C G T A 5 -7 -7 -7 C -7 5 -7 -7 G -7 -7 5 -7 T -7 -7 -7 5 • Scale won’t change the alignment • Multiply by 4 and then round off to get integers Arbitrary substitution matrix • Say you have a substitution matrix provided by someone • It’s important to know what you are actually looking for when you use the matrix A C G T A C A 1 -2 -2 -2 C -2 1 G T A 5 -4 -4 -4 -2 -2 C -4 5 -2 -2 1 -2 G -4 -4 5 -4 -2 -2 -2 1 T -4 -4 -4 5 • What’s the difference? • Which one should I use? G T -4 -4 • We had • Scale it, so that • Reorganize: • Since all probabilities must sum to 1, • We have • Suppose again ps = 0.25 for any s • We know (s, t) from the substitution matrix • We can solve the equation for λ • Plug λ into to get qst A C G T A C G A 1 -2 -2 -2 C -2 1 G T T A 5 -4 -4 -4 -2 -2 C -4 5 -4 -4 -2 -2 1 -2 G -4 -4 5 -4 -2 -2 -2 1 T -4 -4 -4 5 = 1.33 = 1.21 qst = 0.24 for s = t, and 0.004 for s ≠ t qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 95% identity Translate: 65% identity