Download CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms Poll • Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar? Roadmap • Review of last lecture • Local Sequence Alignment • Statistics of sequence alignment – Substitution matrix – Significance of alignment Bounded Dynamic Programming yN ………………………… y1 x1 ………………………… xM • O(kM) time • O(kM) memory – Possibly O(M+k) k • O(M+N) memory • 2MN time k* M/2 M/2 Linear-space alignment N-k* Graph representation of seq alignment A A T A G T A 0 -1 -2 -3 -4 -1 1 0 -1 -2 -2 -3 0 -1 0 -1 1 0 0 (0,0) -1 1 -1 -1 -1 -1 -1 1 1 2 An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph 1 (3,4) Question • If I change the scoring scheme, will it change the alignment? – Match = 1, mismatch = gap = -2 || v – Match = 2, mismatch = gap = -1? • Answer: Yes Proof • Let F1 be the score of an optimal alignment under the scoring scheme – Match = m > 0 – Mismatch = s < 0 – Gap = d < 0 • Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment • F1 = a1m + b1s + c1d Proof (cont’) • Let F2 be the score of a sub-optimal alignment under the same scoring scheme • Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment • F2 = a2m + b2s + c2d • Let F1 = F2 + k, where k > 0 Proof (cont’) • Now we change the scoring scheme, so that – Match = m + 1 – Mismatch = s + 1 – Gap = d + 1 Proof (cont’) • The new scores for the two alignments become: F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) length of alignment 1 = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2 length of alignment 2 Proof (cont’) • F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2) = k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 Length of alignment 1 Length of alignment 2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Proof (cont’) • This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. • We only need to show one example that it is possible to find such two alignments d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 1, s = d = –2 m F1 = 2 – 6 = –4 d F2 = 3 – 8 = –5 m s F1 > F2 s d d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 2, s = d = – 1 m F1’ = 4 – 3 = 1 d F2’ = 6 – 4 = 2 m s F2’ > F1’ s d A A m A C A G AACAG m | | ATCGT T F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 =1 m C G T m AA-CAG- F2 = 3x1 – 4x2 = -5 | | | F2’ = 3x2 – 4x1 -ATC-GT =2 • On the other hand, if we had doubled our scores, such that m’ = 2m, s’ = 2s d’ = 2d • F1’ = 2F1 • F2’ = 2F2 • Our alignment won’t be changed Today • How to model gaps more accurately? • Local sequence alignment • Statistics of alignment What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches. - During evolution, chunks of DNA may be lost entirely - Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA) Model gaps more accurately • Current model: – Gap of length n incurs penalty nd  n • General: – Convex function – E.g. (n) = c * sqrt (n)  n General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable) Compromise: affine gaps (n) = d + (n – 1)e | | gap gap open extension Match: 2 Gap open: 5 Gap extension: 1 (n) d GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 e Additional states • The amount of state needed increases – In scoring a single entry in our matrix, we need remember an extra piece of information • Are we continuing a gap in x? (if no, start is more expensive) • Are we continuing a gap in y? (if no, start is more expensive) • Are we continuing from a match between xi and yj? Finite State Automaton e   Xi aligned to a gap d Xi and Yj aligned d  Yj aligned to a gap e Dynamic programming • We encode this information in three different matrices • For each element (i,j) we use three variables – F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj – Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap – Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) Ix(i, j) = max Iy(i, j) = max F(i, j – 1) – d Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y F Ix Iy F Ix Iy • If we stack all three matrices – No cyclic dependency – We can fill in all three matrices in order Algorithm • for i = 1:m – for j = 1:n • Fill in F(i, j), Ix(i, j), Iy(i, j) – end end • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) • Time: O(MN) • Space: O(MN) or O(N) when combine with the linear-space algorithm To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max I(i – 1, j – 1) + (xi, yj) I (i, j) = max F(i, j – 1) – d I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed To summarize • Global alignment – Basic algorithm: Needleman-Wunsch – Variants: • Overlapping detection • Longest common subsequences • Achieved by varying initial conditions or scoring – Bounded DP (pruning search space) – Linear space (divide-and-conquer) – Affine gap penalty Local alignment The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex Y = xxxcde x y X’ = cxde Y’ = c-de Why local alignment • Conserved regions may be a small part of the whole – “Active site” of a protein – Scattered genes or exons among “junks” – Don’t have whole sequence • Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C B A D B A B D C A D A B C C D Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total. Reminder • The overlap detection algorithm – We do not give penalty to gaps in the ends Free gap Free gap Similar here • We are free of penalty for the unaligned regions The big idea • Whenever we get to some bad region (negative score), we ignore the previous alignment – Reset score to zero The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d Iteration: F(i, j) = max F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) The Smith-Waterman algorithm Termination: 1. If we want the best local alignment… FOPT = maxi,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back • The correctness of the algorithm can be proved by induction using the alignment graph 0 -10 0 100 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 x 0 x 0 c 0 d 0 e 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 x 0 0 0 x 0 0 0 c 0 0 0 d 0 0 0 e 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 c 0 0 0 2 d 0 0 0 1 e 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 x 0 0 0 0 2 x 0 0 0 0 2 c 0 0 0 2 1 d 0 0 0 1 0 e 0 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 x 0 0 0 0 2 1 x 0 0 0 0 2 1 c 0 0 0 2 1 1 d 0 0 0 1 0 3 e 0 0 0 0 0 2 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 c 0 0 0 2 1 1 0 d 0 0 0 1 0 3 2 e 0 0 0 0 0 2 5 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 • No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “Smith-Waterman” • Global alignment: “Needleman-Wunsch” Analysis • Time: – O(MN) for finding the best alignment – Depending on the number of sub-opt alignments • Memory: – O(MN) – O(M+N) possible The statistics of alignment Where does (xi, yj) come from? Are two aligned sequences actually related? Probabilistic model of alignments • We’ll focus on protein alignments without gaps • Given an alignment, we can consider two possibilities – R: the sequences are related by evolution – U: the sequences are unrelated • How can we distinguish these possibilities? • How is this view related to amino-acid substitution matrix? Model for unrelated sequences • Assume each position of the alignment is independently sampled from some distribution of amino acids • ps: probability of amino acid s in the sequences • Probability of seeing an amino acid s aligned to an amino acid t by chance is – Pr(s, t | U) = ps * pt • Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is Model for related sequences • Assume each pair of aligned amino acids evolved from a common ancestor • Let qst be the probability that amino acid s in one sequence is related to t in another sequence • The probability of an alignment of x and y is give by Probabilistic model of Alignments • How can we decide which possibility (U or R) is more likely? • One principled way is to consider the relative likelihood of the two possibilities (the odd ratios) – A higher ratio means that R is more likely than U Log odds ratio • Taking the log, we get • Recall that the score of an alignment is given by • Therefore, if we define This is indeed how biologists have defined the substitution matrices for proteins • We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U • ps can be counted from the available protein sequences • But how do we get qst? (the probability that s and t have a common ancestor) • Counted from trusted alignments of related sequences Protein Substitution Matrices • Two popular sets of matrices for protein sequences – PAM matrices [Dayhoff et al, 1978] • Better for aligning closely related sequences – BLOSUM matrices [Henikoff & Henikoff, 1992] • For both closely or remotely related sequences Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights BLOSUM-N matrices • Constructed from a database called BLOCKS • Contain many closely related sequences – Conserved amino acids may be over-counted • N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity – identity: % of matched columns • Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%) • If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75 • On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 • BLOSUM62 is the standard For DNAs • No database of trusted alignments to start with • Specify the percentage identity you would like to detect • You can then get the substitution matrix by some calculation For example • Suppose pA = pC = pT = pG = 0.25 • We want 88% identity • qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 • (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 • (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t. Substitution matrix A C G T A 1.26 -1.83 -1.83 -1.83 C -1.83 1.26 -1.83 -1.83 G -1.83 -1.83 1.26 -1.83 T -1.83 -1.83 -1.83 1.26 A C G T A 5 -7 -7 -7 C -7 5 -7 -7 G -7 -7 5 -7 T -7 -7 -7 5 • Scale won’t change the alignment • Multiply by 4 and then round off to get integers Arbitrary substitution matrix • Say you have a substitution matrix provided by someone • It’s important to know what you are actually looking for when you use the matrix A C G T A C A 1 -2 -2 -2 C -2 1 G T A 5 -4 -4 -4 -2 -2 C -4 5 -2 -2 1 -2 G -4 -4 5 -4 -2 -2 -2 1 T -4 -4 -4 5 • What’s the difference? • Which one should I use? G T -4 -4 • We had • Scale it, so that • Reorganize: • Since all probabilities must sum to 1, • We have • Suppose again ps = 0.25 for any s • We know (s, t) from the substitution matrix • We can solve the equation for λ • Plug λ into to get qst A C G T A C G A 1 -2 -2 -2 C -2 1 G T T A 5 -4 -4 -4 -2 -2 C -4 5 -4 -4 -2 -2 1 -2 G -4 -4 5 -4 -2 -2 -2 1 T -4 -4 -4 5  = 1.33  = 1.21 qst = 0.24 for s = t, and 0.004 for s ≠ t qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 95% identity Translate: 65% identity

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS 5263 Bioinformatics