Download L3: Blast: Keyword match basics

Document related concepts
no text concepts found
Transcript
CSE182-L4:
Keyword matching
Fa05
CSE 182
Backward scoring
• Defin Sb[i,j] : Best scoring alignment of the
suffixes s[i+1..n] and t[j+1..m]
• Q: What is the score of the best alignment of the
two strings s and t?
• HW: Write the recurrences for Sb
Fa05
CSE 182
Forward/Backward computations
• F[j] = Score of the best
scoring alignment of
s[1..n/2] and t[1..j]
– F[j] = S[n/2,j]
• B[j] = Score of the best
scoring alignment of
s[n/2+1..n] and t[j+1..m]
– B[j] = Sb[n/2,j]
Fa05
CSE 182
1
n/2
j
m
Forward/Backward computations
1
• At the optimal
coordinate, j
– F[j]+B[j]=S[n,m]
• In O(nm) time, and
O(m) space, we can
compute one of the
coordinates on the
optimum path.
Fa05
n/2
CSE 182
j
m
Forward, Backward computation
• There exists a
coordinate, j
– F[j]+B[j]=S[n,m]
• In O(nm) time, and
O(m) space, we can
compute one of the
coordinates on the
optimum path.
Fa05
CSE 182
Linear Space Alignment
• Align(1..n,1..m)
– For all 1<=j <= m
• Compute F[j]=S(n/2,j)
– For all 1<=j <= m
–
–
–
–
Fa05
• Compute B[j]=Sb(n/2,j)
j* = maxj {F[j]+B[j] }
X = Align(1..n/2,1..j*)
Y = Align(n/2..n,j*..m)
Return X,j*,Y
CSE 182
Linear Space complexity
• T(nm) = c.nm + T(nm/2) = O(nm)
• Space = O(m)
Fa05
CSE 182
Summary
• We considered the basics of sequence alignment
– Opt score computation
– Reconstructing alignments
– Local alignments
– Affine gap costs
– Space saving measures
• Can we recreate Blast?
Fa05
CSE 182
Blast and local alignment
• Concatenate all of the database sequences to form
one giant sequence.
• Do local alignment computation with the query.
Fa05
CSE 182
Large database search
Database (n)
Query (m)
Fa05
Database size n=100M, Querysize m=1000.
O(nm) = 1011 computations
CSE 182
Why is Blast Fast?
Fa05
CSE 182
Silly Question!
• True or False: No two people in new york city have
the same number of hair
Fa05
CSE 182
Observations
• Much of the database is random from the query’s
perspective
• Consider a random DNA string of length n.
– Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25
• Assume for the moment that the query is all A’s
(length m).
• What is the probability that an exact match to
the query can be found?
Fa05
CSE 182
Basic probability
• Probability that there is a match starting at a
fixed position i = 0.25m
• What is the probability that some position i has a
match.
• Dependencies confound probability estimates.
Fa05
CSE 182
Basic Probability:Expectation
• Q: Toss a coin: each time it comes up heads, you
get a dollar
– What is the money you expect to get after n
tosses?
– Let Xi be the amount earned in the i-th toss
E(X i ) 1.p  0.(1 p)  p
 Total money you expect to earn

Fa05
E( X i )   E(X i ) np
i
i
CSE 182
Expected number of matches
•
Expected number of matches can still be computed.
i

Let Xi=1 if there is a match starting at position i, Xi=0 otherwise
Pr(Match at Position i )  pi  0.25 m
E(X i )  pi  0.25 m

Expected number of matches =

E( X i )  
i
Fa05
i
 4
E(X i )  n 1
CSE 182
m
Expected number of exact
Matches is small!
• Expected number of matches = n*0.25m
– If n=107, m=10,
• Then, expected number of matches = 9.537
– If n=107, m=11
• expected number of hits = 2.38
– n=107,m=12,
• Expected number of hits = 0.5 < 1
• Bottom Line: An exact match to a substring of the query is
unlikely just by chance.
Fa05
CSE 182
Observation 2
• What is the pigeonhole principle?
 Suppose we are looking for a database string with
greater than 90% identity to the query (length 100)
 Partition the query into size 10 substrings. At least
one much match the database string exactly
Fa05
CSE 182
Why is this important?
•
•
•
Suppose we are looking for sequences that are 80% identical to the query sequence of length 100.
Assume that the mismatches are randomly distributed.
What is the probability that there is no stretch of 10 bp, where the query and the subject match
exactly?
 8

  0.000036
 1
10 

 
10 90
•
Rough calculations show that it is very low. Exact match of a short query substring to a truly similar
subject is very high.

– The above equation does not take dependencies into account
– Reality is better because the matches are not randomly distributed

Fa05
CSE 182
Just the Facts
• Consider the set of all substrings of the query
string of fixed length W.
– Prob. of exact match to a random database
string is very low.
– Prob. of exact match to a true homolog is very
high.
– Keyword Search (exact matches) is MUCH
faster than sequence alignment
Fa05
CSE 182
BLAST
•
•
•
•
Database (n)
Consider all (m-W) query words of size W (Default = 11)
Scan the database for exact match to all such words
For all regions that hit, extend using a dynamic programming alignment.
Can be many orders of magnitude faster than SW over the entire string
Fa05
CSE 182
Why is BLAST fast?
• Assume that keyword searching does not consume any time and
that alignment computation the expensive step.
• Query m=1000, random Db n=107, no TP
• SW = O(nm) = 1000*107 = 1010 computations
• BLAST, W=11
• E(#11-mer hits)= 1000* (1/4)11 * 107=2384
• Number of computations = 2384*100*100=2.384*107
• Ratio=1010/(2.384*107)=420
• Further speed improvements are possible
Fa05
CSE 182
Keyword Matching
• How fast can we match
keywords?
• Hash table/Db index?
What is the size of the
hash table, for m=11
• Suffix trees? What is
the size of the suffix
trees?
• Trie based search. We
will do this in class.
Fa05
CSE 182
AATCA
567
Related notes
• How to choose the alignment region?
– Extend greedily until the score falls below a certain
threshold
• What about protein sequences?
– Default word size = 3, and mismatches are allowed.
• Like sequences, BLAST has been evolving continuously
– Banded alignment
– Seed selection
– Scanning for exact matches, keyword search versus
database indexing
Fa05
CSE 182
P-value computation
• How significant is a score? What happens to
significance when you change the score function
• A simple empirical method:
• Compute a distribution of scores against a random
database.
• Use an estimate of the area under the curve to get
the probability.
• OR, fit the distribution to one of the standard
distributions.
Fa05
CSE 182
Z-scores for alignment
• Initial assumption was that the scores followed a
normal distribution.
• Z-score computation:
– For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation
ZS 
S 

– Look up a table to get a P-value
Fa05
CSE 182
Blast E-value
•
•
•
Initial (and natural) assumption was that scores followed a Normal distribution
1990, Karlin and Altschul showed that ungapped local alignment scores follow an
exponential distribution
Practical consequence:
Fa05
– Longer tail.
– Previously significant hits now not so significant
CSE 182
Exponential distribution
•
•
Random Database, Pr(1) = p
What is the expected number of hits to a sequence of k 1’s
(n  k) p  ne
k
•

k ln p
 ne
 1 
k ln 
 p 
Instead, consider a random binary Matrix. Expected # of diagonals
of k 1s
 1 
k ln 
 p 
  (n  k)(m  k) p k  nmek ln p  nme
Fa05
CSE 182
•
•
As you increase k, the number decreases exponentially.
The number of diagonals of k runs can be approximated by a Poisson process
ue
Pr[u] 
u!
•
•
Pr[u  0]  1 e
In ungapped alignments, we replace the coin tosses by column scores, but
the behaviour does not change (Karlin & Altschul).
As the score increases, the number of alignments that achieve the score
decreases exponentially

Fa05
CSE 182
Blast E-value
•
•
Choose a score such that the expected score between a pair of
residues < 0
Expected number of alignments with a particular score
 S
E  Kmne
 mn2
Pr(S  x)  1 e
•
Sln K 


 ln 2 
Kmne  x
For small values, E-value and P-value are the same
Fa05
CSE 182
Blast Variants
1.
2.
3.
4.
5.
What is mega-blast?
What is discontiguous megablast?
Phi-Blast/Psi-Blast?
BLAT?
PatternHunter?
Fa05
CSE 182
Longer seeds.
Seeds with don’t care values
Later
Database pre-processing
Seeds with don’t care values
Silly Quiz
• Name a famous Bioinformatics Researcher
• Name a famous Bioinformatics Researcher who is a
woman
Fa05
CSE 182
Scoring DNA
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• DNA has structure.
Fa05
CSE 182
DNA scoring matrices
• So far, we considered a
simple match/mismatch
criterion.
• The nucleotides can be
grouped into Purines
(A,G) and Pyrimidines.
• Nucleotide substitutions
within a group
(transitions) are more
likely than those across
a group (transversions)
Fa05
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
CSE 182
Scoring proteins
• Scoring protein sequence alignments is a
much more complex task than scoring DNA
– Not all substitutions are equal
• Problem was first worked on by Pauling and
collaborators
• In the 1970s, Margaret Dayhoff created
the first similarity matrices.
– “One size does not fit all”
– Homologous proteins which are evolutionarily
close should be scored differently than
proteins that are evolutionarily distant
– Different proteins might evolve at different
rates and we need to normalize for that
Fa05
CSE 182
PAM 1 distance
• Two sequences are 1 PAM apart if they differ in 1
% of the residues.
1% mismatch
• PAM1(a,b) = Pr[residue b substitutes residue a,
when the sequences are 1 PAM apart]
Fa05
CSE 182
PAM1 matrix
• Align many proteins that are very similar
– Is this a problem?
• PAM1 distance is the probability of a substitution
when 1% of the residues have changed
• Estimate the frequency Pb|a of residue a being
substituted by residue b.
• S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)
Fa05
CSE 182
PAM 1
Fa05
CSE 182
PAM distance
• Two sequences are 1 PAM apart when they differ
in 1% of the residues.
• When are 2 sequences 2 PAMs apart?
1 PAM
2 PAM
Fa05
1 PAM
CSE 182
Higher PAMs
• PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)
• PAM2 = PAM1 * PAM1 (Matrix multiplication)
• PAM250
– = PAM1*PAM249
– = PAM1250
Fa05
CSE 182
Note: This is not the score matrix:
What happens as you keep increasing the power?
Fa05
CSE 182
Scoring using PAM matrices
• Suppose we know that two sequences are 250
PAMs apart.
• S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) =
log10(PAM250(a,b)/Pb)
Fa05
CSE 182
BLOSUM series of Matrices
• Henikoff & Henikoff: Sequence substitutions in
evolutionarily distant proteins do not seem to follow the
PAM distributions
• A more direct method based on hand-curated multiple
alignments of distantly related proteins from the BLOCKS
database.
• BLOSUM60 Merge all proteins that have greater than 60%.
Then, compute the substitution probability.
– In practice BLOSUM62 seems to work very well.
Fa05
CSE 182
PAM vs. BLOSUM
• What is the correspondence?
• PAM1
• PAM2
Blosum1
Blosum2
•
Blosum62
• PAM250
Blosum100
Fa05
CSE 182
Dictionary Matching, R.E. matching, and
position specific scoring
Fa05
CSE 182
Keyword search
• Recall: In BLAST, we get a collection of keywords
from the query sequence, and identify all db
locations with an exact match to the keyword.
• Question: Given a collection of strings (keywords),
find all occrrences in a database string where they
keyword might match.
Fa05
CSE 182
Dictionary Matching
1:POTATO
2:POTASSIUM
3:TASTE
P O T A S T P O T A T O
database
dictionary
• Q: Given k words (si has length li), and a database of
size n, find all matches to these words in the
database string.
• How fast can this be done?
Fa05
CSE 182
Dict. Matching & string matching
• How fast can you do it, if you only had one word of length m?
– Trivial algorithm O(nm) time
– Pre-processing O(m), Search O(n) time.
• Dictionary matching
– Trivial algorithm (l1+l2+l3…)n
– Using a keyword tree, lpn (lp is the length of the longest
pattern)
– Aho-Corasick: O(n) after preprocessing O(l1+l2..)
• We will consider the most general case
Fa05
CSE 182
Direct Algorithm
P O P O P O T A S T P O T A T O
P O
P T
O
P
T
O
A
TA A
O
P O T A T O
P A
O T
T O
T TO O
Observations:
• When we mismatch, we (should) know something about where
the next match will be.
• When there is a mismatch, we (should) know something about
other patterns in the dictionary as well.
Fa05
CSE 182
The Trie Automaton
•
Construct an automaton A from the dictionary
– A[v,x] describes the transition from node v to a node w upon
reading x.
– A[u,’T’] = v, and A[u,’S’] = w
– Special root node r
– Some nodes are terminal, and labeled with the index of the
dictionary word.
r
P
O
T
A
T
S
T
A
Fa05
u
S
w
S
T
E
v
3
CSE 182
1:POTATO
2:POTASSIUM
3:TASTE
1
O
I
U
M
2
An O(lpn) algorithm for keyword matching
•
•
•
Fa05
CSE 182
Start with the first position in
the db, and the root node.
If successful transition
– Increment current pointer
– Move to a new node
– If terminal node “success”
Else
– Retract ‘current’ pointer
– Increment ‘start’ pointer
– Move to root & repeat
Illustration:
l c
P O T A S T P O T A T O
v
P
O
T
A
S
T
A
Fa05
T
S
T
1
O
S
E
CSE 182
I
U
M
Idea for improving the time
•
Suppose we have partially matched pattern i (indicated by l, and c),
but fail subsequently. If some other pattern j is to match
– Then prefix(pattern j) = suffix [ first c-l characters of
pattern(i))
l
c
P O T A S T P O T A T O
Pattern i
P O T A S S I U M
T A S T E
Pattern j
Fa05
CSE 182
1:POTATO
2:POTASSIUM
3:TASTE
Improving speed of dictionary matching
•
Every node v corresponds to a string sv that is a prefix of some
pattern.
Define F[v] to be the node u such that su is the longest suffix of sv
If we fail to match at v, we should jump to F[v], and commence
matching from there
Let lp[v] = |su|
•
•
•
P 2 O 3 T 4 A 5 T
1
S6
T
7
Fa05
A
8
O
S 11 I
S 9 T 10 E
CSE 182
U
M
An O(n) alg. For keyword matching
•
•
•
•
Fa05
CSE 182
Start with the first position in the
db, and the root node.
If successful transition
– Increment current pointer
– Move to a new node
– If terminal node “success”
Else (if at root)
– Increment ‘current’ pointer
– Mv ‘start’ pointer
– Move to root
Else
– Move ‘start’ pointer forward
– Move to failure node
Illustration
P O T A S T P O T A T O
lc
P
v
O
A
T
T
S
A
Fa05
T
S
T
1
O
S
E
CSE 182
I
U
M
Time analysis
• In each step, either c is
incremented, or l is incremented
• Neither pointer is ever
decremented (lp[v] < c-l).
• l and c do not exceed n
• Total time <= 2n
l
c
P O T A S T P O T A T O
Fa05
CSE 182
Blast: Putting it all together
• Input: Query of
length m, database of
size n
• Select word-size,
scoring matrix, gap
penalties, E-value
cutoff
Fa05
CSE 182
Blast Steps
1. Generate an automaton of all query keywords.
2. Scan database using a “Dictionary Matching” algorithm
(O(n) time). Identify all hits.
3. Extend each hit using a variant of “local alignment”
algorithm. Use the scoring matrix and gap penalties.
4. For each alignment with score S, compute the bit-score, Evalue, and the P-value. Sort according to increasing E-value
until the cut-off is reached.
5. Output results.
Fa05
CSE 182
Protein Sequence Analysis
• What can you do if BLAST does not return a hit?
– Sometimes, homology (evolutionary similarity) exists at very
low levels of sequence similarity.
• A: Accept hits at higher P-value.
– This increases the probability that the sequence similarity is a
chance event.
– How can we get around this paradox?
– Reformulated Q: suppose two sequences B,C have the same
level of sequence similarity to sequence A. If A& B are related
in function, can we assume that A& C are? If not, how can we
distinguish?
Fa05
CSE 182