* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download statgen9
Microevolution wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Human genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transfer RNA wikipedia , lookup
Microsatellite wikipedia , lookup
Frameshift mutation wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Expanded genetic code wikipedia , lookup
Sequence analysis
How to locate rare/important subsequences.
Sequence Analysis Tasks
Representing sequence features, and finding
sequence features using consensus sequences and
frequency matrices
Sequence features
Features following an exact pattern- restriction
enzyme recognition sites
Features with approximate patterns
promoters
transcription initiation sites
transcription termination sites
polyadenylation sites
ribosome binding sites
protein features
Representing uncertainty in
nucleotide sequences
It is often the case that we would like to
represent uncertainty in a nucleotide
sequence, i.e., that more than one base is
“possible” at a given position
to express ambiguity during sequencing
to express variation at a position in a gene
during evolution
to express ability of an enzyme to tolerate
more than one base at a given position of a
recognition site
Representing uncertainty in
nucleotide sequences
To do this for nucleotides, we use a set of
single character codes that represent all
possible combinations of bases
This set was proposed and adopted by the
International Union of Biochemistry and is
referred to as the I.U.B. code
Given the size of the amino acid “alphabet”, it
is not practical to design a set of codes for
ambiguity in protein sequences
The I.U.B. Code
A, C, G, T, U
R = A, G (puRine)
Y = C, T (pYrimidine)
S = G, C (Strong hydrogen bonds)
W = A, T (Weak hydrogen bonds)
M = A, C (aMino group)
K = G, T (Keto group)
B = C, G, T (not A)
D = A, G, T (not C)
H = A, C, T (not G)
V = A, C, G (not T/U)
N = A, C, G, T/U (iNdeterminate) X or - are sometimes used
Definitions
A sequence feature is a pattern that is
observed to occur in more than one
sequence and (usually) to be correlated with
some function
A consensus sequence is a sequence that
summarizes or approximates the pattern
observed in a group of aligned sequences
containing a sequence feature
Consensus sequences are regular
expressions
Finding occurrences of consensus
sequences
Example: recognition site for a restriction enzyme
EcoRI recognizes GAATTC
AccI recognizes GTMKAC
Basic Algorithm
Start with first character of sequence to be searched
See if enzyme site matches starting at that position
Advance to next character of sequence to be searched
Repeat previous two steps until all positions have been
tested
Block Diagram for Search with a
Consensus Sequence
Consensus
Sequence (in
IUB codes)
Sequence to be
searched
Search
Engine
List of positions
where matches
occur
Statistics of pattern appearance
Goal: Determine the significance of observing a
feature (pattern)
Method: Estimate the probability that that pattern
would occur randomly in a given sequence. Three
different methods
Assume all nucleotides are equally frequent
Use measured frequencies of each nucleotide
(mononucleotide frequencies)
Use measured frequencies with which a given
nucleotide follows another (dinucleotide frequencies)
Determining mononucleotide
frequencies
Count how many times each nucleotide appears in
sequence
Divide (normalize) by total number of nucleotides
Result:
fA mononucleotide frequency of A
(frequency that A is observed)
Define:
pAmononucleotide probability that a
nucleotide will be an A
pA assumed to equal fA
Determining dinucleotide
frequencies
Make 4 x 4 matrix, one element for each
ordered pair of nucleotides
Zero all elements
Go through sequence linearly, adding one to
matrix entry corresponding to the pair of
sequence elements observed at that position
Divide by total number of dinucleotides
Result: fAC dinucleotide frequency of AC
(frequency that AC is observed out of all
dinucleotides)
Determining conditional
dinucleotide probabilities
Divide each dinucleotide frequency by the
mononucleotide frequency of the first
nucleotide
Result: p*AC conditional dinucleotide
probability of observing a C given an A
p*AC = fAC/ fA
Illustration of probability calculation
What is the probability of observing the
sequence feature ART? A followed by a
purine, (either A or G), followed by a T?
Using equal mononucleotide frequencies
pA = pC = pG = pT = 1/4
pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32
Illustration (continued)
Using observed mononucleotide frequencies:
pART = pA (pA + pG) pT
Using dinucleotide frequencies:
pART = pA (p*AAp*AT + p*AGp*GT)
Another illustration
What is pACT in the sequence
TTTAACTGGG?
fA
= 2/10, fC = 1/10
pA = 0.2
fAC = 1/10, fCT = 1/10
p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1
pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1
(would have been 1/5 * 1/10 * 4/10 = 0.008
using mononucleotide frequencies)
Expected number and spacing
Probabilities are per nucleotide
How do we calculate number of expected
features in a sequence of length L?
Expected number (for large L) Lp
How do we calculate the expected spacing
between features?
ART expected spacing between ART
features = 1/pART
Renewals
For greatest accuracy in calculating spacing
of features, need to consider renewals of a
feature (taking into account whether a feature
can overlap with a neighboring copy of that
feature)
For example what is the frequency of GCGC
in :
ACTGCATGCGCGCATGCGCATATGACGA
Renewals
We define a renewal as the end of a non
overlapping motif.
For example: The renewals of GCGC in
ACTGCATGCGCGCATGCGCATATGCGCGCG
C
Are at 11,19,27,31
The clamps size are: 2,1,2,1
Renewals and Clump size.
Let R be a general pattern:
R=(r1,…,rm)
Let us denote:
R(i)=(r1,…,ri)
R(i)=(rm-i+1,…,rm)
The clamp size is:
m 1
c 1 pri1 ... prm 1R ( i ) R
(i )
i 1
Clamp Frequency
Let us assume that the clamps are distributed
randomly. Their frequency, and the interval
between any two clamps would be:
nc npr1 ... prm
1
m 1
i 1
1
1R ( i ) R
(i )
pr1 ... pri
Statistical tests
In order to test if the motif is over/under represented
or non-uniformly distributed we must test the clamp
distribution.
In order to test motif frequency we can test if the
clamp frequency has an average and variance of n
In order to test their distribution, we can divide the
entire sequence into k subsequences of size:
m<T<<1/ and test that S has a c2 distribution,
where Ti is the clump frequency in the subsequence
2
and S is:
T n / k
k
s i 1
i
n / k
Frequency of simple
motifs
Statistics of AT- or GC-rich regions
What is the probability of observing a “run” of
the same nucleotide (e.g., 25 A’s)
Let px be the mononucleotide probability of
nucleotide x
The per nucleotide probability of a run of N
consecutive x’s is pxN
The probability of occurrence in a sequence
of length L much longer than N is ≈ L pxN
Statistics of AT- or GC-rich regions
What if J “mismatches” are allowed?
Let py be the probability of observing a different
nucleotide (normally py = 1 - px)
The probability of observing n-j of nucleotide x
and j of nucleotide y in a region of length n is
n- j
np x
n
p y
j
j
n
n!
j (n j )! j!
Statistics of AC- or GC-rich regions
As before, we can multiply by L to approximate the
probability of observing that combination in a sequence
of length L
Note that this is the probability of observing exactly N-J
matches and exactly J mismatches. We may also wish
to know the probability of finding at least N-J matches,
which requires summing the probability for I=0 to I=J.
j
np
i 0
n -i
x
n
p y
i
i
Frequency matrices
Frequency matrices
Goal: Describe a sequence feature (or motif)
more quantitatively than possible using
consensus sequences
Definition: For a feature of length m using an
alphabet of n characters, a frequency matrix
is an n by m matrix in which each element
contains the frequency at which a given
member of the alphabet is observed at a
given position in an aligned set of sequences
containing the feature
Weight matrix
Probabilistic model:
How likely is each letter at each motif
position?
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
Nomenclature
Weight matrices are also known as
Position-specific scoring matrices
Position-specific probability matrices
Position-specific weight matrices
Scoring a motif model
A motif is interesting if it is very different from
the background distribution
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
less interesting
more interesting
Relative entropy
A motif is interesting if it is very different from
the background distribution
Use relative entropy*:
pi ,
pi , log
b
position i letter
pi, = probability of in matrix position i
b = background frequency (in non-motif sequence)
* Relative entropy is sometimes called information content.
Scoring motif instances
A motif instance matches if it looks like it was
generated by the weight matrix
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
“ A C G G C G C C T”
Not likely!
Hard to tell
Matches weight matrix
Log likelihood ratio
A motif instance matches if it looks like it was
generated by the weight matrix
Use log likelihood ratio
pi ,i
log
b
position i
i
i: the character at
position i of the instance
Measures how much more like the weight
matrix than like the background.
Alternating approach
Guess an initial weight matrix
2. Use weight matrix to predict instances in the
input sequences
3. Use instances to predict a weight matrix
4. Repeat 2 & 3 until satisfied.
1.
Examples: Gibbs sampler (Lawrence et al.)
MEME (expectation max. / Bailey, Elkan)
ANN-Spec (neural net / Workman, Stormo)
Expectation-maximization
foreach subsequence of width W
convert subsequence to a matrix
do {
re-estimate motif occurrences from matrix
EM
re-estimate matrix model from motif occurrences
} until (matrix model stops changing)
end
select matrix with highest score
Sample DNA sequences
>ce1cg
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC
AAAAATGGAAGTCCACAGTCTTGACAG
>ara
GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG
AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG
CTATGCCATAGCATTTTTATCCATAAG
>bglr1
ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA
TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC
TGTGAGCATGGTCATATTTTTATCAAT
>crp
CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC
AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC
ACATTACCGTGCAGTACAGTTGATAGC
Motif occurrences
>ce1cg
taatgtttgtgctggtttttgtggcatcgggcgagaata
gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC
aaaaatggaagtccacagtcttgacag
>ara
gacaaaaacgcgtaacaaaagtgtctataatcacggcag
aaaagtccacattgattaTTTGCACGGCGTCACactttg
ctatgccatagcatttttatccataag
>bglr1
acaaatcccaataacttaattattgggatttgttatata
taactttataaattcctaaaattacacaaagttaataac
TGTGAGCATGGTCATatttttatcaat
>crp
cacaaagcgaaagctatgctaaaacagtcaggatgctac
agtaatacattgatgtactgcatgtaTGCAAAGGACGTC
ACattaccgtgcagtacagttgatagc
Starting point
…gactgttttTTTGATCGTTTTCACaaaaatgg…
A
C
G
T
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
G
0.17
0.17
0.50
0.17
A
T C
0.50 ...
0.17
0.17
0.17
G
T
T
Re-estimating motif occurrences
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
A
C
G
T
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
T
0.17
0.17
0.17
0.50
G
0.17
0.17
0.50
0.17
A
T C
0.50 ...
0.17
0.17
0.17
G
T
T
Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...
Scoring each subsequence
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
Subsequences
Score
TGTGCTGGTTTTTGT
2.95
GTGCTGGTTTTTGTG
4.62
TGCTGGTTTTTGTGG
2.31
GCTGGTTTTTGTGGC
...
Select from each sequence the subsequence with maximal score.
Re-estimating motif matrix
Occurrences
TTTGATCGTTTTCAC
TTTGCACGGCGTCAC
TGTGAGCATGGTCAT
TGCAAAGGACGTCAC
A
C
G
T
Counts
000132011000040
001010300200403
020301131130000
423001002114001
Adding pseudocounts
A
C
G
T
Counts
000132011000040
001010300200403
020301131130000
423001002114001
Counts + Pseudocounts
A 111243122111151
C 112121411311514
G 131412242241111
T 534112113225112
Converting to frequencies
Counts + Pseudocounts
A 111243122111151
C 112121411311514
G 131412242241111
T 534112113225112
A
C
G
T
T
0.13
0.13
0.13
0.63
T
0.13
0.13
0.38
0.38
T
0.13
0.25
0.13
0.50
G
0.25
0.13
0.50
0.13
A
T C
0.50 ...
0.25
0.13
0.13
G
T
T
Amino acid weight matrices
A sequence logo is a scaled position-specific
A.A. distribution. Scaling is by a measure of
a position’s information content.
Sequence logos
A visual representation of a position-specific
distribution. Easy for nucleotides, but we
need colour to depict up to 20 amino acid
proportions.
Idea: overall height at position l proportional
to information content (2-Hl); proportions of
each nucleotide ( or amino acid) are in
relation to their observed frequency at that
position, with most frequent on top, next most
frequent below, etc..
Summary of motif
detection
Block Diagram for Searching
with a PSSM
PSSM
Threshold
Set of
Sequences to
search
PSSM
search
Sequences that
match above
threshold
Positions and
scores of
matches
Block Diagram for Searching for
sequences related to a family with a
PSSM
Set of
Aligned
Sequence
Features
Expected
frequencies
of each
sequence
element
PSSM
builder
PSSM
Threshold
Set of
Sequences
to search
PSSM
search
Sequences that match above
threshold
Positions and scores of
matches
Consensus sequences vs.
frequency matrices
Should I use a consensus sequence or a frequency
matrix to describe my site?
If all allowed characters at a given position are equally
"good", use IUB codes to create consensus sequence
Example: Restriction enzyme recognition sites
If some allowed characters are "better" than others, use
frequency matrix
Example: Promoter sequences
Advantages of consensus sequences: smaller
description, quicker comparison
Disadvantage: lose quantitative information on
preferences at certain locations
Similarity Functions
Used to facilitate comparison of two
sequence elements
logical valued (true or false, 1 or 0)
test whether first argument matches (or could
match) second argument
numerical valued
test degree to which first argument matches
second
Logical valued similarity functions
Let Search(I)=‘A’ and Sequence(J)=‘R’
A Function to Test for Exact Match
MatchExact(Search(I),Sequence(J)) would
return FALSE since A is not R
A Function to Test for Possibility of a Match
using IUB codes for Incompletely Specified
Bases
MatchWild(Search(I),Sequence(J)) would
return TRUE since R can be either A or G
Numerical valued similarity
functions
return value could be probability (for DNA)
Let Search(I) = 'A' and Sequence(J) = 'R'
SimilarNuc (Search(I),Sequence(J)) could return 0.5
since chances are 1 out of 2 that a purine is
adenine
return value could be similarity (for protein)
Let Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine)
SimilarProt(Seq1(I),Seq2(J)) could return 0.8
since lysine is similar to arginine
usually use integer values for efficiency
Concluding Notes:
Protein detection
Given a DNA or RNA sequence, find
those regions that code for protein(s)
Direct approach:
Genetic codes
The set of tRNAs that an organism possesses
defines its genetic code(s)
The universal genetic code is common to all
organisms
Prokaryotes, mitochondria and chloroplasts
often use slightly different genetic codes
More than one tRNA may be present for a
given codon, allowing more than one possible
translation product
Genetic codes
Differences in genetic codes occur in start
and stop codons only
Alternate initiation codons: codons that
encode amino acids but can also be used to
start translation (GTG, TTG, ATA, TTA, CTG)
Suppressor tRNA codons: codons that
normally stop translation but are translated as
amino acids (TAG, TGA, TAA)
Reading Frames
Since nucleotide sequences are “read” three
bases at a time, there are three possible
“frames” in which a given nucleotide
sequence can be “read” (in the forward
direction)
Taking the complement of the sequence and
reading in the reverse direction gives three
more reading frames
Reading frames
RF1
RF2
RF3
RF4
RF5
RF6
TTC
Phe
Ser
Leu
AAG
<Glu
<Glu
<Arg
TCA
Ser
His
Met
AGT
***
His
Met
TGT
Cys
Val
Phe
ACA
Thr
Lys
Asn
TTG
Leu
***
Asp
AAC
Gln
Val
Ser
ACA GCT
Thr Ala>
Gln Leu>
Ser>
TGT CGA
Cys Ser
Ala
Leu
Reading frames
To find which reading frame a region is in, take
nucleotide number of lower bound of region, divide by
3 and take remainder (modulus 3)
1=RF1, 2=RF2, 0=RF3
For reverse reading frames, take nucleotide number
of upper bound of region, subtract from total number
of nucleotides, divide by 3 and take remainder
(modulus 3)
0=RF4, 1=RF5, 2=RF6
This is because the convention MacVector uses is
that RF4 starts with the last nucleotide and reads
backwards
Open Reading Frames (ORF)
Concept: Region of DNA or RNA sequence
that could be translated into a peptide
sequence (open refers to absence of stop
codons)
Prerequisite: A specific genetic code
Definition:
(start codon) (amino acid coding codon)n (stop codon)
Note: Not all ORFs are actually used
Block Diagram for Direct
Search for ORFs
Genetic code
Both strands?
Ends start/stop?
Sequence to be
searched
Search
Engine
List of ORF
positions
Statistical Approaches
Calculation Windows
Many sequence analyses require calculating
some statistic over a long sequence looking
for regions where the statistic is unusually
high or low
To do this, we define a window size to be the
width of the region over which each
calculation is to be done
Example: %AT
Base Composition Bias
For a protein with a roughly “normal” amino
acid composition, the first 2 positions of all
codons will be about 50% GC
If an organism has a high GC content overall,
the third position of all codons must be mostly
GC
Useful for prokaryotes
Not useful for eukaryotes due to large amount
of noncoding DNA
Fickett’s statistic
Also called TestCode analysis
Looks for asymmetry of base composition
Strong statistical basis for calculations
Method:
For each window on the sequence, calculate
the base composition of nucleotides 1, 4, 7...,
then of 2, 5, 8..., and then of 3, 6, 9...
Calculate statistic from resulting three
numbers
Codon Bias (Codon Preference)
Principle
Different levels of expression of different
tRNAs for a given amino acid lead to pressure
on coding regions to “conform” to the preferred
codon usage
Non-coding regions, on the other hand, feel no
selective pressure and can drift
Codon Bias (Codon Preference)
Starting point: Table of observed codon
frequencies in known genes from a given
organism
best to use highly expressed genes
Method
Calculate “coding potential” within a moving
window for all three reading frames
Look for ORFs with high scores
Codon Bias (Codon Preference)
Works best for prokaryotes or unicellular
eukaryotes because for multicellular
eukaryotes, different pools of tRNA may be
expressed at different stages of development
in different tissues
may have to group genes into sets
Codon bias can also be used to estimate
protein expression level
Portion of D. melanogaster
codon frequency table
Amino Acid Codon
GlyG
Number Freq/1000 Fraction
Gly
GGG
11
2.60
0.03
Gly
GGA
92
21.74
0.28
Gly
GGT
86
20.33
0.26
Gly
GGC
142
33.56
0.43
Glu
GAG
212
50.11
0.75
Glu
GAA
69
16.31
0.25
Comparison of Glycine codon
frequencies
Codon
GlyG
E. coli D. melanogaster
GGG
0.02
0.03
GGA
0.00
0.28
GGT
0.59
0.26
GGC
0.38
0.43