Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Protein Sequence Motifs
Aalt-Jan van Dijk
Plant Research International, Wageningen UR
Biometris, Wageningen UR
aaltjan.vandijk@wur.nl
www.bioinformatics.nl
Plant Bioinformatics
Genomics
•
•
Next Generation Sequencing
Genome assembly & annotation
(Comparative) genome analysis
SNP analysis, marker development
Computational infrastructure
Database development
Webbased analysis tools
Software- development
Workflow management systems
machine learning
Data (pre-)processing pipelining
Alternative splicing
Protein interactions networks
Metabolomics
•
•
•
Alternative splicing
EST analysis
Proteomics
•
•
•
Technology
Integrated analysis of omics datasets
Transcriptomics
Database- development
Data (pre-)processing pipelining
Metabolite and pathway-identification
Systems biology
network modelling (bottom-up)
• Protein interactions networks
www.bioinformatics.nl
www.bioinformatics.nl
My research
Protein complex structures
Protein-protein docking
Correlated mutations
Interaction site
prediction/analysis
Protein-protein interactions
Protein-DNA interactions
Motif search
Enzyme active sites
www.bioinformatics.nl
www.bioinformatics.nl
Overview
Protein Motif Searching
Hydrophobicity & Transmembrane Domains
Protein Interactions
Sequence-motifs to predict interaction sites
Secondary Structure Prediction
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
www.bioinformatics.nl
What is a motif?
A motif is a description of a particular element of
a protein that contains a specific sequence
pattern
Motifs are identified by
3D structural alignment
Multiple sequence alignment
Pattern searching programs
www.bioinformatics.nl
www.bioinformatics.nl
What is a motif?
A motif is a description of a particular element of
a protein that contains a specific sequence
pattern
Motifs are identified by
3D structural alignment
Multiple sequence alignment
Pattern searching programs
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
Strict consensus pattern
use only strictly conserved residues
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
C
CxxxxxCxxxPxxxxxC
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
Strict consensus pattern
use only strictly conserved residues
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
C
CxxxxxCxxxPxxxxxC
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
Strict consensus pattern
use only strictly conserved residues
But what about:
variable residues?
gaps?
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
C
CxxxxxCxxxPxxxxxC
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
Strict consensus patterns contain
no alternative residues
no flexible regions
no mismatches
no gaps
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
CxxxxxCxxxPxxxxxC
C
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
Most motifs defined as regular expressions
Motifs can contain
alternative residues
flexible regions
C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C
CXXXCXGXPXXXXXC
|
| | |
|
FGCAKLCAGFPLRRLPCFYG
www.bioinformatics.nl
www.bioinformatics.nl
The PROSITE Syntax
A-[BC]-X-D(2,5)-{EFG}-H
A
B or C
anything
2-5 D’s
not E, F, or G
H
www.bioinformatics.nl
www.bioinformatics.nl
PROSITE entries
Mandatory motifs characterise a protein (super-)
family
ID SUBTILASE_ASP; PATTERN.
DE Serine proteases, subtilase family, aspartic acid active site.
PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH].
ID SUBTILASE_HIS; PATTERN.
DE Serine proteases, subtilase family, histidine active site.
PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM].
ID SUBTILASE_SER; PATTERN.
DE Serine proteases, subtilase family, serine active site.
PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG].
www.bioinformatics.nl
www.bioinformatics.nl
Exercise
Find the three subtilase motifs in prosite
(prosite.expasy.org)
Compare the lists of proteins in which the motifs
occur – what does this tell you?
Similarly, compare protein structures in which the
motifs occur
Have a look at the “sequence logo”
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
Some motifs occur frequently in proteins; they
may not actually be present, such as
Post-translational modification sites
ID
DE
PA
ASN_GLYCOSYLATION; PATTERN.
N-glycosylation site.
N-{P}-[ST]-{P}.
www.bioinformatics.nl
www.bioinformatics.nl
Exercise
Use a glycosylation site predictor such as
http://www.cbs.dtu.dk/services/NetNGlyc/
Input: your favorite set of sequences
Do you observe that some N-{P}-[ST] sites are likely to
be glycosylated and others not?
www.bioinformatics.nl
www.bioinformatics.nl
Profiles
Many motifs cannot be easily defined using
simple patterns
Such motifs can be defined using profiles
A profile is constructed from a multiple sequence
alignment. For each position, each amino acid is
given a score depending on how likely it is to
occur
www.bioinformatics.nl
www.bioinformatics.nl
Calculating a Profile
For each alignment position: take the
(weighted) average of the appropriate rows
from the scoring matrix
An (extremely
simple) example:
www.bioinformatics.nl
seq_01
seq_02
seq_03
seq_04
seq_05
seq_06
seq_07
seq_08
seq_09
seq_10
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
W
A
A
A
A
A
A
A
A
W
W
A
A
A
A
A
A
A
W
W
W
A
A
A
A
A
A
W
W
W
W
A
A
A
A
A
W
W
W
W
W
A
A
A
A
W
W
W
W
W
W
A
A
A
W
W
W
W
W
W
W
A
A
W
W
W
W
W
W
W
W
A
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
www.bioinformatics.nl
Excerpt from the EBLOSUM62 matrix:
A R N D C Q E G H I L K M F P S T W Y V
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3
A
4.0
N
-2.0
C
0.0
P
-1.0
D
-2.0
Q
-1.0
E
-1.0
R
-1.0
F
-2.0
S
1.0
G
0.0
T
0.0
H
-2.0
V
0.0
I
-1.0
W
-3.0
K
-1.0
Y
-2.0
L
-1.0
M
-1.0
A
5A+5W: 1.0
N
-6.0
C
-2.0
P
-5.0
D
-6.0
Q
-3.0
E
-4.0
R
-4.0
F
-1.0
S
-2.0
G
-2.0
T
-2.0
H
-4.0
V
-3.0
I
-4.0
W
8.0
K
-4.0
Y
0.0
L
-3.0
M
-2.0
A
-3.0
N
-4.0
C
-2.0
P
-4.0
D
-4.0
Q
-2.0
E
-3.0
R
-3.0
F
1.0
S
-3.0
G
-2.0
T
-2.0
H
-2.0
V
-3.0
I
-3.0
W
11.0
K
-3.0
Y
2.0
L
-2.0
M
-1.0
10A:
10W:
prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62
matrix;
www.bioinformatics.nl
www.bioinformatics.nl
Pattern Searching
Short linear motifs: e.g.
http://dilimot.russelllab.org/
Profiles: meme
http://meme.sdsc.edu/meme/cgi-bin/meme.cgi
www.bioinformatics.nl
www.bioinformatics.nl
Exercise
Use a number of sequences wich contain the
prosite subtilase motif and find motifs in those
sequences with MEME
www.bioinformatics.nl
www.bioinformatics.nl
Hydropathy Plot
Prediction hydrophobic and hydrophilic regions in a
protein
www.bioinformatics.nl
Partition Coefficients
Hydrophilic
Hydrophobic
Oil
Water
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity/Hydrophilicity Values
hydrophilic
hydrophobic
R
K
D
Q
N
E
H
S
T
P
Y
C
G
A
M
W
L
V
F
I
Fauchere & Pliska
-1.37
-1.35
-1.05
-0.78
-0.85
-0.87
-0.40
-0.18
-0.05
0.12
0.26
0.29
0.48
0.62
0.64
0.81
1.06
1.08
1.19
1.38
www.bioinformatics.nl
Kyte & Doolittle
-4.50
-3.90
-3.50
-3.50
-3.50
-3.50
-3.20
-0.80
-0.70
-1.60
-1.30
2.50
-0.40
1.80
1.90
-0.90
3.80
4.20
2.80
4.50
Hopp & Woods
3.00
3.00
3.00
0.20
0.20
3.00
-0.50
0.30
-0.40
0.00
-2.30
-1.00
0.00
-0.50
-1.30
-3.40
-1.80
-1.50
-2.50
-1.80
Eisenberg
-2.53
-1.50
-0.90
-0.85
-0.78
-0.74
-0.40
-0.18
-0.05
0.12
0.26
0.29
0.48
0.62
0.64
0.81
1.06
1.08
1.19
1.38
www.bioinformatics.nl
Hydrophobicity Plot
Sum amino acid hydrophobicity values in a given
window
Plot the value in the middle of the window
Shift the window one position
ik
1
Hi
Hn
2k 1 n i k
www.bioinformatics.nl
www.bioinformatics.nl
Sliding Window Approach
Calculate property for first sub-sequence
Use the result (plot/print/store)
Move to next residue position, and repeat
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Transmembrane Regions
Rotation is 100 degrees per amino acid
Climb is 1.5 Angstrom
per amino acid residue
www.bioinformatics.nl
www.bioinformatics.nl
Transmembrane Regions
30 angstrom
www.bioinformatics.nl
So we need approx.
30 / 1.5 = 20 amino
acids to span the
membrane
www.bioinformatics.nl
www.bioinformatics.nl
www.bioinformatics.nl
Adapting the window size to
the size of the membrane
spanning segment makes the
picture easier to interpret
www.bioinformatics.nl
www.bioinformatics.nl
window = 1
window = 9
window = 19
window = 121
www.bioinformatics.nl
www.bioinformatics.nl
Protein Interactions
www.bioinformatics.nl
Protein Interactions
hemoglobin
Obligatory
www.bioinformatics.nl
www.bioinformatics.nl
Protein Interactions
hemoglobin
Obligatory
www.bioinformatics.nl
Mitochondrial Cu transporters
Transient
www.bioinformatics.nl
Experimental approaches (1)
Yeast two-hybrid (Y2H)
www.bioinformatics.nl
www.bioinformatics.nl
Experimental approaches (2)
Affinity Purification + mass spectrometry (AP-MS)
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
STRING http://string.embl.de/
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
STRING http://string.embl.de/
HPRD http://www.hprd.org/
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
STRING http://string.embl.de/
HPRD http://www.hprd.org/
InteroPorc http://biodev.extra.cea.fr/interoporc/Default.aspx
Many others….
E.g. see
http://nar.oxfordjournals.org./content/39/suppl_1.toc
www.bioinformatics.nl
www.bioinformatics.nl
Yeast protein interaction network
www.bioinformatics.nl
www.bioinformatics.nl
Sequence-based Protein Binding
Site Prediction
www.bioinformatics.nl
Binding site
www.bioinformatics.nl
www.bioinformatics.nl
Binding site
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Motif search in groups of proteins
• Group proteins which have same interaction partner
• Use motif search, e.g. find PWMs
Neduva Plos Biol 2005
www.bioinformatics.nl
www.bioinformatics.nl
Motif search in groups of proteins
• Group proteins which have same interaction partner
• Use motif search
www.bioinformatics.nl
www.bioinformatics.nl
Correlated Motif Search
www.bioinformatics.nl
www.bioinformatics.nl
Correlated Motif Search
Interactors
AARLL PLTEQ
MARLT DLTEP
VVRLM MMTER
Non-interactors
AARLL MARLT
VVRLM MARLT
PLTEQ DLTEP
Correlated Motif Pair: (RL,TE)
www.bioinformatics.nl
www.bioinformatics.nl
Experimental validation
Van Dijk et al, Plos Comp Biol 2010
www.bioinformatics.nl
www.bioinformatics.nl
New approach: slider
•
•
Faster approach genome wide searching for interaction motifs
Improve mining algorithm with a priori biological knowledge
(conservation score, surface accessibility)
www.bioinformatics.nl
www.bioinformatics.nl
Boyen et al, IEEE/ACM Trans Comput Biol Bioinform. 2011
THE END…..
Questions?
www.bioinformatics.nl
www.bioinformatics.nl
www.bioinformatics.nl
www.bioinformatics.nl
Secondary Structure Prediction
www.bioinformatics.nl
Secondary Structure Prediction
Traditional methods (statistical and/or rule-based)
E.g. Garnier, Osguthorpe & Robson
• Statistical method
Accuracy ~ 60%
www.bioinformatics.nl
www.bioinformatics.nl
GOR Helix Parameters
i-8
Gly -5
ala 5
val 0
leu 0
ile 5
ser 0
thr 0
asp 0
glu 0
asn 0
gln 0
lys 20
his 10
arg 0
phe 0
tyr -5
trp -10
cys 0
met 10
pro -10
-10
10
0
5
10
-5
0
-5
0
0
0
40
20
0
0
-10
-20
0
20
-20
i-6
-15
15
0
10
15
-10
0
-10
0
0
0
50
30
0
0
-15
-40
0
25
-40
-20
20
0
15
20
-15
-5
-15
0
0
0
55
40
0
0
-20
-50
0
30
-60
i-4
i-2
-30 -40 -50 -60
30 40 50 60
0
0
5 10
20 25 28 30
25 20 15 10
-20 -25 -30 -35
-10 -15 -20 -25
-20 -15 -10
0
10 20 60 70
-10 -20 -30 -40
5 10 20 20
60 60 50 30
50 50 50 30
0
0
0
0
0
5 10 15
-25 -30 -35 -40
-50 -10
0 10
0
0 -5 -10
35 40 45 50
-80-100-120-140
www.bioinformatics.nl
i
-86
65
14
32
6
-39
-26
5
78
-51
10
23
12
-9
16
-45
12
-13
53
-77
-60
60
10
30
0
-35
-25
10
78
-40
-10
10
-20
-15
15
-40
10
-10
50
-60
i+2
-50
50
5
28
-10
-30
-20
15
78
-30
-20
5
-10
-20
10
-35
0
-5
45
-30
-40
40
0
25
-15
-25
-15
20
78
-20
-20
0
0
-30
5
-30
-10
0
40
-20
i+4
-30
30
0
20
-20
-20
-10
20
78
-10
-10
0
0
-40
0
-25
-50
0
35
-10
-20
20
0
15
-25
-15
-5
20
70
0
-5
0
0
-50
0
-20
-50
0
30
0
i+6
-15
15
0
10
-20
-10
0
15
60
0
0
0
0
-50
0
-15
-40
0
25
0
-10
10
0
5
-10
-5
0
10
40
0
0
0
0
-30
0
-10
-20
0
20
0
i+8
-5
5
0
0
-5
0
0
5
20
0
0
0
0
-10
0
-5
-10
0
10
0
www.bioinformatics.nl
I S G A R N I E R H E L I X P R E D I C T
i-8
Gly -5
ala 5
val 0
leu 0
ile 5
ser 0
thr 0
asp 0
glu 0
asn 0
gln 0
lys 20
his 10
arg 0
phe 0
tyr -5
trp -10
cys 0
met 10
pro -10
-10
10
0
5
10
-5
0
-5
0
0
0
40
20
0
0
-10
-20
0
20
-20
i-6
-15
15
0
10
15
-10
0
-10
0
0
0
50
30
0
0
-15
-40
0
25
-40
-20
20
0
15
20
-15
-5
-15
0
0
0
55
40
0
0
-20
-50
0
30
-60
i-4
i-2
-30 -40 -50 -60
30 40 50 60
0
0
5 10
20 25 28 30
25 20 15 10
-20 -25 -30 -35
-10 -15 -20 -25
-20 -15 -10
0
10 20 60 70
-10 -20 -30 -40
5 10 20 20
60 60 50 30
50 50 50 30
0
0
0
0
0
5 10 15
-25 -30 -35 -40
-50 -10
0 10
0
0 -5 -10
35 40 45 50
-80-100-120-140
www.bioinformatics.nl
i
-86
65
14
32
6
-39
-26
5
78
-51
10
23
12
-9
16
-45
12
-13
53
-77
-60
60
10
30
0
-35
-25
10
78
-40
-10
10
-20
-15
15
-40
10
-10
50
-60
i+2
-50
50
5
28
-10
-30
-20
15
78
-30
-20
5
-10
-20
10
-35
0
-5
45
-30
-40
40
0
25
-15
-25
-15
20
78
-20
-20
0
0
-30
5
-30
-10
0
40
-20
i+4
-30
30
0
20
-20
-20
-10
20
78
-10
-10
0
0
-40
0
-25
-50
0
35
-10
-20
20
0
15
-25
-15
-5
20
70
0
-5
0
0
-50
0
-20
-50
0
30
0
i+6
-15
15
0
10
-20
-10
0
15
60
0
0
0
0
-50
0
-15
-40
0
25
0
-10
10
0
5
-10
-5
0
10
40
0
0
0
0
-30
0
-10
-20
0
20
0
i+8
-5
5
0
0
-5
0
0
5
20
0
0
0
0
-10
0
-5
-10
0
10
0
www.bioinformatics.nl
GOR Prediction
beta sheet
helix
www.bioinformatics.nl
www.bioinformatics.nl
Secondary Structure Prediction
Recent methods
Neural networks
Multiple alignments
Heuristics
Or a combination of the above
= flexible statistics
= variability
= common sense
Accuracy ~ 70%
www.bioinformatics.nl
www.bioinformatics.nl
Heuristics
Conserved parts are structurally and/or
functionally important
Segments with many gaps must be in loop
regions
www.bioinformatics.nl
www.bioinformatics.nl
Secondary Structure Prediction
Strategy
Use as many methods as possible
Use homologous sequences
Combine predictions into consensus prediction
www.bioinformatics.nl
www.bioinformatics.nl
Why can’t it be 100% correct?
All current 2D prediction schemes are based
upon observation of occurrence of 2D elements in
3D structures
Deduction of 2D elements from structures is
ambiguous!
DSSP, Stride, and the PDB (human) annotation do not
always agree upon the assigned elements
www.bioinformatics.nl
www.bioinformatics.nl
Do these residues still belong to the helix?
www.bioinformatics.nl
www.bioinformatics.nl