Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Parallel Computational
Biochemistry
Frank Dehne
www.dehne.net
Proteins, DNA, etc.
DNA encodes the
information necessary
to produce proteins
Frank Dehne
Proteins are the main
molecular building blocks of life
(for example, structural proteins,
enzymes)
www.dehne.net
Proteins, DNA, etc.
• Proteins are formed from a chain of
molecules called amino acids
Frank Dehne
www.dehne.net
Proteins, DNA, etc.
• The DNA sequence encodes the amino acid
sequence that constitutes the protein
Frank Dehne
www.dehne.net
Proteins, DNA, etc.
• There are twenty amino acids found in proteins,
denoted by A, C, D, E, F, G, H, I, ...
Frank Dehne
www.dehne.net
Multiple Sequence Alignment
Frank Dehne
www.dehne.net
Databases of Biological
Sequences
NCBI:
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Swiss-Prot:
PDB:
Frank Dehne
14,976,310 sequences
15,849,921,438 nucleotides
104,559 sequences
38,460,707 residues
17,175 structures
www.dehne.net
Sequence comparison
• Compare one sequence (target) to many
sequences (database search)
• Compare more than two sequences
simultaneously
Frank Dehne
www.dehne.net
Applications
• Phylogenetic analysis
• Identification of conserved motifs and
domains
• Structure prediction
Frank Dehne
www.dehne.net
Frank Dehne
www.dehne.net
Phylogenetic Analysis
Frank Dehne
www.dehne.net
Structure Prediction
> RICIN GLYCOSIDASE
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein sequences
Protein structures
Genomic
sequences
Frank Dehne
www.dehne.net
Our Contributions
• Parallel min vertex cover for improved
sequence alignments
(to appear in Journal of Computer and System Sciences)
• Parallel Clustal W (ICCSA 2003)
• In progress: “Clustal XP” portal at
http://cgm.dehne.net
Frank Dehne
www.dehne.net
Clustal W
Frank Dehne
www.dehne.net
Progressive Alignment
Scerevisiae
Celegans
Drosophia
Human
Mouse
[1]
[2]
[3]
[4]
[5]
0.640
0.634 0.327
0.630 0.408 0.420
0.619 0.405 0.469 0.289
Human
Mouse
Drosophila
C.elegans
S.cerevisiae
1. Do pairwise alignment of all
sequences and calculate
distance matrix
2. Create a guide tree based
on this pairwise distance
matrix
3. Align progressively following guide tree.
• start by aligning most closely related pairs of sequences
• at each step align two sequences or one to an existing subalignment
Frank Dehne
www.dehne.net
Parallel Clustal
• Parallel pairwise (PW)
alignment matrix
• Parallel guide tree
calculation
• Parallel progressive
alignment
Frank Dehne
Scerevisiae
Celegans
Drosophia
Human
Mouse
[1]
[2]
[3]
[4]
[5]
0.640
0.634 0.327
0.630 0.408 0.420
0.619 0.405 0.469 0.289
Human
Mouse
Drosophila
C.elegans
S.cerevisiae
www.dehne.net
Relative Speedup
Frank Dehne
www.dehne.net
Clustal XP vs. SGI
SGI data taken from
Performance Optimization of
Clustal W: Parallel Clustal
W, HT Clustal, and
MULTICLUSTAL
By: Dmitri Mikhailov,
Haruna Cofer, and
Roberto Gomperts
Frank Dehne
www.dehne.net
Parallel Clustal - Improvements
• Optimization of input parameters
– scoring matrices, gap penalties - requires many
repetitive Clustal W calculations with various
input parameters.
• Minimum Vertex Cover
– use minimum vertex cover to remove erroneous
sequences, and identify clusters of highly
similar sequences.
Frank Dehne
www.dehne.net
Minimum Vertex Cover
TASK: remove smallest
number of gene
sequences that
eliminates all conflicts
NP-complete
Conflict Graph
– vertex: sequence
– edge: conflict (e.g.
alignment with very
poor score)
Frank Dehne
www.dehne.net
FPT Algorithms
• Phase 1: Kernelization
Reduce problem to
size f(k)
Frank Dehne
• Phase 2: Bounded
Tree Search
Exhausive tree search;
exponential in f(k)
www.dehne.net
Kernelization
Buss's Algorithm for k-vertex cover
• Let G=(V,E) and let S be the subset of
vertices with degree k or more.
• Remove S and all incident edges
G->G’
k -> k'=k-|S|.
• IF G' has more than k x k' edges
THEN no k-vertex cover exists
ELSE start bounded tree search on G'
Frank Dehne
www.dehne.net
Bounded Tree Search
VC={}
VC+=...
VC+=...
Frank Dehne
VC+=...
VC+=...
VC+=...
VC+=...
VC+=...
VC+=...
VC+=...
www.dehne.net
Case 1: simple path of length 3
in graph G'
search tree
v
VC={...}
v1
v2
VC+={v,v2}
VC+={v1,v2} VC+={v1,v3}
v3
remove selected vertices from G'
k' - = 2
Frank Dehne
www.dehne.net
Case 2: 3-cycle
in graph G'
search tree
v
VC={...}
v1
v2
VC+={v,v1}
VC+={v1,v2} VC+={v,v2}
remove selected vertices from G'
k' - = 2
Frank Dehne
www.dehne.net
Case 3: simple path of length 2
in graph G'
search tree
v
v1
v2
VC={...}
VC+={v1}
remove v1, v2 from G'
k' - = 1
Frank Dehne
www.dehne.net
Case 4: simple path of length 1
in graph G'
search tree
v
v1
VC={...}
VC+={v}
remove v, v1 from G'
k' - = 1
Frank Dehne
www.dehne.net
Sequential Tree Search
Depth first search
– backtrack when k'=0 and
G'<>0 ("dead end" ))
– stop when solution found
(G'={}, k'>=0 )
Frank Dehne
www.dehne.net
Parallel Tree Search
Basic Idea:
– Build top log p levels of
the search tree (T ')
– every proc. starts depthfirst search at one leaf of
T'
– randomize depth-first
search by selecting
random child
Frank Dehne
log p
T'
www.dehne.net
Analysis: Balls-in-bins
sequential depth-first search path
total length:L, #solutions: m
expected sequential time (rand. distr.): L/(m+1)
parallel search path
expected parallel time (rand. distr.): p + L/(p(m+1))
expected speedup: p / (1 + (m+1)/L)
if m << L then expected speedup = p
Frank Dehne
www.dehne.net
Simulation Experiment
LL==1,000,000
1,000,000
predicted speedup
200
m
m
m
m
m
150
=
=
=
=
=
10
100
1,000
10,000
100,000
100
50
0
50
100
150
200
number of processors
Frank Dehne
www.dehne.net
Implementation
• test platform:
– 32 node HPCVL Beowulf cluster
– each node: dual 1.4 GHz Intel Xeon, 512 MB
RAM, 60 GB disk
– gcc and LAM/MPI on LINUX Redhat 7.2
• code-s: Sequential k-vertex cover
• code-p: Parallel k-vertex cover
Frank Dehne
www.dehne.net
Test Data
• Protein sequences
• Same protein from several hundred species
• Each protein sequence a few hundred amino
acid residues in length
• Obtained from the National Center for
Biotechnology Information
(http://www.ncbi.nlm.nih.gov/)
Frank Dehne
www.dehne.net
Test Data
• Somatostatin
– neuropeptide involved in the regulation of
many functions in different organ systems
– Clustal Threshold = 10, |V| = 559, |E| = 33652,
k = 273, k' = 255
Frank Dehne
www.dehne.net
Test Data
• WW
– small protein domain that binds proline rich
sequences in other proteins and is involved in
cellular signaling
– Clustal Threshold = 10, |V| = 425, |E| = 40182,
k = 322, k' = 318
Frank Dehne
www.dehne.net
Test Data
• Kinase
– large family of enzymes involved in cellular
regulation
– Clustal Threshold = 16, |V| = 647, |E| = 113122,
k = 497, k' = 397
Frank Dehne
www.dehne.net
Test Data
• SH2 (src-homology domain 2)
– involved in targeting proteins to specific sites in
cells by binding to phosphor-tyrosine
– Clustal Threshold = 10, |V| = 730, |E| = 95463,
k = 461, k' = 397
Frank Dehne
www.dehne.net
Test Data
• Thrombin
– protease involved in the blood coagulation
cascade and promotes blood clotting by
converting fibrinogen to fibrin
– Clustal Threshold = 15, |V| = 646, |E| = 62731,
k = 413, k' = 413
Frank Dehne
www.dehne.net
Test Data
• PHD (pleckstrin homology domain)
– involved in cellular signaling
– Clustal Threshold = 10, |V| = 670, |E| = 147054,
k = 603, k' = 603
Frank Dehne
www.dehne.net
Test Data
• Random Graph
|V| = 220, |E| = 2155, k = 122, k' = 122
• Grid Graph
|V| = 289, |E| = 544, k = 145, k' = 145
Frank Dehne
www.dehne.net
Test Data
|VC| ~ |V| / 2
Frank Dehne
k' = k
www.dehne.net
Sequential Times
Kinase, SH2, Thombin: n/a
Frank Dehne
www.dehne.net
Code-p on Virtual Proc.
Frank Dehne
www.dehne.net
Parallel Times
Frank Dehne
www.dehne.net
Speedup: Somatostatin
Frank Dehne
www.dehne.net
Speedup: WW
Frank Dehne
www.dehne.net
Speedup: Rand. Graph
Frank Dehne
www.dehne.net
Speedup: Grid Graph
Frank Dehne
www.dehne.net
Clustal XP
in progress
Web Portal
X : Extended
P : Parallel
Clustal XP
Parallel FPT MVC
+
…
Parallel Clustal
Clustal W
Frank Dehne
www.dehne.net
http://cgm.dehne.net
Clustal XP
Frank Dehne
www.dehne.net
Frank Dehne
www.dehne.net