* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Talk
List of types of proteins wikipedia , lookup
Rosetta@home wikipedia , lookup
Protein design wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Protein domain wikipedia , lookup
Circular dichroism wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein folding wikipedia , lookup
Structural alignment wikipedia , lookup
Protein purification wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Western blot wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Homology modeling wikipedia , lookup
Shape Modeling and Matching in
Protein Structure Identification
Sasakthi Abeysinghe, Tao Ju
Washington University, St. Louis, USA
Matthew Baker, Wah Chiu
Baylor College of Medicine, Houston, USA
Shape Matching
• Shape comparison
– How similar are shape A and shape B?
– Application: 3D model retrieval
• Shape alignment
– What is the best alignment of A onto B?
– Application: object recognition and registration
Shape Matching
• Shape comparison
– How similar are shape A and shape B?
– Application: 3D model retrieval
• Shape alignment
– What is the best alignment of A onto B?
– Application: object recognition and registration
1D Protein Sequence
3D Protein Image
Structural Biology
• Protein: a sequence of amino acids
– Folds into a 3D structure in order to interact with other molecules
– Protein function derived from its 3D structure
…
• Identifying protein structure
– Imaging methods: X-ray, NMR
– Drawback: can not resolve large assemblies, like viruses.
Domain Problem
• Cryo-electron microscopy (Cryo-EM)
– Produces 3D density volumes
– Drawback: insufficient resolution to resolve atom locations
?
• How to determine protein structure in a cryo-EM volume?
Shape Matching Formulation
• Matching 1D protein sequence with 3D density volume
• Intermediate goal: Matching alpha-helices
– One of the basic building blocks in a protein
– Identified as cylindrical densities in the volume [Baker 07]
+
?
• How to align the protein sequence with the cryo-EM
volume to match the two sets of helices?
Method Overview
• Compatible shape representation
– 1D sequence and 3D volume as attributed relational graphs
• Graph-based shape matching
– A new constrained graph matching problem and an optimal
solution
– Error-tolerant (inexact) matching
Shape Representation
• Protein sequence as attributed relation graph
– An edge: a helix segment or a non-helix segment
• Attribute: number of amino acids in the segment
– A node: end of a helix of end of the sequence
– Add additional edges that skip at most m helix segments
• To allow matching with a cryo-EM volume that has missing helices
Shape Representation
• Graph representation of Cryo-EM volume via skeletons
– 3D Skeleton [Ju 06] builds connectivity among detected helices
– An edge: a detected helix or a skeleton path between two helices
• Attribute: length of the helix or skeleton path
– A node: end of a helix of
end of the protein
– Add additional edges
between helix-ends less
than d apart
• To account for missing helix
connectivity in the skeleton
Shape Matching - Problem
• Finding two matching chains of helices
– Same number of edges
– Alternating types between non-helix and helix
– Minimal attribute matching error
• Uniqueness of this problem:
– Inexact: not all edges/nodes in the two graphs are used in the
matched sequence
– Constrained: the match must have a linear topology
Shape Matching - Review
• Previous work on graph matching
– Exact matching
• Graph mono-morphism [Wong 90]
• Sub-graph isomorphism [Ullmann 76, Cordella 99]
– Inexact matching
• A* search [Nilsson 80], simulated annealing [Herault 90], neural
networks [Feng 94], probabilistic relaxation [Christmas 95],
genetic algorithms [Wang 97], graph decomposition [Messmer 98]
• All designed for un-constrained problems where there is no
restriction on the topology of the matched sub-graphs.
Shape Matching - Method
• Key idea: utilize the linearity of chains.
• Performing depth-first tree-search
Sequence Graph
Volume Graph
{1,1}
– Append matching nodes to the incomplete
chain with minimal matching error
• A*-search
– Reduce node expansion by
estimating future matching error
{2,2} {2,3} {2,4} {2,5}
40
42
85
92
{3,4}
48
– Optimal if future error estimation is
smaller than the actual error.
– 3 future error functions are designed
{4,3} {4,5}
99
51
{6,6}
58
{3,5}
91
{3,2} {3,3} {3,4}
61 63 72
Experimental Setup
• Test data
– Simulated data: 8 proteins (taken from Protein Data Bank)
– Authentic data: 3 proteins (produced at Baylor)
• Test modes
– Automatic
– With a few user-specified helix correspondences
• Validation with the actual helix correspondence
– Produce a list of candidates sorted by their matching errors
– Find out where the actual correspondence ranks in the list
Results - 1
• Bluetongue Virus (simulated, 10 helices, 0 missing)
– Actual correspondence ranks #1
+
Sequence
Cryo-EM volume and its skeleton
Top Matching
Results - 2
• Human Insulin Receptor (simulated, 9 helices, 1 missing)
– Actual correspondence ranks #1
+
Sequence
+
Cryo-EM volume and its skeleton
Top Matching
Results - 3
• Bacteriophage P22 (authentic, 11 helices, 6 missing)
– Actual correspondence ranks #4
+
Sequence
Cryo-EM skeleton
Top Matching
Actual Correspondence
Results - 4
• Triose Phosphate Isomerase (simulated, 12 helices, 3 missing)
– Before user-specification: actual correspondence not in the candidate list
– Given 2 specified helix pairs: actual correspondence ranks #9
+
Sequence
Cryo-EM skeleton with 2
use-specified helix pairs
Top Matching
Without userspecification
Actual Correspondence
Result - Summary
• Among the 11 proteins, the correct correspondence ranks among
the candidate list computed by our method:
– Top 1: 4 proteins
– Within top 10: 2 proteins (1 simulated)
– Top 1 after user-interaction: 2 proteins (both simulated)
• 4 specified helix pairs in a 14/20-helix protein.
– Within top 10 after user-interaction: 3 proteins
• 2 specified helix pairs in a 6/9/12-helix protein
• Performance
– Under 4 seconds for proteins with 20 helices
– Compare: [Wu 05] uses exhaustive search and takes 16 hours for finding
correspondences in proteins with 8 helices
Conclusion
• Formulate protein structure identification as shape
matching
– 1D protein sequence vs. 3D cryo-EM density volume
– Compatible representation of disparate biological data as graphs
• Formulate a constrained inexact matching problem and
propose an optimal solution
– Based on A*-search
• Validation on simulated and authentic data
Future Work (Bio)
• Incorporating beta-sheets for improved accuracy
– Challenge: the match is no longer a linear chain
• Integrating homology and ab initio modeling
– Utilizing known 3D structure of segments
– Refining the alignment by molecular energy minimization
Future Work (CS)
• Faster graph matching algorithm
– Explore variants of A*-search to reduce running time for larger
proteins (>20 helices)
• Better skeleton generation
– Generate skeletons directly from gray-scale density volume for
iso-value-independent representation
– Utilize cell-complex-based skeleton for better skeleton geometry
• Currently used for topology editing, see [Ju, Zhou and Hu. Siggraph 2007]
Pacific Graphics • Hawaii • 2007
• Oct 29 – Nov 2, in Maui, Hawaii
Conference Chair: Ron Goldman
Program co-chairs: Marc Alexa, Steven Gortler, Tao Ju
Results - 1
• Bluetongue Virus (simulated, 10 helices, 0 missing)
– Actual correspondence ranks #1
+
Sequence
Cryo-EM volume and its skeleton
Top Matching