Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Randomized Algorithms for Three Dimensional Protein Structures Comparison Yaw-Ling Lin Dept Computer Sci and Info Engineering, Providence University, Taiwan E-mail: yllin@pu.edu.tw WWW: http://www.cs.pu.edu.tw/~yawlin 1 Outline • • • • • • • Introduction Protein Structures 3D structure comparisons Algorithms Benchmarking Comparing with other systems Future Works 2 Introduction 3 What are proteins ? • Structural framework (keratin, collagen) • Transport and storage of small molecules (hemoglobin) • Transmit information (hormones, receptors) • Antibodies • Blood clotting factors • Enzymes The protein is created in the cell as a unique sequence of amino acids AC M V L L E C V 4 Sequence ACMVLLCEVEKYP… folding Structure Function ????? 5 Background and Problem definition About protein sequences are known today (non-redundant database). This number keeps rapidly growing (large scale sequencing projects). ! The function of 40-50% of the new proteins is unknown. Understanding biological function is important for: • Study of fundamental biological processes • Drug design • Genetic engineering What bioinformatics can do for us? 7 Drug Discovery • Target Identification – Which protein to inhibit? • Lead discovery & optimization – What sort of molecule will bind to this protein? • Toxicology – Side effects, target specificity • Pharmacokinetics – Metabolization and transport 8 Drug Development Life Cycle Discovery (2 to 10 Years) Preclinical Testing (Lab and Animal Testing) Phase I (20-30 Healthy Volunteers used to check for safety and dosage) With the aid of bioinformatics Phase II (100-300 Patient Volunteers used to check for efficacy and side effects) Phase III $600-700 Million! (1000-5000 Patient Volunteers used to monitor reactions to long-term drug use) FDA Review & Approval Post-Marketing Testing Years 0 2 4 6 8 10 12 7 – 15 Years! 14 16 9 Drug lead screening 5,000 to 10,000 compounds screened 5 Drug Candidates enter Clinical Testing; 80% Pass Phase I 250 Lead Candidates in Preclinical Testing 30%Pass Phase II 80% Pass Phase III One drug approved by the FDA 10 Drug Lead Screening & Docking ? Complementarity Shape Chemical Electrostatic 11 Protein Structures 12 Levels of structure in proteins 13 Myoglobin structure 14 Myoglobin structure contd. 15 Myoglobin in solution 16 Three dimensional structures of cytochrome c, lysozyme and ribonuclease 17 PDB file format 18 PDB file format 19 PDB file format 20 PDB file format 21 Protein Structures 22 Rasmol-Structure PDB: 101M PDB: 2DHB 23 Rasmol-Group PDB: 101M PDB: 2DHB 24 Structural classifications • SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ • CATH http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html • FSSP http://www.ebi.ac.uk/dali/fssp/fssp.html Structure comparison algorithms •Dali •CE •Structal •VAST Contact matrix and the Dali method Contact matrix n n matrix whe re n # residues d (i, j ) distance (c # i, c # j ) Idea: Similar structures have similar contact matrices 26 From distance map to structural similarities • Imagine transparent distance map of one protein put on to of a map of other protein (Liisa Holm Chris Sander J. Mol. Biol. 23 3.): – Matching patches centered on diagonal correspond to matching secondary structures. – Matches of short distances off diagonal correspond to tertiary conformations. – Similarity score Unmatched residues do not contribute to score. 27 Contact matrix and the Dali method Contact matrix n n matrix whe re n # residues d (i, j ) distance (c # i, c # j ) Idea: Similar structures have similar contact matrices 28 DALI algorithm outline • Step1: Consider all possible pairs of 6x6 submatrices of the contact matrices. Such matrices are small enough that the problem can be solved optimally. • Step2: Assembly the alignments from step 1. Method – Monte Carlo algorithm. 29 CE (Shindyalov & Bourne, Protein Eng. 1998) Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Define alignment fragment pair (AFP) as a continuous segment of protein A aligned against a continuous segment of protein B (without gaps). •An alignment is a path of AFPs s.t. for every two consecutive AFPs there may be gaps inserted into either A or B, but not into both. That is, for every two consecutive AFPs i and i+1 and A A pi 1 pi m piB1 piB m or and piA1 piA m piB1 piB m or and A A B B p p m p p m i 1 i i 1 i where piA is the starting position of AFP i in protein A CE What is a “good”AFP? Define the distance between two different AFPs i and j as: 1 m Dij d A ( piA k 1, p Aj m k ) d B ( piB k 1, p Bj m k ) m k 1 dA(p,q) represents the distance between the alpha carbon atoms at positions p and q in protein A. Protein B Dij i j i j Protein A If you already have n-1 AFPs and consider adding the n-th AFN, do so only if 1 n 1 1 n n (1) Dnn D0 ( 2) D n 1 i 0 in D1 (3) n 2 D i 0 j 0 ij D1 CE (cont.) 1. 2. 3. Select an initial AFP. Build an alignment path by incrementally adding “good” AFPs that satisfy the conditions of paths Repeat step (2) until the proteins are completely matched, or until no good AFPs remain. Protein B Protein A 4. To assess the significance of the alignment, compare it to the alignment of a random pairs of structures, and compute the Z-score based on the RMSD and number of gaps in the final alignment. Structal (Levitt & Gerstein, PNAS 1998) An initial equivalence is chosen, based on matching the ends of the two structures. Repeat until convergence: • Superimpose the two structures so as to minimize the RMS, given the equivalence • Given the superposition, calculate the distances dij between any atom i in the first protein and any atom j in the second protein • Transform distances into similarities sij = M/[1+ (dij/d0)2] where M=20 and d0 = 2.24A • Apply dynamic programming to define a new set of equivalences Structal (cont) 2) Superimpose to minimize RMS 1) Alignment fixed 4) Use dynamic prog. to find the best set of equivalences 5) Superimpose given the new alignment 3) Calculate distances between all atoms 6) Recalculate distances between all atoms Approach based on comparing secondary structure arrangement Motivation: • Folds are often defined as arrangement of secondary structure elements (sse). • Why not to compare arrangement of sse rather than going down to atomic level? 1EJ9: Human topoisomerase 35 VAST- graph theoretical approach • http://www2.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml • Perform the comparison on the level of secondary structures and not residues. • Treat each secondary structure as a vector of direction and length corresponding to the direction and length of the secondary structure. Attributes of such vector include the type of secondary structure, number of residues, etc. • For two secondary structure provide a way of describing the relative spatial position of secondary structures – distance, angle, etc. • VAST finds maximal subset of secondary structures that are in the same relative positions in compared protein structures and in the same order within the structure. 36 37 38 39 40 41 SCOP Structural classification of proteins with 5 level hierarchy: Domains: the individual entries Family: homologous proteins with significant sequence similarity Superfamily: protein families that share weak sequence similarity but with conserved functional residues (e.g. in active sites) – believed to be evolutionary related Fold: protein superfamilies that share he same fold (not necessarily due to common evolutionary ancestry) Class: all-alpha, all-beta, alpha/beta, alpha+beta, membrane proteins, small proteins The classification is based on manual analysis by experts (Dr. Alexy Murzin) As of May 2002, 7 main classes, 686 folds, 1073 superfamilies, 1827 families CATH Structural classification of proteins with 5 level hierarchy: Protein chains: the individual entries Homologous superfamily: proteins with highly similar structures and functions. Topology: clusters according to the topological connections and numbers of secondary structures. Architecture: describes the gross orientation of secondary structures, independent of connectivities (assigned manually). Class: derived from secondary structure content, is assigned for more than 90% of protein structures automatically. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons. As of Jan 2002, 8 main classes, 46 architectures, 1453 topologies, more than 2000 superfamilies. FSSP Structural classification of proteins into a tree hierarchy: Protein domains: the individual entries (defined using the algorithm of Holm and Sander 1994) Start with all-vs-all structure comparison of protein domains Domains are clustered automatically into clusters using the single linkage algorithm based on the z-scores of the structure similarity scores 3242 families of more than 30,000 structures as of June 2002 Algorithms • Measurement: rmsd. • Pair atoms of two structures by minimum bipartite matching. • Fix one structure, and keep several 3-D orientations of the other. • Randomly perturb these orientations, and shift to better positions until converging. • Report the best rmsd score and orientation. 45 INIT-S(N) N=4 N=12 N=6 N=8 N=20 46 INIT-S(N) 47 MB-Align Algorithm 48 MB-Align Descriptions 49 3D Transformation • 3D rotation is done around a rotation axis • Fundamental rotations About x, y, or z axes • Positive Rotation Counter-clockwise rotation (when you look down the negative axis) y + z x 50 3D Transformation • Rotation about Z y x’ = x cos(q) – y sin(q) y’ = x sin(q) + y cos(q) z’ = z cos(q) -sin(q) 0 sin(q) cos(q) 0 0 0 1 0 0 0 0 0 0 1 + x z • OpenGL - glRotatef(q, 0,0,1) 51 3D Transformation y Rotation about Y (z → x, x → y, y → z) z’ = z cos(q) – x sin(q) x’ = z sin(q) + x cos(q) y’ = y cos(q) 0 -sin(q) 0 z 0 sin(q) 0 1 0 0 0 cos(q) 0 0 0 1 x + • OpenGL - glRotatef(q, 0,1,0) x + z y 52 3D Transformation y Rotation about X (y → x, z → y, x → z) y’ = y cos(q) – z sin(q) z’ = y sin(q) + z cos(q) x’ = x 1 0 0 0 z 0 0 0 cos(q) -sin(q) 0 sin(q) cos(q) 0 0 0 1 z + • OpenGL - glRotatef(q, 1,0,0) x + y x 53 3D Transformation • Arbitrary rotation axis (rx, ry, rz) • glRotatef(angle, rx, ry, rz) So, which way is a positive rotation? y (rx, ry, rz) x z 54 Rotation 55 Rotation 56 Rotation 57 Rotation 58 Rotation Matrix 59 Perturbation The orientation vector is perturbed to its neighborhood. 60 q r, the normal vector. 61 Perturbation Algorithm 62 MB-Align Algorithm 63 System Implementations • OS: Linux/Red Hat 7.2 run on Pentium-4 2800Mhz CPU and 1G bytes RAM. • Bioperl – pdb file format conversion • Rotation/perturbation/integration – C programs • Minimum bipartite matching – LEDA • Rmsd - PROFIT 64 Benchmarking 65 Benchmarking 66 Benchmarks 67 Efficiencies of Strategies Local dice : have a dice for each si S Global dice : share common dice for each si S 68 The End. 69