Download Protein Sequence Analysis - Bioinformatics Webportal

Protein Sequence Motifs Aalt-Jan van Dijk Plant Research International, Wageningen UR Biometris, Wageningen UR aaltjan.vandijk@wur.nl www.bioinformatics.nl Plant Bioinformatics   Genomics      • • Next Generation Sequencing Genome assembly & annotation (Comparative) genome analysis SNP analysis, marker development       Computational infrastructure Database development Webbased analysis tools Software- development Workflow management systems machine learning  Data (pre-)processing pipelining Alternative splicing Protein interactions networks Metabolomics • • •  Alternative splicing EST analysis Proteomics • • • Technology  Integrated analysis of omics datasets  Transcriptomics Database- development Data (pre-)processing pipelining Metabolite and pathway-identification Systems biology  network modelling (bottom-up) • Protein interactions networks www.bioinformatics.nl www.bioinformatics.nl My research  Protein complex structures  Protein-protein docking  Correlated mutations  Interaction site prediction/analysis  Protein-protein interactions  Protein-DNA interactions  Motif search  Enzyme active sites www.bioinformatics.nl www.bioinformatics.nl Overview  Protein Motif Searching Hydrophobicity & Transmembrane Domains Protein Interactions Sequence-motifs to predict interaction sites  Secondary Structure Prediction    www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching www.bioinformatics.nl What is a motif?   A motif is a description of a particular element of a protein that contains a specific sequence pattern Motifs are identified by    3D structural alignment Multiple sequence alignment Pattern searching programs www.bioinformatics.nl www.bioinformatics.nl What is a motif?   A motif is a description of a particular element of a protein that contains a specific sequence pattern Motifs are identified by    3D structural alignment Multiple sequence alignment Pattern searching programs www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching  Strict consensus pattern  use only strictly conserved residues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching  Strict consensus pattern  use only strictly conserved residues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching  Strict consensus pattern   use only strictly conserved residues But what about:   variable residues? gaps? C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching  Strict consensus patterns contain     no alternative residues no flexible regions no mismatches no gaps C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C CxxxxxCxxxPxxxxxC C C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching   Most motifs defined as regular expressions Motifs can contain   alternative residues flexible regions C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C CXXXCXGXPXXXXXC | | | | | FGCAKLCAGFPLRRLPCFYG www.bioinformatics.nl www.bioinformatics.nl The PROSITE Syntax  A-[BC]-X-D(2,5)-{EFG}-H       A B or C anything 2-5 D’s not E, F, or G H www.bioinformatics.nl www.bioinformatics.nl PROSITE entries  Mandatory motifs characterise a protein (super-) family ID SUBTILASE_ASP; PATTERN. DE Serine proteases, subtilase family, aspartic acid active site. PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH]. ID SUBTILASE_HIS; PATTERN. DE Serine proteases, subtilase family, histidine active site. PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM]. ID SUBTILASE_SER; PATTERN. DE Serine proteases, subtilase family, serine active site. PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG]. www.bioinformatics.nl www.bioinformatics.nl Exercise     Find the three subtilase motifs in prosite (prosite.expasy.org) Compare the lists of proteins in which the motifs occur – what does this tell you? Similarly, compare protein structures in which the motifs occur Have a look at the “sequence logo” www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching  Some motifs occur frequently in proteins; they may not actually be present, such as  Post-translational modification sites ID DE PA ASN_GLYCOSYLATION; PATTERN. N-glycosylation site. N-{P}-[ST]-{P}. www.bioinformatics.nl www.bioinformatics.nl Exercise  Use a glycosylation site predictor such as http://www.cbs.dtu.dk/services/NetNGlyc/  Input: your favorite set of sequences  Do you observe that some N-{P}-[ST] sites are likely to be glycosylated and others not? www.bioinformatics.nl www.bioinformatics.nl Profiles    Many motifs cannot be easily defined using simple patterns Such motifs can be defined using profiles A profile is constructed from a multiple sequence alignment. For each position, each amino acid is given a score depending on how likely it is to occur www.bioinformatics.nl www.bioinformatics.nl Calculating a Profile   For each alignment position: take the (weighted) average of the appropriate rows from the scoring matrix An (extremely simple) example: www.bioinformatics.nl seq_01 seq_02 seq_03 seq_04 seq_05 seq_06 seq_07 seq_08 seq_09 seq_10 A A A A A A A A A A A A A A A A A A A W A A A A A A A A W W A A A A A A A W W W A A A A A A W W W W A A A A A W W W W W A A A A W W W W W W A A A W W W W W W W A A W W W W W W W W A W W W W W W W W W W W W W W W W W W W www.bioinformatics.nl Excerpt from the EBLOSUM62 matrix: A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 A 4.0 N -2.0 C 0.0 P -1.0 D -2.0 Q -1.0 E -1.0 R -1.0 F -2.0 S 1.0 G 0.0 T 0.0 H -2.0 V 0.0 I -1.0 W -3.0 K -1.0 Y -2.0 L -1.0 M -1.0 A 5A+5W: 1.0 N -6.0 C -2.0 P -5.0 D -6.0 Q -3.0 E -4.0 R -4.0 F -1.0 S -2.0 G -2.0 T -2.0 H -4.0 V -3.0 I -4.0 W 8.0 K -4.0 Y 0.0 L -3.0 M -2.0 A -3.0 N -4.0 C -2.0 P -4.0 D -4.0 Q -2.0 E -3.0 R -3.0 F 1.0 S -3.0 G -2.0 T -2.0 H -2.0 V -3.0 I -3.0 W 11.0 K -3.0 Y 2.0 L -2.0 M -1.0 10A: 10W: prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62 matrix; www.bioinformatics.nl www.bioinformatics.nl Pattern Searching  Short linear motifs: e.g. http://dilimot.russelllab.org/ Profiles: meme http://meme.sdsc.edu/meme/cgi-bin/meme.cgi  www.bioinformatics.nl www.bioinformatics.nl Exercise Use a number of sequences wich contain the prosite subtilase motif and find motifs in those sequences with MEME www.bioinformatics.nl www.bioinformatics.nl Hydropathy Plot Prediction hydrophobic and hydrophilic regions in a protein www.bioinformatics.nl Partition Coefficients Hydrophilic Hydrophobic Oil Water www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity/Hydrophilicity Values hydrophilic hydrophobic R K D Q N E H S T P Y C G A M W L V F I Fauchere & Pliska -1.37 -1.35 -1.05 -0.78 -0.85 -0.87 -0.40 -0.18 -0.05 0.12 0.26 0.29 0.48 0.62 0.64 0.81 1.06 1.08 1.19 1.38 www.bioinformatics.nl Kyte & Doolittle -4.50 -3.90 -3.50 -3.50 -3.50 -3.50 -3.20 -0.80 -0.70 -1.60 -1.30 2.50 -0.40 1.80 1.90 -0.90 3.80 4.20 2.80 4.50 Hopp & Woods 3.00 3.00 3.00 0.20 0.20 3.00 -0.50 0.30 -0.40 0.00 -2.30 -1.00 0.00 -0.50 -1.30 -3.40 -1.80 -1.50 -2.50 -1.80 Eisenberg -2.53 -1.50 -0.90 -0.85 -0.78 -0.74 -0.40 -0.18 -0.05 0.12 0.26 0.29 0.48 0.62 0.64 0.81 1.06 1.08 1.19 1.38 www.bioinformatics.nl Hydrophobicity Plot    Sum amino acid hydrophobicity values in a given window Plot the value in the middle of the window Shift the window one position ik 1 Hi  Hn  2k  1 n i  k www.bioinformatics.nl www.bioinformatics.nl Sliding Window Approach  Calculate property for first sub-sequence  Use the result (plot/print/store)  Move to next residue position, and repeat www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Transmembrane Regions Rotation is 100 degrees per amino acid Climb is 1.5 Angstrom per amino acid residue www.bioinformatics.nl www.bioinformatics.nl Transmembrane Regions 30 angstrom www.bioinformatics.nl So we need approx. 30 / 1.5 = 20 amino acids to span the membrane www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl Adapting the window size to the size of the membrane spanning segment makes the picture easier to interpret www.bioinformatics.nl www.bioinformatics.nl window = 1 window = 9 window = 19 window = 121 www.bioinformatics.nl www.bioinformatics.nl Protein Interactions www.bioinformatics.nl Protein Interactions hemoglobin Obligatory www.bioinformatics.nl www.bioinformatics.nl Protein Interactions hemoglobin Obligatory www.bioinformatics.nl Mitochondrial Cu transporters Transient www.bioinformatics.nl Experimental approaches (1) Yeast two-hybrid (Y2H) www.bioinformatics.nl www.bioinformatics.nl Experimental approaches (2) Affinity Purification + mass spectrometry (AP-MS) www.bioinformatics.nl www.bioinformatics.nl Interaction Databases  STRING http://string.embl.de/ www.bioinformatics.nl www.bioinformatics.nl Interaction Databases www.bioinformatics.nl www.bioinformatics.nl Interaction Databases   STRING http://string.embl.de/ HPRD http://www.hprd.org/ www.bioinformatics.nl www.bioinformatics.nl Interaction Databases www.bioinformatics.nl www.bioinformatics.nl Interaction Databases    STRING http://string.embl.de/ HPRD http://www.hprd.org/ InteroPorc http://biodev.extra.cea.fr/interoporc/Default.aspx Many others…. E.g. see  http://nar.oxfordjournals.org./content/39/suppl_1.toc www.bioinformatics.nl www.bioinformatics.nl Yeast protein interaction network www.bioinformatics.nl www.bioinformatics.nl Sequence-based Protein Binding Site Prediction www.bioinformatics.nl Binding site www.bioinformatics.nl www.bioinformatics.nl Binding site www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Motif search in groups of proteins • Group proteins which have same interaction partner • Use motif search, e.g. find PWMs Neduva Plos Biol 2005 www.bioinformatics.nl www.bioinformatics.nl Motif search in groups of proteins • Group proteins which have same interaction partner • Use motif search www.bioinformatics.nl www.bioinformatics.nl Correlated Motif Search www.bioinformatics.nl www.bioinformatics.nl Correlated Motif Search Interactors AARLL PLTEQ MARLT DLTEP VVRLM MMTER Non-interactors AARLL MARLT VVRLM MARLT PLTEQ DLTEP Correlated Motif Pair: (RL,TE) www.bioinformatics.nl www.bioinformatics.nl Experimental validation Van Dijk et al, Plos Comp Biol 2010 www.bioinformatics.nl www.bioinformatics.nl New approach: slider • • Faster approach  genome wide searching for interaction motifs Improve mining algorithm with a priori biological knowledge (conservation score, surface accessibility) www.bioinformatics.nl www.bioinformatics.nl Boyen et al, IEEE/ACM Trans Comput Biol Bioinform. 2011   THE END….. Questions? www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction www.bioinformatics.nl Secondary Structure Prediction  Traditional methods (statistical and/or rule-based)  E.g. Garnier, Osguthorpe & Robson • Statistical method  Accuracy ~ 60% www.bioinformatics.nl www.bioinformatics.nl GOR Helix Parameters i-8 Gly -5 ala 5 val 0 leu 0 ile 5 ser 0 thr 0 asp 0 glu 0 asn 0 gln 0 lys 20 his 10 arg 0 phe 0 tyr -5 trp -10 cys 0 met 10 pro -10 -10 10 0 5 10 -5 0 -5 0 0 0 40 20 0 0 -10 -20 0 20 -20 i-6 -15 15 0 10 15 -10 0 -10 0 0 0 50 30 0 0 -15 -40 0 25 -40 -20 20 0 15 20 -15 -5 -15 0 0 0 55 40 0 0 -20 -50 0 30 -60 i-4 i-2 -30 -40 -50 -60 30 40 50 60 0 0 5 10 20 25 28 30 25 20 15 10 -20 -25 -30 -35 -10 -15 -20 -25 -20 -15 -10 0 10 20 60 70 -10 -20 -30 -40 5 10 20 20 60 60 50 30 50 50 50 30 0 0 0 0 0 5 10 15 -25 -30 -35 -40 -50 -10 0 10 0 0 -5 -10 35 40 45 50 -80-100-120-140 www.bioinformatics.nl i -86 65 14 32 6 -39 -26 5 78 -51 10 23 12 -9 16 -45 12 -13 53 -77 -60 60 10 30 0 -35 -25 10 78 -40 -10 10 -20 -15 15 -40 10 -10 50 -60 i+2 -50 50 5 28 -10 -30 -20 15 78 -30 -20 5 -10 -20 10 -35 0 -5 45 -30 -40 40 0 25 -15 -25 -15 20 78 -20 -20 0 0 -30 5 -30 -10 0 40 -20 i+4 -30 30 0 20 -20 -20 -10 20 78 -10 -10 0 0 -40 0 -25 -50 0 35 -10 -20 20 0 15 -25 -15 -5 20 70 0 -5 0 0 -50 0 -20 -50 0 30 0 i+6 -15 15 0 10 -20 -10 0 15 60 0 0 0 0 -50 0 -15 -40 0 25 0 -10 10 0 5 -10 -5 0 10 40 0 0 0 0 -30 0 -10 -20 0 20 0 i+8 -5 5 0 0 -5 0 0 5 20 0 0 0 0 -10 0 -5 -10 0 10 0 www.bioinformatics.nl I S G A R N I E R H E L I X P R E D I C T i-8 Gly -5 ala 5 val 0 leu 0 ile 5 ser 0 thr 0 asp 0 glu 0 asn 0 gln 0 lys 20 his 10 arg 0 phe 0 tyr -5 trp -10 cys 0 met 10 pro -10 -10 10 0 5 10 -5 0 -5 0 0 0 40 20 0 0 -10 -20 0 20 -20 i-6 -15 15 0 10 15 -10 0 -10 0 0 0 50 30 0 0 -15 -40 0 25 -40 -20 20 0 15 20 -15 -5 -15 0 0 0 55 40 0 0 -20 -50 0 30 -60 i-4 i-2 -30 -40 -50 -60 30 40 50 60 0 0 5 10 20 25 28 30 25 20 15 10 -20 -25 -30 -35 -10 -15 -20 -25 -20 -15 -10 0 10 20 60 70 -10 -20 -30 -40 5 10 20 20 60 60 50 30 50 50 50 30 0 0 0 0 0 5 10 15 -25 -30 -35 -40 -50 -10 0 10 0 0 -5 -10 35 40 45 50 -80-100-120-140 www.bioinformatics.nl i -86 65 14 32 6 -39 -26 5 78 -51 10 23 12 -9 16 -45 12 -13 53 -77 -60 60 10 30 0 -35 -25 10 78 -40 -10 10 -20 -15 15 -40 10 -10 50 -60 i+2 -50 50 5 28 -10 -30 -20 15 78 -30 -20 5 -10 -20 10 -35 0 -5 45 -30 -40 40 0 25 -15 -25 -15 20 78 -20 -20 0 0 -30 5 -30 -10 0 40 -20 i+4 -30 30 0 20 -20 -20 -10 20 78 -10 -10 0 0 -40 0 -25 -50 0 35 -10 -20 20 0 15 -25 -15 -5 20 70 0 -5 0 0 -50 0 -20 -50 0 30 0 i+6 -15 15 0 10 -20 -10 0 15 60 0 0 0 0 -50 0 -15 -40 0 25 0 -10 10 0 5 -10 -5 0 10 40 0 0 0 0 -30 0 -10 -20 0 20 0 i+8 -5 5 0 0 -5 0 0 5 20 0 0 0 0 -10 0 -5 -10 0 10 0 www.bioinformatics.nl GOR Prediction beta sheet helix www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction  Recent methods  Neural networks Multiple alignments Heuristics  Or a combination of the above    = flexible statistics = variability = common sense Accuracy ~ 70% www.bioinformatics.nl www.bioinformatics.nl Heuristics   Conserved parts are structurally and/or functionally important Segments with many gaps must be in loop regions www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction  Strategy  Use as many methods as possible  Use homologous sequences  Combine predictions into consensus prediction www.bioinformatics.nl www.bioinformatics.nl Why can’t it be 100% correct?  All current 2D prediction schemes are based upon observation of occurrence of 2D elements in 3D structures  Deduction of 2D elements from structures is ambiguous!  DSSP, Stride, and the PDB (human) annotation do not always agree upon the assigned elements www.bioinformatics.nl www.bioinformatics.nl Do these residues still belong to the helix? www.bioinformatics.nl www.bioinformatics.nl

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Protein Sequence Analysis - Bioinformatics Webportal