* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT - Bioinformatics.ca
Protein domain wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Homology modeling wikipedia , lookup
Protein purification wikipedia , lookup
Protein moonlighting wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein structure prediction wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Trimeric autotransporter adhesin wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
List of types of proteins wikipedia , lookup
Protein Subcellular Localization Shan Sundararaj University of Alberta Edmonton, AB ss23@ualberta.ca Lecture 4.0 1 Why is Localization Important? • • • • • Function is dependent on context Co-localization of proteins of related function Valuable annotation for new proteins Design of proteins with specific targets Drug targeting – Accessibility: Membrane-bound > cytoplasmic > nuclear Lecture 4.0 2 Why is Localization Important? • 1974 Nobel Prize in Physiology/Medicine – George Palade • “for discoveries concerning the structural and functional organization of the cell” • 1999 Nobel Prize in Physiology/Medicine – Günter Blobel • “for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell” Lecture 4.0 3 Bacteria Gram Positive Gram Negative (3-4 states) (5 states) Extracellular cytoplasm cytoplasmic membrane Lecture 4.0 cytoplasm periplasm cell wall cytoplasmic membrane outer membrane Extracellular 4 Eukaryotic Cell • Compartmentalized • Diverse range of specific organelles: (modified from Voet & Voet, Biochemystry; Wiley-VCH 1992) Lecture 4.0 – Plants: chloroplasts, chromoplasts, other plastids – Muscle: sarcoplasm – Various endosomes, vesicles 5 Yet more categories… Chloroplast Lecture 4.0 Mitochondrion Yeast “specific” 6 Level of Annotation • As simple as two states: – membrane protein vs. non-membrane protein – secreted protein vs. non-secreted protein • Gross compartments: – cytoplasm, inner membrane, periplasm, cell wall, outer membrane, extracellular – nucleus, mitochondria, peroxisome, vacuole… • Fine compartments: – Mitochondrial matrix, bud neck, spindle pole… – Any of 1425 GO cellular compartments Lecture 4.0 7 Localization signaling • Proteins must have intrinsic signals for their localization – a cellular address – E.g. N-terminal signal sequences 321 Nuclear Inner Membrane Lane Nucleus, Intracellular county Eukaryotic Cell CL34V3M3 Lecture 4.0 8 Localization signaling • Some signals are easily recognizable – Signal peptidase cleavage site, consensus sequence for secretion extracellular – Address printed neatly, postal code • Others are difficult to understand – Outer membrane b-barrel proteins, no consensus sequence, few sequence restraints – Sloppy address, different kind of code that we don’t understand yet Lecture 4.0 9 Experimental determination • Since don’t fully understand the language of proteins, our knowledge must often come from inference – Predicting localization is like sorting mail based only on examples of where some mail has gone before • Important to have good data sets of proteins with known localizations Lecture 4.0 10 Datasets • Organelle_DB (http://organelledb.lsi.umich.edu/) – 25095 eukaryotic proteins from subcellular proteomics studies • DBSubLoc (http://www.bioinfo.tsinghua.edu.cn/~guotao/download.html) – Combines SwissProt and PIR annotations (64051 proteins) • PSORTDB (http://db.psort.org/) – Bacterial. 1591 Gram –ve proteins, 574 Gram +ve proteins • SignalP (http://www.cbs.dtu.dk/ftp/signalp/) – 940 plant and 2738 human proteins • YPL (http://bioinfo.mbb.yale.edu/genome/localize/) – 2956 yeast proteins Lecture 4.0 11 Experimental Methods • Electron microscopy • GFP tagging / fluorescence microscopy • Subcellular fractionation + detection – Western blotting – Mass spectrometry Lecture 4.0 12 Electron Microscopy • Highest resolution, can work at the level of a single protein complex • Immunolabel proteins of interest in conjunction with colloidal gold, and visualize • Combined with electron tomography, can even visualize unlabeled complexes (from Koster and Klumperman, Nat Rev Mol Cell Biol, Sep 2003, S6-10) Lecture 4.0 13 Fluorescence Microscopy • Tag gene at either 3’ or 5’ end – Using GFP (or RFP, YFP, CFP, etc.) – Using an epitope tag and a fluorescently labeled antibody – Careful of removing signal peptides! • Also use a subcellular-specific marker or stain • Visualize with confocal fluorescence microscopy and analyze images for colocalization Lecture 4.0 14 Specific co-labeling (yeast) • • • • • • • • • • • • Early Golgi:Cop1 Endosome: Snf7 ER to Golgi: Sec13 Golgi apparatus: Anp1 Late Golgi: Chc1 Lipid particle: Erg6 Mitochondrion: MitoTracker Nucleus: DAPI Nucleolus: Sik1 Nuclear periphery: Nic96 Peroxisome: Pex3 Vacuole: FM4-64 Lecture 4.0 Nuclear-specific DAPI staining 15 Subcellular Fractionation 1000 g tissue homogenate Lecture 4.0 transfer supernatant transfer supernatant 10,000 g 100,000 g Pellet unbroken cells nuclei chloroplast Pellet mitochondria transfer supernatant Pellet microsomal Fraction (ER, golgi, lysosomes, peroxisomes) Super. Cytosol, Soluble enzymes 16 Detergent Fractionation Cells Extraction with Digitonin/EDTA supernatant Cytoplasmic Fraction pellet Extraction with TritonX100/EDTA supernatent Organelle Membranes pellet Extraction with SDS/EDTA supernatant Nuclear Lecture 4.0 pellet Cytoskeletal (in SDS) 17 Fractionation Identification • Once fractionated, take compartment of interest and separate proteins – 2D gel or chromatography • Identify separated proteins – Mass spectrometry for high-throughput – Western blot for specific proteins Lecture 4.0 18 Fractionation in proteomics Lecture 4.0 19 High-Throughput Experiments • Kumar et al., Genes Dev 2002, 16:707-719 – Epitope-tagged >60% of ORFs, visualized with fluorescently labeled antibody – 2744 localizations (44% of S. cerevisiae genes) • Huh et al., Nature 2003, 425:686-691 – GFP tagged all ORFs, RFP tagged compartments – 4156 localizations (75% of S. cerevisiae genes) • Combined, now nearly 87% of yeast proteins have a localization annotation Lecture 4.0 20 High-Throughput Experiments • Lopez-Campistrous et al, Mol Cell Proteomics, 2005 – Subcellular fractionation of E. coli, 2D-gel separation, MS-MS – 2,160 localizations to cytoplasm, inner membrane, periplasm, and outer membrane Lecture 4.0 21 Predictions from known data • Enough experimental data exists to build highly accurate computational predictors of localization Lecture 4.0 22 Predictions from known data • Different information used for predictions: 1) Sequence motifs - N-terminal: secretory signal peptides, mitochondrial targeting peptide, chloroplast transit peptide - C-terminal: peroxisome import signal, ER retention signal - Mid-sequence: nuclear localization signals 2) Amino acid composition - AA frequency, dipeptide composition. 3) Homology - Sequence comparison to proteins of known localization Lecture 4.0 23 N-terminal signal peptides • Common structure of signal peptides: – positively charged n-region, followed by a hydrophobic hregion and a neutral but polar c-region. Prokaryotes Eukaryotes Total length (avg) 22.6 aa Gram-negative Gram-positive 25.1 aa 32.0 aa n-regions only slightly Arg-rich Lys+Arg-rich h-regions short, very hydrophobic slightly longer, less hydrophobic very long, less hydrophobic c-regions short, no pattern short, Ser+Ala-rich longer, Pro+Thr-rich -3,-1 positions small, neutral residues almost exclusively Ala +1 to +5 region no pattern rich in Ala, Asp/Glu, and Ser/Thr Lecture 4.0 24 N-terminal signal peptides Lecture 4.0 25 More work to do • • • • • Multiple bacterial secretion pathways C-terminal signal peptides Internal mitochondrial transit peptides Structural aspects of targeting Gene re-localization • Still a lot to discover in how signaling works! Lecture 4.0 26 Computational methods for predicting localization • • • • • • Expert rule based methods Artificial Neural Nets (ANN) Hidden Markov Models (HMM) Naïve Bayes (NB) Support Vector Machines (SVM) Combination of above methods Lecture 4.0 27 Naïve Bayes • Assumption: – Features are conditionally independent, given class labels • Structure: C – 1 level tree – Class labels — root – Features — leaf nodes F1 F2 … F7 • Prediction: – class(f) = argmax P(C=c)P(F=f | C=c) c Lecture 4.0 28 Artificial Neural Network • Excellent for modeling nonlinear input/output relationships • Robust to noise in training data • Widely used in bioinformatics Input Lecture 4.0 Hidden Output 29 Support Vector Machines • Input vectors are separated into positive vs. negative instance • Map to new feature space • Find hyperplane that best separates the two classes by distance Lecture 4.0 Half-space: w.x + b > 0 Class: +1 x x x w x x x Half-space: w.x + b < 0 Class: -1 x Hyperplane: w.x + b = 0 30 Evaluating Predictors - Precision Predicted True + - + TP FN - FP TN • # of proteins correctly labeled as “cyt” divided by the total # of proteins labeled as “cyt” • How often the label is correct • If there are 90 proteins correctly labeled as “cyt”, and 10 proteins incorrectly labeled as “cyt”, then the precision is 90/100 = 0.90. Lecture 4.0 31 Evaluating Predictors - Sensitivity Predicted True + - + TP FN - FP TN • # of proteins correctly labeled as cytoplasmic divided by the total # of proteins that are cytoplasmic • “How many of the true results were retrieved” (also called “recall” or “accuracy”) Lecture 4.0 32 Predictions from known data • Different information used for predictions: 1) Sequence motifs - N-terminal: secretory signal peptides, mitochondrial targeting peptide, chloroplast transit peptide - C-terminal: peroxisome import signal, ER retention signal - Mid-sequence: nuclear localization signals 2) Amino acid composition - AA frequency, dipeptide composition, hydrophobicity 3) Homology - Sequence comparison to proteins of known localization Lecture 4.0 33 TargetP, SignalP, *P http://www.cbs.dtu.dk/services/ Sequence-based methods • TargetP (85-90% recall) – Predicts mitochondria/chloroplast/secreted – Contains SignalP and ChloroP • LipoP – lipoproteins and signal peptides in Gram negative bacteria • SecretomeP – non-classical secretion in eukaryotes Lecture 4.0 34 SignalP result • Common structure of signal peptides: – positively charged n-region, followed by a hydrophobic h-region and a neutral but polar c-region. Cleavage site Prediction: Signal peptide Signal peptide probability: 0.945 Signal anchor probability: 0.000 Max cleavage site probability: 0.723 between pos. 28 and 29 Lecture 4.0 35 Organellar Prediction • Predotar (http://www.inra.fr/predotar/) (80% recall) – Mitochondrial and plastid sequences; N-terminal sequences • MitoPred (http://mitopred.sdsc.edu/) (82% recall) – Mitochondrial; PFAM domains, AA composition • MitoProteome (http://www.mitoproteome.org/) – Database of experimentally predicted human mitochondrial • MitoP (http://ihg.gsf.de/mitop2/) – Combines data from multiple experimental and computational sources to give a consensus score for each “mitochondrial” protein in yeast and human Lecture 4.0 36 The PSORT Family • PSORT – plant sequences – Expert rule-based system • PSORT II – eukaryotic sequences – Probabilistic tree • iPSORT – eukaryotic N-term. signal sequences – ANN • PSORT-B – bacterial sequences • WoLF PSORT – eukaryotic – Updated (2005) version of PSORTII Lecture 4.0 37 PSORT-B http://www.psort.org/psortb/ Lecture 4.0 38 PSORT-B - methods • Signal peptides: Non-cytoplasmic • AA composition/patterns – SVM’s trained for each location vs. all other locations • Transmembrane helices: Inner membrane – HMMTOP • PROSITE motifs: all localizations • Outer membrane motifs: Outer membrane • Homology to proteins of known localization Integration with a Bayesian network – SCL-BLAST Lecture 4.0 39 PSORT-B results SeqID: Unannotated_bacterial2 Analysis Report: CMSVMUnknown CytoSVMCytoplasmic ECSVMUnknown HMMTOPUnknown MotifUnknown OMPMotifUnknown OMSVMUnknown PPSVMUnknown ProfileUnknown SCL-BLASTCytoplasmic SCL-BLASTeUnknown SignalUnknown Localization Scores: Cytoplasmic 9.97 CytoplasmicMembrane 0.01 Periplasmic 0.01 OuterMembrane 0.00 Extracellular 0.00 Final Prediction: Cytoplasmic 9.97 Lecture 4.0 [No details] [No details] [No details] [No internal helices found] [No motifs found] [No motifs found] [No details] [No details] [No matches to profiles found] [matched 118438: Cyto. protein] [No matches against database] [No signal peptide detected] 40 Proteome Analyst http://www.cs.ualberta.ca/~bioinfo/PA/Sub/ Lecture 4.0 41 Proteome Analyst - Method >?<Fly_01… MDLRATSSND… … Unknown Sequence Training >Extracellular<AFP1_BRANA… MAKSATIVTL … >Etracellular<AFP2_RAPSA… ACRAGMEEP… … Lecture 4.0 Classifier Prediction Training Sequences Machine Learning Algorithm Predicted Class >Cytoplasm<Fly_01 … 42 Proteome Analyst - Feature Extraction >AFP1_ARATH >AFP1_HUMAN >AFP1_SINAL … MAKSATIVTL … Sequence PSI-BLAST Swiss-Prot Lecture 4.0 Homolog Homolog Homolog Feature Feature Feature Feature 43 Proteome Analyst: Feature Extraction • TOP 3 Homologs – – • KW – – – • IPR002118; IPR003614 CC: Subcellular location – • Plant defense; Fungicide; Signal; Multigene Family; Pyrrolidone carboxylic acid DR: InterPro – • AFP1_ARATH AFP1_BRANA AFP2_ARATH Secreted Token Set: {Plant defense; Fungicide; Signal; Multigene Family; Pyrrolidone carboxylic acid; IPR002118; IPR003614; Secreted} Lecture 4.0 44 PASub - Results Contribution of each token Log scale Features Lecture 4.0 45 PASub - Interpretation • Bars represent -log probability, so a little difference is a lot! • Naïve Bayes chosen as classifier because of transparency of method – Each token gives a probability that can be summed and shown graphically – Neural network actually has higher recall • Can change token set, ask to explain with different features Lecture 4.0 46 Save Time: Pre-computed Genomes • PSORTDB – http://db.psort.org – Browse, search, BLAST, download – 103 Gram –ve bacteria, 45 Gram +ve bacteria • Proteome Analyst (PA-GOSUB) – http://www.cs.ualberta.ca/~bioinfo/PA/GOSUB/ – Browse, search, BLAST, download – 15 bacterial and 8 eukaryotic Lecture 4.0 47