Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module 2 Marker Gene Based Analysis William Hsiao Universal phylogenetic tree based on SSU rRNA sequences -N. Pace, Science, 1997 William.hsiao@bccdc.ca @wlhsiao Learning Objectives of Module At the end of this module, you will be able to: • Understand and Perform marker based microbiome analysis • Analyze 16S rRNA marker gene data • Use 16S rRNA marker gene to profile and compare microbomes • Select suitable parameters for marker gene analysis • Explain the advantages and disadvantages of marker based microbiome analysis Module 2 bioinformatics.ca Marker genes Extract DNA Amplify with targeted primers Filter errors, build clusters Diversity analysis Source: Rob Beiko Module 2 bioinformatics.ca rRNAs – the universal phylogenetic markers • Ribosomal RNAs are present in all living organisms • rRNAs play critical roles in protein translation • rRNAs are relatively conserved and rarely acquired horizontally • Behave like a molecular clock – Useful for phylogenetic analysis – Used to build tree-of-life (placing organisms in a single phylogenetic tree) • 16S rRNA most commonly used Module 2 bioinformatics.ca rRNA – the lens into Life • A tool for placing organisms on a phylogenetic tree – Tree of life • A tool for understanding the composition of a microbial community – Profiling of a community (alpha-diversity) • A tool for relating one microbial community to another – Comparison of communities (beta-diversity) • A tool for reading out the properties of a host or environment – Obese vs. lean; IBD vs. no IBD (classification) -McDonald et al, RNA, 2015 Module 2 bioinformatics.ca Other Marker Genes Used • Eukaryotic Organisms (protists, fungi) – 18S (http://www.arb-silva.de) – ITS (http://www.mothur.org/wiki/UNITE_ITS_database) • Bacteria – CPN60 (http://www.cpndb.ca/cpnDB/home.php) – ITS (Martiny, Env Micro 2009) – RecA gene • Viruses – Gp23 for T4-like bacteriophage – RdRp for picornaviruses Faster evolving markers used for strain-level differentiation Module 2 bioinformatics.ca A Few Considerations for Selecting Marker Genes • The marker should have sufficient resolution to differentiate the different communities you want to study (16S differentiates genera and some species; e.g. HSP65 may be more useful for Mycobacterium) • A reference database is needed for taxonomic assignment and for binning by reference so availability of a good reference DB for the samples you study is necessary • When comparing across studies, need to standardize markers (different V regions of 16S gene), experimental protocols, and bioinformatic pipelines Module 2 bioinformatics.ca DNA Extraction • Many protocols/kits available but kit-bias exists • HMP and other major projects standardized DNA extraction protocol http://www.earthmicrobiome.org/emp-standardprotocols/dna-extraction-protocol/ • DNA Extraction can also be done after fractionation to separate out different organisms based on cell size and other characteristics http://www.jove.com/video/52685/automated-gel-sizeselection-to-improve-quality-next-generation Module 2 bioinformatics.ca Contaminations Lab: • Extraction process can introduce contamination from the lab – Reagents may be contaminated with bacterial DNA • Include Extraction Negative Control in your experiments! • Especially crucial if samples have low DNA yield! Host / Environment (metagenomics sequencing): • Host DNA often ends up in the microbiome sample – http://hmpdacc.org/doc/HumanSequenceRemoval_SOP.pdf • Unwanted fractions (e.g. eukaryotes) can be filtered by cell size-selection prior to DNA extraction • Unwanted DNA can be removed by subtractive hybridization Module 2 bioinformatics.ca Target Amplification Adapter barcode primer Source: Illumina application Note Module 2 bioinformatics.ca Target Amplification • PCR primers designed to amplify specific regions of a genome – In a “dirty” sample, inhibitors can interfere with PCR – In a complex sample, non-specific amplification can occur – Check your products using BioAnalyzer or run a gel • Gel size selection may be necessary to clean up the PCR products – http://www.jove.com/video/52685/automated-gel-sizeselection-to-improve-quality-next-generation • Dilution may be necessary to reduce inhibition Module 2 bioinformatics.ca Target Selection and Bias • 16S rRNA contains 9 hypervariable regions (V1-V9) • V4 was chosen because of its size (suitable for Illumina 150bp paired-end sequencing) and phylogenetic resolution • Different V regions have different phylogenetic resolutions – giving rise to slightly different community composition results – Jumpstart Consortium HMP, PLoS One, 2012 Module 2 bioinformatics.ca Sample Multiplexing • MiSEQ capacity (~20M paired-end reads) allows multiple samples to be combined into a single run – Number of reads needed to differentiate samples depends on the nature of the studies • Unique DNA barcodes can be incorporated into your amplicons to differentiate samples • Amplicon sequences are quite homogeneous – Necessary to combine different amplicon types or spike in PhiX control reads to diversify the sequences Module 2 bioinformatics.ca One Step vs. Two Step Amplification One step amplification Two step amplification Barcodes and sequencing adaptors included in the amplicon primers Barcodes and adaptors separated from the amplicon primers Single step PCR reaction PCR followed by DNA-ligation or more PCR Long primers; barcodes can interfere with amplification primers Target primers tested independently from barcodes/adaptors Not suitable for degenerate primer amplification Compatible with random and degenerate primer amplification Suitable if working with one amplicon type from many samples Suitable if working with many types of amplicons or metagenomics in a single study Rapid protocol and little loss of biomaterial Longer protocols with more biomaterial loss Barcoded primers for 16S available from Rob Knight’s Lab Tested barcodes from Illumina, BioO, NEB, etc. available Module 2 bioinformatics.ca Available Marker Gene Analysis Platforms • QIIME (http://qiime.org) • Mothur (http://www.mothur.org) Module 2 bioinformatics.ca Comparison Between QIIME and Mothur QIIME Mothur A python interface to glue together many programs Single program with minimal external dependency Wrappers for existing programs Reimplementation of popular algorithms Large number of dependencies / VM available; work best on single multi-core server with lots of memory Easy to install and setup; work best on single multi-core server with lots of memory More scalable Less scalable Steeper learning curve but more flexible Easy to learn but workflow works the best workflow if you can write your own scripts with built-in tools http://www.ncbi.nlm.nih.gov/pubmed/240 http://www.mothur.org/wiki/MiSeq_SOP 60131 Module 2 bioinformatics.ca Overall Bioinformatics Workflow Sequence data (fastq) 1) Preprocessing: remove primers, demultiplex, quality filter, decontamination 2) OTU Picking / Representative Sequences 3) Taxonomic Assignment Metadata about samples (mapping file) Inputs 4) Build OTU Table (BIOM file) 6) Phylogenetic Analysis OTU Table Phylogenetic Tree Processed Data Outputs Module 2 5) Sequence Alignment 7) Downstream analysis and Visualization – knowledge discovery bioinformatics.ca 1) Preprocessing Sample de-multiplexing • MiSEQ capacity (~20M paired-end reads) allows multiple samples to be combined into a single run • Reads need to be linked back to the samples they came from using the unique barcodes – Error-correcting barcodes available (~2000 unique barcodes from Caporaso et al ISME 2012) • De-multiplexing removes the barcodes and the primer sequences • QIIME has several different scripts to help you prepare your sequence files for analysis – http://qiime.org/tutorials/processing_illumina_data.html Module 2 bioinformatics.ca 1) Preprocessing Quality Filtering • QIIME filtering by: – Maximal number of consecutive low-quality bases ( -r=3 ) – Minimum Phred quality score ( -q=3 ) – Minimal length of consecutive high-quality bases (as % of total read length) ( -p=0.75 ) – Maximal number of ambigous bases (N’s) (-n=0 ) • Other quality filtering tools available – Cutadapt (https://github.com/marcelm/cutadapt) – Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) – Sickle (https://github.com/najoshi/sickle) • Sequence quality summary using FASTQC – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Module 2 bioinformatics.ca QIIME’s Quality Filtering Workflow and Observations • Adjusting p,q, and c (minimal reads per OTU) significantly affect OTU count • Filtering mainly affect low abundance OTUs – difficult to differentiate sequencing errors and rare OTUs • Downstream analyses more robust to QF changes than raw OTU counts http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531572/ Module bioinformatics.ca 1) Preprocessing Decontamination (in-silico) • In situations where host and other non-bacterial DNA greatly outnumber the bacterial DNA, non-specific PCR amplification can occur • Useful to check a subset of your sequences to make sure they match 16S reference sequences • Two Strategies: – Query a subset of your sequences against 16S database(s) to confirm - use BLAST for higher sensitivity – mapping reads to databases containing suspected contaminants (host sequences, known contaminating sequences, etc.) – use short read aligners Module 2 bioinformatics.ca 2) OTU Picking OTU Picking • OTUs: formed arbitrarily based on sequence identity – 97% of sequence similarity over the entire 16S gene ≈ species • QIIME supports 3 approaches (each with several algorithms) – De novo clustering – Closed-reference – Open-reference • More details about OTU picking – http://qiime.org/tutorials/otu_picking.html – Current recommendation outlined in http://msystems.asm.org/content/1/1/e00003-15 Module 2 bioinformatics.ca De novo Clustering • • • • • • Groups sequences based on sequence identity Pair-wise comparisons needed Hierarchical clustering commonly used Requires a lot of disk/memory (hundreds Gb) and time Suitable if no reference database is available According to Schloss (mothur), average-linkage clustering is the most robust to changes in input data and algorithm parameters – “generated OTUs that were more likely to represent the actual distances between sequences” – Recommend first group input sequences into broad taxonomic bins (e.g. class or family) then cluster within each bin (https://peerj.com/articles/1487/) Navas-Molina et al 2013, Meth. Enzym Module 2 bioinformatics.ca De novo clustering • Exhaustive de novo clustering (pair-wise distance matrix followed by hierarchical clustering) is not scalable (e.g. 1000 sequences requires 1000x1000 comparisons) • Greedy algorithms (e.g. UCLUST, sumaclust) achieves time/space saving by clustering incrementally using a subset of sequences as “centroids” (centroid = seed of a cluster) • Input order of sequences affects centroids picking so input seqs are usually pre-sorted by length or abundance Module 2 bioinformatics.ca Closed-Reference • Matches sequences to an existing database of reference sequences • Unmatched sequences discarded • Fast and can be parallelized • Suitable if accurate and comprehensive reference is available • Allow taxonomic comparison across different markers and different datasets • Novel organisms missed/discarded Navas-Molina et al 2013, Meth. Enzym Module 2 bioinformatics.ca Open-Reference • First matches sequences to reference database (like closed-reference) • Unmatched sequences then go through de-novo clustering • Best of both worlds • Suitable if a mixture of novel sequences and known sequences is expected • Results could be biased by the reference database used • QIIME Recommended approach Navas-Molina et al 2013, Meth. Enzym Module 2 bioinformatics.ca 2) OTU Picking Representative Sequence • Once sequences have been clustered into OTUs, a representative sequence is picked for each OTU. • This means downstream analyses treats all organisms in one OTU as the same! • Representative sequence can be picked based on – – – – – – Abundance (default) Centroid used (open-reference OTU or de novo) Length Existing reference sequence (for closed-reference OTU picking) Random First sequence Module 2 bioinformatics.ca 2) OTU Picking Chimera Sequence Removal • PCR amplification process can generate chimeric sequences (an artificially joined sequence from >1 templates) • Chimeras represent ~1% of the reads • Detection based on the identification of a 3-way alignment of non-overlapping sub-sequences to 2 sequences in the search database Source: http://drive5.com/usearch/manual/chimera_formation.html Module 2 bioinformatics.ca 3) Taxonomic Assignment Taxonomic Assignment • OTUs don’t have names but humans have the desire to refer to things by names. – I.e. we want to refer to a group of organism as Bacteroides rather than OTU1. • Closed-reference: – Transfer taxonomy of the reference sequence to the query read • Open-reference + de-novo clustering: – Match OTUs to a reference data using “similarity” searching algorithms – It is important to report how the matching is done in your publication and which taxonomy database is used Module 2 bioinformatics.ca 3) Taxonomic Assignment Taxonomic Assignment Taxonomy DBs RDP Classifier RDP NCBI BLAST GreenGenes rtax Silva Module 2 Specify a matching algorithm Bacteria; Firmicutes; Bacilli; Lactobacillales; Lactobacillaceae; Lactobacillus Etc. tax2tree OTUs Bacteria; Bacteroidetes/Chl orobi group; Bacteroidetes; Bacteroidia; Bacteroidales; Bacteroidaceae; Bacterroides Specify a target taxonomy database Ranked taxonomy names bioinformatics.ca 3) Taxonomic Assignment Taxonomy Databases • RDP (Cole et al 2009) – Most similar to NCBI Taxonomy – Has a rapid classification tool (RDP-Classifier) • GreenGenes (DeSantis et al 2006) – Preferred by QIIME • Silva (Quast et al. 2013) – Preferred by Mothur • Choose DB based on existing datasets you want to compare to and community preference • Caveat: only 11% of the ~150 human associated bacterial genera have species that fall in the 95%-98.7% 16S rRNA identity (Tamisier et al, IJSEM, 2015) Module 2 bioinformatics.ca 3) Taxonomic Assignment Taxonomy Summary Different body sites have different taxonomy composition shown as stacked bar graphs Source: Navas-Molina et al 2015 Module 2 bioinformatics.ca 4) Build OTU Table OTU Table • OTU table is a sample-by-observation matrix OTU1 OTU2 OTU3 OTU4 OTU5... Sample1 10 14 0 33 23 Sample2 5 0 54 2 12 • OTUs can have corresponding taxonomy information • Extremely rare OTUs (singletons or frequencies <0.005% of total reads) can be filtered out to improve downstream analysis – These OTUs may be due to sequencing errors and chimeras • Encoded in Biological observation Matrix (BIOM) format (http://biom-format.org/ Module 2 bioinformatics.ca BIOM Format • Advantages: – Efficient: Able to handle large amount of data; sparse matrix – Encapsulation: Able to keep metadata (e.g. sample information) and observational data (e.g. OTU counts) in one file – Supported: Have software applications to manipulate the content, check the content to minimize editing errors – Standardization: BIOM is a GSC (http://gensc.org/recognized standard; many tools use this file format (e.g. QIIME, MG-RAST, PICRUSt, Mothur, phyloseq, MEGAN, VAMPS, metagenomeSeq, Phinch, RDP Classifier) Module 2 bioinformatics.ca 5) Sequence Alignment Sequence Alignment • Sequences must be aligned to infer a phylogenetic tree • Phylogenetic trees can be used for diversity analyses • Traditional alignment programs (clustalw, muscle etc) are slow (Larkin et al 2007, Edgar 2004) • Template-based aligners such as PyNAST (Caporaso et al 2010) and Infernal (Nawrocki, 2009) are faster and more scalable for large number of sequences Module 2 bioinformatics.ca 6) Phylogenetic Analysis Phylogenetic Tree Reconstruction • After sequences are aligned, phylogenetic trees can be constructed from the multiple alignments • Examples: – – – – – FastTree (Price et al 2009) – recommended by QIIME Clearcut (Evans et al 2006) Clustalw RAXML (Stamatakis et al 2005) Muscle Module 2 bioinformatics.ca Rarefaction • As we multiplex samples, the sequencing depths can vary from sample to sample • Many richness and diversity measures and downstream analyses are sensitive to sampling depth – “more reads sequenced, more species/OTUs will be found in a given environment” • So we need to rarefy the samples to the same level of sampling depth for more fair comparison • Alternative approach: variance stabilizing transformations – McMurdie and Holmes, PLOS CB, 2013 – Liu, Hsiao et al, Bioinformatics, 2011 Module 2 bioinformatics.ca 7) Downstream Analysis Alpha Diversity Analysis • Alpha-diversity: diversity of organisms in one sample / environment – Richness: # of species/taxa observed or estimated – Evenness: relative abundance of each taxon – α-Diversity: taking both evenness and richness into account • Phylogenetic distance (PD) • Shannon entropy • dozens of different measures in QIIME Module 2 bioinformatics.ca 7) Downstream Analysis Alpha Diversity (PD) Closed-reference Open-reference De-novo Rarefaction curves different based on the OTU-picking method used. De-novo approach and Open-reference generated higher diversity than Closed-reference approach Source: Navas-Molina et al 2015 Module 2 bioinformatics.ca 7) Downstream Analysis Beta Diversity Analysis • Beta-diversity: differences in diversities across samples or environments – – – – UniFrac (Lozupone et al, AEM, 2005) (phylogenetic) Bray-Curtis dissimilarity measure (OTU abundance) Jaccard similarity coefficient (OTU presence/absence) Large number of other measures available in QIIME (and in Mothur) Module 2 bioinformatics.ca 7) Downstream Analysis Beta Diversity Analysis • First, paired-wise distances/similarity of the samples calculated Sample 1 • Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 1 0.4 0.8 1 0.7 1 • Then the matrix (high-dimensional data) is transformed into a lower-dimensional display for visualization – Shows you which samples cluster together in 2D or 3D space – PCoA – Hierarchical clustering Module 2 bioinformatics.ca 7) Downstream Analysis Beta Diversity (UniFrac) • all OTUs are mapped to a phylogenetic tree • summation of branch lengths unique to a sample constitutes the UniFrac score between the samples • shared nodes either ignored (unweighted) or discounted (weighted) • Weighted UniFrac sensitive to experimental bias • Unweighted UniFrac often proven more robust but sensitive to the effects of rare OTUs Module 2 bioinformatics.ca 7) Downstream Analysis Principal Coordinate Analysis (PCoA) • When there are multiple samples, the UniFrac scores are calculated for each pair of samples. In order to view the multidimensional data, the distances are projected onto 2-dimensions using PCoA Module 2 bioinformatics.ca 7) Downstream Analysis PCoA • Each dot represents a sample • Dots are colored based on the attributes provided in the mapping (metadata) file Source: Navas-Molina et al 2015 Module 2 bioinformatics.ca 7) Downstream Analysis Hierarchical Clustering • Each leaf is a sample • Samples “forced” into a bifurcating tree Source: Navas-Molina et al 2015 Module 2 bioinformatics.ca Marker Genes vs. Shotgun Metagenomics Marker Gene Profiling Shotgun Metagenomics Profiling Less expensive (~$100 per sample) Still very expensive (~$1000 per sample) Computational needs can be met by desktop / small server computers Usually requires huge computational resources (cluster of computers) Provides mainly taxonomic profiling Provides both taxonomic and functional profiling For 16S, majority of genes can be assigned at least to phylum level Many more unassigned gene fragments (“wasted” data) Relatively free of host DNA contamination Prone to host DNA contamination Module 2 bioinformatics.ca Functional Stability vs. OTU Stability Source: Turnbaugh, P, et al. 2008. Nature Module 2 bioinformatics.ca PICRUSt • Predicts gene content of microbiome based on marker gene survey data • Analysis showed that in poorlycharacterized community types, 16S profile works better than (predicted) functional profile to classify samples Module 2 Source: Xu et al ISME Journal 2014 bioinformatics.ca We are on a Coffee Break & Networking Session Module bioinformatics.ca