Download PPT - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

United Kingdom National DNA Database wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 2
Marker Gene Based Analysis
William Hsiao
Universal phylogenetic
tree based on SSU rRNA
sequences
-N. Pace, Science, 1997
William.hsiao@bccdc.ca
@wlhsiao
Learning Objectives of Module
At the end of this module, you will be able to:
• Understand and Perform marker based microbiome
analysis
• Analyze 16S rRNA marker gene data
• Use 16S rRNA marker gene to profile and compare
microbomes
• Select suitable parameters for marker gene analysis
• Explain the advantages and disadvantages of marker
based microbiome analysis
Module 2
bioinformatics.ca
Marker genes
Extract
DNA
Amplify
with
targeted
primers
Filter
errors,
build
clusters
Diversity
analysis
Source: Rob Beiko
Module 2
bioinformatics.ca
rRNAs – the universal phylogenetic
markers
• Ribosomal RNAs are present in all living organisms
• rRNAs play critical roles in protein translation
• rRNAs are relatively conserved and rarely acquired
horizontally
• Behave like a molecular clock
– Useful for phylogenetic analysis
– Used to build tree-of-life (placing organisms in a single
phylogenetic tree)
• 16S rRNA most commonly used
Module 2
bioinformatics.ca
rRNA – the lens into Life
• A tool for placing organisms on a phylogenetic tree
– Tree of life
• A tool for understanding the composition of a microbial
community
– Profiling of a community (alpha-diversity)
• A tool for relating one microbial community to another
– Comparison of communities (beta-diversity)
• A tool for reading out the properties of a host or environment
– Obese vs. lean; IBD vs. no IBD (classification)
-McDonald et al, RNA, 2015
Module 2
bioinformatics.ca
Other Marker Genes Used
• Eukaryotic Organisms (protists, fungi)
– 18S (http://www.arb-silva.de)
– ITS (http://www.mothur.org/wiki/UNITE_ITS_database)
• Bacteria
– CPN60 (http://www.cpndb.ca/cpnDB/home.php)
– ITS (Martiny, Env Micro 2009)
– RecA gene
• Viruses
– Gp23 for T4-like bacteriophage
– RdRp for picornaviruses
Faster evolving markers used for strain-level differentiation
Module 2
bioinformatics.ca
A Few Considerations for Selecting
Marker Genes
• The marker should have sufficient resolution to
differentiate the different communities you want to study
(16S differentiates genera and some species; e.g. HSP65
may be more useful for Mycobacterium)
• A reference database is needed for taxonomic
assignment and for binning by reference so availability of
a good reference DB for the samples you study is
necessary
• When comparing across studies, need to standardize
markers (different V regions of 16S gene), experimental
protocols, and bioinformatic pipelines
Module 2
bioinformatics.ca
DNA Extraction
• Many protocols/kits available but kit-bias exists
• HMP and other major projects standardized DNA
extraction protocol http://www.earthmicrobiome.org/emp-standardprotocols/dna-extraction-protocol/
• DNA Extraction can also be done after fractionation to
separate out different organisms based on cell size and
other characteristics http://www.jove.com/video/52685/automated-gel-sizeselection-to-improve-quality-next-generation
Module 2
bioinformatics.ca
Contaminations
Lab:
• Extraction process can introduce contamination from the lab
– Reagents may be contaminated with bacterial DNA
• Include Extraction Negative Control in your experiments!
• Especially crucial if samples have low DNA yield!
Host / Environment (metagenomics sequencing):
• Host DNA often ends up in the microbiome sample
– http://hmpdacc.org/doc/HumanSequenceRemoval_SOP.pdf
• Unwanted fractions (e.g. eukaryotes) can be filtered by cell
size-selection prior to DNA extraction
• Unwanted DNA can be removed by subtractive hybridization
Module 2
bioinformatics.ca
Target Amplification
Adapter
barcode
primer
Source: Illumina application Note
Module 2
bioinformatics.ca
Target Amplification
• PCR primers designed to amplify specific regions of a
genome
– In a “dirty” sample, inhibitors can interfere with PCR
– In a complex sample, non-specific amplification can occur
– Check your products using BioAnalyzer or run a gel
• Gel size selection may be necessary to clean up the PCR
products
– http://www.jove.com/video/52685/automated-gel-sizeselection-to-improve-quality-next-generation
• Dilution may be necessary to reduce inhibition
Module 2
bioinformatics.ca
Target Selection and Bias
• 16S rRNA contains 9 hypervariable regions (V1-V9)
• V4 was chosen because of its size (suitable for Illumina
150bp paired-end sequencing) and phylogenetic
resolution
• Different V regions have
different phylogenetic resolutions
– giving rise to slightly different
community composition results
– Jumpstart Consortium HMP,
PLoS One, 2012
Module 2
bioinformatics.ca
Sample Multiplexing
• MiSEQ capacity (~20M paired-end reads) allows multiple
samples to be combined into a single run
– Number of reads needed to differentiate samples depends on
the nature of the studies
• Unique DNA barcodes can be incorporated into your
amplicons to differentiate samples
• Amplicon sequences are quite homogeneous
– Necessary to combine different amplicon types or spike in PhiX
control reads to diversify the sequences
Module 2
bioinformatics.ca
One Step vs. Two Step Amplification
One step amplification
Two step amplification
Barcodes and sequencing adaptors
included in the amplicon primers
Barcodes and adaptors separated from
the amplicon primers
Single step PCR reaction
PCR followed by DNA-ligation or more
PCR
Long primers; barcodes can interfere with
amplification primers
Target primers tested independently from
barcodes/adaptors
Not suitable for degenerate primer
amplification
Compatible with random and degenerate
primer amplification
Suitable if working with one amplicon
type from many samples
Suitable if working with many types of
amplicons or metagenomics in a single
study
Rapid protocol and little loss of
biomaterial
Longer protocols with more biomaterial
loss
Barcoded primers for 16S available from
Rob Knight’s Lab
Tested barcodes from Illumina, BioO, NEB,
etc. available
Module 2
bioinformatics.ca
Available Marker Gene Analysis
Platforms
• QIIME (http://qiime.org)
• Mothur (http://www.mothur.org)
Module 2
bioinformatics.ca
Comparison Between QIIME and Mothur
QIIME
Mothur
A python interface to glue together many
programs
Single program with minimal external
dependency
Wrappers for existing programs
Reimplementation of popular algorithms
Large number of dependencies / VM
available; work best on single multi-core
server with lots of memory
Easy to install and setup; work best on
single multi-core server with lots of
memory
More scalable
Less scalable
Steeper learning curve but more flexible
Easy to learn but workflow works the best
workflow if you can write your own scripts with built-in tools
http://www.ncbi.nlm.nih.gov/pubmed/240 http://www.mothur.org/wiki/MiSeq_SOP
60131
Module 2
bioinformatics.ca
Overall Bioinformatics Workflow
Sequence data
(fastq)
1) Preprocessing:
remove primers,
demultiplex, quality
filter, decontamination
2) OTU Picking /
Representative
Sequences
3) Taxonomic
Assignment
Metadata about
samples (mapping
file)
Inputs
4) Build OTU
Table (BIOM file)
6) Phylogenetic
Analysis
OTU Table
Phylogenetic
Tree
Processed Data
Outputs
Module 2
5) Sequence
Alignment
7) Downstream analysis and
Visualization
– knowledge discovery
bioinformatics.ca
1) Preprocessing
Sample de-multiplexing
• MiSEQ capacity (~20M paired-end reads) allows multiple
samples to be combined into a single run
• Reads need to be linked back to the samples they came
from using the unique barcodes
– Error-correcting barcodes available (~2000 unique barcodes
from Caporaso et al ISME 2012)
• De-multiplexing removes the barcodes and the primer
sequences
• QIIME has several different scripts to help you prepare
your sequence files for analysis
– http://qiime.org/tutorials/processing_illumina_data.html
Module 2
bioinformatics.ca
1) Preprocessing
Quality Filtering
• QIIME filtering by:
– Maximal number of consecutive low-quality bases ( -r=3 )
– Minimum Phred quality score ( -q=3 )
– Minimal length of consecutive high-quality bases (as % of total read
length) ( -p=0.75 )
– Maximal number of ambigous bases (N’s) (-n=0 )
• Other quality filtering tools available
– Cutadapt (https://github.com/marcelm/cutadapt)
– Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
– Sickle (https://github.com/najoshi/sickle)
• Sequence quality summary using FASTQC
– http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Module 2
bioinformatics.ca
QIIME’s Quality Filtering
Workflow and Observations
• Adjusting p,q, and c (minimal
reads per OTU) significantly
affect OTU count
• Filtering mainly affect low
abundance OTUs – difficult to
differentiate sequencing errors
and rare OTUs
• Downstream analyses more
robust to QF changes than raw
OTU counts
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531572/
Module
bioinformatics.ca
1) Preprocessing
Decontamination (in-silico)
• In situations where host and other non-bacterial DNA
greatly outnumber the bacterial DNA, non-specific PCR
amplification can occur
• Useful to check a subset of your sequences to make sure
they match 16S reference sequences
• Two Strategies:
– Query a subset of your sequences against 16S database(s) to
confirm - use BLAST for higher sensitivity
– mapping reads to databases containing suspected contaminants
(host sequences, known contaminating sequences, etc.) – use
short read aligners
Module 2
bioinformatics.ca
2) OTU Picking
OTU Picking
• OTUs: formed arbitrarily based on sequence identity
– 97% of sequence similarity over the entire 16S gene ≈ species
• QIIME supports 3 approaches (each with several
algorithms)
– De novo clustering
– Closed-reference
– Open-reference
• More details about OTU picking
– http://qiime.org/tutorials/otu_picking.html
– Current recommendation outlined in
http://msystems.asm.org/content/1/1/e00003-15
Module 2
bioinformatics.ca
De novo Clustering
•
•
•
•
•
•
Groups sequences based on sequence identity
Pair-wise comparisons needed
Hierarchical clustering commonly used
Requires a lot of disk/memory (hundreds Gb) and time
Suitable if no reference database is available
According to Schloss (mothur), average-linkage
clustering is the most robust to changes in input data
and algorithm parameters
– “generated OTUs that were more likely to represent the
actual distances between sequences”
– Recommend first group input sequences into broad
taxonomic bins (e.g. class or family) then cluster within
each bin
(https://peerj.com/articles/1487/)
Navas-Molina et al 2013, Meth. Enzym
Module 2
bioinformatics.ca
De novo clustering
• Exhaustive de novo clustering (pair-wise distance matrix
followed by hierarchical clustering) is not scalable (e.g.
1000 sequences requires 1000x1000 comparisons)
• Greedy algorithms (e.g. UCLUST, sumaclust) achieves
time/space saving by clustering incrementally using a
subset of sequences as “centroids” (centroid = seed of a
cluster)
• Input order of sequences affects
centroids picking so input seqs
are usually pre-sorted by length
or abundance
Module 2
bioinformatics.ca
Closed-Reference
• Matches sequences to an existing
database of reference sequences
• Unmatched sequences discarded
• Fast and can be parallelized
• Suitable if accurate and
comprehensive reference is available
• Allow taxonomic comparison across
different markers and different
datasets
• Novel organisms missed/discarded
Navas-Molina et al 2013, Meth. Enzym
Module 2
bioinformatics.ca
Open-Reference
• First matches sequences to reference
database (like closed-reference)
• Unmatched sequences then go through
de-novo clustering
• Best of both worlds
• Suitable if a mixture of novel sequences
and known sequences is expected
• Results could be biased by the reference
database used
• QIIME Recommended approach
Navas-Molina et al 2013, Meth. Enzym
Module 2
bioinformatics.ca
2) OTU Picking
Representative Sequence
• Once sequences have been clustered into OTUs, a
representative sequence is picked for each OTU.
• This means downstream analyses treats all organisms in
one OTU as the same!
• Representative sequence can be picked based on
–
–
–
–
–
–
Abundance (default)
Centroid used (open-reference OTU or de novo)
Length
Existing reference sequence (for closed-reference OTU picking)
Random
First sequence
Module 2
bioinformatics.ca
2) OTU Picking
Chimera Sequence Removal
• PCR amplification process can
generate chimeric sequences
(an artificially joined sequence
from >1 templates)
• Chimeras represent ~1% of the reads
• Detection based on the identification of a 3-way
alignment of non-overlapping sub-sequences to 2
sequences in the search database
Source: http://drive5.com/usearch/manual/chimera_formation.html
Module 2
bioinformatics.ca
3) Taxonomic Assignment
Taxonomic Assignment
• OTUs don’t have names but humans have the desire to
refer to things by names.
– I.e. we want to refer to a group of organism as Bacteroides
rather than OTU1.
• Closed-reference:
– Transfer taxonomy of the reference sequence to the query read
• Open-reference + de-novo clustering:
– Match OTUs to a reference data using “similarity” searching
algorithms
– It is important to report how the matching is done in your
publication and which taxonomy database is used
Module 2
bioinformatics.ca
3) Taxonomic Assignment
Taxonomic Assignment
Taxonomy DBs
RDP Classifier
RDP
NCBI BLAST
GreenGenes
rtax
Silva
Module 2
Specify a
matching
algorithm
Bacteria;
Firmicutes; Bacilli;
Lactobacillales;
Lactobacillaceae;
Lactobacillus
Etc.
tax2tree
OTUs
Bacteria;
Bacteroidetes/Chl
orobi group;
Bacteroidetes;
Bacteroidia;
Bacteroidales;
Bacteroidaceae;
Bacterroides
Specify a target
taxonomy
database
Ranked
taxonomy
names
bioinformatics.ca
3) Taxonomic Assignment
Taxonomy Databases
• RDP (Cole et al 2009)
– Most similar to NCBI Taxonomy
– Has a rapid classification tool (RDP-Classifier)
• GreenGenes (DeSantis et al 2006)
– Preferred by QIIME
• Silva (Quast et al. 2013)
– Preferred by Mothur
• Choose DB based on existing datasets you want to compare
to and community preference
• Caveat: only 11% of the ~150 human associated bacterial
genera have species that fall in the 95%-98.7% 16S rRNA
identity (Tamisier et al, IJSEM, 2015)
Module 2
bioinformatics.ca
3) Taxonomic Assignment
Taxonomy Summary
Different body sites
have different
taxonomy
composition shown
as stacked bar
graphs
Source: Navas-Molina et al 2015
Module 2
bioinformatics.ca
4) Build OTU Table
OTU Table
• OTU table is a sample-by-observation matrix
OTU1
OTU2
OTU3
OTU4
OTU5...
Sample1
10
14
0
33
23
Sample2
5
0
54
2
12
• OTUs can have corresponding taxonomy information
• Extremely rare OTUs (singletons or frequencies <0.005% of
total reads) can be filtered out to improve downstream
analysis
– These OTUs may be due to sequencing errors and chimeras
• Encoded in Biological observation Matrix (BIOM) format
(http://biom-format.org/
Module 2
bioinformatics.ca
BIOM Format
• Advantages:
– Efficient: Able to handle large amount of data; sparse matrix
– Encapsulation: Able to keep metadata (e.g. sample information)
and observational data (e.g. OTU counts) in one file
– Supported: Have software applications to manipulate the
content, check the content to minimize editing errors
– Standardization: BIOM is a GSC (http://gensc.org/recognized
standard; many tools use this file format (e.g. QIIME, MG-RAST,
PICRUSt, Mothur, phyloseq, MEGAN, VAMPS, metagenomeSeq,
Phinch, RDP Classifier)
Module 2
bioinformatics.ca
5) Sequence Alignment
Sequence Alignment
• Sequences must be aligned to infer a phylogenetic tree
• Phylogenetic trees can be used for diversity analyses
• Traditional alignment programs (clustalw, muscle etc) are
slow (Larkin et al 2007, Edgar 2004)
• Template-based aligners such as PyNAST (Caporaso et al 2010)
and Infernal (Nawrocki, 2009) are faster and more scalable for
large number of sequences
Module 2
bioinformatics.ca
6) Phylogenetic Analysis
Phylogenetic Tree Reconstruction
• After sequences are aligned, phylogenetic trees can be
constructed from the multiple alignments
• Examples:
–
–
–
–
–
FastTree (Price et al 2009) – recommended by QIIME
Clearcut (Evans et al 2006)
Clustalw
RAXML (Stamatakis et al 2005)
Muscle
Module 2
bioinformatics.ca
Rarefaction
• As we multiplex samples, the sequencing depths can vary from
sample to sample
• Many richness and diversity measures and downstream analyses
are sensitive to sampling depth
– “more reads sequenced, more species/OTUs will be found in a given
environment”
• So we need to rarefy the samples to the same level of sampling
depth for more fair comparison
• Alternative approach: variance stabilizing transformations
– McMurdie and Holmes, PLOS CB, 2013
– Liu, Hsiao et al, Bioinformatics, 2011
Module 2
bioinformatics.ca
7) Downstream Analysis
Alpha Diversity Analysis
• Alpha-diversity: diversity of organisms in one
sample / environment
– Richness: # of species/taxa observed or estimated
– Evenness: relative abundance of each taxon
– α-Diversity: taking both evenness and richness into
account
• Phylogenetic distance (PD)
• Shannon entropy
• dozens of different measures in QIIME
Module 2
bioinformatics.ca
7) Downstream Analysis
Alpha Diversity (PD)
Closed-reference
Open-reference
De-novo
Rarefaction curves different based on the OTU-picking method
used. De-novo approach and Open-reference generated higher
diversity than Closed-reference approach
Source: Navas-Molina et al 2015
Module 2
bioinformatics.ca
7) Downstream Analysis
Beta Diversity Analysis
• Beta-diversity: differences in diversities across samples
or environments
–
–
–
–
UniFrac (Lozupone et al, AEM, 2005) (phylogenetic)
Bray-Curtis dissimilarity measure (OTU abundance)
Jaccard similarity coefficient (OTU presence/absence)
Large number of other measures available in QIIME (and in
Mothur)
Module 2
bioinformatics.ca
7) Downstream Analysis
Beta Diversity Analysis
• First, paired-wise distances/similarity of the samples
calculated
Sample 1
•
Sample 2
Sample 3
Sample 1
Sample 2
Sample 3
1
0.4
0.8
1
0.7
1
• Then the matrix (high-dimensional data) is transformed
into a lower-dimensional display for visualization
– Shows you which samples cluster together in 2D or 3D space
– PCoA
– Hierarchical clustering
Module 2
bioinformatics.ca
7) Downstream Analysis
Beta Diversity (UniFrac)
• all OTUs are mapped to a phylogenetic tree
• summation of branch lengths unique to a sample constitutes
the UniFrac score between the samples
• shared nodes either ignored (unweighted) or discounted
(weighted)
• Weighted UniFrac
sensitive to experimental
bias
• Unweighted UniFrac
often proven more robust
but sensitive to the
effects of rare OTUs
Module 2
bioinformatics.ca
7) Downstream Analysis
Principal Coordinate Analysis (PCoA)
• When there are multiple samples, the UniFrac scores are
calculated for each pair of samples. In order to view the
multidimensional data, the distances are projected onto
2-dimensions using PCoA
Module 2
bioinformatics.ca
7) Downstream Analysis
PCoA
• Each dot
represents a
sample
• Dots are colored
based on the
attributes
provided in the
mapping
(metadata) file
Source: Navas-Molina et al 2015
Module 2
bioinformatics.ca
7) Downstream Analysis
Hierarchical Clustering
• Each leaf is a
sample
• Samples “forced”
into a bifurcating
tree
Source: Navas-Molina et al 2015
Module 2
bioinformatics.ca
Marker Genes vs.
Shotgun Metagenomics
Marker Gene Profiling
Shotgun Metagenomics Profiling
Less expensive (~$100 per sample)
Still very expensive (~$1000 per sample)
Computational needs can be met by
desktop / small server computers
Usually requires huge computational
resources (cluster of computers)
Provides mainly taxonomic profiling
Provides both taxonomic and functional
profiling
For 16S, majority of genes can be
assigned at least to phylum level
Many more unassigned gene fragments
(“wasted” data)
Relatively free of host DNA contamination Prone to host DNA contamination
Module 2
bioinformatics.ca
Functional Stability vs. OTU Stability
Source: Turnbaugh, P, et al. 2008. Nature
Module 2
bioinformatics.ca
PICRUSt
• Predicts gene
content of
microbiome based
on marker gene
survey data
• Analysis showed
that in poorlycharacterized
community types,
16S profile works
better than
(predicted) functional
profile to classify
samples
Module 2
Source: Xu et al ISME Journal 2014
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module
bioinformatics.ca