Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tara A.Gianoulis, Jeroen Raes
April 13,2010
Presenter: Quan Zhang
Introduction
Data collection
Three methods
Linear Model (LM)
Canonical correlation analysis (CCA)
Discriminative partition matching (DPM)
Results : three case studies
Energy conversion strategies
Balancing amino acid Synthesis vs. Import
Lipid and Glycan metabolism
Conclusion
Discussion
It is critical to understand:
Environmental influence on microbial communities VS.
how microbes reshape their environment.
Direct sequencing:
First large-scale technique that allows us to see the
functions of these microbial communities
Evidence for genomic adaptations:
Comparative metagenomics approaches
Sequence composition, genome size, evolutionary rates,
metabolic capabilities in different environments
A one-dimensional
representation of the
environmental metabolic profiles
for microbes sampled from nine
environments.
Dinsdale EA, et al. (2008) Functional metagenomic profiling of
nine biomes. Nature 452:629–632
The previous studies used a rough definition for
environment
For example: marine vs. land
This study treated environments explicitly as a set of
continuous features
For example: temperature, sample depth …
Define metabolic footprint of distinct environments
Footprint– The set of metabolic pathways that depend
on or covary with the environment
Data collection
Global Ocean Survey (GOS) dataset: filter size 0.1-0.8
µm
Discard Sargasso Sea 11
Remaining 37 sites from CAMERA
Environmental features
temperature, sample depth, water depth, salinity and
monthly average chlorophyll level
Processing feature data
average the salinity for all nonzero(except freshwater site)
corroborate the missing measurements using World Ocean
Database
Assign the peptides to a particular site using a mapping algorithm
that cross-referenced between reads, scaffolds, and peptides based
on predicted gene coordinates.
The “multiple sites” peptide distribution is similar to the distribution of all
peptides, so this implies there are no major differences in assembly quality
Assign the peptide to a pathway
Similarity search tool: BLASTP
Database: STRING 7.0 ( current STRING 8.2)
Threshold: bitscore>60, 80% consistency among top 5
hits
Assign pathway frequency for each site
Build two matrices
Rows are sites, columns are environmental features
Rows are sites, columns are metabolic features
Determine the first order relationships between each pair of
metabolic and environmental features
Two directions:
The environmental factors: variable; predicted from subset of
pathway frequencies
The pathway frequency : variable; predicted from environmental
factors
Determine the subset of predictive variables:
Stepwise regression
Akaike’s information criterion (AIC)
Top 20 pathways showing the highest pairwise correlation were
used
Limitation:
Views each feature in isolation
There are hidden dependencies among the environmental features
Ref: http://en.wikipedia.org/wiki
Predicting specific environmental parameters from subsets of metabolic pathways.
Gianoulis T A et al. PNAS 2009;106:1374-1379
©2009 by National Academy of Sciences
Canonical correlation analysis (CCA)
Determines whether a global relationship between
environmental and metabolic features exists
Calculates the relative contribution of each feature to
the global relationship by weighting both sets of
features simultaneously.
Discriminative partition matching(DPM)
Analyzes whether groupings of sites based on similar
environmental features also shared functional
(pathway) similarities
Variables
relationships
between two groups
of variables
species variables vs.
Units
Looks at the
environment variables
(community ecology)
genetic variables vs.
environmental
variables (population
genetics)
Ref: http://myweb.dal.ca/hwhitehe/BIOL4062/redundancy.ppt
X’s
Y’s
Given a linear combination of X variables:
F = f1X1 + f2X2 + ... + fpXp
and a linear combination of Y variables:
G = g1Y1 + g2Y2 + ... + gqYq
----------------------------------------------------------------------------------------------------------The first canonical correlation is:
Maximum correlation coefficient between F and G,
for all F and G
F1={f11,f12,...,f1p} and G1={g11,g12,...,g1q}
are corresponding canonical variates (dimensions)
----------------------------------------------------------------------------------------------------------The second canonical correlation is:
Maximum correlation coefficient between F and G,
for all F, orthogonal to F , and G, orthogonal to G
F2={f21,f22,...,f2p} and G2={g21,g22,...,g2q}
are corresponding second canonical variates (dimensions)
1
Ref: http://myweb.dal.ca/hwhitehe/BIOL4062/redundancy.ppt
1
Amino acid
metabolism
Lipid synthesis and glycan
metabolism
Energy conversion
For environmental metadata
Cluster sites based on their quantitative environmental
metadata
Two or more clusters
For metabolism matrices
Partition the sites in the metabolism matrices into 2 site sets
Calculate the mean frequency of each pathway in each site set.
If the means of the pathway frequencies between 2 site
sets were not significantly different:
environment-based partitioning does not reflect
functional differences
If they do differ significantly:
environmental features are related to that specific
aspect of metabolism
Specially, Benajamini-Hochberg was employed to
correct p-value
When a two-sample t-test is performed on a gene, p-value is used to
measure the significantly different level between two groups of
samples.
Ref: http://www.silicongenetics.com/Support/GeneSpring/GSnotes/analysis_guides/mtc.pdf
Similarities
Both are used to explore relationships between metabolism
and quantitative environmental parameters
Differences
DPM
All environmental variables are equally important when defining
the site sets
Robust to noise
May lose individual differences among sites and their relationships
to the environment
CCA
Weights each environmental feature and each metabolic pathway
independently
More sensitive, but more susceptible to noise
•NMI stands for
Normalized Mutual
Information
•NMI attempts to
determine how well one
classification is able to
predict the second
classification.
•If the NMI and
transposed NMI scores
are high, then either
classification is good at
predicting the other.
Energy conversion strategies
Balancing amino acid Synthesis vs. Import
Lipid and Glycan metabolism
Many of the environmentally-dependent pathways
were associated with energy conversion.
Ample diversification in energy conversion strategies
observed
Helps organisms maintain adequate energy levels
despite changing environmental conditions
Light capture
and electron
transport
ATP
synthase
Phenomenon: Metabolic pathways associated with
amino acid and cofactor transport and metabolism
varied greatly with environment
This variation may be a way to cope with the
oligotrophic (nutrient-limited) nature of the oceans
Example: changes in amino acid uptake strategies
Amino acid uptake is sensitive to light availability, which could be an
additional factor in their variation.
We could say temperature and
chlorophyll influenced the
metabolism pathways mostly.
Phenomenon: correlation of amino acid biosynthesis
pathways with the environment was unrelated to the
energetic cost of synthesizing a particular amino acid
Significant positive correlation between the structural
correlation of the amino acid pathways and their
dependence on potentially limiting cofactors
Import of exogenous amino acids may be preferred
when cofactors are limiting
Methionine is a central amino acid in oceanic
microorganisms.
Cobalamin is a methionine cofactor containing cobalt.
Reduction of methoione is caused by cofactor
limitation.
Observation :
synthesis of methionine and cobalamin
amino acid transporters , methionine degradation
Thus, methoionine has a significant role in shaping
downstream environmental adaptations.
Lipid & glycans are important components in
microbial cell membrane
Like what people
expected, lipid and
glycan metabolism
were related with
environmental
conditions.
Explanation: Depth significantly contributed to lipid metabolism since microbes
needed to choose the optimal buoyancy as a growth condition.
This method associates microbial community
functions with quantitative, continuous features of the
environment
Metabolic pathway footprints can be used to predict
environmental conditions when those data are not
available
Only five environmental features ( temperature,
sample depth, water depth, salinity and monthly
average chlorophyll level) cannot fully describe the
real-world environmental complexity
<0.3% of proteins in GOS dataset were characterized
as viral, but are expected to be much higher in reality
Other questions?