Download ChrisP 11/16/2009 Presentation

Document related concepts

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene regulatory network wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Transcript
Mining Public Data for Insights
into Human Disease
11/16/2009
Baliga Lab Meeting
Chris Plaisier
Utility of Gene Expression for
Human Disease
Microarray Technology
Big Picture
Data Access
Gene Expression Microarray
Repositories
•
Gene Expression Omnibus (GEO)





•
ArrayExpress





•
Hosted by: NCBI
Platform: All accepted
Normalization: Experiment by experiment basis
Access: R (GEOquery), EUtils
Meta-Information: GEOMetaDB
Hosted by: EMBL
Platform: All accepted
Normalization: Experiment by experiment basis
Access: Web interface, EMBL API
Meta-Information: ? (API)
Many smaller repositories which have more phenotypic information for specific
diseases

Phenotypic information may be hard to access
Gene Expression Omnibus
Samples Per Platform in GEO
HGU133 Plus 2.0
Latest 3’ Affymetrix Array
HGU133A
Affymetrix arrays account for ~67% of human
gene expression data in public repositories.
Affymetrix Probesets
>54,000 Probesets
Perfect Match
Probe Probe
Probeset
Pair
(11 Probe Pairs)
Mismatch
25 nucleotides
GeneChip U133 Plus 2.0 Array
(Image stored as CEL file.)
Pre-Processing 101
Pre-Processing Gene Expression Data
Removing Miss-Targeted and
Non-Specific Probes
Normally CDF File Comes from Affymetrix
CEL
File
CDF
File
Intensities
Alternative CDF File Thorougly Cleaned
CEL
File
AltCDF
File
Intensities
Zhang, et al. 2005
Pre-Processing Gene Expression Data
What Makes Cells Different?
PANP: Presence/Absence Filtering
• Use Negative Strand Matching Probesets (NSMPs) to determine
true background distribution
 NSMPs probesets are designed to hybridize to the opposite strand from
the expressed strand
• Utilize this background distribution from these NSMPs to threshold
the entire dataset
• Output is a call for each array for each gene
 Calls are:
• P = presence
• M = marginal
• A = Absence
Identifying Present Genes
• Filter out genes ≥ 50% absent
 Whole dataset
 Subsets
• Only present genes are utilized in future
analyses
Pre-Processing Gene Expression Data
Removing Redundancy
Reason for Removing
Redundancy Before Running
Removing Redundancy
• Collapse Affymetrix Probeset IDs to
EntrezIDs
• Test for correlation between probesets
 If correlation is ≥ 0.8 then combine probesets
 If not then leave them separate
Pre-Processing Gene Expression Data
Pre-Processing Pipeline
= Implemented in R
= Implemented in Python
Big Picture
Glioma:
A Deadly Brain Cancer
Wikimedia commons
Brain Anatomy
Wikimedia commons
What do they do?
Neurophysiology
Hierarchy of
Nervous Tissue Tumors
Glioma
WHO Grade
Tumor Type
I
Pilocytic Astrocytoma
II
Diffuse or Low-Grade
Astrocytoma
III
Anaplastic Astrocytoma
IV
Glioblastoma Multiforme
Percentage of CNS
Tumors
9.8%
20.3%
Gliomas account for 40% of all tumors and 78% of malignant tumors.
Buckner et al., 2007
Glioma Survival
10 years
5 years
http://www.neurooncology.ucla.edu/
Repository of Molecular Brain
Neoplasia Data (REMBRANDT)
•
REMBRANDT (Madhavan et al., 2009)
 Currently 257 individual specimens
•
•
•
•
•
Glioblastoma multiforme (GBM) = 110
Astrocytoma = 50
Oligodendroglioma = 55
Mixed = 21
Non-Tumor = 21
 Phenotypes
• Tumor type:

GBM, Astrocytoma, etc.
• WHO Grade:

176 individuals
• Age:

253 individuals
• Sex:

250 individuals (partially inferred using Y chromosome genes)
• Survival (days post diagnosis):

169 individuals
REMBRANT:
8 males cluster
with females
Chromosome Y Expression
Female
4 females cluster
with males
Male
Sex specific
gene expression
Conversions of male to female should be more common than the other way,
because it is difficult for females to express the Y chromosome.
REMBRANT:
Chr. Y Expression – Intelligent Reassignment
Female
Male
Sex specific
gene expression
Intelligent Reassignment – If previous call of sex is for other group then the call
is turned into an NA. All unknowns are given a call.
Progression of Astrocytic Glioma
Furnari, et al. (2007)
Modeling Glioma
• Increasing metastatic
potential and severity
of glioma could be
modeled using this
simple schema
0
1
• Correlation of model
to survival post
diagnosis is -0.68
2
Exploring Meta-Information
• Age explains 31% of survival post diagnosis
• Age explains 25% of the progression model
• Sex does not have a significant effect on either
survival or the progression model
 Yet it is known that glioblastoma is slightly more
common in men than in women
Summary
• Very ample dataset with good amount of
meta-information
• Ready for dimensionality reduction and
network inference!
Big Picture
Clustering as
Dimensionality Reduction
Big Picture
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental
heterogeneity
Relative Genome Sizes
Solutions
• Pre-process genomic sequences
• Reduce data complexity by collapsing
redundancies
• Utilize filters that select for only the most
variant genes
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental
heterogeneity
Eukaryotic Gene Structure
Eukaryotic Gene Structure
Transcriptional
Start
Site
Start
Codon
Untranslated Regions
Eukaryotic Gene Structure
Exons
Eukaryotic Gene Structure
Introns
Regulatory Regions
Transcription Factor
Binding Sites
(6-12bp motifs)
miRNA binding sites
(4-9bp motifs)
Promoter
3’ UTR
No set length for
promoters in eukaryotes.
Grabbing 2Kbp, so we can
use 2Kbp or smaller.
Median 3’ UTR
length is 831bp
Three Examples After Capture
85% (n = 36,177) of probesets are associated with a sequence
Solution
• Do motif detection on both promoter and 3’
UTR sequences
• Incorporate both of these regulatory
regions into the cMonkey bi-cluster scoring
matrix
Promoter Sequences
• Looking for transcription factor binding sites
(TFBS)
 Using MEME with 6-12bp motif widths
• Utilized RefSeq gene mapping to identify
putative promoter regions
 2Kbp of sequence upstream of transcriptional start
site (TSS) was grabbed
• If two RefSeq gene mappings did not overlap
then the longest transcripts promoter was taken
3’ UTR Sequences
• Looking for miRNA
binding sites
 miRNA are 21bp RNA
molecules that bind to
mRNA and alter
expression
 Using MEME with 49bp motif widths
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental
heterogeneity
Complexity of
Mammalian Systems
Cellular Heterogeneity
in Tissues
What Makes Cells Different?
Solution
• Filter our genes that are not expressed for
each tissue, leaving only those that are
expressed
• Enhance the capability of the software to
handle missing data
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental
heterogeneity
Intelligent Sample Collection
• Genetic and environmental heterogeneity
are real world issues
• Can try to match for certain confounders
• Or stratify analyses based on particular
confounders
Running cMonkey
• Running cMonkey on
AEGIR cluster
 10 nodes with 8 cores per
node
 1 node has 24GB ram
 2 others have 16GB ram
• Completion time
depending heavily on the
size of the run
Beautiful New
Result Interface
Looking at a Cluster
Chris’s Graphics Mods
Original cMonkey Output
Sorted cMonkey Output
Boxplot For All Samples
Boxplot for In Samples
Integrating Phenotypes
What to do when you find a
cluster?
Checking Out PSSM #1
Known Motif?
Motif Known?
What do the genes do?
Functional Enrichment?
Functional Enrichment
Genes?
Interesting Cluster
Phenotype Correlations
• Survival –
 Correlation coefficient = -0.48
 P-value = 3.2 x 10-11
• Progression Model –
 Correlation coefficient = 0.55
 P-value = 6.7 x 10-16
• Age –
 Correlation coefficient = 0.32
 P-value = 2.2 x 10-7
• Sex –
 Correlation coefficient = -0.27
 P-value = 0.0012
Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10-5
Genes from Cluster
AFFY_ID
Gene Symbol
Gene Name
212067_S_AT
C1R
complement component 1, r subcomponent
208747_S_AT
C1S
complement component 1, s subcomponent
201743_AT
CD14
cd14 antigen
215049_X_AT
CD163
cd163 antigen
203854_AT
CFI
complement factor i
213060_S_AT
CHI3L2
chitinase 3-like 2
208146_S_AT
CPVL
carboxypeptidase, vitellogenic-like
201798_S_AT
FER1L3
fer-1-like 3, myoferlin (c. elegans)
206584_AT
LY96
lymphocyte antigen 96
202180_S_AT
MVP
major vault protein
204150_AT
STAB1
stabilin 1
204924_AT
TLR2
toll-like receptor 2
= Previously known to be differentially expressed in GBM.
Motif Matches
PSSM #1
PSSM #2
Summary
• Very promising results
• Need to further develop certain aspects of
cMonkey to better utilize the human data
• Then need to build network inference
component
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be
included?
Cluster Samples, or Not?
• Bi-clustering clusters not only on genes but also
by experimental conditions (samples)
• Because we are using just one experiment it
may not be necessary to cluster samples
• Although it may be useful again once other
experiments are included
Bi-clustering or Not?
Bi-clustering
Gene Clustering Only
Brief Glance
• Looks like for this dataset it may make
more sense to only cluster genes
 More clusters with significant motifs
• Although this is likely to change once we
add more experiments to the mix
• Need a method to quantify this
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be
included?
Maxing Out cMonkey
• Can cMonkey handle running all genes
 Yes, without doing motif finding
 With motif finding this will take a long time (weeks?),
and tends to crash out
• Essentially need to balance sequence length for
motif finding with cluster size and number of
clusters
• Need a method to quantify this
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be
included?
Length for Promoters?
• MEME suggests 1Kbp or less for
sequences as input
• Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp,
and 5Kbp
Brief Glance
• So far looks like the 500bp give the most
clusters with motifs
• Need a method to quantify this
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be
included?
Breast Cancer Metastasis
Bos et al., 2009
cMonkey for Eukaryotes
Future Modifications to cMonkey for
eukaryotes:
 Preprocess sequence data
 Add 3’ UTR miRNA motif detection
 Integrate 3’ UTR miRNA motif scores with
promoter motif scores
Network Inference
• cMonkey software is
utilized to produce the biclusters
• Inferelator can then be
used to identify regulatory
factors
• Simple correlation with
phenotypes can relate biclusters to disease
Acknowledgements
Baliga Lab
• Nitin
• David
• Chris
• Dan
Hood Lab
• Burak Kutlu
• Luxembourg Project
• REMBRANDT