Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier Utility of Gene Expression for Human Disease Microarray Technology Big Picture Data Access Gene Expression Microarray Repositories • Gene Expression Omnibus (GEO) • ArrayExpress • Hosted by: NCBI Platform: All accepted Normalization: Experiment by experiment basis Access: R (GEOquery), EUtils Meta-Information: GEOMetaDB Hosted by: EMBL Platform: All accepted Normalization: Experiment by experiment basis Access: Web interface, EMBL API Meta-Information: ? (API) Many smaller repositories which have more phenotypic information for specific diseases Phenotypic information may be hard to access Gene Expression Omnibus Samples Per Platform in GEO HGU133 Plus 2.0 Latest 3’ Affymetrix Array HGU133A Affymetrix arrays account for ~67% of human gene expression data in public repositories. Affymetrix Probesets >54,000 Probesets Perfect Match Probe Probe Probeset Pair (11 Probe Pairs) Mismatch 25 nucleotides GeneChip U133 Plus 2.0 Array (Image stored as CEL file.) Pre-Processing 101 Pre-Processing Gene Expression Data Removing Miss-Targeted and Non-Specific Probes Normally CDF File Comes from Affymetrix CEL File CDF File Intensities Alternative CDF File Thorougly Cleaned CEL File AltCDF File Intensities Zhang, et al. 2005 Pre-Processing Gene Expression Data What Makes Cells Different? PANP: Presence/Absence Filtering • Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand • Utilize this background distribution from these NSMPs to threshold the entire dataset • Output is a call for each array for each gene Calls are: • P = presence • M = marginal • A = Absence Identifying Present Genes • Filter out genes ≥ 50% absent Whole dataset Subsets • Only present genes are utilized in future analyses Pre-Processing Gene Expression Data Removing Redundancy Reason for Removing Redundancy Before Running Removing Redundancy • Collapse Affymetrix Probeset IDs to EntrezIDs • Test for correlation between probesets If correlation is ≥ 0.8 then combine probesets If not then leave them separate Pre-Processing Gene Expression Data Pre-Processing Pipeline = Implemented in R = Implemented in Python Big Picture Glioma: A Deadly Brain Cancer Wikimedia commons Brain Anatomy Wikimedia commons What do they do? Neurophysiology Hierarchy of Nervous Tissue Tumors Glioma WHO Grade Tumor Type I Pilocytic Astrocytoma II Diffuse or Low-Grade Astrocytoma III Anaplastic Astrocytoma IV Glioblastoma Multiforme Percentage of CNS Tumors 9.8% 20.3% Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007 Glioma Survival 10 years 5 years http://www.neurooncology.ucla.edu/ Repository of Molecular Brain Neoplasia Data (REMBRANDT) • REMBRANDT (Madhavan et al., 2009) Currently 257 individual specimens • • • • • Glioblastoma multiforme (GBM) = 110 Astrocytoma = 50 Oligodendroglioma = 55 Mixed = 21 Non-Tumor = 21 Phenotypes • Tumor type: GBM, Astrocytoma, etc. • WHO Grade: 176 individuals • Age: 253 individuals • Sex: 250 individuals (partially inferred using Y chromosome genes) • Survival (days post diagnosis): 169 individuals REMBRANT: 8 males cluster with females Chromosome Y Expression Female 4 females cluster with males Male Sex specific gene expression Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome. REMBRANT: Chr. Y Expression – Intelligent Reassignment Female Male Sex specific gene expression Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call. Progression of Astrocytic Glioma Furnari, et al. (2007) Modeling Glioma • Increasing metastatic potential and severity of glioma could be modeled using this simple schema 0 1 • Correlation of model to survival post diagnosis is -0.68 2 Exploring Meta-Information • Age explains 31% of survival post diagnosis • Age explains 25% of the progression model • Sex does not have a significant effect on either survival or the progression model Yet it is known that glioblastoma is slightly more common in men than in women Summary • Very ample dataset with good amount of meta-information • Ready for dimensionality reduction and network inference! Big Picture Clustering as Dimensionality Reduction Big Picture Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity Relative Genome Sizes Solutions • Pre-process genomic sequences • Reduce data complexity by collapsing redundancies • Utilize filters that select for only the most variant genes Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity Eukaryotic Gene Structure Eukaryotic Gene Structure Transcriptional Start Site Start Codon Untranslated Regions Eukaryotic Gene Structure Exons Eukaryotic Gene Structure Introns Regulatory Regions Transcription Factor Binding Sites (6-12bp motifs) miRNA binding sites (4-9bp motifs) Promoter 3’ UTR No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp Three Examples After Capture 85% (n = 36,177) of probesets are associated with a sequence Solution • Do motif detection on both promoter and 3’ UTR sequences • Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix Promoter Sequences • Looking for transcription factor binding sites (TFBS) Using MEME with 6-12bp motif widths • Utilized RefSeq gene mapping to identify putative promoter regions 2Kbp of sequence upstream of transcriptional start site (TSS) was grabbed • If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken 3’ UTR Sequences • Looking for miRNA binding sites miRNA are 21bp RNA molecules that bind to mRNA and alter expression Using MEME with 49bp motif widths Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity Complexity of Mammalian Systems Cellular Heterogeneity in Tissues What Makes Cells Different? Solution • Filter our genes that are not expressed for each tissue, leaving only those that are expressed • Enhance the capability of the software to handle missing data Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity Intelligent Sample Collection • Genetic and environmental heterogeneity are real world issues • Can try to match for certain confounders • Or stratify analyses based on particular confounders Running cMonkey • Running cMonkey on AEGIR cluster 10 nodes with 8 cores per node 1 node has 24GB ram 2 others have 16GB ram • Completion time depending heavily on the size of the run Beautiful New Result Interface Looking at a Cluster Chris’s Graphics Mods Original cMonkey Output Sorted cMonkey Output Boxplot For All Samples Boxplot for In Samples Integrating Phenotypes What to do when you find a cluster? Checking Out PSSM #1 Known Motif? Motif Known? What do the genes do? Functional Enrichment? Functional Enrichment Genes? Interesting Cluster Phenotype Correlations • Survival – Correlation coefficient = -0.48 P-value = 3.2 x 10-11 • Progression Model – Correlation coefficient = 0.55 P-value = 6.7 x 10-16 • Age – Correlation coefficient = 0.32 P-value = 2.2 x 10-7 • Sex – Correlation coefficient = -0.27 P-value = 0.0012 Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10-5 Genes from Cluster AFFY_ID Gene Symbol Gene Name 212067_S_AT C1R complement component 1, r subcomponent 208747_S_AT C1S complement component 1, s subcomponent 201743_AT CD14 cd14 antigen 215049_X_AT CD163 cd163 antigen 203854_AT CFI complement factor i 213060_S_AT CHI3L2 chitinase 3-like 2 208146_S_AT CPVL carboxypeptidase, vitellogenic-like 201798_S_AT FER1L3 fer-1-like 3, myoferlin (c. elegans) 206584_AT LY96 lymphocyte antigen 96 202180_S_AT MVP major vault protein 204150_AT STAB1 stabilin 1 204924_AT TLR2 toll-like receptor 2 = Previously known to be differentially expressed in GBM. Motif Matches PSSM #1 PSSM #2 Summary • Very promising results • Need to further develop certain aspects of cMonkey to better utilize the human data • Then need to build network inference component General Questions • Biclustering or not? • How many genes to run? • How much sequence to feed MEME? • Can more than one experiment be included? Cluster Samples, or Not? • Bi-clustering clusters not only on genes but also by experimental conditions (samples) • Because we are using just one experiment it may not be necessary to cluster samples • Although it may be useful again once other experiments are included Bi-clustering or Not? Bi-clustering Gene Clustering Only Brief Glance • Looks like for this dataset it may make more sense to only cluster genes More clusters with significant motifs • Although this is likely to change once we add more experiments to the mix • Need a method to quantify this General Questions • Biclustering or not? • How many genes to run? • How much sequence to feed MEME? • Can more than one experiment be included? Maxing Out cMonkey • Can cMonkey handle running all genes Yes, without doing motif finding With motif finding this will take a long time (weeks?), and tends to crash out • Essentially need to balance sequence length for motif finding with cluster size and number of clusters • Need a method to quantify this General Questions • Biclustering or not? • How many genes to run? • How much sequence to feed MEME? • Can more than one experiment be included? Length for Promoters? • MEME suggests 1Kbp or less for sequences as input • Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp Brief Glance • So far looks like the 500bp give the most clusters with motifs • Need a method to quantify this General Questions • Biclustering or not? • How many genes to run? • How much sequence to feed MEME? • Can more than one experiment be included? Breast Cancer Metastasis Bos et al., 2009 cMonkey for Eukaryotes Future Modifications to cMonkey for eukaryotes: Preprocess sequence data Add 3’ UTR miRNA motif detection Integrate 3’ UTR miRNA motif scores with promoter motif scores Network Inference • cMonkey software is utilized to produce the biclusters • Inferelator can then be used to identify regulatory factors • Simple correlation with phenotypes can relate biclusters to disease Acknowledgements Baliga Lab • Nitin • David • Chris • Dan Hood Lab • Burak Kutlu • Luxembourg Project • REMBRANDT