Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas Administrative Register for 3 hours of credit 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 Me Luke Huan, assistant prof. in Electrical Engineering & Computer Science Homepage: http://people.eecs.ku.edu/~jhuan/ Office: 2304 Eaton Hall Email: jhuan@eecs.ku.edu Office hour: 10:00 – 11:00am Monday and Wednesday 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 My Lecture Style I may tend to talk fast, especially when excited Class materials are highly interdisciplinary Use your questions to slow me down Ask for clarification, repetition of a strange phrase, jargons “If in doubt, speak it out” 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 You Introduction: Who you are What department you are in Why you are taking the course 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 Outline for Today What is mining biological data? What is this course about? Course home page Course references Paper presentation Final project Grading Forward class reviewing 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 What is Mining Biological Data Goal: understanding the structure of biological data Patterns Descriptive models Predictive models Challenges: What is the nature of the data? What are the computational tasks? How to break a task into a group of computational components? How to evaluate the computational results? Applications Experimental design and hypothesis generation Synthesis novel proteins Drug design … 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 What is this Course About? Learning… Problems in mining biological data Available techniques, their pros and cons How to combine techniques together Enough perception to avoid pitfalls Practicing… To present recent papers on a selected topic To work on a project that may involve A domain expert, A driving biological problem, and The development of new data mining techniques 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 Class Information Class Homepage: http://people.eecs.ku.edu/~jhuan/fall06.html Meeting time: 9:00 – 9:45 Monday, Wednesday, Friday Meeting place: Eaton Hall 2001 Prerequisite: none 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 Textbook & References Textbook: none References Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 2001. (ISBN:1-55860-489-8) The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5) Bioinformatics: Genes, Proteins, and Computers, edited by Christine Orengo, David Jones, Janet Thornton, Bios Scientific Publishers, 2003. (ISBN: 1-85996-0545) 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 Paper Presentation One per student Research paper(s) List of recommendations will be posted at the class webpage a week from now Your own pick (upon approval) Three parts Review the goal of the paper(s) Discuss the research challenges Present the techniques and comment on their pros and cons Questions and comments from audience Extra credit for active participants of class discussions Order of presentation: first come first pick Please send in your choice of paper by September 1st. 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 Final Project Project (due Nov. 27th) One project I will post some suggestions at class website. I am soliciting projects from researchers on campus You are welcome to propose your own Discuss with me before you start Checkpoints Proposal: title and goal (due Sep. 8th) Background and related work (due Sep. 29th) Outline of approach (due Oct. 20th) Implementation & Evaluation (due Nov. 10th) Class demo (due Nov. 27th) 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 Grading Grading scheme Paper presentation and discussion 45% Project 45% Attendance and Participation 10% No homework No exam 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 Forward Class Reviewing This is for overview, not content Don’t worry if you do not understand some of the words, that’s why you want to take this class. Gives an idea of what is coming Order of presentation might be shuffled to accommodate everyone’s schedule Topics may be adjusted with progresses of the class 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 Week 1: Pattern Mining Frequent patterns: finding regularities in data Frequent patterns (set of items) are one that occur frequently in a data set Can we automatically profile customers? What products are often purchased together? Customer Shopping basket ID Items bought 8/21/2006 Introduction 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n One hypothesis: {a, c} {m} Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 Week 2: Advanced Pattern Mining Reducing number of patterns Maximal patterns and closed patterns Constraint-based mining Patterns with concept hierarchy Patterns in quantitative data Correlation vs. association 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 Week 3: Mining Microarray CH1I CH1B CH1D CH2I CH2B CTFC3 4392 284 4108 280 228 VPS8 401 281 120 275 298 EFB1 318 280 37 277 215 SSA1 401 292 109 580 238 FUN14 2857 285 2576 271 226 SP07 228 290 48 285 224 MDM10 538 272 266 277 236 CYS3 322 288 41 278 219 DEP1 312 272 40 273 232 NTG1 329 296 33 274 228 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “Comprehensive Identification of Cell Cycleregulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 9, 3273-3297. slide17 Week 4: Patterns in Sequences, Trees, and Graphs p1 a p5 c p2 y y b y y b p3 = 2/3 f=2/3 a f = 3/3 a y P1 f=2/3 b y P4 8/21/2006 Introduction b x G3 f=2/3 a y c f=3/3 a y b b x y b b P2 b s4 c b s3 G2 y y y b q3 b s1 b y s2 a x y d p4 G1 y y q2 a x q1 b P3 b P5 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 x b f=2/3 P6 slide18 Week 5: Pattern Discovery in Biomolecules Protein A sequence from 20 amino acids Lys Lys Gly Gly Leu Val Ala His Adopts a stable 3D structure that can be measured experimentally Oxygen Nitrogen Carbon Sulfur Cartoon Space filling Surface Ribbon 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 Week 6: Descriptive Models Group objects into clusters Ones in the same cluster are similar Ones in different clusters are dissimilar Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 Week 7: Subspace Clustering Movie 1 Movie Movie 22 Movie Movie33 Movie Movie44 Movie Movie55 Movie Movie6 6 Movie Movie7 7 Viewer 11 Viewer 11 Viewer 22 Viewer 44 Viewer 33 Viewer 22 33 44 66 Viewer 44 Viewer 33 44 55 77 Viewer 55 Viewer 8/21/2006 Introduction 22 44 33 66 55 55 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 55 77 33 11 33 44 slide21 rating Week 7: Subspace Clustering 8 7 6 5 4 3 2 1 0 viewer 1 viewer 3 viewer 4 movie 1 8/21/2006 Introduction movie 2 movie 4 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 movie 6 slide22 Week 8: Mining Microarray (II) Apply subspace clustering to microarray analysis Find groups of genes that are co-regulated May integrate data from protein sequences and functional description of genes Applying subgraph mining to microarray analysis 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 Week 9: Predictive Models Two-class version: Using “training data” from Class +1 and Class -1 Develop a “rule” for assigning new data to a Class Slides from J.S. Marron in Statistics at UNC 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 Week 10: Classification Algorithms and Applications Decision tree Fishers linear discrimination method Kernel methods 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 Week 11: Text Mining, Gene Ontology, Data Management Ontology seeks to describe or posit the basic categories and relationships of being or existence to define entities and types of entities within its framework. Ontology can be said to study conceptions of reality (Wikipedia). GO is a database of terms for genes Terms are connected as a directed acyclic graph Levels represent specifity of the terms (not normalized) GO contains three different sub-ontologies: Molecular function Biological process Cellular component 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 Week 12: Systems Biology & Proteomics Part of the biological system in a cell at the molecular level FAS-L IGF1 IL-3 IGF1R FAS A proteome is the set of all proteins in anmitogen organism IL-3R FADD/MORT IRS1 FLICE P53 P21 Cyclin D1 RAS pRb P16 Cdk4 ICE PI 3-K P27 P107 Bin-1 E2F CPP32 AKT/PKB apoptosis Bcl-XL BAD Mad Max C-Myc C-Myc Max Max Mad Cyclin E Cdc25A ? cell proliferation Cyclin E Cdk2 p Cdk2 P27 p Cyclin E Cyclin E Cdk2 p Cdk2 Source: http://www.ircs.upenn.edu/modeling2001/, 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 Week 13: Analyzing Biological Networks Biological networks pose serious challenges and opportunities for the data mining research in computer science Large volume of data Heterogeneous data types 35,000 Protein-protein interaction in yeast # of structures Growth of Known Structures in Protein Data Bank (PDB) Year Gary D. Bader & Christopher W.V. Hogue, Nature Biotechnology 20, 991 - 997 (2002) 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 Week 14: bio-Data Integration Data are collected from many different sources Each piece of data describes part of a complicated (and not directly observable) biological process Combine data together to achieve better understanding and better prediction 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 Week 15, 16: Project Presentation Check what you have learned from the class Celebrate the hard work! 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 Further References Data mining Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, IEEETKDD Bioinformatics Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc. Journals: Bioinformatics, J. of Computational Biology, etc. 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 Further References AI & Machine Learning Conferences: Machine learning (ICML), AAAI, IJCAI, etc. Journals: Machine Learning, Artificial Intelligence, etc. Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Database systems Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, Journals: ACM-TODS, IEEE-TKDE etc. Visualization Conference proceedings: IEEE Visualization, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 8/21/2006 Introduction Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32