Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Algorithms for Biological Data Analysis Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University http://bi.snu.ac.kr/ http://cbit.snu.ac.kr/ Lecture Schedule Day 1: Introduction to Machine Learning Day 2: Neural Networks Day 3: Hidden Markov Models Day 4: Principal Component Analysis Day 5: Clustering Analysis 2 Introduction to Machine Learning Algorithms in Bioinformatics Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University E-mail: btzhang@cse.snu.ac.kr http://bi.snu.ac.kr./ http://cbit.snu.ac.kr/ Outline Part I Concept of Machine Learning (ML) Machine Learning Algorithms and Applications Applications in Bioinformatics Part II Version Space Learning Decision Tree Learning 4 5 What is Artificial Intelligence (AI)? Design and study of computer programs that behave intelligently. Designing computer programs to make computers smarter. Study of how to make computers do things at which, at the moment, people are better. (No satisfactory definition of AI) 6 Research Areas and Approaches Research Artificial Intelligence Learning Algorithms Inference Mechanisms Knowledge Representation Intelligent System Architecture Application Intelligent Agents Information Retrieval Electronic Commerce Data Mining Bioinformatics Natural Language Proc. Expert Systems Paradigm Rationalism (Logical) Empiricism (Statistical) Connectionism (Neural) Evolutionary (Genetic) Biological (Molecular) 7 Concept of Machine Learning 8 9 Context Computer Science (AI) Cognitive Science Machine Learning Statistics Information Theory 10 Why Machine Learning? Recent progress in algorithms and theory Growing flood of online data Computational power is available Budding industry Three niches for machine learning Data mining: using historical data to improve decisions Medical records --> medical knowledge Software applications we can’t program by hand Autonomous driving Speech recognition Self-customizing programs Newsreader that learns user interests 11 Brief History of Machine Learning 1950’s: Samuels checker player 1960’s: Neural networks, perceptron; pattern recognition; learning in the limit theory; Minsky &Papert. 1970’s: Symbolic concept induction; Winstons’s arch learner; knowledge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results; scientific discovery with BACON; mathematical discovery with AM. 1980’s: Continued progress on decision-tree and rule learning; Explanation-based learning; speedup learning; utility problem, analogy; resurgence of connectionism (PDP, ANN); Valiant’s PAC learning; experimental evaluation 1990’s: Data mining; adaptive software agents & IR; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks. 12 Learning: Definition Definition Learning is the improvement of performance in some environment through the acquisition of knowledge resulting from experience in that environment. the improvement of behavior through acquisition of knowledge on some performance task based on partial task experience 13 A Learning Problem: EnjoySport Sky Temp Humid Wind Water Forecast EnjoySports Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy High Strong Warm Change No High Strong Cool Cold Sunny Warm Change Yes What is the general concept? 14 Possible Uses of Machine Learning configuration and design diagnostic reasoning planning and scheduling data mining and knowledge discovery language understanding execution and control vision and speech 15 Metaphors and Methods Neurobiology Connectionist Learning Biological Evolution Heuristic Search Tree / Rule Induction Genetic Learning Memory and Retrieval Case-Based Learning Statistical Inference Probabilistic Induction 16 Learning: Components Components of a learning system Performance: accuracy, efficiency, understandability Environment: external setting to the learner Knowledge: internal data structure Experience: perception, action, mental traces Improvement: desirable change in performance 17 Learning System Performance problem solution Environment get data improve behavior Knowledge get knowledge acquired knowledge Learning 18 What is the Learning Problem? Learning = improving with experience at some task Improve over task T, With respect to performance measure P, Based on experience E. E.g., Learn to play checkers T: Play checkers P: % of games won in world tournament E: opportunity to play against self 19 Machine Learning: Tasks Supervised Learning Estimate an unknown mapping from known input- output pairs Learn fw from training set D={(x,y)} s.t. f w (x) y f (x) Classification: y is discrete Regression: y is continuous Unsupervised Learning Only input values are provided Learn fw from D={(x)} s.t. f w (x) x Compression Clustering Reinforcement Learning 20 Machine Learning: Strategies Rote learning Concept learning Learning from examples Learning by instruction Inductive learning Deductive learning Explanation-based learning (EBL) Learning by analogy Learning by observation 21 Supervised Learning Given a sequence of input/output pairs of the form <xi, yi>, where xi is a possible input and yi is the output associated with xi. Learn a function f that accounts for the examples seen so far, f(xi) = yi for all i, and that makes a good guess for the outputs of the inputs that it has not seen. 22 Examples of Input-Output Pairs Task Inputs Outputs Recognition Descriptions of objects Classes that the objects belong to Action Descriptions of situations Actions or predictions Janitor robot problem Descriptions of offices (floor, prof’s office) Yes or No (indicating whether or not the office contains a recycling bin) 23 Classification and Concept Learning Classification If the function is discrete valued, then the outputs are called classes Concept learning Learned function has only two possible outputs 24 Unsupervised Learning Clustering A clustering algorithm partitions the inputs into a fixed number of subsets or clusters so that inputs in the same cluster are close to one another. Discovery learning The objective is to uncover new relations in the data. Reinforcement learning Uses a feedback signal (not the target output) that gives the learning program an indication of whether or not what it has learned is correct. 25 Online and Batch Learning Batch methods Process large sets of examples all at once. Online (incremental) methods Process examples one at a time. 26 Machine Learning Algorithms and Applications 27 Machine Learning Algorithms (1/2) Symbolic Learning (covered on Day 1) Version Space Learning Case-Based Learning Neural Learning (covered on Day 2) Multilayer Perceptrons (MLPs) Self-Organizing Maps (SOMs) Support Vector Machines (SVMs) Evolutionary Learning (very briefly explained on Day 1) Evolution Strategies Evolutionary Programming Genetic Algorithms Genetic Programming 28 Machine Learning Algorithms (2/2) Probabilistic Learning (covered on Days 3 and 5) Bayesian Networks (BNs) Helmholtz Machines (HMs) Latent Variable Models (LVMs) Generative Topographic Mapping (GTM) Other Machine Learning Methods (partially covered on Days 1 and 4) Decision Trees (DTs) Reinforcement Learning (RL) Boosting Algorithms Mixture of Experts (ME) Independent Component Analysis (ICA) 29 Example Applications of ML (1/2) Banking & Investment Credit card fraud Delinquent accounts Authorization of purchases Predict stock market Health Care Disease diagnosis Managing resources Look for causal relationships between environment and disease Marketing Credit card applications Use past buying habits to predict likelihood of customer purchasing some new product Textual Data Mining 30 Example Applications of ML (2/2) Astronomy Bioinformatics Chemistry Human resources: evaluating job performance Insurance & Finance Manufacturing: process control Signal and image processing Speech recognition … 31 Neural Nets for Handwritten Digit Recognition … … … Pre-processing ? 0 1 2 3 9 … … 0 Output units Training 2 3 9 … Hidden units … 1 … Input units … Test … 32 ALVINN System: Neural Network Learning to Steer an Autonomous Vehicle 33 Learning to Navigate a Vehicle by Observing an Human Expert (1/2) Inputs The images produces by a camera mounted on the vehicle Outputs The actions taken by the human driver to steer the vehicle or adjust its speed. Result of learning A function mapping images to control actions 34 Learning to Navigate a Vehicle by Observing an Human Expert (2/2) 35 Data Recorrection by a Hopfield Network corrupted input data original target data Recorrected data after 10 iterations Recorrected data after 20 iterations Fully recorrected data after 35 iterations 36 Predicting the Sunspot Number with Neural Networks 37 ANN for Face Recognition 960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up. 38 Data Mining Selection & Sampling Preprocessing & Cleaning Transformation & reduction Data Mining Interpretation/ Evaluation -- -- --- -- --- -- -- Database/data warehouse Target data Cleaned data Transformed data Patterns/ model Knowledge Performance system 39 Customer Relationship Management (CRM) Increased Customer Lifetime Value Increased Wallet Share Improved Customer Retention Segmentation of Customers by Profitability Segmentation of Customers by Risk of Default Integrating Data Mining into the Full Marketing Proce 40 Hot Water Flashing Nozzle with Evolutionary Algorithms Hans-Paul Schwefel performed the original experiments Start Hot water entering Steam and droplet at exit At throat: Mach 1 and onset of flashing 41 Case-Based Reasoning (Aamodt & Plaza, 1994) Input Learned Case New Problem 1. Retrieve Case Base Retrived Cases General Knowledge 4. Retain Output 2. Reuse Retrived Solution 3. Revise Retrived Solution 42 Machine Learning Applications in Bioinformatics 43 Bioinformatics What is a Bioinformatics? Bioinformatics is a new term referring to the discipline that employs computers to store, retrieve, analyze and assist in understanding biological information. The application of information technology and computer science to the study of biological systems. The analysis of the massive (and constantly increasing) amount of genetic information Sophisticated computer technologies to enable discovery in all fields of life sciences. 44 Problems in Bioinformatics Sequence analysis Sequence alignment Structure and function prediction Gene finding Structure analysis Protein structure comparison Protein structure prediction RNA structure modeling Expression analysis Gen expression analysis Gene clustering Pathway analysis Metabolic pathway Regulatory networks 45 Applications of Bioinformatics Drug design Identification of genetic risk factors Gene therapy Genetic modification of food crops and animals Forensics Biological warfare Personalized Medicine E-Doctor 46 Machine Learning and Bioinformatics knowledge knowledge Machine learning Bio DB Drug Development Medical therapy research Pharmacology Ecology 47 Machine Learning Techniques for Bio Data Mining Sequence Alignment Simulated Annealing Genetic Algorithms Structure and Function Prediction Hidden Markov Models Multilayer Perceptrons Decision Trees Molecular Clustering and Classification Support Vector Machines Nearest Neighbor Algorithms Expression (DNA Chip Data) Analysis Self-Organizing Maps Bayesian Networks 48 Structure and Function Prediction Protein structure prediction Protein modeling Gene finding and gene prediction 49 Effect and Applications of Biological Data Mining Biocomputing Increase and Improvement Renewable Energy of Farm Products Biological Data Mining store, retrieve, analyze and assist in understanding biological information Diagnosis with Chip SNP (Single Nucleotide Polymorphism) Customized Drug 50 Hidden Markov Models for Protein Modeling 20 alphabets (20 amino acids) m0: start state, m5: end state, mk: match states ik: insertion states, dk: deletion states T(s2|s1): transition probabilities P(x|mk): alphabet generating probabilities (x: letter: amino acid) 51 A Simple Example of Hidden Markov Models 0.5 0.25 0.25 0.25 0.25 0.5 S 0.25 E 0.25 0.5 0.5 ATCCTTTTTTTCA 0.1 0.1 0.1 0.7 52 Clustering of Related Gene Expressions 53 Non-negative Matrix Factorization Clustering Gene Expression Data H1· H2 · W(?) G 7,129 genes W . . . . . . . . . . … g1 g2 g3 g4 g7,129 x . …. 38 samples H(?) 7,129 genes . . . . … encoding 38 samples 2 factors Factors can capture the correlations between the genes using the values of expression level. Cluster training samples into 2 groups by NMF Assign each sample to the factor (class) which has higher encoding value. Accuracy: 0 ~1 error for the training data set 54 Bayesian Networks for Gene Expression Analysis Learning Gene C Processed data Data Learning algorithm Gene B Gene D Gene A Preprocessing Target Inference Gene C Gene D Gene B Gene A Target The values of Gene C and Gene B are given. Gene C Gene D Gene B Gene A Target Belief propagation Gene C Gene D Gene B Gene A Target Probability for the target is computed. 55 Multilayer Perceptrons for Gene Finding and Prediction Coding potential value GC Composition bases Length Discrete Donor exon score Acceptor Intron vocabulary 1 score 0 sequence 56 Self-Organizing Maps for DNA Microarray Data Analysis Two-dimensional array of postsynaptic neurons Winning neurons Bundle of synaptic connections Input 57 Biological Information Extraction Data Analysis & Field Identification Text Data Data Classification & Field Extraction Field Property Identification & Learning Database Template DB Record Filling Location Date Information Extraction DB 58 Biomolecular Computing 011001101010001 ATGCTCGAAGCT 59 More information on biological data mining and related research can be found at http://cbit.snu.ac.kr/ http://bi.snu.ac.kr/ 60