Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Advanced Algorithms for Biological
Data Analysis
Center for Bioinformation Technology (CBIT) &
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
http://bi.snu.ac.kr/
http://cbit.snu.ac.kr/
Lecture Schedule
Day 1: Introduction to Machine Learning
Day 2: Neural Networks
Day 3: Hidden Markov Models
Day 4: Principal Component Analysis
Day 5: Clustering Analysis
2
Introduction to Machine Learning
Algorithms in Bioinformatics
Byoung-Tak Zhang
Center for Bioinformation Technology (CBIT) &
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
E-mail: btzhang@cse.snu.ac.kr
http://bi.snu.ac.kr./
http://cbit.snu.ac.kr/
Outline
Part I
Concept of Machine Learning (ML)
Machine Learning Algorithms and Applications
Applications in Bioinformatics
Part II
Version Space Learning
Decision Tree Learning
4
5
What is Artificial Intelligence (AI)?
Design and study of computer programs that
behave intelligently.
Designing computer programs to make computers
smarter.
Study of how to make computers do things at
which, at the moment, people are better.
(No satisfactory definition of AI)
6
Research Areas and Approaches
Research
Artificial
Intelligence
Learning Algorithms
Inference Mechanisms
Knowledge Representation
Intelligent System Architecture
Application
Intelligent Agents
Information Retrieval
Electronic Commerce
Data Mining
Bioinformatics
Natural Language Proc.
Expert Systems
Paradigm
Rationalism (Logical)
Empiricism (Statistical)
Connectionism (Neural)
Evolutionary (Genetic)
Biological (Molecular)
7
Concept of Machine Learning
8
9
Context
Computer
Science
(AI)
Cognitive
Science
Machine
Learning
Statistics
Information
Theory
10
Why Machine Learning?
Recent progress in algorithms and theory
Growing flood of online data
Computational power is available
Budding industry
Three niches for machine learning
Data mining: using historical data to improve decisions
Medical records --> medical knowledge
Software applications we can’t program by hand
Autonomous driving
Speech recognition
Self-customizing programs
Newsreader that learns user interests
11
Brief History of Machine Learning
1950’s: Samuels checker player
1960’s: Neural networks, perceptron; pattern recognition; learning in
the limit theory; Minsky &Papert.
1970’s: Symbolic concept induction; Winstons’s arch learner;
knowledge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and
soybean diagnosis results; scientific discovery with BACON;
mathematical discovery with AM.
1980’s: Continued progress on decision-tree and rule learning;
Explanation-based learning; speedup learning; utility problem, analogy;
resurgence of connectionism (PDP, ANN); Valiant’s PAC learning;
experimental evaluation
1990’s: Data mining; adaptive software agents & IR; reinforcement
learning; theory refinement; inductive logic programming; voting,
bagging, boosting, and stacking; learning Bayesian networks.
12
Learning: Definition
Definition
Learning is the improvement of performance in some
environment through the acquisition of knowledge
resulting from experience in that environment.
the improvement
of behavior
through acquisition
of knowledge
on some
performance task
based on partial
task experience
13
A Learning Problem: EnjoySport
Sky
Temp
Humid
Wind
Water Forecast EnjoySports
Sunny Warm
Normal Strong Warm Same
Yes
Sunny Warm
High
Strong Warm Same
Yes
Rainy
High
Strong Warm Change
No
High
Strong Cool
Cold
Sunny Warm
Change
Yes
What is the general concept?
14
Possible Uses of Machine
Learning
configuration
and design
diagnostic
reasoning
planning and
scheduling
data mining and
knowledge discovery
language
understanding
execution
and control
vision and
speech
15
Metaphors and Methods
Neurobiology
Connectionist
Learning
Biological
Evolution
Heuristic
Search
Tree / Rule
Induction
Genetic Learning
Memory and
Retrieval
Case-Based
Learning
Statistical
Inference
Probabilistic
Induction
16
Learning: Components
Components of a learning system
Performance: accuracy, efficiency, understandability
Environment: external setting to the learner
Knowledge: internal data structure
Experience: perception, action, mental traces
Improvement: desirable change in performance
17
Learning System
Performance
problem
solution
Environment
get data
improve behavior
Knowledge
get knowledge
acquired knowledge
Learning
18
What is the Learning Problem?
Learning = improving with experience at some
task
Improve over task T,
With respect to performance measure P,
Based on experience E.
E.g., Learn to play checkers
T: Play checkers
P: % of games won in world tournament
E: opportunity to play against self
19
Machine Learning: Tasks
Supervised Learning
Estimate an unknown mapping from known input- output pairs
Learn fw from training set D={(x,y)} s.t. f w (x) y f (x)
Classification: y is discrete
Regression: y is continuous
Unsupervised Learning
Only input values are provided
Learn fw from D={(x)} s.t. f w (x) x
Compression
Clustering
Reinforcement Learning
20
Machine Learning: Strategies
Rote learning
Concept learning
Learning from examples
Learning by instruction
Inductive learning
Deductive learning
Explanation-based learning (EBL)
Learning by analogy
Learning by observation
21
Supervised Learning
Given a sequence of input/output pairs of
the form <xi, yi>, where xi is a possible
input and yi is the output associated with xi.
Learn a function f that accounts for the
examples seen so far, f(xi) = yi for all i, and
that makes a good guess for the outputs of
the inputs that it has not seen.
22
Examples of Input-Output Pairs
Task
Inputs
Outputs
Recognition
Descriptions of
objects
Classes that the
objects belong to
Action
Descriptions of
situations
Actions or predictions
Janitor robot
problem
Descriptions of
offices (floor, prof’s
office)
Yes or No (indicating
whether or not the
office contains a
recycling bin)
23
Classification and Concept
Learning
Classification
If the function is discrete valued, then the
outputs are called classes
Concept learning
Learned function has only two possible outputs
24
Unsupervised Learning
Clustering
A clustering algorithm partitions the inputs into a fixed
number of subsets or clusters so that inputs in the same
cluster are close to one another.
Discovery learning
The objective is to uncover new relations in the data.
Reinforcement learning
Uses a feedback signal (not the target output) that gives
the learning program an indication of whether or not
what it has learned is correct.
25
Online and Batch Learning
Batch methods
Process large sets of examples all at once.
Online (incremental) methods
Process examples one at a time.
26
Machine Learning Algorithms and
Applications
27
Machine Learning Algorithms (1/2)
Symbolic Learning (covered on Day 1)
Version Space Learning
Case-Based Learning
Neural Learning (covered on Day 2)
Multilayer Perceptrons (MLPs)
Self-Organizing Maps (SOMs)
Support Vector Machines (SVMs)
Evolutionary Learning (very briefly explained on Day 1)
Evolution Strategies
Evolutionary Programming
Genetic Algorithms
Genetic Programming
28
Machine Learning Algorithms (2/2)
Probabilistic Learning (covered on Days 3 and 5)
Bayesian Networks (BNs)
Helmholtz Machines (HMs)
Latent Variable Models (LVMs)
Generative Topographic Mapping (GTM)
Other Machine Learning Methods (partially covered on
Days 1 and 4)
Decision Trees (DTs)
Reinforcement Learning (RL)
Boosting Algorithms
Mixture of Experts (ME)
Independent Component Analysis (ICA)
29
Example Applications of ML (1/2)
Banking & Investment
Credit card fraud
Delinquent accounts
Authorization of purchases
Predict stock market
Health Care
Disease diagnosis
Managing resources
Look for causal relationships between environment and disease
Marketing
Credit card applications
Use past buying habits to predict likelihood of customer
purchasing some new product
Textual Data Mining
30
Example Applications of ML (2/2)
Astronomy
Bioinformatics
Chemistry
Human resources: evaluating job performance
Insurance & Finance
Manufacturing: process control
Signal and image processing
Speech recognition
…
31
Neural Nets for Handwritten Digit
Recognition
…
…
…
Pre-processing
?
0
1
2
3
9
…
…
0
Output units
Training
2
3
9
…
Hidden units
…
1
…
Input units
…
Test
…
32
ALVINN System: Neural Network Learning to Steer
an Autonomous Vehicle
33
Learning to Navigate a Vehicle by
Observing an Human Expert (1/2)
Inputs
The images produces by a camera mounted on
the vehicle
Outputs
The actions taken by the human driver to steer
the vehicle or adjust its speed.
Result of learning
A function mapping images to control actions
34
Learning to Navigate a Vehicle by
Observing an Human Expert (2/2)
35
Data Recorrection by a Hopfield
Network
corrupted
input data
original
target data
Recorrected
data after
10 iterations
Recorrected
data after
20 iterations
Fully
recorrected
data after
35 iterations
36
Predicting the Sunspot Number with
Neural Networks
37
ANN for Face Recognition
960 x 3 x 4 network is trained on gray-level images of faces to predict
whether a person is looking to their left, right, ahead, or up.
38
Data Mining
Selection
& Sampling
Preprocessing
& Cleaning
Transformation
& reduction
Data Mining
Interpretation/
Evaluation
-- -- --- -- --- -- --
Database/data
warehouse
Target
data
Cleaned
data
Transformed
data
Patterns/
model
Knowledge
Performance
system
39
Customer Relationship Management
(CRM)
Increased Customer Lifetime Value
Increased Wallet Share
Improved Customer Retention
Segmentation of Customers by Profitability
Segmentation of Customers by Risk of Default
Integrating Data Mining into the Full Marketing Proce
40
Hot Water Flashing Nozzle with
Evolutionary Algorithms
Hans-Paul Schwefel
performed the original
experiments
Start
Hot water entering
Steam and droplet at exit
At throat: Mach 1 and onset of flashing
41
Case-Based Reasoning
(Aamodt & Plaza, 1994)
Input
Learned
Case
New
Problem
1. Retrieve
Case Base
Retrived
Cases
General
Knowledge
4. Retain
Output
2. Reuse
Retrived
Solution
3. Revise
Retrived
Solution
42
Machine Learning Applications in
Bioinformatics
43
Bioinformatics
What is a Bioinformatics?
Bioinformatics is a new term referring to the discipline
that employs computers to store, retrieve, analyze and
assist in understanding biological information.
The application of information technology and computer
science to the study of biological systems.
The analysis of the massive (and constantly increasing)
amount of genetic information
Sophisticated computer technologies to enable discovery in
all fields of life sciences.
44
Problems in Bioinformatics
Sequence analysis
Sequence alignment
Structure and function prediction
Gene finding
Structure analysis
Protein structure comparison
Protein structure prediction
RNA structure modeling
Expression analysis
Gen expression analysis
Gene clustering
Pathway analysis
Metabolic pathway
Regulatory networks
45
Applications of Bioinformatics
Drug design
Identification of genetic risk factors
Gene therapy
Genetic modification of food crops and animals
Forensics
Biological warfare
Personalized Medicine
E-Doctor
46
Machine Learning and
Bioinformatics
knowledge
knowledge
Machine learning
Bio DB
Drug
Development
Medical
therapy
research
Pharmacology
Ecology
47
Machine Learning Techniques for Bio
Data Mining
Sequence Alignment
Simulated Annealing
Genetic Algorithms
Structure and Function Prediction
Hidden Markov Models
Multilayer Perceptrons
Decision Trees
Molecular Clustering and Classification
Support Vector Machines
Nearest Neighbor Algorithms
Expression (DNA Chip Data) Analysis
Self-Organizing Maps
Bayesian Networks
48
Structure and Function Prediction
Protein structure
prediction
Protein modeling
Gene finding and
gene prediction
49
Effect and Applications of Biological
Data Mining
Biocomputing
Increase and Improvement Renewable Energy
of Farm Products
Biological Data Mining
store, retrieve, analyze and assist
in understanding biological information
Diagnosis with Chip
SNP (Single Nucleotide
Polymorphism)
Customized Drug
50
Hidden Markov Models
for Protein Modeling
20 alphabets (20 amino acids)
m0: start state, m5: end state, mk: match states
ik: insertion states, dk: deletion states
T(s2|s1): transition probabilities
P(x|mk): alphabet generating probabilities (x: letter: amino
acid)
51
A Simple Example of Hidden Markov
Models
0.5
0.25
0.25
0.25
0.25
0.5
S
0.25
E
0.25
0.5
0.5
ATCCTTTTTTTCA
0.1
0.1
0.1
0.7
52
Clustering of Related Gene
Expressions
53
Non-negative Matrix Factorization
Clustering Gene Expression Data
H1·
H2 ·
W(?)
G
7,129
genes
W
. . . . .
. . . . .
…
g1
g2
g3
g4
g7,129
x
.
….
38 samples
H(?)
7,129
genes
. .
. .
…
encoding
38 samples
2 factors
Factors can capture the correlations between the genes using the values
of expression level.
Cluster training samples into 2 groups by NMF
Assign each sample to the factor (class) which has higher encoding value.
Accuracy: 0 ~1 error for the training data set
54
Bayesian Networks
for Gene Expression Analysis
Learning
Gene C
Processed
data
Data
Learning
algorithm
Gene B
Gene D
Gene A
Preprocessing
Target
Inference
Gene C
Gene D
Gene B
Gene A
Target
The values of Gene C and
Gene B are given.
Gene C
Gene D
Gene B
Gene A
Target
Belief propagation
Gene C
Gene D
Gene B
Gene A
Target
Probability for the target
is computed.
55
Multilayer Perceptrons for Gene
Finding and Prediction
Coding potential value
GC Composition
bases
Length
Discrete
Donor
exon score
Acceptor
Intron vocabulary
1
score
0
sequence
56
Self-Organizing Maps for DNA
Microarray Data Analysis
Two-dimensional array
of postsynaptic neurons
Winning
neurons
Bundle of synaptic
connections
Input
57
Biological Information Extraction
Data Analysis &
Field Identification
Text Data
Data Classification &
Field Extraction
Field Property
Identification & Learning
Database Template
DB Record
Filling
Location
Date
Information Extraction
DB
58
Biomolecular Computing
011001101010001
ATGCTCGAAGCT
59
More information
on
biological data mining
and related research
can be found
at
http://cbit.snu.ac.kr/
http://bi.snu.ac.kr/
60