* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Machine learning methods for Protein Secondary Structure Prediction
Magnesium transporter wikipedia , lookup
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
Point mutation wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Drug design wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biochemistry wikipedia , lookup
Protein purification wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Proteolysis wikipedia , lookup
Structural alignment wikipedia , lookup
COT 6930
HPC and Bioinformatics
Protein Structure Prediction
Xingquan Zhu
Dept. of Computer Science and Engineering
Protein
structure
databases
Gene expression
database
transcription
DNA
Genomic
DNA
Databases
translation
RNA
cDNA
ESTs
UniGene
protein
Protein
sequence
databases
phenotype
Outline
Protein Structure
Why structure
How to predict protein structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for protein secondary structure prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling
Proteins
Proteins play a crucial role in virtually all biological processes with a
broad range of functions.
The activity of an enzyme or the function of a protein is governed by
the three-dimensional structure
Protein Structure is Hierarchical
Protein Structure
Video
http://www.youtube.co
m/watch?v=lijQ3a8yU
YQ
Primary Structure: Sequence
The primary structure of a protein is the amino acid sequence
Protein Structure Prediction Problem
Protein structure prediction
Predict protein 3D structure from (amino acid) sequence
One step closer to useful biological knowledge
Sequence → secondary structure → 3D structure → function
Outline
Protein Structure
Why structure
How to Predict Protein Structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling
Why Predict Structure?
Structure is more
conserved than
sequence
Structure
determines
function
Goals:
1. Predict structure from
sequence
2. Predict function based on
structure
3. Predict function based
on sequence
Molecular
function
Why predict structure: Structure is
more conserved than sequence
28% sequence identity
Why predict structure: Can Label
Proteins by Dominant Structure
SCOP: Structural Classification Of Proteins
Why predict structure: Large number
proteins vs. relative smaller number folds
Small number of unique folds found in practice
90% proteins < 1000 folds, estimated ~4000 total folds
http://www.rcsb.org/pdb/home/home.do
As of 02/05/2008
48,878 structures
Examples of Fold Classes
How to Predict Protein Structure
A related biological question: what are the factors that
determine a structure?
Energy
Kinematics
How can we determine structure?
Experimental methods
X-ray crystallography or NMR (Nuclear magnetic resonance)
spectrometry
limitation: protein size, require crystallized proteins
Computational methods (predictive methods)
2-D structure (secondary structure)
3-D structure (tertiary structure)
Geometry of Protein Structure
rotatable
rotatable
Inter-atomic Forces
Covalent bond
(short range, very strong)
Covalent bond between sulfhydryl (sulfur + hydrogen) groups
Hydrophobic / hydrophillic interaction (weak)
(short range, strong)
Binds two polar groups (hydrogen + electronegative atom)
Disulfide bond / bridge
Binds atoms into molecules / macromolecules
Hydrogen bond
(short range, very strong)
Hydrogen bonding w/ H2O in solution
Van der Waal’s interaction
Nonspecific electrostatic attractive force
(very weak)
Types of Inter-atomic Forces
Quick Overview of Energy
Bond
Strength
(kcal/mole)
H-bonds
3-7
Ionic bonds
10
Hydrophobic
interactions
1-2
Van der vaals
interactions
1
Disulfide bridge
51
Protein Folding Animation
http://www.youtube.com/watch?v=fvBO3TqJ6FE
http://www.youtube.com/watch?v=swEc_sUVz5I
Two Related Problems in
Structure Prediction
Directly predicting protein structure from the
amino acid sequence has proved elusive
Two sub-problems
Secondary Structure Prediction
Tertiary Structure Prediction
Secondary Structure Predication (2D)
For each residues in a protein structure, three possible states: a
(a-helix), ß (ß-strand), t (others).
amino acid sequence
Secondary structure sequence
Currently the accuracy of secondary structure methods is nearly
80% (2000).
Secondary structure prediction can provide useful information to
improve other sequence and structure analysis methods, such as
sequence alignment and 3-D modeling.
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
Outline
Protein Structure
Why structure
How to Predict Protein Structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling
PSSP: Protein Secondary
Structure Prediction
Three Generations
•
•
•
Based on statistical information of single
amino acids
Based on local amino acid interaction
(segments). Typically a segment containes
11-21 aminoacids
Based on evolutionary information of the
homology sequences
Secondary Structure preferences for
Amino Acids
The normalized frequencies for
each conformation were calculated
from the fraction of residues of each
amino acid that occurred in that
conformation, divided by this
fraction for all residues.
Random occurrence of a particular
amino in a conformation would give
a value of unity. A value greater
than unity indicates a preference for
a particular type of secondary
structure.
Outline
Protein Structure
Why structure
How to Predict Protein Structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling
Machine learning methods for Protein
Secondary Structure Prediction
Introduction to classification
Generalize protein secondary structure prediction
as a machine learning problem
Introduction to Neural Network
Classification and Classifiers
Given a data base table DB with a set of
attribute values and a special atribute C, called
a class label.
Example:
A1
1
0
1
A2
1
1
0
A3
m
v
m
A4
g
g
b
C
Tumor
Normal
Normal
Classification and Classifiers
An algorithm is called a classification algorithm if it uses
the data to build a set of patterns
Decision rules or decision trees, etc.
Those patters are structured in such a way that we can use them to
classify unknown sets of objects- unknown records.
For that reason (because of the goal) the classification
algorithm is often called shortly a classifier.
Classifier Example
Classification and Classifiers
Building a classifier consists of two phases:
The training data set to create patterns (rules, trees, or to
train a Neural network).
Training and testing.
In both phases we use data (training data set and disjoint test data
set) for which the class labels are known for ALL of the records.
Evaluate created patterns with the use of of test data, which
classification is known.
The measure for a trained classifier accuracy is called
predictive accuracy.
Predictive Accuracy Evaluation
The main methods of predictive accuracy evaluations are:
•
•
•
•
Re-substitution (N ; N)
Holdout (2N/3 ; N/3)
x-fold cross-validation (N-N/x ; N/x)
Leave-one-out (N-1 ; 1),
where N is the number of instances in the dataset
The process of building and evaluating a classifier is also
called a supervised learning, or lately when dealing with
large data bases a classification method in Data Mining
Classification Models: Different
Classifiers
Typical classification models
Decision Trees (ID3, C4.5)
Nearest Neighbors
Support Vector Machines
Neural Networks
Most of the best classifiers for PSSP are based on
Neural Network model
Demonstration
Machine learning methods for Protein
Secondary Structure Prediction
Introduction to classification
Generalize protein secondary structure prediction
as a machine learning problem
Introduction to Neural Network
How to generalize protein secondary
prediction as a machine learning problem?
Using a sliding window to move along the amino acid
sequence
Each window denotes an instance
Each amino acid inside the window denotes an attribute
The known secondary structure of the central amino acid is the class
label
How to generalize protein secondary
prediction as a machine learning problem?
A set of “examples” are generated from sequence
with known secondary structures
Examples form a training set
Build a neural network classifier
Apply the classifier to a sequence with unknown
secondary structure
Machine learning methods for Protein
Secondary Structure Prediction
Introduction to classification
Generalize protein secondary structure prediction
as a machine learning problem
Introduction to Neural Network
Introduction to Neural Network
What is an artificial Neural Network?
An extremely simplified model of the brain
Essentially a function approximator
Transforms inputs into outputs to the best of its ability
Introduction to Neural Network
Composed of many “neurons” that co-operate to
perform the desired function
How do Neural Network Work?
A neuron (perceptron) is a single layer NN
The output of a neuron is a function of the weighted
sum of the inputs plus a bias
Activation Function
Binary active function
f(x)=1 if x>=0
f(x)=0 otherwise
The most common sigmoid function used is the
logistic function
f(x) = 1/(1 + e-x)
The calculation of derivatives are important for neural
networks and the logistic function has a very nice
derivative
f’(x) = f(x)(1 - f(x))
Where Do The Weights Come
From?
The weights in a neural network are the most
important factor in determining its function
Training is the act of presenting the network with
some sample data and modifying the weights to
better approximate the desired function
Supervised Training
Supplies the neural network with inputs and the desired
outputs
Response of the network to the inputs is measured
The weights are modified to reduce the difference between the
actual and desired outputs
Perceptron Example
Simplest neural network with the ability to learn
Made up of only input neurons and output neurons
Output neurons use a simple threshold activation
function
In basic form, can only solve linear problems
Limited applications
Perceptron Example
Perceptron weight updating
If the output is not correct, the weights are adjusted
according to the formula:
wnew = wold + ·(desired – output)input
Assuming given
instance
{(1,0,1), 0}
Multi-Layer Feedforward NN
An extension of the perceptron
Multiple layers
Activation function is not simply a threshold
Usually a sigmoid function
A general function approximator
The addition of one or more “hidden” layers in between the
input and output layers
Not limited to linear problems
Information flows in one direction
The outputs of one layer act as inputs to the next layer
Multi-Layer Feedforward NN
Example
XOR problem
Back-propagation
Searches for weight values that minimize the
total error of the network over the set of
training examples
Forward pass: Compute the outputs of all units in the
network, and the error of the output layers.
Backward pass: The network error is used for
updating the weights (credit assignment problem).
NN for Protein
Secondary
Structure
Prediction
Outline
Protein Structure
Why structure
How to Predict Protein Structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling
Ab initio Prediction
Sampling the global conformation space
Lattice models / Discrete-state models
Molecular Dynamics
Picking native conformations with an energy function
Solvation model: how protein interacts with water
Pair interactions between amino acids
Lattice String Folding
HP model: main modeled force is hydrophobic attraction
Amino Acids are classified into two types
Hydrophopic (H) or Polar (P)
NP-hard in both 2-D square and 3-D cubic
Constant approximation algorithms
Not so relevant biologically
Lattice String Folding
Energy Minimization
Many forces act on a protein
Hydrophobic: inside of protein wants to avoid water
Packing: atoms can't be too close, nor too far away
van der Waals interactions
Bond angle/length constraints
Long distance, e.g.
Hydrophobic molecules associate with each other in water solvent as if water
molecules is the repellent to them. It is like oil/water separation.
Electrostatics & Hydrogen bonds
Disulphide bonds
Salt bridges
Can calculate all of these forces, and minimize
Intractable in general case, but can be useful
Molecular Dynamics (MD)
In molecular dynamics simulation, we simulate motions of atoms as a function of
time according to Newton’s equation of motion. The equations for a system
consisting on N atoms can be written as
d ri t
2
mi
2
Fi t ,
(i 1, 2, , N ).
(1)
dt
Here, ri and mi represent the position and mass of atom i and Fi(t) is the force on
atom i at time t. Fi(t) is given by
Fi iV r1 , r2 , , rN ,
(2)
where V(r1, r2, …, rN) is the potential energy of the system that depends on the
positions of the N atoms in the system. ∇i is
i i
j
k
x
y
z
(3)
Energy Functions used in
Molecular Simulation
Φ
r
Θ
Bond stretching
term
Angle bending
term
Vtotal
Dihedral term
K r r K K 1 cosn
2
b
2
0
bonds
angles
dihedrals
Cij Dij
12 10
van der Waals
r
Hbonds rij
ij
i , j pairs
0
H-bonding term Van der Waals term
O
r
H
The most
time
demanding
part.
Aij Bij
qi q j
12 6
r
electrosta tic r
r
ij
ij
i , j pairs ij
Electrostatic
term
+
r
r
ー
Outline
Protein Structure
Why structure
How to Predict Protein Structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling
Homology-based Prediction
Align query sequence with sequences of known structure,
usually >30% similar
Superimpose the aligned sequence onto the structure
template, according to the computed sequence alignment
Perform local refinement of the resulting structure in 3D
The number of unique structural folds
is small (possibly a few thousand)
90% of new structures submitted to PDB in the
past three years have similar folds in PDB
Homology-based Prediction
Raw model
Loop modeling
Side chain placement
Refinement
Homology-based Prediction
Outline
Protein Structure
Why structure
How to predict protein structure
Experimental methods
Computational methods (predictive methods)
Protein Structure Prediction
Secondary structure prediction (2D)
Machine learning methods for protein secondary structure prediction
Tertiary structure prediction (3D)
Ab initio
Homology modeling