* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Introduction and Preliminaries - Department of Computer and
Site-specific recombinase technology wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Microevolution wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Primary transcript wikipedia , lookup
Genome evolution wikipedia , lookup
SNP genotyping wikipedia , lookup
DNA vaccination wikipedia , lookup
Epigenomics wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Human genome wikipedia , lookup
Molecular cloning wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
DNA supercoil wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Point mutation wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genome editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genomic library wikipedia , lookup
Computational Molecular Biology
Introduction and Preliminaries
Preliminaries in Computer Science
Strings and alphabet
Basic notations in graph theory
Algorithms and Complexity
My T. Thai
mythai@cise.ufl.edu
2
Strings
Consist of a sequence of letters:
DNA: four nucleotides A, C, G, T
Proteins: 20 symbol alphabet of animo acids
Given a string s, we have the following
notations:
Length: |s|
Substring: ACT is a substring of ATGACTG
Superstring: ATGACTG is a superstring of ACT
Index and interval: s[i] and s[i..j]
Prefix and suffix: s[1..j] and s[i..|s|]
My T. Thai
mythai@cise.ufl.edu
3
Graphs
G = (V, E) where V is a set of vertices and E is
a set of edges
Undirected graph: edges are undirected
Directed graph: edges are directed
Weighted graph G = (V, E, w) where each edge
has some weight
Some special graphs: complete graph, bipartite
graph, tree, and interval graph
Subgraph, spanning tree, steiner tree
My T. Thai
mythai@cise.ufl.edu
4
Interval Graphs
Intersection graph of a set of intervals on the
real line
A vertex represents an interval and an edge (u,
v) exists if intervals u and v intersect
My T. Thai
mythai@cise.ufl.edu
5
Some Problems in Graphs
Euler circuit: Given a graph, find a cycle that passes
through each edge exactly once
Hamiltonian circuit: Given a graph, find a cycle that
passes through each vertex exactly once
Minimum Spanning Tree: Given a weighted
undirected graph, find a spanning tree with minimum
total weight
Maximum Matching: Given an undirected graph, find
a maximum cardinality matching, which is a subset of
edges such that no two edges in the subset share an
endpoint
My T. Thai
mythai@cise.ufl.edu
6
P vs. NP
Class of P: Set of problems solvable by
polynomial-time algoirthms
Class of NP: Set of problems whose solutions,
once found, can be verified in polynomial time
NP-complete (NP-hard) problems: cannot
obtain an optimal solutions in polynomial time
My T. Thai
mythai@cise.ufl.edu
7
Some approaches for NP-complete
Problems
Special-case method: Work on the problem with a
restricted class of inputs
Exhaustive search: Design an exponential-time
algorithms that may perform well in practice
Approximation algorithms: Design a polynomialtime algorithm that is guaranteed to find near-optimal
solutions (with a good approximation ratio)
Heuristics: Fast algorithms that produce satisfactory
solutions most of the time but without guarantee
My T. Thai
mythai@cise.ufl.edu
8
Preliminaries in Molecular Biology
My T. Thai
mythai@cise.ufl.edu
9
DNA and Base Pairs
Double helix consisting of two
dual strands
Has four types of nucleotides:
Adenine, Thymine, Guanine,
Cytosine
Base Pairs: A↔T, C↔G
Two ends of a strand are
marked with 3’ and 5’
The entire DNA of a living
organism is called its genome
My T. Thai
mythai@cise.ufl.edu
10
DNA Sequences
My T. Thai
mythai@cise.ufl.edu
11
DNA Replication
Strands are separated
Each strand is replicated
using one of the parental
strands as a template
My T. Thai
mythai@cise.ufl.edu
12
Cell, Chromosome, and
DNA
My T. Thai
mythai@cise.ufl.edu
13
Cell Classification
My T. Thai
mythai@cise.ufl.edu
14
Chromosomes
Consists of a DNA molecule associated with proteins
that fold and pack the DNA thread into a more
compact structure and proteins required for the
process of gene expression, DNA replication and
DNA repair.
Human genome is distributed over 24 chromosomes
Each cell contains 46 chromosomes
22 pairs common to both males and females
2 sex chromosome X and Y in males and two Xs in
female
My T. Thai
mythai@cise.ufl.edu
15
Genes
Segments of DNA
Functional and physical
unit of heredity passed
from parent to offspring
Contain the information
for making a specific
protein
My T. Thai
mythai@cise.ufl.edu
16
Proteins
Shorts strings in the amino acid 20-letter
alphabet
Human genome: about 100,000 proteins, with
each protein a few hundred amino acids long
Bacteria make 500-1500 proteins
Made by genes (fragments of DNA) that are
roughly three times longer than the
corresponding proteins.
Why? Every 3 nucleotides in the DNA alphabet
code one letter in the protein alphabet of amino
acids
My T. Thai
mythai@cise.ufl.edu
17
Central Dogma of Molecular
Biology
My T. Thai
mythai@cise.ufl.edu
18
Transcription
My T. Thai
mythai@cise.ufl.edu
19
Translation
Translation
mRNA (after exported out of the nucleus and reaching the
cytosol) directs the synthesis of the protein by joining together
amino acids in the order encoded by the mRNA
Genetic code
Defines a mapping between codons and amino acid.
Codon
Triplet of nucleotides specifies a single amino acid in a
corresponding protein
64 codons and 20 amino acids
Translation is carried out by ribosomes
My T. Thai
mythai@cise.ufl.edu
20
Polymerase Chain Reaction (PCR)
Primer
Nucleic acid
strand
Serves as a
starting point of
DNA replication
My T. Thai
mythai@cise.ufl.edu
21
Plasmid Vector
Vector
an agent that can carry a DNA fragment into a host cell
Plasmid
Circular and doublestranded DNA
Antibiotic resistance
Automatic replication
Exists in bacteria
My T. Thai
mythai@cise.ufl.edu
22
DNA Cloning Using Plasmids as Vectors
(a) DNA recombination
(b) Transformation
My T. Thai
mythai@cise.ufl.edu
23
DNA Cloning Using Plasmids as Vectors (Cont)
(c) Selective amplification
(d) Isolation of desired DNA clones
My T. Thai
mythai@cise.ufl.edu
24
DNA Library Screening
Probe:
Labeled with radioisotope or
fluorescence
Used to detect specific DNA
sequences by hybridization
Hybridization:
Binding of two nucleic acid
chains by base paring
DNA Library Screening
To identify each clone whether it
contains a probe from a given set
of probes
Positive clone: contains a probe
My T. Thai
mythai@cise.ufl.edu
25
Some Computational Problems
Pooling Design
Non-unique probe selection
Sequence Alignment, Multi Sequence
Alignment
DNA sequencing
Genome Rearrangement
Protein Structure Prediction and Recognition
Protein-Protein Interactions
Functional Groups, Modules
My T. Thai
mythai@cise.ufl.edu
26
Pooling Designs
Problem Definition
Given a set of n clones with at most d positive
clones
Identify all positive clones with the minimum
number of tests
Pool: a subset of clones
Positive pool: a pool contains at least one positive
clone
My T. Thai
mythai@cise.ufl.edu
27
Pooling Designs
clones
pools
Mtxn =
c1
p1 0
p2 0
.
.
pi 0
.
.
pt 0
V(D)
c2
cj
cn
0 … 0 … 0 … 0 … 0
1 … 0 … 0 … 0 … 0
Testing
0 … 0 … 1 … 0 … 0
0 … 0 … 0 … 0 … 0
0
1
.
.
1
.
.
0
txn
tx1
M[i, j] = 1 iff the ith pool contains the jth clone
Decoding Algorithm:
Given M and V(D), identify all positive clones
My T. Thai
mythai@cise.ufl.edu
28
Challenges
Challenge 1: How to construct the binary
matrix M such that:
Outputs of any union of d columns are distinct
Challenge 2: How to design a decoding
algorithm with efficient time complexity [O(tn)]
My T. Thai
mythai@cise.ufl.edu
29
Probe Selection
Problem Definition:
Given a biological sample (e.g., blood) and a set of
probes
Identify the presence (or absence) of some
biological objects (e.g., viruses or bacteria) with the
minimum number of probes
My T. Thai
mythai@cise.ufl.edu
30
Unique Probes VS. Non-unique
Probes
Unique probes
Gene-specific probes or signature probes.
Difficult to find
Non-unique probes
Hybridize to more than one target.
Difficult to decode the results
My T. Thai
mythai@cise.ufl.edu
31
Probe-Target Matrix
12 probe candidates.
4 targets (genes).
For target set S, define P(S) as set of
probes reacting to any gene in S.
P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}.
P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}.
Symmetric set difference:
P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes
that separate two sets.
My T. Thai
mythai@cise.ufl.edu
32
Sequence Alignment
Problem Definition:
Given: 2 DNA or protein sequences
Find: Best match between them
What is an Alignment:
Given: 2 Strings S and S’
Goal: The lengths of S and S’ are the same by inserting
spaces into these strings
A
--
T
C
--
A
--
C
T
C
A
A
My T. Thai
mythai@cise.ufl.edu
33
Matches, Mismatches and Indels
Match: two aligned, identical characters in an
alignment
Mismatch: two aligned, unequal characters
Indel: A character aligned with a space
A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
My T. Thai
mythai@cise.ufl.edu
34
Basic Algorithmic Problem
Find the alignment of the two strings that:
Max m where m = (# matches – mismatches –
indels)
m defines the similarity of the two strings, also
called Optimal Global Alignment
Biologically: a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character
My T. Thai
mythai@cise.ufl.edu
35
Multiple Sequence Alignment
Problem Definition:
Similar to the sequence alignment problem but the
input has more than 2 strings
Challenges:
NP-hard
Guarantee factor: 2 – 2/k where k is the number of
the input sequences.
More work to reduce the time and space complexity
My T. Thai
mythai@cise.ufl.edu
36
DNA Sequencing
Problem Definition:
Given a set of fragments that are contained in a
DNA string S
Goal: Determine the string S
NP-complete
Further complicated due to the existence of
repetitive sequences in the genome
Can cast this as a Hamiltonian path or Euler
path problem (was introduced by Pavel
Pavzner)
My T. Thai
mythai@cise.ufl.edu
37
Genome Rearrangement
Problem Definition:
Given genomes of 2 different species
Goal: Find a sequence of evolutionary events that
turn the first genome to the second one.
Biological reasons: How close between these
species, how much evolution separate these
species.
E.g.: We usually test new drugs on mice before
humans. However, how close is a mouse to a
human?
My T. Thai
mythai@cise.ufl.edu
38
Genome Rearrangement
Can we use the solutions of sequence alignment
to solve this problem?
Answer: NO, because:
Genome is a very long strings (3 million letters for
a human genome
Model of sequence alignment is not appropriate for
human genome comparison since the differences
are not in terms of insertions/deletion/mutations of
a nucleotide, but a rearrangement of a long DNA
regions
The basic comparison is gene
My T. Thai
mythai@cise.ufl.edu
39
An Example
• If we compare these two strings by sequence alignment,
it’s impossible
• However, the second string is the first string after
reverse the fragment AATGGT…CCC.
My T. Thai
mythai@cise.ufl.edu
40
Main Evolutionary Events
Deletions: A fragment is removed
Duplications: create many copies of a fragment
and insert into different positions
Transpositions: A fragment is removed and reinserted into a different position
Inversions: A fragment is removed, reversed,
and then reinserted into the same position
Translocations: A pair of fragments are
exchanged between the ends of two
chromosomes
My T. Thai
mythai@cise.ufl.edu
41
My T. Thai
mythai@cise.ufl.edu
42
Protein Structure Prediction
Problem Definition:
Given: A sequence of amino acids
Goal: Predict the 3D structure of the protein
Some approaches:
Determine the position of a protein’s atoms so as to
minimize the total free energy
Find the similarities to some known proteins
My T. Thai
mythai@cise.ufl.edu
43
Community Structure
Problem Definition:
Given a graph G = (V, E) representing a network
Partition G into a set of subgraph (community structure) so
that nodes in each subgraph are highly connected
Biological reason: Genes with similar expression data
may have similar functions. Identify the community
structure can help us to reduce the number of tests
Others: Community structure is also studied in
different fields
My T. Thai
mythai@cise.ufl.edu
44