* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction and Preliminaries - Department of Computer and
Site-specific recombinase technology wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Microevolution wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Primary transcript wikipedia , lookup
Genome evolution wikipedia , lookup
SNP genotyping wikipedia , lookup
DNA vaccination wikipedia , lookup
Epigenomics wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Human genome wikipedia , lookup
Molecular cloning wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
DNA supercoil wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Point mutation wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genome editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genomic library wikipedia , lookup
Computational Molecular Biology Introduction and Preliminaries Preliminaries in Computer Science Strings and alphabet Basic notations in graph theory Algorithms and Complexity My T. Thai mythai@cise.ufl.edu 2 Strings Consist of a sequence of letters: DNA: four nucleotides A, C, G, T Proteins: 20 symbol alphabet of animo acids Given a string s, we have the following notations: Length: |s| Substring: ACT is a substring of ATGACTG Superstring: ATGACTG is a superstring of ACT Index and interval: s[i] and s[i..j] Prefix and suffix: s[1..j] and s[i..|s|] My T. Thai mythai@cise.ufl.edu 3 Graphs G = (V, E) where V is a set of vertices and E is a set of edges Undirected graph: edges are undirected Directed graph: edges are directed Weighted graph G = (V, E, w) where each edge has some weight Some special graphs: complete graph, bipartite graph, tree, and interval graph Subgraph, spanning tree, steiner tree My T. Thai mythai@cise.ufl.edu 4 Interval Graphs Intersection graph of a set of intervals on the real line A vertex represents an interval and an edge (u, v) exists if intervals u and v intersect My T. Thai mythai@cise.ufl.edu 5 Some Problems in Graphs Euler circuit: Given a graph, find a cycle that passes through each edge exactly once Hamiltonian circuit: Given a graph, find a cycle that passes through each vertex exactly once Minimum Spanning Tree: Given a weighted undirected graph, find a spanning tree with minimum total weight Maximum Matching: Given an undirected graph, find a maximum cardinality matching, which is a subset of edges such that no two edges in the subset share an endpoint My T. Thai mythai@cise.ufl.edu 6 P vs. NP Class of P: Set of problems solvable by polynomial-time algoirthms Class of NP: Set of problems whose solutions, once found, can be verified in polynomial time NP-complete (NP-hard) problems: cannot obtain an optimal solutions in polynomial time My T. Thai mythai@cise.ufl.edu 7 Some approaches for NP-complete Problems Special-case method: Work on the problem with a restricted class of inputs Exhaustive search: Design an exponential-time algorithms that may perform well in practice Approximation algorithms: Design a polynomialtime algorithm that is guaranteed to find near-optimal solutions (with a good approximation ratio) Heuristics: Fast algorithms that produce satisfactory solutions most of the time but without guarantee My T. Thai mythai@cise.ufl.edu 8 Preliminaries in Molecular Biology My T. Thai mythai@cise.ufl.edu 9 DNA and Base Pairs Double helix consisting of two dual strands Has four types of nucleotides: Adenine, Thymine, Guanine, Cytosine Base Pairs: A↔T, C↔G Two ends of a strand are marked with 3’ and 5’ The entire DNA of a living organism is called its genome My T. Thai mythai@cise.ufl.edu 10 DNA Sequences My T. Thai mythai@cise.ufl.edu 11 DNA Replication Strands are separated Each strand is replicated using one of the parental strands as a template My T. Thai mythai@cise.ufl.edu 12 Cell, Chromosome, and DNA My T. Thai mythai@cise.ufl.edu 13 Cell Classification My T. Thai mythai@cise.ufl.edu 14 Chromosomes Consists of a DNA molecule associated with proteins that fold and pack the DNA thread into a more compact structure and proteins required for the process of gene expression, DNA replication and DNA repair. Human genome is distributed over 24 chromosomes Each cell contains 46 chromosomes 22 pairs common to both males and females 2 sex chromosome X and Y in males and two Xs in female My T. Thai mythai@cise.ufl.edu 15 Genes Segments of DNA Functional and physical unit of heredity passed from parent to offspring Contain the information for making a specific protein My T. Thai mythai@cise.ufl.edu 16 Proteins Shorts strings in the amino acid 20-letter alphabet Human genome: about 100,000 proteins, with each protein a few hundred amino acids long Bacteria make 500-1500 proteins Made by genes (fragments of DNA) that are roughly three times longer than the corresponding proteins. Why? Every 3 nucleotides in the DNA alphabet code one letter in the protein alphabet of amino acids My T. Thai mythai@cise.ufl.edu 17 Central Dogma of Molecular Biology My T. Thai mythai@cise.ufl.edu 18 Transcription My T. Thai mythai@cise.ufl.edu 19 Translation Translation mRNA (after exported out of the nucleus and reaching the cytosol) directs the synthesis of the protein by joining together amino acids in the order encoded by the mRNA Genetic code Defines a mapping between codons and amino acid. Codon Triplet of nucleotides specifies a single amino acid in a corresponding protein 64 codons and 20 amino acids Translation is carried out by ribosomes My T. Thai mythai@cise.ufl.edu 20 Polymerase Chain Reaction (PCR) Primer Nucleic acid strand Serves as a starting point of DNA replication My T. Thai mythai@cise.ufl.edu 21 Plasmid Vector Vector an agent that can carry a DNA fragment into a host cell Plasmid Circular and doublestranded DNA Antibiotic resistance Automatic replication Exists in bacteria My T. Thai mythai@cise.ufl.edu 22 DNA Cloning Using Plasmids as Vectors (a) DNA recombination (b) Transformation My T. Thai mythai@cise.ufl.edu 23 DNA Cloning Using Plasmids as Vectors (Cont) (c) Selective amplification (d) Isolation of desired DNA clones My T. Thai mythai@cise.ufl.edu 24 DNA Library Screening Probe: Labeled with radioisotope or fluorescence Used to detect specific DNA sequences by hybridization Hybridization: Binding of two nucleic acid chains by base paring DNA Library Screening To identify each clone whether it contains a probe from a given set of probes Positive clone: contains a probe My T. Thai mythai@cise.ufl.edu 25 Some Computational Problems Pooling Design Non-unique probe selection Sequence Alignment, Multi Sequence Alignment DNA sequencing Genome Rearrangement Protein Structure Prediction and Recognition Protein-Protein Interactions Functional Groups, Modules My T. Thai mythai@cise.ufl.edu 26 Pooling Designs Problem Definition Given a set of n clones with at most d positive clones Identify all positive clones with the minimum number of tests Pool: a subset of clones Positive pool: a pool contains at least one positive clone My T. Thai mythai@cise.ufl.edu 27 Pooling Designs clones pools Mtxn = c1 p1 0 p2 0 . . pi 0 . . pt 0 V(D) c2 cj cn 0 … 0 … 0 … 0 … 0 1 … 0 … 0 … 0 … 0 Testing 0 … 0 … 1 … 0 … 0 0 … 0 … 0 … 0 … 0 0 1 . . 1 . . 0 txn tx1 M[i, j] = 1 iff the ith pool contains the jth clone Decoding Algorithm: Given M and V(D), identify all positive clones My T. Thai mythai@cise.ufl.edu 28 Challenges Challenge 1: How to construct the binary matrix M such that: Outputs of any union of d columns are distinct Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)] My T. Thai mythai@cise.ufl.edu 29 Probe Selection Problem Definition: Given a biological sample (e.g., blood) and a set of probes Identify the presence (or absence) of some biological objects (e.g., viruses or bacteria) with the minimum number of probes My T. Thai mythai@cise.ufl.edu 30 Unique Probes VS. Non-unique Probes Unique probes Gene-specific probes or signature probes. Difficult to find Non-unique probes Hybridize to more than one target. Difficult to decode the results My T. Thai mythai@cise.ufl.edu 31 Probe-Target Matrix 12 probe candidates. 4 targets (genes). For target set S, define P(S) as set of probes reacting to any gene in S. P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes that separate two sets. My T. Thai mythai@cise.ufl.edu 32 Sequence Alignment Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them What is an Alignment: Given: 2 Strings S and S’ Goal: The lengths of S and S’ are the same by inserting spaces into these strings A -- T C -- A -- C T C A A My T. Thai mythai@cise.ufl.edu 33 Matches, Mismatches and Indels Match: two aligned, identical characters in an alignment Mismatch: two aligned, unequal characters Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai mythai@cise.ufl.edu 34 Basic Algorithmic Problem Find the alignment of the two strings that: Max m where m = (# matches – mismatches – indels) m defines the similarity of the two strings, also called Optimal Global Alignment Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai mythai@cise.ufl.edu 35 Multiple Sequence Alignment Problem Definition: Similar to the sequence alignment problem but the input has more than 2 strings Challenges: NP-hard Guarantee factor: 2 – 2/k where k is the number of the input sequences. More work to reduce the time and space complexity My T. Thai mythai@cise.ufl.edu 36 DNA Sequencing Problem Definition: Given a set of fragments that are contained in a DNA string S Goal: Determine the string S NP-complete Further complicated due to the existence of repetitive sequences in the genome Can cast this as a Hamiltonian path or Euler path problem (was introduced by Pavel Pavzner) My T. Thai mythai@cise.ufl.edu 37 Genome Rearrangement Problem Definition: Given genomes of 2 different species Goal: Find a sequence of evolutionary events that turn the first genome to the second one. Biological reasons: How close between these species, how much evolution separate these species. E.g.: We usually test new drugs on mice before humans. However, how close is a mouse to a human? My T. Thai mythai@cise.ufl.edu 38 Genome Rearrangement Can we use the solutions of sequence alignment to solve this problem? Answer: NO, because: Genome is a very long strings (3 million letters for a human genome Model of sequence alignment is not appropriate for human genome comparison since the differences are not in terms of insertions/deletion/mutations of a nucleotide, but a rearrangement of a long DNA regions The basic comparison is gene My T. Thai mythai@cise.ufl.edu 39 An Example • If we compare these two strings by sequence alignment, it’s impossible • However, the second string is the first string after reverse the fragment AATGGT…CCC. My T. Thai mythai@cise.ufl.edu 40 Main Evolutionary Events Deletions: A fragment is removed Duplications: create many copies of a fragment and insert into different positions Transpositions: A fragment is removed and reinserted into a different position Inversions: A fragment is removed, reversed, and then reinserted into the same position Translocations: A pair of fragments are exchanged between the ends of two chromosomes My T. Thai mythai@cise.ufl.edu 41 My T. Thai mythai@cise.ufl.edu 42 Protein Structure Prediction Problem Definition: Given: A sequence of amino acids Goal: Predict the 3D structure of the protein Some approaches: Determine the position of a protein’s atoms so as to minimize the total free energy Find the similarities to some known proteins My T. Thai mythai@cise.ufl.edu 43 Community Structure Problem Definition: Given a graph G = (V, E) representing a network Partition G into a set of subgraph (community structure) so that nodes in each subgraph are highly connected Biological reason: Genes with similar expression data may have similar functions. Identify the community structure can help us to reduce the number of tests Others: Community structure is also studied in different fields My T. Thai mythai@cise.ufl.edu 44