Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Analysis of DNA sequences MVE235 - Matematisk Orientering 2016 Erik Kristiansson erik.kristiansson@chalmers.se The plan for this lecture • Big data • DNA and DNA sequencing • Metagenomics – analysis of the hidden biodiversity • Application to antibiotic resistance Large datasets: Internet • Google search engine: 1 trillion web pages • Google Maps: >20 petabyte data • Facebook: 300 petabytes of daily data Large datasets: Astronomy • Hubble: 140 gigabytes/week • Very Large Array in Mexico: 100 gigabytes/second Large datasets: CERN • Large Hadron Collider: 30 petabytes/year Large datasets: CERN • Large Hadron Collider: 30 petabytes/year Genes, RNA and proteins Large datasets: Molecular biology • Genome size (in bases: A,C,G or T) – Bacteria: 5 million bases – Humans: 3,2 billion bases. – Amoeba: 670 billion bases • A human chromosome: 10 um long, 10 cm DNA and 200,000,000 bases (A,C,G,T) • 1000 times higher information density than a modern hard drive! 1 gram of soil • 100 million bacteria • DNA: >100 terabases (1014) History of DNA sequencing Watson & Crick Fred Sanger • Structure of the DNA discovered in 1953. • Rapid DNA sequencing developed by Frederick Sanger 1977. History of DNA sequencing Bacteriophage Phi X 174 11 genes, 5,386 bases Finished 1977 Haemophilus influenzae 1800 genes, 1,8 million bases Finished 1995 History of DNA sequencing Saccharomyces cerevisiae 6000 genes and 12 million bases Finished 1997 – the project took 7 years Homo sapiens Genome consists of ~21.000 genes and 3.25 billion bases Finished 2003 – the project took 13 years >100,000,000,000,000 bases/year Next generation DNA sequencing Number of DNA bases / year and person 10,00,000,000 100,000,000 Partial automation 1,000,000 Sanger sequencing 10,000 100 First sequence 1965 1977 1986 1995 2003 Today First generation DNA sequencing ATTTCCGGCATCTGACGATAGAAGAGGTG AGGCAACACTCCTACGGGAGGCAGCAGTG GGGAATTTTGGACAATGGACGCAAGTCTG Next generation DNA sequencing ACTCCTACGGGAGGCAGCAGTGGGGAATT TTGGACAATGGACGCAAGTCTGATCCAGC CATTCCGTGTGCAGGACGAAGGCCTTCGG GTTGTAAACTGCTTTTGTACAGAACGAAA AGGTCTCTATTAATACTAGGGGCTCATGA CGGTACTGTAAGAATAAGCACCGGCTAAC CTCAGATCGTCGCTGTCTCTGCCAGTTAA TCGCCATCTCTGCCAGTTAATCGCCATCT CTGCCAGTTAATCGCTATCTCTGCCAGTT AATCGCCATCTCTGCCAGTTAATCGCCAT CTCTGCCAGTTAATCGCCATCTCTGCCAG TTAATCGCCATCTCTGACGAAATCCACCG Introduction of high-throughput sequencing Genome sequencing and Kryders Law Data from DNA sequencing AAGAGCCTAGCATGACTGCACAGGATAGGTGCCTAGTTAATACTGACCTCTCATTCCCTTCCACCTCTGCTAAATAAAGGGCTCGATTTCTTTAAA AACCAATCCGCGGCATTTAGTAGCGGTAAAGTTAGACCAAACCATGAAACCAACATAAACATTATTGCCCGGCGTACGGGGAAGGACGTCAATAGT ATAAGTGTCTTCTTTTGAGAAGTGTCTGTTCATATACTTCGCCCACTTGTTGATGGGGTTGTTTGTTTTTTTCTTGTAAATTTGTTTGAGTTCATT CAATTTTGAGTTAGTAGGTTTGCCTAAGCAGAAATTGGATCTTTTATATCATCACGATTAAATACTCAAAACAGTATTTAAGCACAGTATTTAAAT ACTCATGCACATTTCTGAAGCAGGCTTGAATTTCATCCCATAATATGGATTTATCTTTTCTACTATATGATCAGGTTGCGAATTTTCCAAACTTTT AATTATAAGGGGATTTGGGAAATTTAAAGGGTGGATAGAATATTTATTTTGTAGTTCTGTTTGGGTTTAGATAATGTAAATAACGTGTCATTCAGC ATCGATTGCATAAAATCATTTTGTTTGTCTGAGCCCAACAACAGGGAATCCATGGCTTGTTCCTCCAGAATGGGCAGCAACATGCAAATAACTGTA CTGTTTCAGTGGGATCTCAGGGGAAACCCCTCCACTGTAACTCAGATGCCAGTCTCCACATGAGACCAAATCTTCTCAGTAAGAAAAGCACTGCAG ATATGAATTTGGGCAAGTTATTGATTTGGGCAACCCTTGATCCTCAGTGTCCTCAGCTTTAAAATGGCAATGATAAATAGTACCTGTTACTCAGTT CCTCTTCTGCCTGCTGGGCGCACCTGCCCACGCGGTATCCATACCCGGCGTTACAACCACAACGACAACGGACTCAACGACTGAACCGGCCCCGGA ATATGAATTTGGGCAAGTTATTGATTTGGGCAACCCTTGATCCTCAGTGTCCTCAGCTTTAAAATGGCAATGATAAATAGTACCTGTTACTCAGTT TTCTGAGACTGAACTAGGCAAAGTCAGAAACATTGTTATAATTTGTTAGTGATGTCTGTTATAGAGAAGAAAGTGGGGAATGGGGCATACATGTCA CTCCATTTTAATAAGATAGAGAGATGAAAGTGTATTACTGGTGGGATTATGTTAGAAAACACATTTCTTGTCCCAGTAGCATTCAAGATCAAGAGT CTCCATTTTAATAAGATAGAGAGATGAAAGTGTATTACTGGTGGGATTATGTTAGAAAACACATTTCTTGTCCCAGTAGCATTCAAGATCAAGAGT Analysis of DNA sequences Applications • Medicine - personal genomics • Industrial biotechnology • Research – Sequencing of new genomes – Evolution – Tumor biology (cancer) – Infectious diseases and antibiotic resistance Example 1 – The reconstruction of a genome Genome sequencing Genome Sequencing of random fragments 100 bases each DNA fragments Can we recreate the genome based on the fragments? Genome assembly DNA fragments 1. Compare all fragments with each other. Save results! 2. Identify the fragments with the best overlap - merge 3. Repeat n n n×n matrix n=number of fragments Genome assembly Genome Fragments Assembly Reconstructured genome Genome assembly - challenges • Computationally heavy – Computational complexity: o(n2) – Memory complexity: o(n2) • Sequencing errors • Area of research: faster and more efficient algorithms needed! Assembly of the spruce genome • Large and complex genome – 20 gigabases (6 times our genome) – Many repetitive regions • Assembly statistics – 1 terabases (1012), 10 billion fragments – Assembly had to be done on a computer with 2 TB RAM – Results: 3 million regions corresponding to 30 % of the genome Metagenomics Microorganisms are everywhere! 1000 species 10 000 species Bacteria Number of microbes on Earth Number of microbes in all humans Number of stars in the universe Number of bacterial cells in one human gut Number of human cells in one human Number of bacterial genes in one human gut Number of genes in the human genome 1030 1023 1021 1014 1013 3,000,000 21,000 Bacteria Number species 10 000 000 Formally named species 50 000 Species with sequenced genomes 10 000 Most bacteria have never been observed! • 1-5 million bases • 1000-5000 genes 1 gram of soil • 10 000 species • 100 million cells • DNA: 100 terabases (1014) Total sequencing to date: less than 1% of the DNA in liter of ocean water. Metagenomics ATTTCCGGCATCTGACGAT AACTCCTACGGGAGGCAGC AGCTCAGATCGTCGCTGTC TCTCACGAAATCCACCGTC TCTTGAATTCGGCCATACG Sample with microorganisms DNA Metagenome The metagenome ATTTCCGGCATCTGACGATAGAAGAAGGTGAGGCAAC ACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATG GACGCAAGTCTGATCCAGCCATTCCGTGTGCAGGACGAA GGCCTTCGGGTTGTAAACTGCTTTTGTACAGAACGAAAA GGTCTCTATTAATACTAGGGGCTCATGACGGTACTGTAA GAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGT CTCAGATCGTCGCTGTCTCTGCCAGTTAATCGCCATCTC TGCCAGTTAATCGCCATCTCTGCCAGTTAATCGCTATCT CTGCCAGTTAATCGCCATCTCTGCCAGTTAATCGCCATC TCTGCCAGTTAATCGCCATCTCTGCCAGTTAATCGCCAT CTCTG CACGAAATCCACCGTCTCTTTCTCAATGTCAGAAAGCAT GAATTCGGCCATACGCTCAAGCCGGGCCTCGGTATAACG CATGGCCGCTGGCTCATCACCATCCTGGTTGCCGAAGTT TCCCTGGCCATCAACAAGGGTGTAGCGCATATTCCACTC CTGCGCCAGGCGCACCATGGCGTCGTAAATGGCTTTATC GCCATGCGGGTGGTATTTACCCATCACCTCTCCCACAAT ACGGGCACTCTTCTTATAGGGCTTTCCGTAATCGAGCCC CAATTCATTCATGGCGTAAAGTACGCGACGGTGTACC Microorganisms Metagenomic data revolution Sizes of metagenomic projects 10 000 000 1 000 000 Gigabases 100 000 10 000 1 000 100 10 1 2006 2007 2008 2009 2010 2011 Year 2012 2013 2014 2015 2016 Analysis of genes in metagenomes Metagenome Gene Quantification Statistical analysis Biological Results Analysis of genes in metagenomes 1. Raw data 3. Identification of genes 2. Genome reconstruction 4. Mapping and counting Analysis of genes in metagenomes Gene 1 173 237 71 209 41 Gene 2 37 72 14 36 24 Gene 3 627 2751 488 691 1522 Gene 4 194 250 86 211 89 Gene 5 2 8 1 11 0 5.3×107 7.9×107 2.3×108 1.9×107 6.6×107 Analysis of genes in metagenomes 1. High dimension – Thousands of genes – Few samples 4. Vastly undersampled – We can not sequence everything 10000 100 1 0.01 – Sampling of DNA fragments – Technical errors – Biological variability Variance 2. Discrete 3. High variability Human gut 0.01 1 100 10000 Technical and biological variability Analysis of genes in metagenomes Global variance structure Gene-specific variance Sequencing of DNA Missing genes Example 2: Analysis of antibiotic resistance genes in the environment Antibiotics Antibiotics Alexander Fleming Penicillin-producing fungi • Antibiotic resistance is caused by 1. Mutations in pre-existing DNA 2. Acquisition of resistance genes Downstream 3 Discharge site Downstream 2 Downstream 1 Upstream 1 & 2 Downstream 3 • Downstream the Indian plant • High levels of antibiotics PETL Discharge site • Upstream the Indian plant Downstream 2 • Moderate levels of antibiotics Downstream 1 • A nearby lake • Moderate levels of antibiotics Upstream 1 & 2 • Swedish sewage treatment plant • No levels of antibiotics Aim: Investigate abundance of resistance genes in these three places using metagenomics. Relative abundance (%) Sulfonamide resistance 1060 475 122 0 0 11 9 475 Relative abundance (%) Aminoglycoside resistance Indian lake polluted by antibiotics Exempel 3: The hunt for new antibiotic resistance genes The story of the gene ’NDM’ • First discovered in 2009 in a patient traveling from India to Sweden. • One year later, the gene had spread globally. Global spread of NDM 2010 The story of the gene ’NDM’ Spread of NDM in europe 2015 The hunt for new resistance genes Question Can we use metagenomes to identify new types of resistance genes? Challenges • How do a ‘new’ gene look like? • Very big datasets • Only small pieces of DNA available The hunt for new resistance genes 1. Probabilistic modeling of resistance genes The hunt for new resistance genes 2. Use the model to search for new genes Gene model Billions of DNA fragments ACAGGTACTTCCTTTACAGACA AAAAATGTCAGACAGCCAAGA AATTGTGATCGCAATTACCCCA GACTTTTCGATTTAGGAGCTTC TTCCATCTGCTCAGCGCACCGC TCTCCTCTACCATCTCTCTTATC TCTGTTTGGCAAAAACCTGGTT TCCACACTTCTGCCTGCGCTGA NoCURE – identification of new resistance genes 3. Recreate and test the gene Gene fragments Reconstructed genes ? New superbug? Biostatistics, Biomathematics, Bioinformatics Mathematics Biology Computer Science Biostatistics, Biomathematics, Bioinformatics Statistical and mathematical modelling Exploration of big and complex datasets Applications in biology and medicine Describe and understand life - and its randomness! • An interdisciplinary area • • • • Biostatistics, Biomathematics, Bioinformatics • Fun and important questions • Theoretical and applied topics • Excellent career opportunities – Industry (biotech companies, core facilities, hospitals) – Academia (PhD positions) erik.kristiansson@chalmers.se