Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College marth@bc.edu Lecture overview 1. Inter-species evolution and comparative genomics 2. Intra-species evolution, population genomics, and human origins 1. Inter-species evolution and comparative genomics Initial sequencing and comparative analysis of the mouse genome Mouse Genome Sequencing Consortium Nature 420, 520-562. 2002 Questions of Evolutionary Biology • What are the taxological relationships between living organisms (which organisms are more or less closely related to each other)? • How do genes evolve? • How do genomes evolve? • How do comparisons with other organisms help us understand our own genome? Mechanisms of molecular evolution DNA sequence evolution: mutations Phylogenetic relationships (1) Higgs and Attwood, Bioinformatics and Molecular Evolution, Blackwell Publishing Multiple alignment of mammalian mitochondrial small subunit rRNA sequences Phylogenetic relationships (2) Higgs and Attwood, Bioinformatics and Molecular Evolution, Blackwell Publishing Jukes-Cantor distance matrix for mammalian mitochondrial small subunit rRNA sequences Phylogenetic relationships (3) Higgs and Attwood, Bioinformatics and Molecular Evolution, Blackwell Publishing Phylogenetic tree constructed from mammalian mitochondrial small subunit rRNA sequences Gene structure evolution: duplications Gene duplication – paralogs Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001 Evolution of chromosome organization Synteny Initial sequencing and comparative analysis of the mouse genome Mouse Genome Sequencing Consortium Nature 420, 520-562. 2002 Gene classes across organisms Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001 Gene conservation across organisms Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001 Comparative genomics helps gene annotations 2. Intra-species evolution, population genomics, and human origins Questions about human evolution • How do we discover / assess genetic variations? • What is the level of diversity across humans? • How can we model the ancestral and mutation processes? • What do phylogenetic analyses of human mitochondrial sequences tell us about human origins and dispersal? • Does mitochondrial DNA give us the full picture? • What do we learn from model-fitting analysis of nuclear DNA? • A single wave of out-of-Africa migration or multiple waves? Human genetic diversity average polymorphism rate between a pair of human chromosomes: 1 SNP in 1,300 bp of sequence polymorphism density along chromosomes varies widely What explains heterogeneity? G+C nucleotide content 7 8 6 5 30 33 36 39 42 45 G+C Content [%] 48 51 SNP Rate [per 10,000 bp] SNP Rate [per 10,000 bp] 8 CpG di-nucleotide content 7 54 6 recombination rate 0.3 1.2 2.1 3 3.9 CpG Content [%] 10-4 3’ UTR 5’ UTR Exon, overall Exon, coding 5.00 x 4.95 x 10-4 4.20 x 10-4 3.77 x 10-4 synonymous non-synonymous 366 / 653 287 / 653 functional constraints 4.8 SNP Rate [per 10,000 bp] 10 5 9 5.7 8 7 6 5 0 0.5 1 1.5 2 2.5 3 3.5 4 Recombination rate [per Mb] Variance is so high that these quantities are poor predictors of nucleotide diversity in local regions hence random processes are likely to govern the basic shape of the genome variation landscape (random) genetic drift The origin of genetic variations • sequence variations are the result of mutation events • mutations are propagated down through generations TAAAAAT TAACAAT MRCA TAAAAAT TAAAAAT TAAAAAT TAAAAAT TAACAAT TAACAAT • and determine present-day variation patterns TAACAAT TAACAAT Recombination messes up phylogenies accgttatgtaga acggttatgtaga accgttatgtaga acggttatgtaga acggttatgtaga acggttatgtaga acggttatgtaga acggttatgtaga acggttatgtaga acggttatgtaga accgttatgtaga accgttatgtaga accgttatgtaga • because of recombination, DNA sequences may not have a unique common ancestor, hence phylogenetic analysis may not apply What does mtDNA say about human origins? However, the mitochondrion is only a single locus (~16kb, short on the scale of the 3Gb human genome) Campbell and Heyer. Genomics, Proteomics, Bioinformatics. Cummings. What does nuclear DNA say? • Because of recombination, phylogenetic analysis is not feasible (there is not a unique tree that can explain the ancestry of DNA sequences) • Instead, one uses statistical “genetic analysis” i.e. one examines the statistical properties of the possible ancestries that produced the nucleotide sequences observed in individuals Polymorphism data 1. marker density (MD): distribution of number of SNPs in pairs of sequences 0.3 0.2 Clone 1 Clone 2 # SNPs AL00675 AL00982 8 0.1 0 AS81034 AK43001 0 CB00341 AL43234 2 0 1 2 3 4 5 6 7 8 9 10 2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples 0.1 0.05 0 1 2 “rare” 3 4 5 6 7 8 9 10 “common” SNP Minor allele Allele count A/G A 1 C/T T 9 A/G G 3 Population genetic models stationary past collapse expansion bottleneck history present MD (simulation) 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 AFS (direct form) 1 2 3 4 5 6 7 8 9 10 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 0 10 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 9 10 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 Data fitting: polymorphism density • best model is a bottleneck shaped population size history 0.4 0.3 N3=11,000 0.2 N2=5,000 T2=400 gen. 0.1 0 16kb 16 kb 12 kb 12 kb 0.00 5.00 10.00 8kb 8 kb 15.00 20.00 25.00 4 kb 30.00 4 kb 35.00 40.00 present N1=6,000 T1=1,200 gen. Marth et al. PNAS 2003 • our conclusions from the marker density data are confounded by the unknown ethnicity of the public genome sequence we looked at allele frequency data from ethnically defined samples Data fitting: allele frequency 0.15 0.15 0.1 0.1 0.05 0.05 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 model consensus: bottleneck 0.15 N3=10,000 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 present N2=2,000 T2=400 gen. N1=20,000 T1=3,000 gen. bottleneck ~ 3,000 generations (or 100,000 years) ago Data from other human populations European data African data bottleneck modest but uninterrupted expansion Marth et al. Genetics 2004 What nuclear DNA tells us our results Recent African Origin Multiregional