Download here - BC Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Evolutionary Genome Biology
Gabor T. Marth, D.Sc.
Department of Biology, Boston College
marth@bc.edu
Lecture overview
1. Inter-species evolution and
comparative genomics
2. Intra-species evolution,
population genomics, and
human origins
1. Inter-species evolution and comparative
genomics
Initial sequencing and comparative analysis of the mouse genome
Mouse Genome Sequencing Consortium
Nature 420, 520-562. 2002
Questions of Evolutionary Biology
• What are the taxological relationships between living organisms
(which organisms are more or less closely related to each other)?
• How do genes evolve?
• How do genomes evolve?
• How do comparisons with other organisms help us understand our
own genome?
Mechanisms of molecular evolution
DNA sequence evolution: mutations
Phylogenetic relationships (1)
Higgs and Attwood, Bioinformatics and Molecular Evolution, Blackwell Publishing
Multiple alignment of mammalian mitochondrial small subunit rRNA
sequences
Phylogenetic relationships (2)
Higgs and Attwood, Bioinformatics and Molecular Evolution, Blackwell Publishing
Jukes-Cantor distance matrix for mammalian mitochondrial small
subunit rRNA sequences
Phylogenetic relationships (3)
Higgs and Attwood, Bioinformatics and Molecular Evolution, Blackwell Publishing
Phylogenetic tree constructed from mammalian mitochondrial small
subunit rRNA sequences
Gene structure evolution: duplications
Gene duplication – paralogs
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Evolution of chromosome organization
Synteny
Initial sequencing and comparative analysis of the mouse genome
Mouse Genome Sequencing Consortium
Nature 420, 520-562. 2002
Gene classes across organisms
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene conservation across organisms
Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Comparative genomics helps gene annotations
2. Intra-species evolution, population genomics,
and human origins
Questions about human evolution
• How do we discover / assess genetic variations?
• What is the level of diversity across humans?
• How can we model the ancestral and mutation processes?
• What do phylogenetic analyses of human mitochondrial sequences
tell us about human origins and dispersal?
• Does mitochondrial DNA give us the full picture?
• What do we learn from model-fitting analysis of nuclear DNA?
• A single wave of out-of-Africa migration or multiple waves?
Human genetic diversity
average polymorphism rate between a
pair of human chromosomes:
1 SNP in 1,300 bp of sequence
polymorphism density along
chromosomes varies widely
What explains heterogeneity?
G+C nucleotide content
7
8
6
5
30
33
36
39
42
45
G+C Content [%]
48
51
SNP Rate [per 10,000 bp]
SNP Rate [per 10,000 bp]
8
CpG di-nucleotide content
7
54
6
recombination
rate
0.3
1.2
2.1
3
3.9
CpG Content [%]
10-4
3’ UTR
5’ UTR
Exon, overall
Exon, coding
5.00 x
4.95 x 10-4
4.20 x 10-4
3.77 x 10-4
synonymous
non-synonymous
366 / 653
287 / 653
functional
constraints
4.8
SNP Rate [per 10,000 bp]
10
5
9
5.7
8
7
6
5
0
0.5
1
1.5
2
2.5
3
3.5
4
Recombination rate [per Mb]
Variance is so high that these quantities are poor predictors of
nucleotide diversity in local regions hence random processes are likely
to govern the basic shape of the genome variation landscape 
(random) genetic drift
The origin of genetic variations
• sequence variations are the result of mutation events
• mutations are propagated down
through generations
TAAAAAT
TAACAAT
MRCA
TAAAAAT
TAAAAAT
TAAAAAT
TAAAAAT
TAACAAT
TAACAAT
• and determine present-day variation
patterns
TAACAAT
TAACAAT
Recombination messes up phylogenies
accgttatgtaga
acggttatgtaga
accgttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
• because of recombination, DNA sequences may not have a unique
common ancestor, hence phylogenetic analysis may not apply
What does mtDNA say about human origins?
However, the mitochondrion is only a
single locus (~16kb, short on the
scale of the 3Gb human genome)
Campbell and Heyer. Genomics, Proteomics, Bioinformatics. Cummings.
What does nuclear DNA say?
• Because of recombination, phylogenetic analysis is not feasible (there
is not a unique tree that can explain the ancestry of DNA sequences)
• Instead, one uses statistical “genetic analysis” i.e. one examines the
statistical properties of the possible ancestries that produced the
nucleotide sequences observed in individuals
Polymorphism data
1. marker density (MD): distribution of
number of SNPs in pairs of sequences
0.3
0.2
Clone 1
Clone 2
# SNPs
AL00675
AL00982
8
0.1
0
AS81034
AK43001
0
CB00341
AL43234
2
0
1
2
3
4
5
6
7
8
9
10
2. allele frequency spectrum (AFS):
distribution of SNPs according to
allele frequency in a set of samples
0.1
0.05
0
1
2
“rare”
3
4
5
6
7
8
9
10
“common”
SNP
Minor allele
Allele count
A/G
A
1
C/T
T
9
A/G
G
3
Population genetic models
stationary
past
collapse
expansion
bottleneck
history
present
MD
(simulation)
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
0
0
AFS
(direct form)
1
2
3
4
5
6
7
8
9
10
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
0
10
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
9
10
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
Data fitting: polymorphism density
• best model is a bottleneck
shaped population size history
0.4
0.3
N3=11,000
0.2
N2=5,000
T2=400 gen.
0.1
0
16kb
16 kb
12 kb
12 kb
0.00
5.00
10.00
8kb
8 kb
15.00
20.00
25.00
4 kb
30.00
4 kb
35.00
40.00
present
N1=6,000
T1=1,200 gen.
Marth et al.
PNAS 2003
• our conclusions from the marker density data are confounded by
the unknown ethnicity of the public genome sequence we looked at
allele frequency data from ethnically defined samples
Data fitting: allele frequency
0.15
0.15
0.1
0.1
0.05
0.05
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
model consensus: bottleneck
0.15
N3=10,000
0.1
0.05
0
1
2
3
4
5
6
7
8
9
10
present
N2=2,000
T2=400 gen.
N1=20,000
T1=3,000 gen.
bottleneck ~ 3,000 generations (or 100,000 years) ago
Data from other human populations
European data
African data
bottleneck
modest but
uninterrupted
expansion
Marth et al.
Genetics 2004
What nuclear DNA tells us
our results
Recent African Origin
Multiregional
Related documents