Download Human genome

Document related concepts
no text concepts found
Transcript
The Human Genome
(part 2 of 2)
Wednesday, November 5, 2003
Introduction to Bioinformatics
Johns Hopkins School of Medicine
ME:440.714
J. Pevsner
pevsner@jhmi.edu
Copyright notice
Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics
by J Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by Wiley.
These images and materials may not be used
without permission from the publisher.
Visit http://www.bioinfbook.org
The human genome from a
bioinformatics perspective:
International Human Genome Sequencing
Consortium (2001). Initial sequencing and
analysis of the human genome.
Nature 409, 860-921.
Page 617
International Human Genome Sequencing
Consortium (2001). Initial sequencing and
analysis of the human genome.
Nature 409, 860-921.
To freely access this article, visit
http://www.nature.com/genomics/human/
and click the “human genome” link.
From this URL you can also access
Watson and Crick’s 1953 article on the double helix.
Page 617
Human genome publications
Since 1999, articles have been published on seven
individual human chromosomes:
Chromosome 22 (Dunham et al., 1999)
Chromosome 21 (Hattori et al., 2000)
Chromosome 20 (Deloukas et al., 2001)
Chromosome 14 (Heilig et al., 2003)
Chromosome Y (Skaletsky et al., 2003)
Chromosome 7 (Hillier et al., 2003; Scherer et al., 2003)
Chromosome 6 (Mungall et al., 2003)
Background of the human genome project
The Human Genome Project (HGP) was first proposed
by the U.S. National Research Council in 1988. The goals
were to create genetic, physical, and sequence maps
of the human genome. In parallel, genomes of model
organisms were to be studied.
(You can read this report on-line via http://www.nap.edu;
See page 618 for the full URL)
Page 618
Human genome project: 8 goals
There have been 8 main goals for the HGP:
1]
2]
3]
4]
5]
6]
7]
8]
Human DNA sequence
Develop sequencing technology
Identify human genome sequence variation
Functional genomics technology
Comparative genomics
ELSI: ethical, legal, and social issues
Bioinformatics and computational biology
Training and manpower
Table 17.5
Page 624
Human genome project: 8 goals
There have been 8 main goals for the HGP:
1]
2]
3]
4]
5]
6]
7]
8]
Human DNA sequence
Develop sequencing technology
Identify human genome sequence variation
Functional genomics technology
Comparative genomics
ELSI: ethical, legal, and social issues
Bioinformatics and computational biology
Training and manpower
Encourage the establishment of academic career
paths for genomic scientists – job openings
Human genome project: 8 goals
There have been 8 main goals for the HGP:
1]
2]
3]
4]
5]
6]
7]
8]
Human DNA sequence
Develop sequencing technology
Identify human genome sequence variation
Functional genomics technology
Comparative genomics
ELSI: ethical, legal, and social issues
Bioinformatics and computational biology
Training and manpower
ELSI: Who owns genetic information? Who has access to it?
To what extent do genes determine behavior?
What is the relation between genes and race?
Human genome project: timeline (to 2001)
Source: IHGSC (2001)
Strategic issues:
Hierarchical / shotgun sequencing
The human genome was sequenced in parallel by
a public consortium (IHGSC) and by Celera Genomics.
These groups applied alternative sequencing strategies.
Page 620
Human genome project: strategies
Whole genome shotgun sequencing (Celera)
Hierarchical shotgun sequencing (public consortium)
Human genome project: strategies
Whole genome shotgun sequencing (Celera)
Hierarchical shotgun sequencing (public consortium)
-- 29,000 BAC clones
-- 4.3 billion base pairs
-- it is helpful to assign chromosomal loci to
sequenced fragments, especially in light of
the large amount of repetitive DNA in the genome
-- individual chromosomes assigned to centers
Human genome project: strategies
Whole genome shotgun sequencing (Celera)
-- given the computational capacity, this approach
is far faster than hierarchical shotgun sequencing
-- the approach was ~validated using Drosophila
Hierarchical shotgun sequencing (public consortium)
-- 29,000 BAC clones
-- 4.3 billion base pairs
-- it is helpful to assign chromosomal loci to
sequenced fragments, especially in light of
the large amount of repetitive DNA in the genome
-- individual chromosomes assigned to centers
Source: IHGSC (2001)
Source: IHGSC (2001)
Source: IHGSC (2001)
Sequenced-clone contigs are merged to form
scaffolds of known order and orientation
Source: IHGSC (2001)
Fig. 17.14
Page 626
Features of the genome sequence
The genome sequence (in 2001) included a mixture of
finished, draft, and pre-draft data. The N50 length
describes the largest length L such that 50% of all
nucleotides are contained in contigs or scaffolds
of at least size L. For the 2001 version of the human
genome, N50 is at least 8.4 Mb.
For current N50 statistics, see
http://genome.ucsc.edu/goldenpath/stats.html
Page 623
Features of the genome sequence
The quality of genome sequence is assessed by counting
the number of gaps and by measuring the nucleotide
accuracy. About 91% of the unfinished draft sequence
had an error rate less than 1 per 10,000 bases.
This corresponds to a PHRAP score >40 (i.e. an error
probability of 10-40/10, or 99.99% accuracy)(see
Chapter 12).
Page 623
Broad genomic landscape
The broad genomic landscape includes the following features:
• Long-range variation in GC content
• CpG islands
• Comparison of genetic and physical distance
• The repeat content of the human genome
• The gene content of the human genome
Page 627
Broad genomic landscape: GC content
The overall GC content of the human genome is 41%.
A plot of GC content versus number of 20 kb windows
shows a broad profile with skewing to the right.
Page 627
GC content of the human genome: mean 41%
Source: IHGSC (2001)
Fig. 17.15
Page 628
Broad genomic landscape: GC content
Some genomic regions are GC-rich, while some are GC-poor.
Giorgio Bernardi and colleagues described “isochores”
which are large DNA segments (e.g. >300 kb) that are
fairly homogeneous compositionally and can be divided
into GC-rich and GC-poor subtypes.
Page 628
Do isochores exist?
(Perhaps, but not on chromosomes 21, 22)
Source: IHGSC (2001)
Broad genomic landscape: CpG islands
Dinucleotides of CpG are under-represented in
genomic DNA, occuring at one fifth the expected frequency.
CpG dinucleotides are often methylated on cytosine
(and subsequently may be deamination to thymine).
Methylated CpG residues are often associated with
house-keeping genes in the promoter and exonic regions.
Methyl-CpG binding proteins recruit histone deacetylases
and are thus responsible for transcriptional repression.
They have roles in gene silencing, genomic imprinting,
and X-chromosome inactivation.
Page 628
Broad genomic landscape: CpG islands
Findings:
50,267 CpG islands in human genome
28,890 after masking repeats with RepeatMasker
5-15 CpG islands per megabase
(about <40 genes per megabase)
Page 628
Broad genomic landscape: CpG islands
Fig. 17.16
Page 629
Human genome: genetic vs physical distance
Genetic maps (linkage maps) measure genetic distance
based on meiotic recombination (DNA exchange).
The units are centimorgans (cM). One cM corresponds
to 1% recombination.
Physical maps describe physical positions of DNA (genes)
in megabases (Mb).
A comparison of the two types of map reveals
the rate of recombination per nucleotide.
Page 629
Genetic distance
Physical distance
Fig. 17.17
Page 630
Rate of recombination per nucleotide:
-- Suppressed near the centromeres
-- Higher near telomeres
-- Higher in males
-- There are deserts (<0.3 cM/Mb) and jungles (>3 cM/Mb)
Fig. 17.17
Page 630
Repeat content of the human genome
The human genome contains >50% repetitive DNA.
Monday (Chapter 16), we discussed five classes of
repetitive DNA in humans:
[1]
[2]
[3]
[4]
[5]
interspersed repeats (transposon-derived)
processed pseudogenes
simple sequence repeats (micro-, minisatellites)
segmental duplications
blocks of tandem repeats (e.g. at centromeres)
Page 629
Human genome: interspersed repeats
Four main classes of interspersed repeats:
[1]
[2]
[3]
[4]
LINEs (21% of human genome)
RT
SINEs (13%)
RT
Long terminal repeat transposons (8%)
RT
DNA transposons (3%)
transposase
Page 631
Structure of interspersed repeats
Fig. 17.18
Page 631
Comparison of the age of interspersed repeats
in four eukaryotic genomes
Most interspersed repeats in human are ancient.
There is no evidence of DNA transposon activity in the
past 50 MY; thus they are extinct fossils.
Fig. 17.19
Page 632
Interspersed repeats on
chromosomes 2 and 22
Red bars = interspersed repeats
Blue bars = exons of known genes
Scale = about 1 Mb
Source: IHGSC (2001), Fig. 21
Source: IHGSC (2001), Fig. 21
human
mouse
Source: IHGSC (2001), Fig. 21
Higher substitution rate on the Y
than on the X chromosome (L1 subfamilies)
Source: IHGSC (2001), Fig. 21
Human genome: simple sequence repeats
Simple sequence repeats (SSR) are perfect (or slightly
imperfect) tandem repeats of k-mers. Microsatellites
have k = 1 to 12, while minisatellites have k from about
a dozen to 500 base pairs.
Micro- and minisatellites comprise 3% of the genome.
AC, AT, and AG are the most common dinucleotide
repeats.
Length=
1
2
3
4
33.7 elements/megabase
43.1
11.8
32.5
Page 632
Human genome: segmental duplications
About 3.6% of the finished human genome sequence
consists of segmental duplications, typically 10-50 kb.
Centromeres contain particularly large amounts of
interchromosomally duplicated DNA.
Page 632
Human genome: segmental duplications
on chromosome 22q
Red = interchromosomal
Blue = intrachromosomal
Fig. 17.20
Page 633
Human genome: gene content
As for any eukaryotic genome, gene prediction is difficult.
• For the human genome, the average exon
is only 150 nucleotides. Thus they are hard to identify.
• Exon/intron borders can be difficult to assign.
• Introns may be many kilobases in length.
• Pseudogenes may be difficult to identify.
• Noncoding RNAs are also difficult to identify.
According to the IHGSC and Celera, there are
~30,000 to 40,000 genes. The gene density is
about 27,000 base pairs per gene
Page 633
Human genome: noncoding RNA
Noncoding RNAs include the following (see Monday’s lecture):
tRNA
rRNA
snoRNA
snRNA
miRNA
They can be difficult to identify because
-- they lack open reading frames
-- they may be extremely smart
-- they are not polyadenylated
-- they may not be in cDNA libraries
They are usually identified using blastn.
Page 635
The human genetic code and
associated tRNA genes
Source: IHGSC (2001), Fig. 21
Human genome: protein coding genes
When RefSeq genes (manually curated) were compared
to the draft human genome sequence, 92% could be
aligned at high stringency over part of their length, and
85% could be aligned at > half their length.
Some RefSeq genes had high stringency matches to
multiple genomic locations. This could be due to paralogs,
pseudogenes, or misassembly of the genome sequence.
Page 636
Human genome: protein coding genes
The basic characteristics of human genes include:
Feature
Size (median)
Size (mean)
Internal exon
Exon number
Introns
3’ untranslated
5’ untranslated
Coding sequence
Coding sequence
Genomic extent
122 bp
7
1023 bp
400 bp
240 bp
1100 bp
367 aa
14 kb
145 bp
8.8
3365 bp
770 bp
300 bp
1340 bp
447 aa
27 kb
Page 636
Human genome: protein coding genes
The basic characteristics of human genes include:
Feature
Size (median)
Size (mean)
Internal exon
Exon number
Introns
3’ untranslated
5’ untranslated
Coding sequence
Coding sequence
Genomic extent
122 bp
7
1023 bp
400 bp
240 bp
1100 bp
367 aa
14 kb
145 bp
8.8
3365 bp
770 bp
300 bp
1340 bp
447 aa
27 kb
Values are likely underestimates: many UTRs are incomplete,
and longer introns are more likely to be interrupted by gaps.
Human genome: protein coding genes
Human protein-coding genes have lengths about the same
as those of worms and flies:
Human coding sequence 1340 bp
Worm
1311 bp
Fly
1497 bp
However, intron size differs substantially. It is more
variable for humans,with a peak at 87 bp but a mean of
more than 3300 bp.
98.12% of human introns use the canonical dinucleotides
GT at the 5’ splice site and AG at the 3’ splice site.
Page 636
Size distribution
of exons…
introns…
and short introns
Fig. 17.21
Page 637
Human genome: protein coding genes
Protein-coding genes are associated with a high
GC content.
AT rich regions tend to be gene-poor with many sprawling
genes containing large introns.
Page 636
Correlation of gene density with GC content
Fig. 17.22
Page 638
Correlation of gene density with GC content
Red = 9315 known genes
Blue = genome
Fig. 17.22
Page 638
Correlation of gene density with GC content
Gene density as a function of GC content:
gene density rises 10-fold as GC content
Fig. 17.22
increases from 30% to 50%
Page 638
Correlation of gene density with GC content
Exon length is constant with increasing GC
content, while intron length decreases
Fig. 17.22
at higher GC content
Page 638
Human genome: protein coding genes
As part of the sequencing effort, the most common
protein families, domains, and motifs have been catalogued.
This also permits comparative proteomic analyses.
Overall, 40% of predicted human proteins could be placed
in InterPro functional categories. A blastp search of every
human protein revealed that 74% had significant matches
to known proteins.
Page 637
Functional protein categories in the proteomes of
yeast, Arabidopsis, worm, fly, and human
Fig. 17.23
Page 640
Taxonomic distribution of protein homologs
of predicted human proteins
Fig. 17.24
Page 641
Human genome: proteome complexity
Although the human genome encodes a comparable
number of proteins as do other genomes, the human
proteome may display greater complexity.
• There are relatively more domains and protein families
in the humans
• The human genome includes relatively more paralogs,
potentially yielding more functional diversity
• Domain architectures tend to be more complex
• Alternative RNA splicing may be more extensive
in humans.
Page 637
Source: IHGSC (2001)
Source: IHGSC (2001)
Related documents