Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
The Human Genome (part 2 of 2) Wednesday, November 5, 2003 Introduction to Bioinformatics Johns Hopkins School of Medicine ME:440.714 J. Pevsner pevsner@jhmi.edu Copyright notice Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by J Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by Wiley. These images and materials may not be used without permission from the publisher. Visit http://www.bioinfbook.org The human genome from a bioinformatics perspective: International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921. Page 617 International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921. To freely access this article, visit http://www.nature.com/genomics/human/ and click the “human genome” link. From this URL you can also access Watson and Crick’s 1953 article on the double helix. Page 617 Human genome publications Since 1999, articles have been published on seven individual human chromosomes: Chromosome 22 (Dunham et al., 1999) Chromosome 21 (Hattori et al., 2000) Chromosome 20 (Deloukas et al., 2001) Chromosome 14 (Heilig et al., 2003) Chromosome Y (Skaletsky et al., 2003) Chromosome 7 (Hillier et al., 2003; Scherer et al., 2003) Chromosome 6 (Mungall et al., 2003) Background of the human genome project The Human Genome Project (HGP) was first proposed by the U.S. National Research Council in 1988. The goals were to create genetic, physical, and sequence maps of the human genome. In parallel, genomes of model organisms were to be studied. (You can read this report on-line via http://www.nap.edu; See page 618 for the full URL) Page 618 Human genome project: 8 goals There have been 8 main goals for the HGP: 1] 2] 3] 4] 5] 6] 7] 8] Human DNA sequence Develop sequencing technology Identify human genome sequence variation Functional genomics technology Comparative genomics ELSI: ethical, legal, and social issues Bioinformatics and computational biology Training and manpower Table 17.5 Page 624 Human genome project: 8 goals There have been 8 main goals for the HGP: 1] 2] 3] 4] 5] 6] 7] 8] Human DNA sequence Develop sequencing technology Identify human genome sequence variation Functional genomics technology Comparative genomics ELSI: ethical, legal, and social issues Bioinformatics and computational biology Training and manpower Encourage the establishment of academic career paths for genomic scientists – job openings Human genome project: 8 goals There have been 8 main goals for the HGP: 1] 2] 3] 4] 5] 6] 7] 8] Human DNA sequence Develop sequencing technology Identify human genome sequence variation Functional genomics technology Comparative genomics ELSI: ethical, legal, and social issues Bioinformatics and computational biology Training and manpower ELSI: Who owns genetic information? Who has access to it? To what extent do genes determine behavior? What is the relation between genes and race? Human genome project: timeline (to 2001) Source: IHGSC (2001) Strategic issues: Hierarchical / shotgun sequencing The human genome was sequenced in parallel by a public consortium (IHGSC) and by Celera Genomics. These groups applied alternative sequencing strategies. Page 620 Human genome project: strategies Whole genome shotgun sequencing (Celera) Hierarchical shotgun sequencing (public consortium) Human genome project: strategies Whole genome shotgun sequencing (Celera) Hierarchical shotgun sequencing (public consortium) -- 29,000 BAC clones -- 4.3 billion base pairs -- it is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome -- individual chromosomes assigned to centers Human genome project: strategies Whole genome shotgun sequencing (Celera) -- given the computational capacity, this approach is far faster than hierarchical shotgun sequencing -- the approach was ~validated using Drosophila Hierarchical shotgun sequencing (public consortium) -- 29,000 BAC clones -- 4.3 billion base pairs -- it is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome -- individual chromosomes assigned to centers Source: IHGSC (2001) Source: IHGSC (2001) Source: IHGSC (2001) Sequenced-clone contigs are merged to form scaffolds of known order and orientation Source: IHGSC (2001) Fig. 17.14 Page 626 Features of the genome sequence The genome sequence (in 2001) included a mixture of finished, draft, and pre-draft data. The N50 length describes the largest length L such that 50% of all nucleotides are contained in contigs or scaffolds of at least size L. For the 2001 version of the human genome, N50 is at least 8.4 Mb. For current N50 statistics, see http://genome.ucsc.edu/goldenpath/stats.html Page 623 Features of the genome sequence The quality of genome sequence is assessed by counting the number of gaps and by measuring the nucleotide accuracy. About 91% of the unfinished draft sequence had an error rate less than 1 per 10,000 bases. This corresponds to a PHRAP score >40 (i.e. an error probability of 10-40/10, or 99.99% accuracy)(see Chapter 12). Page 623 Broad genomic landscape The broad genomic landscape includes the following features: • Long-range variation in GC content • CpG islands • Comparison of genetic and physical distance • The repeat content of the human genome • The gene content of the human genome Page 627 Broad genomic landscape: GC content The overall GC content of the human genome is 41%. A plot of GC content versus number of 20 kb windows shows a broad profile with skewing to the right. Page 627 GC content of the human genome: mean 41% Source: IHGSC (2001) Fig. 17.15 Page 628 Broad genomic landscape: GC content Some genomic regions are GC-rich, while some are GC-poor. Giorgio Bernardi and colleagues described “isochores” which are large DNA segments (e.g. >300 kb) that are fairly homogeneous compositionally and can be divided into GC-rich and GC-poor subtypes. Page 628 Do isochores exist? (Perhaps, but not on chromosomes 21, 22) Source: IHGSC (2001) Broad genomic landscape: CpG islands Dinucleotides of CpG are under-represented in genomic DNA, occuring at one fifth the expected frequency. CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine). Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions. Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression. They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation. Page 628 Broad genomic landscape: CpG islands Findings: 50,267 CpG islands in human genome 28,890 after masking repeats with RepeatMasker 5-15 CpG islands per megabase (about <40 genes per megabase) Page 628 Broad genomic landscape: CpG islands Fig. 17.16 Page 629 Human genome: genetic vs physical distance Genetic maps (linkage maps) measure genetic distance based on meiotic recombination (DNA exchange). The units are centimorgans (cM). One cM corresponds to 1% recombination. Physical maps describe physical positions of DNA (genes) in megabases (Mb). A comparison of the two types of map reveals the rate of recombination per nucleotide. Page 629 Genetic distance Physical distance Fig. 17.17 Page 630 Rate of recombination per nucleotide: -- Suppressed near the centromeres -- Higher near telomeres -- Higher in males -- There are deserts (<0.3 cM/Mb) and jungles (>3 cM/Mb) Fig. 17.17 Page 630 Repeat content of the human genome The human genome contains >50% repetitive DNA. Monday (Chapter 16), we discussed five classes of repetitive DNA in humans: [1] [2] [3] [4] [5] interspersed repeats (transposon-derived) processed pseudogenes simple sequence repeats (micro-, minisatellites) segmental duplications blocks of tandem repeats (e.g. at centromeres) Page 629 Human genome: interspersed repeats Four main classes of interspersed repeats: [1] [2] [3] [4] LINEs (21% of human genome) RT SINEs (13%) RT Long terminal repeat transposons (8%) RT DNA transposons (3%) transposase Page 631 Structure of interspersed repeats Fig. 17.18 Page 631 Comparison of the age of interspersed repeats in four eukaryotic genomes Most interspersed repeats in human are ancient. There is no evidence of DNA transposon activity in the past 50 MY; thus they are extinct fossils. Fig. 17.19 Page 632 Interspersed repeats on chromosomes 2 and 22 Red bars = interspersed repeats Blue bars = exons of known genes Scale = about 1 Mb Source: IHGSC (2001), Fig. 21 Source: IHGSC (2001), Fig. 21 human mouse Source: IHGSC (2001), Fig. 21 Higher substitution rate on the Y than on the X chromosome (L1 subfamilies) Source: IHGSC (2001), Fig. 21 Human genome: simple sequence repeats Simple sequence repeats (SSR) are perfect (or slightly imperfect) tandem repeats of k-mers. Microsatellites have k = 1 to 12, while minisatellites have k from about a dozen to 500 base pairs. Micro- and minisatellites comprise 3% of the genome. AC, AT, and AG are the most common dinucleotide repeats. Length= 1 2 3 4 33.7 elements/megabase 43.1 11.8 32.5 Page 632 Human genome: segmental duplications About 3.6% of the finished human genome sequence consists of segmental duplications, typically 10-50 kb. Centromeres contain particularly large amounts of interchromosomally duplicated DNA. Page 632 Human genome: segmental duplications on chromosome 22q Red = interchromosomal Blue = intrachromosomal Fig. 17.20 Page 633 Human genome: gene content As for any eukaryotic genome, gene prediction is difficult. • For the human genome, the average exon is only 150 nucleotides. Thus they are hard to identify. • Exon/intron borders can be difficult to assign. • Introns may be many kilobases in length. • Pseudogenes may be difficult to identify. • Noncoding RNAs are also difficult to identify. According to the IHGSC and Celera, there are ~30,000 to 40,000 genes. The gene density is about 27,000 base pairs per gene Page 633 Human genome: noncoding RNA Noncoding RNAs include the following (see Monday’s lecture): tRNA rRNA snoRNA snRNA miRNA They can be difficult to identify because -- they lack open reading frames -- they may be extremely smart -- they are not polyadenylated -- they may not be in cDNA libraries They are usually identified using blastn. Page 635 The human genetic code and associated tRNA genes Source: IHGSC (2001), Fig. 21 Human genome: protein coding genes When RefSeq genes (manually curated) were compared to the draft human genome sequence, 92% could be aligned at high stringency over part of their length, and 85% could be aligned at > half their length. Some RefSeq genes had high stringency matches to multiple genomic locations. This could be due to paralogs, pseudogenes, or misassembly of the genome sequence. Page 636 Human genome: protein coding genes The basic characteristics of human genes include: Feature Size (median) Size (mean) Internal exon Exon number Introns 3’ untranslated 5’ untranslated Coding sequence Coding sequence Genomic extent 122 bp 7 1023 bp 400 bp 240 bp 1100 bp 367 aa 14 kb 145 bp 8.8 3365 bp 770 bp 300 bp 1340 bp 447 aa 27 kb Page 636 Human genome: protein coding genes The basic characteristics of human genes include: Feature Size (median) Size (mean) Internal exon Exon number Introns 3’ untranslated 5’ untranslated Coding sequence Coding sequence Genomic extent 122 bp 7 1023 bp 400 bp 240 bp 1100 bp 367 aa 14 kb 145 bp 8.8 3365 bp 770 bp 300 bp 1340 bp 447 aa 27 kb Values are likely underestimates: many UTRs are incomplete, and longer introns are more likely to be interrupted by gaps. Human genome: protein coding genes Human protein-coding genes have lengths about the same as those of worms and flies: Human coding sequence 1340 bp Worm 1311 bp Fly 1497 bp However, intron size differs substantially. It is more variable for humans,with a peak at 87 bp but a mean of more than 3300 bp. 98.12% of human introns use the canonical dinucleotides GT at the 5’ splice site and AG at the 3’ splice site. Page 636 Size distribution of exons… introns… and short introns Fig. 17.21 Page 637 Human genome: protein coding genes Protein-coding genes are associated with a high GC content. AT rich regions tend to be gene-poor with many sprawling genes containing large introns. Page 636 Correlation of gene density with GC content Fig. 17.22 Page 638 Correlation of gene density with GC content Red = 9315 known genes Blue = genome Fig. 17.22 Page 638 Correlation of gene density with GC content Gene density as a function of GC content: gene density rises 10-fold as GC content Fig. 17.22 increases from 30% to 50% Page 638 Correlation of gene density with GC content Exon length is constant with increasing GC content, while intron length decreases Fig. 17.22 at higher GC content Page 638 Human genome: protein coding genes As part of the sequencing effort, the most common protein families, domains, and motifs have been catalogued. This also permits comparative proteomic analyses. Overall, 40% of predicted human proteins could be placed in InterPro functional categories. A blastp search of every human protein revealed that 74% had significant matches to known proteins. Page 637 Functional protein categories in the proteomes of yeast, Arabidopsis, worm, fly, and human Fig. 17.23 Page 640 Taxonomic distribution of protein homologs of predicted human proteins Fig. 17.24 Page 641 Human genome: proteome complexity Although the human genome encodes a comparable number of proteins as do other genomes, the human proteome may display greater complexity. • There are relatively more domains and protein families in the humans • The human genome includes relatively more paralogs, potentially yielding more functional diversity • Domain architectures tend to be more complex • Alternative RNA splicing may be more extensive in humans. Page 637 Source: IHGSC (2001) Source: IHGSC (2001)